* [PATCH] packfile: enhance the mtime of packfile by idx file @ 2021-07-10 19:01 Sun Chao via GitGitGadget 2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason 2021-07-14 1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget 0 siblings, 2 replies; 34+ messages in thread From: Sun Chao via GitGitGadget @ 2021-07-10 19:01 UTC (permalink / raw) To: git; +Cc: Sun Chao, Sun Chao From: Sun Chao <16657101987@163.com> Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on file system cache when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed, but '.idx' file not. So if we update the mtime of packfile by updating the '.idx' file instead of '.pack' file, when we check the mtime of packfile, get it from '.idx' file instead. Large git repository may contains large '.pack' files, but '.idx' files are smaller enough, this can avoid file system cache reload the large files again and speed up git commands. Signed-off-by: Sun Chao <16657101987@163.com> --- packfile: enhance the mtime of packfile by idx file Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on file system cache when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed, but '.idx' file not. So if we update the mtime of packfile by updating the '.idx' file instead of '.pack' file, when we check the mtime of packfile, get it from '.idx' file instead. Large git repository may contains large '.pack' files, but '.idx' files are smaller enough, this can avoid file system cache reload the large files again and speed up git commands. Signed-off-by: Sun Chao 16657101987@163.com Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v1 Pull-Request: https://github.com/git/git/pull/1043 builtin/index-pack.c | 19 +++---------------- object-file.c | 7 ++++++- packfile.c | 19 ++++++++++++++++++- packfile.h | 7 +++++++ t/t7701-repack-unpack-unreachable.sh | 2 +- 5 files changed, 35 insertions(+), 19 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 3fbc5d70777..60bacc8ee7f 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f) free(sorted_by_pos); } -static const char *derive_filename(const char *pack_name, const char *strip, - const char *suffix, struct strbuf *buf) -{ - size_t len; - if (!strip_suffix(pack_name, strip, &len) || !len || - pack_name[len - 1] != '.') - die(_("packfile name '%s' does not end with '.%s'"), - pack_name, strip); - strbuf_add(buf, pack_name, len); - strbuf_addstr(buf, suffix); - return buf->buf; -} - static void write_special_file(const char *suffix, const char *msg, const char *pack_name, const unsigned char *hash, const char **report) @@ -1460,7 +1447,7 @@ static void write_special_file(const char *suffix, const char *msg, int msg_len = strlen(msg); if (pack_name) - filename = derive_filename(pack_name, "pack", suffix, &name_buf); + filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf); else filename = odb_pack_name(&name_buf, hash, suffix); @@ -1855,13 +1842,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) if (from_stdin && hash_algo) die(_("--object-format cannot be used with --stdin")); if (!index_name && pack_name) - index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf); + index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf); opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY); if (rev_index) { opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV; if (index_name) - rev_index_name = derive_filename(index_name, + rev_index_name = derive_pack_filename(index_name, "idx", "rev", &rev_index_name_buf); } diff --git a/object-file.c b/object-file.c index f233b440b22..068ef99d461 100644 --- a/object-file.c +++ b/object-file.c @@ -1974,11 +1974,16 @@ static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; + struct strbuf name_buf = STRBUF_INIT; + const char *filename; + if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) return 1; - if (!freshen_file(e.p->pack_name)) + + filename = derive_pack_filename(e.p->pack_name, "pack", "idx", &name_buf); + if (!freshen_file(filename)) return 0; e.p->freshened = 1; return 1; diff --git a/packfile.c b/packfile.c index 755aa7aec5e..46f8fb22462 100644 --- a/packfile.c +++ b/packfile.c @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1) return odb_pack_name(&buf, sha1, "idx"); } +const char *derive_pack_filename(const char *pack_name, const char *strip, + const char *suffix, struct strbuf *buf) +{ + size_t len; + if (!strip_suffix(pack_name, strip, &len) || !len || + pack_name[len - 1] != '.') + die(_("packfile name '%s' does not end with '.%s'"), + pack_name, strip); + strbuf_add(buf, pack_name, len); + strbuf_addstr(buf, suffix); + return buf->buf; +} + static unsigned int pack_used_ctr; static unsigned int pack_mmap_calls; static unsigned int peak_pack_open_windows; @@ -693,6 +706,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) size_t alloc; struct packed_git *p; + if (stat(path, &st) || !S_ISREG(st.st_mode)) { + return NULL; + } + /* * Make sure a corresponding .pack file exists and that * the index looks sane. @@ -707,6 +724,7 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) alloc = st_add3(path_len, strlen(".promisor"), 1); p = alloc_packed_git(alloc); memcpy(p->pack_name, path, path_len); + p->mtime = st.st_mtime; xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep"); if (!access(p->pack_name, F_OK)) @@ -727,7 +745,6 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) */ p->pack_size = st.st_size; p->pack_local = local; - p->mtime = st.st_mtime; if (path_len < the_hash_algo->hexsz || get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) hashclr(p->hash); diff --git a/packfile.h b/packfile.h index 3ae117a8aef..ff702b22e6a 100644 --- a/packfile.h +++ b/packfile.h @@ -31,6 +31,13 @@ char *sha1_pack_name(const unsigned char *sha1); */ char *sha1_pack_index_name(const unsigned char *sha1); +/* + * Return the corresponding filename with given suffix from "file_name" + * which must has "strip" suffix. + */ +const char *derive_pack_filename(const char *file_name, const char *strip, + const char *suffix, struct strbuf *buf); + /* * Return the basename of the packfile, omitting any containing directory * (e.g., "pack-1234abcd[...].pack"). diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh index 937f89ee8c8..51c1afcbe2c 100755 --- a/t/t7701-repack-unpack-unreachable.sh +++ b/t/t7701-repack-unpack-unreachable.sh @@ -106,7 +106,7 @@ test_expect_success 'do not bother loosening old objects' ' git prune-packed && git cat-file -p $obj1 && git cat-file -p $obj2 && - test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.idx && git repack -A -d --unpack-unreachable=1.hour.ago && git cat-file -p $obj1 && test_must_fail git cat-file -p $obj2 base-commit: d486ca60a51c9cb1fe068803c3f540724e95e83a -- gitgitgadget ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] packfile: enhance the mtime of packfile by idx file 2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget @ 2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason 2021-07-12 16:17 ` Sun Chao 2021-07-14 1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget 1 sibling, 1 reply; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-11 23:44 UTC (permalink / raw) To: Sun Chao via GitGitGadget; +Cc: git, Sun Chao On Sat, Jul 10 2021, Sun Chao via GitGitGadget wrote: > From: Sun Chao <16657101987@163.com> > > Commit 33d4221c79 (write_sha1_file: freshen existing objects, > 2014-10-15) avoid writing existing objects by freshen their > mtime (especially the packfiles contains them) in order to > aid the correct caching, and some process like find_lru_pack > can make good decision. However, this is unfriendly to > incremental backup jobs or services rely on file system > cache when there are large '.pack' files exists. > > For example, after packed all objects, use 'write-tree' to > create same commit with the same tree and same environments > such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can > notice the '.pack' file's mtime changed, but '.idx' file not. > > So if we update the mtime of packfile by updating the '.idx' > file instead of '.pack' file, when we check the mtime > of packfile, get it from '.idx' file instead. Large git > repository may contains large '.pack' files, but '.idx' > files are smaller enough, this can avoid file system cache > reload the large files again and speed up git commands. > > Signed-off-by: Sun Chao <16657101987@163.com> Does this have the unstated trade-off that in a mixed-version environment (say two git versions coordinating writes to an NFS share) where one is old and thinks *.pack needs updating, and the other is new and thinks *.idx is what should be checked, that until both are upgraded we're effectively back to pre-33d4221c79. I don't think it's a dealbreaker, just wondering if I've got that right & if it is's a trade-off you thought about, maybe we should check the mtime of both. The stat() is cheap, it's the re-sync that matters for you. But just to run with that thought, wouldn't it be even more helpful to you to have say a config setting to create a *.bump file next to the *.{idx,pack}. Then you'd have an empty file (the *.idx is smaller, but still not empty), and as a patch it seems relatively simple, i.e. some core.* or gc.* or pack.* setting changing what we touch/stat(). ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] packfile: enhance the mtime of packfile by idx file 2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason @ 2021-07-12 16:17 ` Sun Chao 0 siblings, 0 replies; 34+ messages in thread From: Sun Chao @ 2021-07-12 16:17 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Sun Chao via GitGitGadget, git > 2021年7月12日 07:44,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道: > > > On Sat, Jul 10 2021, Sun Chao via GitGitGadget wrote: > >> From: Sun Chao <16657101987@163.com> >> >> Commit 33d4221c79 (write_sha1_file: freshen existing objects, >> 2014-10-15) avoid writing existing objects by freshen their >> mtime (especially the packfiles contains them) in order to >> aid the correct caching, and some process like find_lru_pack >> can make good decision. However, this is unfriendly to >> incremental backup jobs or services rely on file system >> cache when there are large '.pack' files exists. >> >> For example, after packed all objects, use 'write-tree' to >> create same commit with the same tree and same environments >> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can >> notice the '.pack' file's mtime changed, but '.idx' file not. >> >> So if we update the mtime of packfile by updating the '.idx' >> file instead of '.pack' file, when we check the mtime >> of packfile, get it from '.idx' file instead. Large git >> repository may contains large '.pack' files, but '.idx' >> files are smaller enough, this can avoid file system cache >> reload the large files again and speed up git commands. >> >> Signed-off-by: Sun Chao <16657101987@163.com> > > Does this have the unstated trade-off that in a mixed-version > environment (say two git versions coordinating writes to an NFS share) > where one is old and thinks *.pack needs updating, and the other is new > and thinks *.idx is what should be checked, that until both are upgraded > we're effectively back to pre-33d4221c79. > Thanks for your reply, I can not agree with you more. > I don't think it's a dealbreaker, just wondering if I've got that right > & if it is's a trade-off you thought about, maybe we should check the > mtime of both. The stat() is cheap, it's the re-sync that matters for > you. > > But just to run with that thought, wouldn't it be even more helpful to > you to have say a config setting to create a *.bump file next to the > *.{idx,pack}. > > Then you'd have an empty file (the *.idx is smaller, but still not > empty), and as a patch it seems relatively simple, i.e. some core.* or > gc.* or pack.* setting changing what we touch/stat(). Yes, thanks. This is a good idea, let me try this way. ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget 2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason @ 2021-07-14 1:28 ` Sun Chao via GitGitGadget 2021-07-14 1:39 ` Ævar Arnfjörð Bjarmason 2021-07-19 19:53 ` [PATCH v3] " Sun Chao via GitGitGadget 1 sibling, 2 replies; 34+ messages in thread From: Sun Chao via GitGitGadget @ 2021-07-14 1:28 UTC (permalink / raw) To: git; +Cc: Sun Chao, Sun Chao From: Sun Chao <16657101987@163.com> Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on file system cache when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed, and '.idx' file not. If we freshen the mtime of packfile by updating another file instead of '.pack' file e.g. a empty '.bump' file, when we need to check the mtime of packfile, get it from another file instead. Large git repository may contains large '.pack' files, and we can use smaller files even empty file to do the mtime get/set operation, this can avoid file system cache re-sync large '.pack' files again and then speed up most git commands. Signed-off-by: Sun Chao <16657101987@163.com> --- packfile: freshen the mtime of packfile by configuration Commit 33d4221 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on file system cache when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed, and '.idx' file not. If we freshen the mtime of packfile by updating another file instead of '.pack' file e.g. a empty '.bump' file, when we need to check the mtime of packfile, get it from another file instead. Large git repository may contains large '.pack' files, and we can use smaller files even empty file to do the mtime get/set operation, this can avoid file system cache re-sync large '.pack' files again and then speed up most git commands. Signed-off-by: Sun Chao 16657101987@163.com Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v2 Pull-Request: https://github.com/git/git/pull/1043 Range-diff vs v1: 1: f6adbb4e90f ! 1: 943e31e8587 packfile: enhance the mtime of packfile by idx file @@ Metadata Author: Sun Chao <16657101987@163.com> ## Commit message ## - packfile: enhance the mtime of packfile by idx file + packfile: freshen the mtime of packfile by configuration Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their @@ Commit message For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can - notice the '.pack' file's mtime changed, but '.idx' file not. + notice the '.pack' file's mtime changed, and '.idx' file not. - So if we update the mtime of packfile by updating the '.idx' - file instead of '.pack' file, when we check the mtime - of packfile, get it from '.idx' file instead. Large git - repository may contains large '.pack' files, but '.idx' - files are smaller enough, this can avoid file system cache - reload the large files again and speed up git commands. + If we freshen the mtime of packfile by updating another + file instead of '.pack' file e.g. a empty '.bump' file, + when we need to check the mtime of packfile, get it from + another file instead. Large git repository may contains + large '.pack' files, and we can use smaller files even empty + file to do the mtime get/set operation, this can avoid + file system cache re-sync large '.pack' files again and + then speed up most git commands. Signed-off-by: Sun Chao <16657101987@163.com> + ## Documentation/config/core.txt ## +@@ Documentation/config/core.txt: the largest projects. You probably do not need to adjust this value. + + + Common unit suffixes of 'k', 'm', or 'g' are supported. + ++core.packMtimeSuffix:: ++ Normally we avoid writing existing object by freshening the mtime ++ of the *.pack file which contains it in order to aid some processes ++ such like prune. Use different file instead of *.pack file will ++ avoid file system cache re-sync the large packfiles, and consequently ++ make git commands faster. +++ ++The default is 'pack' which means the *.pack file will be freshened by ++default. You can configure a different suffix to use, the file with the ++suffix will be created automatically, it's better not using any known ++suffix such like 'idx', 'keep', 'promisor'. ++ + core.deltaBaseCacheLimit:: + Maximum number of bytes per thread to reserve for caching base objects + that may be referenced by multiple deltified objects. By storing the + ## builtin/index-pack.c ## @@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f) free(sorted_by_pos); @@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char &rev_index_name_buf); } + ## cache.h ## +@@ cache.h: extern size_t packed_git_limit; + extern size_t delta_base_cache_limit; + extern unsigned long big_file_threshold; + extern unsigned long pack_size_limit_cfg; ++extern const char *pack_mtime_suffix; + + /* + * Accessors for the core.sharedrepository config which lazy-load the value + + ## config.c ## +@@ config.c: static int git_default_core_config(const char *var, const char *value, void *cb) + return 0; + } + ++ if (!strcmp(var, "core.packmtimesuffix")) { ++ return git_config_string(&pack_mtime_suffix, var, value); ++ } ++ + if (!strcmp(var, "core.deltabasecachelimit")) { + delta_base_cache_limit = git_config_ulong(var, value); + return 0; + + ## environment.c ## +@@ environment.c: const char *git_hooks_path; + int zlib_compression_level = Z_BEST_SPEED; + int core_compression_level; + int pack_compression_level = Z_DEFAULT_COMPRESSION; ++const char *pack_mtime_suffix = "pack"; + int fsync_object_files; + size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; + size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; + ## object-file.c ## @@ object-file.c: static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; ++ struct stat st; + struct strbuf name_buf = STRBUF_INIT; + const char *filename; + @@ object-file.c: static int freshen_loose_object(const struct object_id *oid) if (e.p->freshened) return 1; - if (!freshen_file(e.p->pack_name)) +- return 0; ++ ++ filename = e.p->pack_name; ++ if (!strcasecmp(pack_mtime_suffix, "pack")) { ++ if (!freshen_file(filename)) ++ return 0; ++ e.p->freshened = 1; ++ return 1; ++ } ++ ++ /* If we want to freshen different file instead of .pack file, we need ++ * to make sure the file exists and create it if needed. ++ */ ++ filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf); ++ if (lstat(filename, &st) < 0) { ++ int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); ++ if (fd < 0) { ++ // here we need to check it again because other git process may created it ++ if (lstat(filename, &st) < 0) ++ die_errno("unable to create '%s'", filename); ++ } else { ++ close(fd); ++ } ++ } else { ++ if (!freshen_file(filename)) ++ return 0; ++ } + -+ filename = derive_pack_filename(e.p->pack_name, "pack", "idx", &name_buf); -+ if (!freshen_file(filename)) - return 0; e.p->freshened = 1; return 1; + } ## packfile.c ## @@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1) @@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1) static unsigned int pack_used_ctr; static unsigned int pack_mmap_calls; static unsigned int peak_pack_open_windows; -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local) - size_t alloc; - struct packed_git *p; - -+ if (stat(path, &st) || !S_ISREG(st.st_mode)) { -+ return NULL; -+ } -+ - /* - * Make sure a corresponding .pack file exists and that - * the index looks sane. -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local) - alloc = st_add3(path_len, strlen(".promisor"), 1); - p = alloc_packed_git(alloc); - memcpy(p->pack_name, path, path_len); -+ p->mtime = st.st_mtime; - - xsnprintf(p->pack_name + path_len, alloc - path_len, ".keep"); - if (!access(p->pack_name, F_OK)) @@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local) */ p->pack_size = st.st_size; p->pack_local = local; -- p->mtime = st.st_mtime; ++ ++ /* If we have different file used to freshen the mtime, we should ++ * use it at a higher priority. ++ */ ++ if (!!strcasecmp(pack_mtime_suffix, "pack")) { ++ struct strbuf name_buf = STRBUF_INIT; ++ const char *filename; ++ ++ filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf); ++ stat(filename, &st); ++ } + p->mtime = st.st_mtime; if (path_len < the_hash_algo->hexsz || get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) - hashclr(p->hash); ## packfile.h ## @@ packfile.h: char *sha1_pack_name(const unsigned char *sha1); @@ packfile.h: char *sha1_pack_name(const unsigned char *sha1); ## t/t7701-repack-unpack-unreachable.sh ## @@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' ' - git prune-packed && - git cat-file -p $obj1 && - git cat-file -p $obj2 && -- test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && -+ test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.idx && - git repack -A -d --unpack-unreachable=1.hour.ago && - git cat-file -p $obj1 && test_must_fail git cat-file -p $obj2 + ' + ++test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' ' ++ obj1=$(echo three | git hash-object -w --stdin) && ++ obj2=$(echo four | git hash-object -w --stdin) && ++ pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && ++ git -c core.packmtimesuffix=bump prune-packed && ++ git cat-file -p $obj1 && ++ git cat-file -p $obj2 && ++ touch .git/objects/pack/pack-$pack2.bump && ++ test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && ++ git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago && ++ git cat-file -p $obj1 && ++ test_must_fail git cat-file -p $obj2 ++' ++ + test_expect_success 'keep packed objects found only in index' ' + echo my-unique-content >file && + git add file && Documentation/config/core.txt | 12 ++++++++++ builtin/index-pack.c | 19 +++------------- cache.h | 1 + config.c | 4 ++++ environment.c | 1 + object-file.c | 33 ++++++++++++++++++++++++++-- packfile.c | 24 ++++++++++++++++++++ packfile.h | 7 ++++++ t/t7701-repack-unpack-unreachable.sh | 15 +++++++++++++ 9 files changed, 98 insertions(+), 18 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a1..9a256992d8c 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -398,6 +398,18 @@ the largest projects. You probably do not need to adjust this value. + Common unit suffixes of 'k', 'm', or 'g' are supported. +core.packMtimeSuffix:: + Normally we avoid writing existing object by freshening the mtime + of the *.pack file which contains it in order to aid some processes + such like prune. Use different file instead of *.pack file will + avoid file system cache re-sync the large packfiles, and consequently + make git commands faster. ++ +The default is 'pack' which means the *.pack file will be freshened by +default. You can configure a different suffix to use, the file with the +suffix will be created automatically, it's better not using any known +suffix such like 'idx', 'keep', 'promisor'. + core.deltaBaseCacheLimit:: Maximum number of bytes per thread to reserve for caching base objects that may be referenced by multiple deltified objects. By storing the diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 3fbc5d70777..60bacc8ee7f 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f) free(sorted_by_pos); } -static const char *derive_filename(const char *pack_name, const char *strip, - const char *suffix, struct strbuf *buf) -{ - size_t len; - if (!strip_suffix(pack_name, strip, &len) || !len || - pack_name[len - 1] != '.') - die(_("packfile name '%s' does not end with '.%s'"), - pack_name, strip); - strbuf_add(buf, pack_name, len); - strbuf_addstr(buf, suffix); - return buf->buf; -} - static void write_special_file(const char *suffix, const char *msg, const char *pack_name, const unsigned char *hash, const char **report) @@ -1460,7 +1447,7 @@ static void write_special_file(const char *suffix, const char *msg, int msg_len = strlen(msg); if (pack_name) - filename = derive_filename(pack_name, "pack", suffix, &name_buf); + filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf); else filename = odb_pack_name(&name_buf, hash, suffix); @@ -1855,13 +1842,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) if (from_stdin && hash_algo) die(_("--object-format cannot be used with --stdin")); if (!index_name && pack_name) - index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf); + index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf); opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY); if (rev_index) { opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV; if (index_name) - rev_index_name = derive_filename(index_name, + rev_index_name = derive_pack_filename(index_name, "idx", "rev", &rev_index_name_buf); } diff --git a/cache.h b/cache.h index ba04ff8bd36..d95b3d127e0 100644 --- a/cache.h +++ b/cache.h @@ -956,6 +956,7 @@ extern size_t packed_git_limit; extern size_t delta_base_cache_limit; extern unsigned long big_file_threshold; extern unsigned long pack_size_limit_cfg; +extern const char *pack_mtime_suffix; /* * Accessors for the core.sharedrepository config which lazy-load the value diff --git a/config.c b/config.c index f9c400ad306..27a6b619ed8 100644 --- a/config.c +++ b/config.c @@ -1431,6 +1431,10 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.packmtimesuffix")) { + return git_config_string(&pack_mtime_suffix, var, value); + } + if (!strcmp(var, "core.deltabasecachelimit")) { delta_base_cache_limit = git_config_ulong(var, value); return 0; diff --git a/environment.c b/environment.c index 2f27008424a..07c21e1a934 100644 --- a/environment.c +++ b/environment.c @@ -43,6 +43,7 @@ const char *git_hooks_path; int zlib_compression_level = Z_BEST_SPEED; int core_compression_level; int pack_compression_level = Z_DEFAULT_COMPRESSION; +const char *pack_mtime_suffix = "pack"; int fsync_object_files; size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; diff --git a/object-file.c b/object-file.c index f233b440b22..b3e77213c42 100644 --- a/object-file.c +++ b/object-file.c @@ -1974,12 +1974,41 @@ static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; + struct stat st; + struct strbuf name_buf = STRBUF_INIT; + const char *filename; + if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) return 1; - if (!freshen_file(e.p->pack_name)) - return 0; + + filename = e.p->pack_name; + if (!strcasecmp(pack_mtime_suffix, "pack")) { + if (!freshen_file(filename)) + return 0; + e.p->freshened = 1; + return 1; + } + + /* If we want to freshen different file instead of .pack file, we need + * to make sure the file exists and create it if needed. + */ + filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf); + if (lstat(filename, &st) < 0) { + int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); + if (fd < 0) { + // here we need to check it again because other git process may created it + if (lstat(filename, &st) < 0) + die_errno("unable to create '%s'", filename); + } else { + close(fd); + } + } else { + if (!freshen_file(filename)) + return 0; + } + e.p->freshened = 1; return 1; } diff --git a/packfile.c b/packfile.c index 755aa7aec5e..a607dda4e25 100644 --- a/packfile.c +++ b/packfile.c @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1) return odb_pack_name(&buf, sha1, "idx"); } +const char *derive_pack_filename(const char *pack_name, const char *strip, + const char *suffix, struct strbuf *buf) +{ + size_t len; + if (!strip_suffix(pack_name, strip, &len) || !len || + pack_name[len - 1] != '.') + die(_("packfile name '%s' does not end with '.%s'"), + pack_name, strip); + strbuf_add(buf, pack_name, len); + strbuf_addstr(buf, suffix); + return buf->buf; +} + static unsigned int pack_used_ctr; static unsigned int pack_mmap_calls; static unsigned int peak_pack_open_windows; @@ -727,6 +740,17 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) */ p->pack_size = st.st_size; p->pack_local = local; + + /* If we have different file used to freshen the mtime, we should + * use it at a higher priority. + */ + if (!!strcasecmp(pack_mtime_suffix, "pack")) { + struct strbuf name_buf = STRBUF_INIT; + const char *filename; + + filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf); + stat(filename, &st); + } p->mtime = st.st_mtime; if (path_len < the_hash_algo->hexsz || get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) diff --git a/packfile.h b/packfile.h index 3ae117a8aef..ff702b22e6a 100644 --- a/packfile.h +++ b/packfile.h @@ -31,6 +31,13 @@ char *sha1_pack_name(const unsigned char *sha1); */ char *sha1_pack_index_name(const unsigned char *sha1); +/* + * Return the corresponding filename with given suffix from "file_name" + * which must has "strip" suffix. + */ +const char *derive_pack_filename(const char *file_name, const char *strip, + const char *suffix, struct strbuf *buf); + /* * Return the basename of the packfile, omitting any containing directory * (e.g., "pack-1234abcd[...].pack"). diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh index 937f89ee8c8..15828a318c4 100755 --- a/t/t7701-repack-unpack-unreachable.sh +++ b/t/t7701-repack-unpack-unreachable.sh @@ -112,6 +112,21 @@ test_expect_success 'do not bother loosening old objects' ' test_must_fail git cat-file -p $obj2 ' +test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' ' + obj1=$(echo three | git hash-object -w --stdin) && + obj2=$(echo four | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && + git -c core.packmtimesuffix=bump prune-packed && + git cat-file -p $obj1 && + git cat-file -p $obj2 && + touch .git/objects/pack/pack-$pack2.bump && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && + git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + test_must_fail git cat-file -p $obj2 +' + test_expect_success 'keep packed objects found only in index' ' echo my-unique-content >file && git add file && base-commit: d486ca60a51c9cb1fe068803c3f540724e95e83a -- gitgitgadget ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget @ 2021-07-14 1:39 ` Ævar Arnfjörð Bjarmason 2021-07-14 2:52 ` Taylor Blau 2021-07-14 16:11 ` Sun Chao 2021-07-19 19:53 ` [PATCH v3] " Sun Chao via GitGitGadget 1 sibling, 2 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-14 1:39 UTC (permalink / raw) To: Sun Chao via GitGitGadget; +Cc: git, Sun Chao On Wed, Jul 14 2021, Sun Chao via GitGitGadget wrote: > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > index c04f62a54a1..9a256992d8c 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -398,6 +398,18 @@ the largest projects. You probably do not need to adjust this value. > + > Common unit suffixes of 'k', 'm', or 'g' are supported. > > +core.packMtimeSuffix:: > + Normally we avoid writing existing object by freshening the mtime > + of the *.pack file which contains it in order to aid some processes > + such like prune. Use different file instead of *.pack file will > + avoid file system cache re-sync the large packfiles, and consequently > + make git commands faster. > ++ > +The default is 'pack' which means the *.pack file will be freshened by > +default. You can configure a different suffix to use, the file with the > +suffix will be created automatically, it's better not using any known > +suffix such like 'idx', 'keep', 'promisor'. > + Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good here, there's others more familiar with this area than I am), I was thinking of something like a *.bump file written via: core.packUseBumpFiles=bool Or something like that, anyway, the edge case in allowing the user to pick arbitrary suffixes is that we'd get in-the-wild user arbitrary configuration squatting on a relatively sensitive part of the object store. E.g. we recently added *.rev files to go with *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some suffix). What if before that a user had set: core.packMtimeSuffix=rev In practice it's probably too obscure to worry about, but I think it's still worth it to only strictly write things we decide to write into the object store. > core.deltaBaseCacheLimit:: > Maximum number of bytes per thread to reserve for caching base objects > that may be referenced by multiple deltified objects. By storing the > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > index 3fbc5d70777..60bacc8ee7f 100644 > --- a/builtin/index-pack.c > +++ b/builtin/index-pack.c > @@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f) > free(sorted_by_pos); > } > > -static const char *derive_filename(const char *pack_name, const char *strip, > - const char *suffix, struct strbuf *buf) > -{ > - size_t len; > - if (!strip_suffix(pack_name, strip, &len) || !len || > - pack_name[len - 1] != '.') > - die(_("packfile name '%s' does not end with '.%s'"), > - pack_name, strip); > - strbuf_add(buf, pack_name, len); > - strbuf_addstr(buf, suffix); > - return buf->buf; > -} Would be more readable to split this series up into at least two patches, starting with just moving this function as-is to a the new location (just renaming it & moving it), and then using it. I don't think there's changes to it, but right now I'm just eyeballing the diff. It's more obvious if it's split up. > + if (!strcmp(var, "core.packmtimesuffix")) { > + return git_config_string(&pack_mtime_suffix, var, value); Can drop the {} braces here, per Documentation/CodingGuidelines > +const char *pack_mtime_suffix = "pack"; I can see how having a configurable suffix made the implementation easier, perhaps that's how it started? > int fsync_object_files; > size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; > size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; > diff --git a/object-file.c b/object-file.c > index f233b440b22..b3e77213c42 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1974,12 +1974,41 @@ static int freshen_loose_object(const struct object_id *oid) > static int freshen_packed_object(const struct object_id *oid) > { > struct pack_entry e; > + struct stat st; > + struct strbuf name_buf = STRBUF_INIT; > + const char *filename; > + > if (!find_pack_entry(the_repository, oid, &e)) > return 0; > if (e.p->freshened) > return 1; > - if (!freshen_file(e.p->pack_name)) > - return 0; > + > + filename = e.p->pack_name; > + if (!strcasecmp(pack_mtime_suffix, "pack")) { > + if (!freshen_file(filename)) > + return 0; > + e.p->freshened = 1; > + return 1; > + } > + > + /* If we want to freshen different file instead of .pack file, we need > + * to make sure the file exists and create it if needed. > + */ > + filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf); You populate name_buf here, but don't strbuf_release(&name_buf) it at the end of this function. > + if (lstat(filename, &st) < 0) { > + int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); > + if (fd < 0) { > + // here we need to check it again because other git process may created it /* */ comments, not //, if it's needed at all. Covered in CodingGuidelines > + if (lstat(filename, &st) < 0) > + die_errno("unable to create '%s'", filename); If we can't create this specific file here shouldn't we just continue silently at this point? Surely if this process is screwed we're just about to die on something more important? And lstat() can also return transitory errors that don't indicate "unable to create", e.g. maybe we can out of memory the kernel is willing to give us or something (just skimming the lstat manpage). > + } else { > + close(fd); > + } > + } else { > + if (!freshen_file(filename)) > + return 0; Style/indentatino: just do "} else if (!freshen..." ? > + } > + > e.p->freshened = 1; > return 1; > } > diff --git a/packfile.c b/packfile.c > index 755aa7aec5e..a607dda4e25 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1) > return odb_pack_name(&buf, sha1, "idx"); > } > > +const char *derive_pack_filename(const char *pack_name, const char *strip, > + const char *suffix, struct strbuf *buf) > +{ > + size_t len; > + if (!strip_suffix(pack_name, strip, &len) || !len || > + pack_name[len - 1] != '.') > + die(_("packfile name '%s' does not end with '.%s'"), > + pack_name, strip); > + strbuf_add(buf, pack_name, len); > + strbuf_addstr(buf, suffix); > + return buf->buf; > +} Just have this return void? > static unsigned int pack_used_ctr; > static unsigned int pack_mmap_calls; > static unsigned int peak_pack_open_windows; > @@ -727,6 +740,17 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) > */ > p->pack_size = st.st_size; > p->pack_local = local; > + > + /* If we have different file used to freshen the mtime, we should > + * use it at a higher priority. > + */ > + if (!!strcasecmp(pack_mtime_suffix, "pack")) { > + struct strbuf name_buf = STRBUF_INIT; > + const char *filename; > + > + filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf); > + stat(filename, &st); I.e. the "filename" here isn't needed, just call derive_pack_filename() and use name.buf.buf to stat. Also: We should check the stat return value here & report errno if needed, no? > +test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' ' > + obj1=$(echo three | git hash-object -w --stdin) && > + obj2=$(echo four | git hash-object -w --stdin) && > + pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && > + pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && > + git -c core.packmtimesuffix=bump prune-packed && > + git cat-file -p $obj1 && > + git cat-file -p $obj2 && > + touch .git/objects/pack/pack-$pack2.bump && > + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && > + git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago && On command-lines we can spell it camel-cased, e.g. -c core.packMtimeSuffix[...]. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 1:39 ` Ævar Arnfjörð Bjarmason @ 2021-07-14 2:52 ` Taylor Blau 2021-07-14 16:46 ` Sun Chao 2021-07-14 16:11 ` Sun Chao 1 sibling, 1 reply; 34+ messages in thread From: Taylor Blau @ 2021-07-14 2:52 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Sun Chao via GitGitGadget, git, Sun Chao On Wed, Jul 14, 2021 at 03:39:18AM +0200, Ævar Arnfjörð Bjarmason wrote: > Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good > here, there's others more familiar with this area than I am), I was > thinking of something like a *.bump file written via: > > core.packUseBumpFiles=bool > > Or something like that, anyway, the edge case in allowing the user to > pick arbitrary suffixes is that we'd get in-the-wild user arbitrary > configuration squatting on a relatively sensitive part of the object > store. > > E.g. we recently added *.rev files to go with > *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some > suffix). What if before that a user had set: > > core.packMtimeSuffix=rev I think making the suffix configurable is probably a mistake. It seems like an unnecessary detail to expose, but it also forces us to think about cases like these where the configured suffix is already used for some other purpose. I don't think that a new ".bump" file is a bad idea, but it does seem like we have a lot of files that represent a relatively little amount of the state that a pack can be in. The ".promisor" and ".keep" files both come to mind here. Some thoughts in this direction: - Combining *all* of the pack-related files (including the index, reverse-index, bitmap, and so on) into a single "pack-meta" file seems like a mistake for caching reasons. - But a meta file that contains just the small state (like promisor information and whether or not the pack is "kept") seems like it could be OK. On the other hand, being able to tweak the kept state by touching or deleting a file is convenient (and having to rewrite a meta file containing other information is much less so). But a ".bump" file does seem like an awkward way to not rely on the mtime of the pack itself. And I do think it runs into compatibility issues like Ævar mentioned. Any version of Git that includes a hypothetical .bump file (or something like it) needs to also update the pack's mtime, too, so that old versions of Git can understand it. (Of course, that could be configurable, but that seems far too obscure to me). Stepping back, I'm not sure I understand why freshening a pack is so slow for you. freshen_file() just calls utime(2), and any sync back to the disk shouldn't need to update the pack itself, just a couple of fields in its inode. Maybe you could help explain further. In any case, I couldn't find a spot in your patch that updates the packed_git's 'mtime' field, which is used to (a) sort packs in the linked list of packs, and (b) for determining the least-recently used pack if it has individual windows mmapped. Thanks, Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 2:52 ` Taylor Blau @ 2021-07-14 16:46 ` Sun Chao 2021-07-14 17:04 ` Taylor Blau 0 siblings, 1 reply; 34+ messages in thread From: Sun Chao @ 2021-07-14 16:46 UTC (permalink / raw) To: Taylor Blau Cc: Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget, git > 2021年7月14日 10:52,Taylor Blau <me@ttaylorr.com> 写道: > > On Wed, Jul 14, 2021 at 03:39:18AM +0200, Ævar Arnfjörð Bjarmason wrote: >> Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good >> here, there's others more familiar with this area than I am), I was >> thinking of something like a *.bump file written via: >> >> core.packUseBumpFiles=bool >> >> Or something like that, anyway, the edge case in allowing the user to >> pick arbitrary suffixes is that we'd get in-the-wild user arbitrary >> configuration squatting on a relatively sensitive part of the object >> store. >> >> E.g. we recently added *.rev files to go with >> *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some >> suffix). What if before that a user had set: >> >> core.packMtimeSuffix=rev > > I think making the suffix configurable is probably a mistake. It seems > like an unnecessary detail to expose, but it also forces us to think > about cases like these where the configured suffix is already used for > some other purpose. Thanks, I agree with you and will fix it, such like the *.keep file, we do not use the suffix configuration to create keep files. > > I don't think that a new ".bump" file is a bad idea, but it does seem > like we have a lot of files that represent a relatively little amount of > the state that a pack can be in. The ".promisor" and ".keep" files both > come to mind here. Some thoughts in this direction: > > - Combining *all* of the pack-related files (including the index, > reverse-index, bitmap, and so on) into a single "pack-meta" file > seems like a mistake for caching reasons. > > - But a meta file that contains just the small state (like promisor > information and whether or not the pack is "kept") seems like it > could be OK. On the other hand, being able to tweak the kept state > by touching or deleting a file is convenient (and having to rewrite > a meta file containing other information is much less so). Yes, read and rewrite a meta file also means we need do lock/unlock, which may cause inconvenient operations. > > But a ".bump" file does seem like an awkward way to not rely on the > mtime of the pack itself. And I do think it runs into compatibility > issues like Ævar mentioned. Any version of Git that includes a > hypothetical .bump file (or something like it) needs to also update the > pack's mtime, too, so that old versions of Git can understand it. (Of > course, that could be configurable, but that seems far too obscure to > me). Here we will have a configuration and default is backward compatiblity, and if user decide to use the '.bump' file, which means he can not use the old versions of Git, like the Repository Format Version, it is limited. > > Stepping back, I'm not sure I understand why freshening a pack is so > slow for you. freshen_file() just calls utime(2), and any sync back to > the disk shouldn't need to update the pack itself, just a couple of > fields in its inode. Maybe you could help explain further. > > In any case, I couldn't find a spot in your patch that updates the > packed_git's 'mtime' field, which is used to (a) sort packs in the > linked list of packs, and (b) for determining the least-recently used > pack if it has individual windows mmapped. The reason why we want to avoid freshen the mtime of ".pack" file is to improve the reading speed of Git Servers. We have some large repositories in our Git Severs (some are bigger than 10GB), and we created '.keep' files for large ".pack" files, we want the big files unchanged to speed up git upload-pack, because in our mind the file system cache will reduce the disk IO if a file does not changed. However we find the mtime of ".pack" files changes over time which makes the file system always reload the big files, that takes a lot of IO time and result in lower speed of git upload-pack and even further the disk IOPS is exhausted. > > Thanks, > Taylor > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 16:46 ` Sun Chao @ 2021-07-14 17:04 ` Taylor Blau 2021-07-14 18:19 ` Ævar Arnfjörð Bjarmason 2021-07-15 16:30 ` Sun Chao 0 siblings, 2 replies; 34+ messages in thread From: Taylor Blau @ 2021-07-14 17:04 UTC (permalink / raw) To: Sun Chao Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget, git On Thu, Jul 15, 2021 at 12:46:47AM +0800, Sun Chao wrote: > > Stepping back, I'm not sure I understand why freshening a pack is so > > slow for you. freshen_file() just calls utime(2), and any sync back to > > the disk shouldn't need to update the pack itself, just a couple of > > fields in its inode. Maybe you could help explain further. > > > > [ ... ] > > The reason why we want to avoid freshen the mtime of ".pack" file is to > improve the reading speed of Git Servers. > > We have some large repositories in our Git Severs (some are bigger than 10GB), > and we created '.keep' files for large ".pack" files, we want the big files > unchanged to speed up git upload-pack, because in our mind the file system > cache will reduce the disk IO if a file does not changed. > > However we find the mtime of ".pack" files changes over time which makes the > file system always reload the big files, that takes a lot of IO time and result > in lower speed of git upload-pack and even further the disk IOPS is exhausted. That's surprising behavior to me. Are you saying that calling utime(2) causes the *page* cache to be invalidated and that most reads are cache-misses lowering overall IOPS? If so, then I am quite surprised ;). The only state that should be dirtied by calling utime(2) is the inode itself, so the blocks referred to by the inode corresponding to a pack should be left in-tact. If you're on Linux, you can try observing the behavior of evicting inodes, blocks, or both from the disk cache by changing "2" in the following: hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null' --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches' where "1" drops the page cache, "2" drops the inodes, and "3" evicts both. I wonder if you could share the results of running the above varying the value of "1", "2", and "3", as well as swapping the `--prepare` for `--warmup=3` to warm your caches (and give us an idea of what your expected performance is probably like). Thanks, Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 17:04 ` Taylor Blau @ 2021-07-14 18:19 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:11 ` Martin Fick 2021-07-14 19:30 ` Taylor Blau 2021-07-15 16:30 ` Sun Chao 1 sibling, 2 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-14 18:19 UTC (permalink / raw) To: Taylor Blau; +Cc: Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git On Wed, Jul 14 2021, Taylor Blau wrote: > On Thu, Jul 15, 2021 at 12:46:47AM +0800, Sun Chao wrote: >> > Stepping back, I'm not sure I understand why freshening a pack is so >> > slow for you. freshen_file() just calls utime(2), and any sync back to >> > the disk shouldn't need to update the pack itself, just a couple of >> > fields in its inode. Maybe you could help explain further. >> > >> > [ ... ] >> >> The reason why we want to avoid freshen the mtime of ".pack" file is to >> improve the reading speed of Git Servers. >> >> We have some large repositories in our Git Severs (some are bigger than 10GB), >> and we created '.keep' files for large ".pack" files, we want the big files >> unchanged to speed up git upload-pack, because in our mind the file system >> cache will reduce the disk IO if a file does not changed. >> >> However we find the mtime of ".pack" files changes over time which makes the >> file system always reload the big files, that takes a lot of IO time and result >> in lower speed of git upload-pack and even further the disk IOPS is exhausted. > > That's surprising behavior to me. Are you saying that calling utime(2) > causes the *page* cache to be invalidated and that most reads are > cache-misses lowering overall IOPS? > > If so, then I am quite surprised ;). The only state that should be > dirtied by calling utime(2) is the inode itself, so the blocks referred > to by the inode corresponding to a pack should be left in-tact. > > If you're on Linux, you can try observing the behavior of evicting > inodes, blocks, or both from the disk cache by changing "2" in the > following: > > hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null' > --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches' > > where "1" drops the page cache, "2" drops the inodes, and "3" evicts > both. > > I wonder if you could share the results of running the above varying > the value of "1", "2", and "3", as well as swapping the `--prepare` for > `--warmup=3` to warm your caches (and give us an idea of what your > expected performance is probably like). I think you may be right narrowly, but wrong in this context :) I.e. my understanding of this problem is that they have some incremental backup job, e.g. rsync without --checksum (not that doing that would help, chicken & egg issue).. So by changing the mtime you cause the file to be re-synced. Yes Linux (or hopefully any modern OS) isn't so dumb as to evict your FS cache because of such a metadata change, but that's besides the point. If you have a backup job like that your FS cache will get evicted or be subject to churn anyway, because you'll shortly be having to deal with the "rsync" job that's noticed the changed mtime competing for caching resources with "real" traffic. Sun: Does that summarize the problem you're having? <large digression ahead> Sun, also: Note that in general doing backups of live git repositories with rsync is a bad idea, and will lead to corruption. The most common cause of such corruption is that a tool like "rsync" will iterate recursively through say "objects" followed by "refs". So by the time it gets to the latter (or is doing a deep iteration within those dirs) git's state has changed in such a way as to yield an rsync backup in a state that the repository was never in. (As an aside, I've often wondered what it is about git exactly makes people who'd never think of doing the same thing with the FS part of an RDMBS's data store think that implementing such an ad-hoc backup solution for git would be a good idea, but I digress. Perhaps we need more scarier looking BerkeleyDB-looking names in the .git directory :) Even if you do FS snapshots of live git repositories you're likely to get corruption, search this mailing list for references to fsync(), e.g. [1]. In short, git's historically (and still) been sloppy about rsync, and relied on non-standard behavior such as "if I do N updates for N=1..100, and fsync just "100", then I can assume 1..99 are fsynced (spoiler: you can't assume that). Our use of fsync is still broken in that sense today, git is not a safe place to store your data in the POSIXLY pedantic sense (and no, I don't just mean that core.fsyncObjectFiles is `false` by default, it only covers a small part of this, e.g. we don't fsync dir entries even with that). On a real live filesystem this is usually not an issue, because if you're not dealing with yanked power cords (and even then, journals might save you), then even if you fsync a file but don't fsync the dir entry it's in, the FS is usually forgiving about such cases. I.e. if someone does a concurrent request for the could-be-outdated dir entry they'll service the up-to-date one, even without that having been fsync'd, because the VFS layer isn't going to the synced disk, it's checking it's current state and servicing your request from that. But at least some FS snapshot implementations have a habit of exposing the most pedantic interpretation possible of FS semantics, and one that you wouldn't ever get on a live FS. I.e. you might be hooking into the equivalent of the order in which things are written to disk, and end up with a state that would never have been exposed to a running program (there would be a 1=1 correspondence if we fsync'd properly, which we don't). The best way to get backups of git repositories you know are correct are is to use git's own transport mechanisms, i.e. fetch/pull the data, or create bundles from it. This would be the case even if we fixed all our fsync issues, because doing so wouldn't help you in the case of a bit-flip, but an "index-pack" on the other end will spot such issues. 1. https://lore.kernel.org/git/20200917112830.26606-2-avarab@gmail.com/ ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 18:19 ` Ævar Arnfjörð Bjarmason @ 2021-07-14 19:11 ` Martin Fick 2021-07-14 19:41 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:30 ` Taylor Blau 1 sibling, 1 reply; 34+ messages in thread From: Martin Fick @ 2021-07-14 19:11 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote: > The best way to get backups of git repositories you know are correct are > is to use git's own transport mechanisms, i.e. fetch/pull the data, or > create bundles from it. I don't think this is a fair recommendation since unfortunately, this cannot be used to create a full backup. This can be used to back up the version controlled data, but not the repositories meta-data, i.e. configs, reflogs, alternate setups... -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 19:11 ` Martin Fick @ 2021-07-14 19:41 ` Ævar Arnfjörð Bjarmason 2021-07-14 20:20 ` Martin Fick 2021-07-15 8:23 ` Son Luong Ngoc 0 siblings, 2 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-14 19:41 UTC (permalink / raw) To: Martin Fick Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git On Wed, Jul 14 2021, Martin Fick wrote: > On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote: >> The best way to get backups of git repositories you know are correct are >> is to use git's own transport mechanisms, i.e. fetch/pull the data, or >> create bundles from it. > > I don't think this is a fair recommendation since unfortunately, this cannot > be used to create a full backup. This can be used to back up the version > controlled data, but not the repositories meta-data, i.e. configs, reflogs, > alternate setups... *nod* FWIW at an ex-job I helped systems administrators who'd produced such a broken backup-via-rsync create a hybrid version as an interim solution. I.e. it would sync the objects via git transport, and do an rsync on a whitelist (or blacklist), so pickup config, but exclude objects. "Hybrid" because it was in a state of needing to deal with manual tweaking of config. But usually someone who's needing to thoroughly solve this backup problem will inevitably end up with wanting to drive everything that's not in the object or refstore from some external system, i.e. have config be generated from puppet, a database etc., ditto for alternates etc. But even if you can't get to that point (or don't want to) I'd say aim for the hybrid system. This isn't some purely theoretical concern b.t.w., the system using rsync like this was producing repos that wouldn't fsck all the time, and it wasn't such a busy site. I suspect (but haven't tried) that for someone who can't easily change their backup solution they'd get most of the benefits of git-native transport by having their "rsync" sync refs, then objects, not the other way around. Glob order dictates that most backup systems will do objects, then refs (which will of course, at that point, refer to nonexisting objects). It's still not safe, you'll still be subject to races, but probably a lot better in practice. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 19:41 ` Ævar Arnfjörð Bjarmason @ 2021-07-14 20:20 ` Martin Fick 2021-07-20 6:32 ` Ævar Arnfjörð Bjarmason 2021-07-15 8:23 ` Son Luong Ngoc 1 sibling, 1 reply; 34+ messages in thread From: Martin Fick @ 2021-07-14 20:20 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git On Wednesday, July 14, 2021 9:41:42 PM MDT you wrote: > On Wed, Jul 14 2021, Martin Fick wrote: > > On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote: > >> The best way to get backups of git repositories you know are correct are > >> is to use git's own transport mechanisms, i.e. fetch/pull the data, or > >> create bundles from it. > > > > I don't think this is a fair recommendation since unfortunately, this > > cannot be used to create a full backup. This can be used to back up the > > version controlled data, but not the repositories meta-data, i.e. > > configs, reflogs, alternate setups... > > *nod* > > FWIW at an ex-job I helped systems administrators who'd produced such a > broken backup-via-rsync create a hybrid version as an interim > solution. I.e. it would sync the objects via git transport, and do an > rsync on a whitelist (or blacklist), so pickup config, but exclude > objects. > > "Hybrid" because it was in a state of needing to deal with manual > tweaking of config. > > But usually someone who's needing to thoroughly solve this backup > problem will inevitably end up with wanting to drive everything that's > not in the object or refstore from some external system, i.e. have > config be generated from puppet, a database etc., ditto for alternates > etc. > > But even if you can't get to that point (or don't want to) I'd say aim > for the hybrid system. > > This isn't some purely theoretical concern b.t.w., the system using > rsync like this was producing repos that wouldn't fsck all the time, and > it wasn't such a busy site. > > I suspect (but haven't tried) that for someone who can't easily change > their backup solution they'd get most of the benefits of git-native > transport by having their "rsync" sync refs, then objects, not the other > way around. Glob order dictates that most backup systems will do > objects, then refs (which will of course, at that point, refer to > nonexisting objects). > > It's still not safe, you'll still be subject to races, but probably a > lot better in practice. It would be great if git provided a command to do a reliable incremental backup, maybe it could copy things in the order that you mention? However, most people will want to use the backup system they have and not a special git tool. Maybe git fsck should gain a switch that would rewind any refs to an older point that is no broken (using reflogs)? That way, most backups would just work and be rewound to the point at which the backup started? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 20:20 ` Martin Fick @ 2021-07-20 6:32 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-20 6:32 UTC (permalink / raw) To: Martin Fick Cc: Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git On Wed, Jul 14 2021, Martin Fick wrote: > On Wednesday, July 14, 2021 9:41:42 PM MDT you wrote: >> On Wed, Jul 14 2021, Martin Fick wrote: >> > On Wednesday, July 14, 2021 8:19:15 PM MDT Ævar Arnfjörð Bjarmason wrote: >> >> The best way to get backups of git repositories you know are correct are >> >> is to use git's own transport mechanisms, i.e. fetch/pull the data, or >> >> create bundles from it. >> > >> > I don't think this is a fair recommendation since unfortunately, this >> > cannot be used to create a full backup. This can be used to back up the >> > version controlled data, but not the repositories meta-data, i.e. >> > configs, reflogs, alternate setups... >> >> *nod* >> >> FWIW at an ex-job I helped systems administrators who'd produced such a >> broken backup-via-rsync create a hybrid version as an interim >> solution. I.e. it would sync the objects via git transport, and do an >> rsync on a whitelist (or blacklist), so pickup config, but exclude >> objects. >> >> "Hybrid" because it was in a state of needing to deal with manual >> tweaking of config. >> >> But usually someone who's needing to thoroughly solve this backup >> problem will inevitably end up with wanting to drive everything that's >> not in the object or refstore from some external system, i.e. have >> config be generated from puppet, a database etc., ditto for alternates >> etc. >> >> But even if you can't get to that point (or don't want to) I'd say aim >> for the hybrid system. >> >> This isn't some purely theoretical concern b.t.w., the system using >> rsync like this was producing repos that wouldn't fsck all the time, and >> it wasn't such a busy site. >> >> I suspect (but haven't tried) that for someone who can't easily change >> their backup solution they'd get most of the benefits of git-native >> transport by having their "rsync" sync refs, then objects, not the other >> way around. Glob order dictates that most backup systems will do >> objects, then refs (which will of course, at that point, refer to >> nonexisting objects). >> >> It's still not safe, you'll still be subject to races, but probably a >> lot better in practice. > > It would be great if git provided a command to do a reliable incremental > backup, maybe it could copy things in the order that you mention? I don't think we can or want to support this sort of thing ever, for the same reason that you probably won't convince MySQL,PostgreSQL etc. that they should support "cp -r" as a mode for backing up their live database services. I mean, there is the topic of git being lazy about fsync() etc, but even if all of that were 100% solved you'd still get bad things if you picked an arbitrary time to snapshot a running git directory, e.g. your "master" branch might have a "master.lock" because it was in the middle of an update. If you used "fetch/clone/bundle" etc. to get the data no problem, but if your snapshot happens then you'd need to manually clean that up, a situation which in practice wouldn't persist, but would be persistent with a snapshot approach. > However, most people will want to use the backup system they have and not a > special git tool. Maybe git fsck should gain a switch that would rewind any > refs to an older point that is no broken (using reflogs)? That way, most > backups would just work and be rewound to the point at which the backup > started? I think the main problem in the wild is not the inability of using a special tool, but one of education. Most people wouldn't think of "cp -r" as a first approach to say backing up a live mysql server, they'd use mysqldump and the like. But for some reason git is considered "not a database" enough that those same people would just use rsync/tar/whatever, and are then surprised when their data is corrupt or in some weird or inconsistent state... Anyway, see also my just-posted: https://lore.kernel.org/git/878s21wl4z.fsf@evledraar.gmail.com/ I.e. I'm not saying "never use rsync", there's cases where that's fine, but for a live "real" server I'd say solutions in that class shouldn't be considered/actively migrated away from. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 19:41 ` Ævar Arnfjörð Bjarmason 2021-07-14 20:20 ` Martin Fick @ 2021-07-15 8:23 ` Son Luong Ngoc 2021-07-20 6:29 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 34+ messages in thread From: Son Luong Ngoc @ 2021-07-15 8:23 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Martin Fick, Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git Hi folks, On Wed, Jul 14, 2021 at 10:03 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > *nod* > > FWIW at an ex-job I helped systems administrators who'd produced such a > broken backup-via-rsync create a hybrid version as an interim > solution. I.e. it would sync the objects via git transport, and do an > rsync on a whitelist (or blacklist), so pickup config, but exclude > objects. > > "Hybrid" because it was in a state of needing to deal with manual > tweaking of config. > > But usually someone who's needing to thoroughly solve this backup > problem will inevitably end up with wanting to drive everything that's > not in the object or refstore from some external system, i.e. have > config be generated from puppet, a database etc., ditto for alternates > etc. > > But even if you can't get to that point (or don't want to) I'd say aim > for the hybrid system. FWIW, we are running our repo on top of a some-what flickery DRBD setup and we decided to use both git clone --upload-pack 'git -c transfer.hiderefs="!refs" upload-pack' --mirror` and `tar` to create 2 separate snapshots for backup in parallel (full backup, not incremental). In case of recovery (manual), we first rely on the git snapshot and if there is any missing objects/refs, we will try to get it from the tarball. > > This isn't some purely theoretical concern b.t.w., the system using > rsync like this was producing repos that wouldn't fsck all the time, and > it wasn't such a busy site. > > I suspect (but haven't tried) that for someone who can't easily change > their backup solution they'd get most of the benefits of git-native > transport by having their "rsync" sync refs, then objects, not the other > way around. Glob order dictates that most backup systems will do > objects, then refs (which will of course, at that point, refer to > nonexisting objects). > > It's still not safe, you'll still be subject to races, but probably a > lot better in practice. I would love to get some guidance in official documentation on what is the best practice around handling git data on the server side. Is git-clone + git-bundle the go-to solution? Should tar/rsync not be used completely or is there a trade-off? Thanks, Son Luong. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-15 8:23 ` Son Luong Ngoc @ 2021-07-20 6:29 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-20 6:29 UTC (permalink / raw) To: Son Luong Ngoc Cc: Martin Fick, Taylor Blau, Sun Chao, Taylor Blau, Sun Chao via GitGitGadget, git On Thu, Jul 15 2021, Son Luong Ngoc wrote: > Hi folks, > > On Wed, Jul 14, 2021 at 10:03 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> *nod* >> >> FWIW at an ex-job I helped systems administrators who'd produced such a >> broken backup-via-rsync create a hybrid version as an interim >> solution. I.e. it would sync the objects via git transport, and do an >> rsync on a whitelist (or blacklist), so pickup config, but exclude >> objects. >> >> "Hybrid" because it was in a state of needing to deal with manual >> tweaking of config. >> >> But usually someone who's needing to thoroughly solve this backup >> problem will inevitably end up with wanting to drive everything that's >> not in the object or refstore from some external system, i.e. have >> config be generated from puppet, a database etc., ditto for alternates >> etc. >> >> But even if you can't get to that point (or don't want to) I'd say aim >> for the hybrid system. > > FWIW, we are running our repo on top of a some-what flickery DRBD setup and > we decided to use both > > git clone --upload-pack 'git -c transfer.hiderefs="!refs" > upload-pack' --mirror` > > and > > `tar` > > to create 2 separate snapshots for backup in parallel (full backup, > not incremental). > > In case of recovery (manual), we first rely on the git snapshot and if > there is any > missing objects/refs, we will try to get it from the tarball. That sounds good, and similar to what I described with that "hybrid" setup. >> >> This isn't some purely theoretical concern b.t.w., the system using >> rsync like this was producing repos that wouldn't fsck all the time, and >> it wasn't such a busy site. >> >> I suspect (but haven't tried) that for someone who can't easily change >> their backup solution they'd get most of the benefits of git-native >> transport by having their "rsync" sync refs, then objects, not the other >> way around. Glob order dictates that most backup systems will do >> objects, then refs (which will of course, at that point, refer to >> nonexisting objects). >> >> It's still not safe, you'll still be subject to races, but probably a >> lot better in practice. > > I would love to get some guidance in official documentation on what is the best > practice around handling git data on the server side. > > Is git-clone + git-bundle the go-to solution? > Should tar/rsync not be used completely or is there a trade-off? I should have tempered some of those comments, it's perfectly fine in general to use tar+rsync for "backing up" git repositories in certain contexts. E.g. when I switch laptops or whatever it's what I do to grab data. The problem is when the data isn't at rest, i.e. in the context of an active server. There you start moving towards a scale where it goes from "sure, it's fine" to "this is such a bad idea that nobody should pursue it". If you're running a setup where you're starting to submit patches to git.git you're probably at the far end of that spectrum. Whether it's clone, push, fetch, bundle etc. doesn't really matter, the important part is that you're using git's pack transport mechanism to ferry updates around, which gives you guarantees rsync+tar can't, particularly in the face of concurrently updated data. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 18:19 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:11 ` Martin Fick @ 2021-07-14 19:30 ` Taylor Blau 2021-07-14 19:32 ` Ævar Arnfjörð Bjarmason 2021-07-14 21:40 ` Junio C Hamano 1 sibling, 2 replies; 34+ messages in thread From: Taylor Blau @ 2021-07-14 19:30 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git On Wed, Jul 14, 2021 at 08:19:15PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> The reason why we want to avoid freshen the mtime of ".pack" file is to > >> improve the reading speed of Git Servers. > > > > That's surprising behavior to me. Are you saying that calling utime(2) > > causes the *page* cache to be invalidated and that most reads are > > cache-misses lowering overall IOPS? > > I think you may be right narrowly, but wrong in this context :) > > I.e. my understanding of this problem is that they have some incremental > backup job, e.g. rsync without --checksum (not that doing that would > help, chicken & egg issue).. Ah, thanks for explaining. That's helpful, and changes my thinking. Ideally, Sun would be able to use --checksum (if they are using rsync) or some equivalent (if they are not). In other words, this seems like a problem that Git shouldn't be bending over backwards for. But if that isn't possible, then I find introducing a new file to redefine the pack's mtime just to accommodate a backup system that doesn't know better to be a poor justification for adding this complexity. Especially since we agree that rsync-ing live Git repositories is a bad idea in the first place ;). If it were me, I would probably stop here and avoid pursuing this further. But an OK middle ground might be core.freshenPackfiles=<bool> to indicate whether or not packs can be freshened, or the objects contained within them should just be rewritten loose. Sun could then set this configuration to "false", implying: - That they would have more random loose objects, leading to some redundant work by their backup system. - But they wouldn't have to resync their huge packfiles. ...and we wouldn't have to introduce any new formats/file types to do it. To me, that seems like a net-positive outcome. Thanks, Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 19:30 ` Taylor Blau @ 2021-07-14 19:32 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:52 ` Taylor Blau 2021-07-14 21:40 ` Junio C Hamano 1 sibling, 1 reply; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-14 19:32 UTC (permalink / raw) To: Taylor Blau; +Cc: Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git On Wed, Jul 14 2021, Taylor Blau wrote: > On Wed, Jul 14, 2021 at 08:19:15PM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> The reason why we want to avoid freshen the mtime of ".pack" file is to >> >> improve the reading speed of Git Servers. >> > >> > That's surprising behavior to me. Are you saying that calling utime(2) >> > causes the *page* cache to be invalidated and that most reads are >> > cache-misses lowering overall IOPS? >> >> I think you may be right narrowly, but wrong in this context :) >> >> I.e. my understanding of this problem is that they have some incremental >> backup job, e.g. rsync without --checksum (not that doing that would >> help, chicken & egg issue).. > > Ah, thanks for explaining. That's helpful, and changes my thinking. > > Ideally, Sun would be able to use --checksum (if they are using rsync) > or some equivalent (if they are not). In other words, this seems like a > problem that Git shouldn't be bending over backwards for. Even with my strong opinions about rsync being bad for this use-case, in practice it does work for a lot of people, e.g. with nightly jobs etc. Not everyone's repository is insanely busy, where I've mainly seen this sort of corruption. In that case making people use --checksum is borderline inhumane :) > But if that isn't possible, then I find introducing a new file to > redefine the pack's mtime just to accommodate a backup system that > doesn't know better to be a poor justification for adding this > complexity. Especially since we agree that rsync-ing live Git > repositories is a bad idea in the first place ;). > > If it were me, I would probably stop here and avoid pursuing this > further. But an OK middle ground might be core.freshenPackfiles=<bool> > to indicate whether or not packs can be freshened, or the objects > contained within them should just be rewritten loose. > > Sun could then set this configuration to "false", implying: > > - That they would have more random loose objects, leading to some > redundant work by their backup system. > - But they wouldn't have to resync their huge packfiles. > > ...and we wouldn't have to introduce any new formats/file types to do > it. To me, that seems like a net-positive outcome. This approach is getting quite close to my core.checkCollisions patch, to the point of perhaps being indistinguishable in practice: https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/ I.e. if you're happy to re-write out duplicate objects then you're going to be ignoring the collision check and don't need to do it. It's not the same in that you might skip writing objects you know are reachable, and with the collisions check off and not-so-thin packs you will/might get more redundancy than you asked for. But in practice with modern clients mostly/entirely sending you just the things you need in the common case it might be close enough. I mean it addresses the expiry race that a unreachable-becomes-reachable again race would be "solved" by just re-writing that data, hrm, but there's probably aspects of that race I'm not considering. Anyway, in the sense that for a lot of systems syncing file additions is a lot cheaper than rewrites it might get you what you want... ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 19:32 ` Ævar Arnfjörð Bjarmason @ 2021-07-14 19:52 ` Taylor Blau 0 siblings, 0 replies; 34+ messages in thread From: Taylor Blau @ 2021-07-14 19:52 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git On Wed, Jul 14, 2021 at 09:32:26PM +0200, Ævar Arnfjörð Bjarmason wrote: > > But if that isn't possible, then I find introducing a new file to > > redefine the pack's mtime just to accommodate a backup system that > > doesn't know better to be a poor justification for adding this > > complexity. Especially since we agree that rsync-ing live Git > > repositories is a bad idea in the first place ;). > > > > If it were me, I would probably stop here and avoid pursuing this > > further. But an OK middle ground might be core.freshenPackfiles=<bool> > > to indicate whether or not packs can be freshened, or the objects > > contained within them should just be rewritten loose. > > > > Sun could then set this configuration to "false", implying: > > > > - That they would have more random loose objects, leading to some > > redundant work by their backup system. > > - But they wouldn't have to resync their huge packfiles. > > > > ...and we wouldn't have to introduce any new formats/file types to do > > it. To me, that seems like a net-positive outcome. > > This approach is getting quite close to my core.checkCollisions patch, > to the point of perhaps being indistinguishable in practice: > https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/ Hmm, I'm not sure if I understand. That collision check is only done during index-pack, and reading builtin/index-pack.c:check_collision(), it looks like we only do it for large blobs anyway. > I.e. if you're happy to re-write out duplicate objects then you're going > to be ignoring the collision check and don't need to do it. It's not the > same in that you might skip writing objects you know are reachable, and > with the collisions check off and not-so-thin packs you will/might get > more redundancy than you asked for. We may be talking about different things, but if users are concerned about SHA-1 collisions, then they should still be able to build with DC_SHA1=YesPlease to catch shattered-style collisions. Anyway, I think we may be a little in the weeds for what we are trying to accomplish here. I'm thinking something along the lines of the following (sans documentation and tests, of course ;)). --- >8 --- diff --git a/object-file.c b/object-file.c index f233b440b2..87c9238365 100644 --- a/object-file.c +++ b/object-file.c @@ -1971,9 +1971,22 @@ static int freshen_loose_object(const struct object_id *oid) return check_and_freshen(oid, 1); } +static int can_freshen_packs = -1; +static int get_can_freshen_packs(void) +{ + if (can_freshen_packs < 0) { + if (git_config_get_bool("core.freshenpackfiles", + &can_freshen_packs)) + can_freshen_packs = 1; + } + return can_freshen_packs; +} + static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; + if (!get_can_freshen_packs()) + return 0; if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 19:30 ` Taylor Blau 2021-07-14 19:32 ` Ævar Arnfjörð Bjarmason @ 2021-07-14 21:40 ` Junio C Hamano 1 sibling, 0 replies; 34+ messages in thread From: Junio C Hamano @ 2021-07-14 21:40 UTC (permalink / raw) To: Taylor Blau Cc: Ævar Arnfjörð Bjarmason, Taylor Blau, Sun Chao, Sun Chao via GitGitGadget, git Taylor Blau <me@ttaylorr.com> writes: > But if that isn't possible, then I find introducing a new file to > redefine the pack's mtime just to accommodate a backup system that > doesn't know better to be a poor justification for adding this > complexity. Especially since we agree that rsync-ing live Git > repositories is a bad idea in the first place ;). > > If it were me, I would probably stop here and avoid pursuing this > further. ;-). ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 17:04 ` Taylor Blau 2021-07-14 18:19 ` Ævar Arnfjörð Bjarmason @ 2021-07-15 16:30 ` Sun Chao 2021-07-15 16:42 ` Taylor Blau 1 sibling, 1 reply; 34+ messages in thread From: Sun Chao @ 2021-07-15 16:30 UTC (permalink / raw) To: Taylor Blau Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget, git > 2021年7月15日 01:04,Taylor Blau <ttaylorr@github.com> 写道: > > [...] >> >> However we find the mtime of ".pack" files changes over time which makes the >> file system always reload the big files, that takes a lot of IO time and result >> in lower speed of git upload-pack and even further the disk IOPS is exhausted. > > That's surprising behavior to me. Are you saying that calling utime(2) > causes the *page* cache to be invalidated and that most reads are > cache-misses lowering overall IOPS? > > If so, then I am quite surprised ;). The only state that should be > dirtied by calling utime(2) is the inode itself, so the blocks referred > to by the inode corresponding to a pack should be left in-tact. > > If you're on Linux, you can try observing the behavior of evicting > inodes, blocks, or both from the disk cache by changing "2" in the > following: > > hyperfine 'git pack-objects --all --stdout --delta-base-offset >/dev/null' > --prepare='sync; echo 2 | sudo tee /proc/sys/vm/drop_caches' > > where "1" drops the page cache, "2" drops the inodes, and "3" evicts > both. > > I wonder if you could share the results of running the above varying > the value of "1", "2", and "3", as well as swapping the `--prepare` for > `--warmup=3` to warm your caches (and give us an idea of what your > expected performance is probably like). > > Thanks, > Taylor I'm sorry to reply so late, I work long hours during the day, and the company network can not send external mail, so I can only go home late at night to reply to you. Thanks for your reply again, My explaination for 'why the mtime is so important' lost some informations and it is not clear enough, I will tell the details here: Servers: - We maintain a number of servers, each mounting some NFS disks that hold our git repositories, some of them are so large (cannot reduce the size now), they are > 10GB - There are too many objects and large files in the git history which result in some large '.pack' files in the '.git/objects/pack' directires - We created the '.keep' files for each large '.pack' file, wish the disk cache can reduce the NFS IOPS and just load contents from caches. Clients: - There are too many CI systems are keep downloading the git repositories in a very high frequency, e.g. we find different CI systems make 600 download requests in a short period of time by 'git fetch'. - Some developers are doing 'git push' at the same time, create Pull Requests after that (which trigger the CI then), so git servers will do some update tasks which may cause the mtime of '.pack' file freshend. So, in this case there will be many 'git-upload-pack' processes running on the git servers, they all need to load the big '.pack' files. The 'git-upload-pack' will be faster if the disk cache is warmed up and the NFS server will be not so busy. However we find the IOPS of the NFS server always be exhausted and the 'git-upload-pack' will runs for a very long time. We noticed the mtime of '.pack' changes over time, one of my colleagues who is familiar with the file system tell me it's the mtime who invalidate the disk caches. So we want the caches to be valid for a long time which can speed up the 'git-upload-pack' processes. I don't known if the `/proc/sys/vm/drop_caches` can help or not, but thanks for your tips, I will try to check them and see if there are some differences. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-15 16:30 ` Sun Chao @ 2021-07-15 16:42 ` Taylor Blau 2021-07-15 16:48 ` Sun Chao 0 siblings, 1 reply; 34+ messages in thread From: Taylor Blau @ 2021-07-15 16:42 UTC (permalink / raw) To: Sun Chao Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget, git On Fri, Jul 16, 2021 at 12:30:18AM +0800, Sun Chao wrote: > I'm sorry to reply so late, I work long hours during the day, and the > company network can not send external mail, so I can only go home late > at night to reply to you. There's no need to apologize :-). > Thanks for your reply again, My explaination for 'why the mtime is so > important' lost some informations and it is not clear enough, I will > tell the details here: Let me see if I can summarize here. Basically: - You have a number of servers that have NFS mounts which hold large repositories with packs in excess of 10 GB in size. - You have a lot of clients that are fetching, and a smaller number of clients that are pushing, some of which happen to freshen the mtimes of the packs. ...and the mtimes being updated cause the disk cache to be invalidated? It's the last part that is so surprising to me. Ævar and I discussed earlier in the thread that their understanding was that you had a backup system which had to resynchronize an unchanged file because its metadata had changed. But this is different than that. If I understand what you're saying correctly, then you're saying that the disk caches themselves are invalidated by changing the mtime. That is highly surprising to me, since the block cache should only be invalidated if the *blocks* change, not metadata in the inode. It would be good to confirm that this is actually what's happening. Thanks, Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-15 16:42 ` Taylor Blau @ 2021-07-15 16:48 ` Sun Chao 0 siblings, 0 replies; 34+ messages in thread From: Sun Chao @ 2021-07-15 16:48 UTC (permalink / raw) To: Taylor Blau Cc: Ævar Arnfjörð Bjarmason, Sun Chao via GitGitGadget, git > 2021年7月16日 00:42,Taylor Blau <me@ttaylorr.com> 写道: > > On Fri, Jul 16, 2021 at 12:30:18AM +0800, Sun Chao wrote: >> I'm sorry to reply so late, I work long hours during the day, and the >> company network can not send external mail, so I can only go home late >> at night to reply to you. > > There's no need to apologize :-). > >> Thanks for your reply again, My explaination for 'why the mtime is so >> important' lost some informations and it is not clear enough, I will >> tell the details here: > > Let me see if I can summarize here. Basically: > > - You have a number of servers that have NFS mounts which hold large > repositories with packs in excess of 10 GB in size. > - You have a lot of clients that are fetching, and a smaller number of > clients that are pushing, some of which happen to freshen the mtimes > of the packs. > > ...and the mtimes being updated cause the disk cache to be invalidated? > > It's the last part that is so surprising to me. Ævar and I discussed > earlier in the thread that their understanding was that you had a backup > system which had to resynchronize an unchanged file because its metadata > had changed. > > But this is different than that. If I understand what you're saying > correctly, then you're saying that the disk caches themselves are > invalidated by changing the mtime. > > That is highly surprising to me, since the block cache should only be > invalidated if the *blocks* change, not metadata in the inode. It would > be good to confirm that this is actually what's happening. > > Thanks, > Taylor Oh, Maybe I didn't understand caching well enough, let me check it again, and thanks for your and Ævar's answers, they are really helpful. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v2] packfile: freshen the mtime of packfile by configuration 2021-07-14 1:39 ` Ævar Arnfjörð Bjarmason 2021-07-14 2:52 ` Taylor Blau @ 2021-07-14 16:11 ` Sun Chao 1 sibling, 0 replies; 34+ messages in thread From: Sun Chao @ 2021-07-14 16:11 UTC (permalink / raw) To: git; +Cc: Sun Chao via GitGitGadget > 2021年7月14日 09:39,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道: > >> +The default is 'pack' which means the *.pack file will be freshened by >> +default. You can configure a different suffix to use, the file with the >> +suffix will be created automatically, it's better not using any known >> +suffix such like 'idx', 'keep', 'promisor'. >> + > > Hrm, per my v1 feedback (and I'm not sure if my suggestion is even good > here, there's others more familiar with this area than I am), I was > thinking of something like a *.bump file written via: > > core.packUseBumpFiles=bool > > Or something like that, anyway, the edge case in allowing the user to > pick arbitrary suffixes is that we'd get in-the-wild user arbitrary > configuration squatting on a relatively sensitive part of the object > store. > > E.g. we recently added *.rev files to go with > *.{pack,idx,bitmap,keep,promisor} (and I'm probably forgetting some > suffix). What if before that a user had set: > > core.packMtimeSuffix=rev > > In practice it's probably too obscure to worry about, but I think it's > still worth it to only strictly write things we decide to write into the > object store. Thanks, this makes sense, allowing user to pick arbitrary suffixes may cause some unpredictable problems, E.g. users who don’t know about the *.keep files but setting core.packMtimeSuffix=keep, so I think it’s right to restrict the suffix here. >> core.deltaBaseCacheLimit:: >> Maximum number of bytes per thread to reserve for caching base objects >> that may be referenced by multiple deltified objects. By storing the >> diff --git a/builtin/index-pack.c b/builtin/index-pack.c >> index 3fbc5d70777..60bacc8ee7f 100644 >> --- a/builtin/index-pack.c >> +++ b/builtin/index-pack.c >> @@ -1437,19 +1437,6 @@ static void fix_unresolved_deltas(struct hashfile *f) >> free(sorted_by_pos); >> } >> >> -static const char *derive_filename(const char *pack_name, const char *strip, >> - const char *suffix, struct strbuf *buf) >> -{ >> - size_t len; >> - if (!strip_suffix(pack_name, strip, &len) || !len || >> - pack_name[len - 1] != '.') >> - die(_("packfile name '%s' does not end with '.%s'"), >> - pack_name, strip); >> - strbuf_add(buf, pack_name, len); >> - strbuf_addstr(buf, suffix); >> - return buf->buf; >> -} > > > Would be more readable to split this series up into at least two > patches, starting with just moving this function as-is to a the new > location (just renaming it & moving it), and then using it. I don't > think there's changes to it, but right now I'm just eyeballing the > diff. It's more obvious if it's split up. Thanks, I will do it. > >> + if (!strcmp(var, "core.packmtimesuffix")) { >> + return git_config_string(&pack_mtime_suffix, var, value); > > Can drop the {} braces here, per Documentation/CodingGuidelines Thanks, I will read the CodingGuidelines again and fix the issue. > >> +const char *pack_mtime_suffix = "pack"; > > I can see how having a configurable suffix made the implementation > easier, perhaps that's how it started? > yes, here is the default value. > >> int fsync_object_files; >> size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; >> size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; >> diff --git a/object-file.c b/object-file.c >> index f233b440b22..b3e77213c42 100644 >> --- a/object-file.c >> +++ b/object-file.c >> @@ -1974,12 +1974,41 @@ static int freshen_loose_object(const struct object_id *oid) >> static int freshen_packed_object(const struct object_id *oid) >> { >> struct pack_entry e; >> + struct stat st; >> + struct strbuf name_buf = STRBUF_INIT; >> + const char *filename; >> + >> if (!find_pack_entry(the_repository, oid, &e)) >> return 0; >> if (e.p->freshened) >> return 1; >> - if (!freshen_file(e.p->pack_name)) >> - return 0; >> + >> + filename = e.p->pack_name; >> + if (!strcasecmp(pack_mtime_suffix, "pack")) { >> + if (!freshen_file(filename)) >> + return 0; >> + e.p->freshened = 1; >> + return 1; >> + } >> + >> + /* If we want to freshen different file instead of .pack file, we need >> + * to make sure the file exists and create it if needed. >> + */ >> + filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf); > > You populate name_buf here, but don't strbuf_release(&name_buf) it at the end of this function. Will fix it. > >> + if (lstat(filename, &st) < 0) { >> + int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); >> + if (fd < 0) { >> + // here we need to check it again because other git process may created it > > /* */ comments, not //, if it's needed at all. Covered in CodingGuidelines Will fix it. > >> + if (lstat(filename, &st) < 0) >> + die_errno("unable to create '%s'", filename); > > If we can't create this specific file here shouldn't we just continue > silently at this point? Surely if this process is screwed we're just > about to die on something more important? Yes, because here we have the *.pack file exists, we can step back to freshen the *.pack file if the specific file cannot be created. And when we want to get the mtime, we can also read mtime from *.pack file either if the specific file does not exists. > > And lstat() can also return transitory errors that don't indicate > "unable to create", e.g. maybe we can out of memory the kernel is > willing to give us or something (just skimming the lstat manpage). Thanks, I will think about it. > >> + } else { >> + close(fd); >> + } >> + } else { >> + if (!freshen_file(filename)) >> + return 0; > > Style/indentatino: just do "} else if (!freshen..." ? Will fix it. > >> + } >> + >> e.p->freshened = 1; >> return 1; >> } >> diff --git a/packfile.c b/packfile.c >> index 755aa7aec5e..a607dda4e25 100644 >> --- a/packfile.c >> +++ b/packfile.c >> @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1) >> return odb_pack_name(&buf, sha1, "idx"); >> } >> >> +const char *derive_pack_filename(const char *pack_name, const char *strip, >> + const char *suffix, struct strbuf *buf) >> +{ >> + size_t len; >> + if (!strip_suffix(pack_name, strip, &len) || !len || >> + pack_name[len - 1] != '.') >> + die(_("packfile name '%s' does not end with '.%s'"), >> + pack_name, strip); >> + strbuf_add(buf, pack_name, len); >> + strbuf_addstr(buf, suffix); >> + return buf->buf; >> +} > > Just have this return void? I renamed the original 'derive_filename' here with new name 'derive_pack_filename', when it is used like this: filename = derive_filename(pack_name, "pack", suffix, &name_buf); and so it looks like a more convenient way to use the strbuf object, so I decided to keep it the old way. > >> static unsigned int pack_used_ctr; >> static unsigned int pack_mmap_calls; >> static unsigned int peak_pack_open_windows; >> @@ -727,6 +740,17 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) >> */ >> p->pack_size = st.st_size; >> p->pack_local = local; >> + >> + /* If we have different file used to freshen the mtime, we should >> + * use it at a higher priority. >> + */ >> + if (!!strcasecmp(pack_mtime_suffix, "pack")) { >> + struct strbuf name_buf = STRBUF_INIT; >> + const char *filename; >> + >> + filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf); >> + stat(filename, &st); > > I.e. the "filename" here isn't needed, just call derive_pack_filename() > and use name.buf.buf to stat. Let me thinks about it, it's a good idea then. > > Also: We should check the stat return value here & report errno if > needed, no? Here I think we should step back to use the mtime of *.pack file instead, so if the specific file does not exists, there are some reasons like (a) the *.pack is just created and mtime of it has not freshened again or (b) we can not create the specific file for some reason, and in this case we will freshen the *.pack file instead. (c) someone or some process just delete it, we should try to go futher. I don't know if it's right to do this, but I think we should avoid die here. > >> +test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' ' >> + obj1=$(echo three | git hash-object -w --stdin) && >> + obj2=$(echo four | git hash-object -w --stdin) && >> + pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && >> + pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && >> + git -c core.packmtimesuffix=bump prune-packed && >> + git cat-file -p $obj1 && >> + git cat-file -p $obj2 && >> + touch .git/objects/pack/pack-$pack2.bump && >> + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && >> + git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago && > > On command-lines we can spell it camel-cased, e.g. -c > core.packMtimeSuffix[...]. > Thanks, will do it. ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-14 1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget 2021-07-14 1:39 ` Ævar Arnfjörð Bjarmason @ 2021-07-19 19:53 ` Sun Chao via GitGitGadget 2021-07-19 20:51 ` Taylor Blau 2021-08-15 17:08 ` [PATCH v4 0/2] " Sun Chao via GitGitGadget 1 sibling, 2 replies; 34+ messages in thread From: Sun Chao via GitGitGadget @ 2021-07-19 19:53 UTC (permalink / raw) To: git Cc: Ævar Arnfjörð Bjarmason, Taylor Blau, Taylor Blau, Martin Fick, Son Luong Ngoc, Sun Chao, Sun Chao From: Sun Chao <16657101987@163.com> Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on cached file system when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed. Git servers that mount the same NFS disk will re-sync the '.pack' files to cached file system which will slow the git commands. So if add core.freshenPackfiles to indicate whether or not packs can be freshened, turning off this option on some servers can speed up the execution of some commands on servers which use NFS disk instead of local disk. Signed-off-by: Sun Chao <16657101987@163.com> --- packfile: freshen the mtime of packfile by configuration packfile: freshen the mtime of packfile by configuration Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on cached file system when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed. Git servers that mount the same NFS disk will re-sync the '.pack' files to cached file system which will slow the git commands. Here we can find the description of the cached file system for NFS Client from https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system: 3. To ensure that the cached directories and files are kept up to date, CacheFS periodically checks the consistency of files stored in the cache. It does this by comparing the current modification time to the previous modification time. 4. If the modification times are different, all data and attributes for the directory or file are purged from the cache, and new data and attributes are retrieved from the back file system. So if add core.freshenPackfiles to indicate whether or not packs can be freshened, turning off this option on some servers can speed up the execution of some commands on servers which use NFS disk instead of local disk. Signed-off-by: Sun Chao 16657101987@163.com Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v3 Pull-Request: https://github.com/git/git/pull/1043 Range-diff vs v2: 1: 943e31e8587 ! 1: 16c68923bea packfile: freshen the mtime of packfile by configuration @@ Commit message mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to - incremental backup jobs or services rely on file system - cache when there are large '.pack' files exists. + incremental backup jobs or services rely on cached file system + when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can - notice the '.pack' file's mtime changed, and '.idx' file not. + notice the '.pack' file's mtime changed. Git servers + that mount the same NFS disk will re-sync the '.pack' files + to cached file system which will slow the git commands. - If we freshen the mtime of packfile by updating another - file instead of '.pack' file e.g. a empty '.bump' file, - when we need to check the mtime of packfile, get it from - another file instead. Large git repository may contains - large '.pack' files, and we can use smaller files even empty - file to do the mtime get/set operation, this can avoid - file system cache re-sync large '.pack' files again and - then speed up most git commands. + So if add core.freshenPackfiles to indicate whether or not + packs can be freshened, turning off this option on some + servers can speed up the execution of some commands on servers + which use NFS disk instead of local disk. Signed-off-by: Sun Chao <16657101987@163.com> @@ Documentation/config/core.txt: the largest projects. You probably do not need t + Common unit suffixes of 'k', 'm', or 'g' are supported. -+core.packMtimeSuffix:: ++core.freshenPackFiles:: + Normally we avoid writing existing object by freshening the mtime + of the *.pack file which contains it in order to aid some processes -+ such like prune. Use different file instead of *.pack file will -+ avoid file system cache re-sync the large packfiles, and consequently -+ make git commands faster. ++ such like prune. Turning off this option on some servers can speed ++ up the execution of some commands like 'git-upload-pack'(e.g. some ++ servers that mount the same NFS disk will re-sync the *.pack files ++ to cached file system if the mtime cahnges). ++ -+The default is 'pack' which means the *.pack file will be freshened by -+default. You can configure a different suffix to use, the file with the -+suffix will be created automatically, it's better not using any known -+suffix such like 'idx', 'keep', 'promisor'. ++The default is true which means the *.pack file will be freshened if we ++want to write a existing object whthin it. + core.deltaBaseCacheLimit:: Maximum number of bytes per thread to reserve for caching base objects that may be referenced by multiple deltified objects. By storing the - ## builtin/index-pack.c ## -@@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f) - free(sorted_by_pos); - } - --static const char *derive_filename(const char *pack_name, const char *strip, -- const char *suffix, struct strbuf *buf) --{ -- size_t len; -- if (!strip_suffix(pack_name, strip, &len) || !len || -- pack_name[len - 1] != '.') -- die(_("packfile name '%s' does not end with '.%s'"), -- pack_name, strip); -- strbuf_add(buf, pack_name, len); -- strbuf_addstr(buf, suffix); -- return buf->buf; --} -- - static void write_special_file(const char *suffix, const char *msg, - const char *pack_name, const unsigned char *hash, - const char **report) -@@ builtin/index-pack.c: static void write_special_file(const char *suffix, const char *msg, - int msg_len = strlen(msg); - - if (pack_name) -- filename = derive_filename(pack_name, "pack", suffix, &name_buf); -+ filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf); - else - filename = odb_pack_name(&name_buf, hash, suffix); - -@@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char *prefix) - if (from_stdin && hash_algo) - die(_("--object-format cannot be used with --stdin")); - if (!index_name && pack_name) -- index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf); -+ index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf); - - opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY); - if (rev_index) { - opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV; - if (index_name) -- rev_index_name = derive_filename(index_name, -+ rev_index_name = derive_pack_filename(index_name, - "idx", "rev", - &rev_index_name_buf); - } - ## cache.h ## @@ cache.h: extern size_t packed_git_limit; extern size_t delta_base_cache_limit; extern unsigned long big_file_threshold; extern unsigned long pack_size_limit_cfg; -+extern const char *pack_mtime_suffix; ++extern int core_freshen_packfiles; /* * Accessors for the core.sharedrepository config which lazy-load the value @@ config.c: static int git_default_core_config(const char *var, const char *value, return 0; } -+ if (!strcmp(var, "core.packmtimesuffix")) { -+ return git_config_string(&pack_mtime_suffix, var, value); ++ if (!strcmp(var, "core.freshenpackfiles")) { ++ core_freshen_packfiles = git_config_bool(var, value); + } + if (!strcmp(var, "core.deltabasecachelimit")) { @@ config.c: static int git_default_core_config(const char *var, const char *value, return 0; ## environment.c ## -@@ environment.c: const char *git_hooks_path; - int zlib_compression_level = Z_BEST_SPEED; - int core_compression_level; - int pack_compression_level = Z_DEFAULT_COMPRESSION; -+const char *pack_mtime_suffix = "pack"; - int fsync_object_files; - size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; - size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; +@@ environment.c: int core_sparse_checkout_cone; + int merge_log_config = -1; + int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ + unsigned long pack_size_limit_cfg; ++int core_freshen_packfiles = 1; + enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET; + + #ifndef PROTECT_HFS_DEFAULT ## object-file.c ## @@ object-file.c: static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; -+ struct stat st; -+ struct strbuf name_buf = STRBUF_INIT; -+ const char *filename; ++ ++ if (!core_freshen_packfiles) ++ return 1; + if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) - return 1; -- if (!freshen_file(e.p->pack_name)) -- return 0; -+ -+ filename = e.p->pack_name; -+ if (!strcasecmp(pack_mtime_suffix, "pack")) { -+ if (!freshen_file(filename)) -+ return 0; -+ e.p->freshened = 1; -+ return 1; -+ } -+ -+ /* If we want to freshen different file instead of .pack file, we need -+ * to make sure the file exists and create it if needed. -+ */ -+ filename = derive_pack_filename(filename, "pack", pack_mtime_suffix, &name_buf); -+ if (lstat(filename, &st) < 0) { -+ int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); -+ if (fd < 0) { -+ // here we need to check it again because other git process may created it -+ if (lstat(filename, &st) < 0) -+ die_errno("unable to create '%s'", filename); -+ } else { -+ close(fd); -+ } -+ } else { -+ if (!freshen_file(filename)) -+ return 0; -+ } -+ - e.p->freshened = 1; - return 1; - } - - ## packfile.c ## -@@ packfile.c: char *sha1_pack_index_name(const unsigned char *sha1) - return odb_pack_name(&buf, sha1, "idx"); - } - -+const char *derive_pack_filename(const char *pack_name, const char *strip, -+ const char *suffix, struct strbuf *buf) -+{ -+ size_t len; -+ if (!strip_suffix(pack_name, strip, &len) || !len || -+ pack_name[len - 1] != '.') -+ die(_("packfile name '%s' does not end with '.%s'"), -+ pack_name, strip); -+ strbuf_add(buf, pack_name, len); -+ strbuf_addstr(buf, suffix); -+ return buf->buf; -+} -+ - static unsigned int pack_used_ctr; - static unsigned int pack_mmap_calls; - static unsigned int peak_pack_open_windows; -@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local) - */ - p->pack_size = st.st_size; - p->pack_local = local; -+ -+ /* If we have different file used to freshen the mtime, we should -+ * use it at a higher priority. -+ */ -+ if (!!strcasecmp(pack_mtime_suffix, "pack")) { -+ struct strbuf name_buf = STRBUF_INIT; -+ const char *filename; -+ -+ filename = derive_pack_filename(path, "idx", pack_mtime_suffix, &name_buf); -+ stat(filename, &st); -+ } - p->mtime = st.st_mtime; - if (path_len < the_hash_algo->hexsz || - get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) - - ## packfile.h ## -@@ packfile.h: char *sha1_pack_name(const unsigned char *sha1); - */ - char *sha1_pack_index_name(const unsigned char *sha1); - -+/* -+ * Return the corresponding filename with given suffix from "file_name" -+ * which must has "strip" suffix. -+ */ -+const char *derive_pack_filename(const char *file_name, const char *strip, -+ const char *suffix, struct strbuf *buf); -+ - /* - * Return the basename of the packfile, omitting any containing directory - * (e.g., "pack-1234abcd[...].pack"). ## t/t7701-repack-unpack-unreachable.sh ## @@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' ' test_must_fail git cat-file -p $obj2 ' -+test_expect_success 'do not bother loosening old objects with core.packmtimesuffix config' ' ++test_expect_success 'do not bother loosening old objects without freshen pack time' ' + obj1=$(echo three | git hash-object -w --stdin) && + obj2=$(echo four | git hash-object -w --stdin) && -+ pack1=$(echo $obj1 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && -+ pack2=$(echo $obj2 | git -c core.packmtimesuffix=bump pack-objects .git/objects/pack/pack) && -+ git -c core.packmtimesuffix=bump prune-packed && ++ pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && ++ git -c core.freshenPackFiles=false prune-packed && + git cat-file -p $obj1 && + git cat-file -p $obj2 && -+ touch .git/objects/pack/pack-$pack2.bump && -+ test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && -+ git -c core.packmtimesuffix=bump repack -A -d --unpack-unreachable=1.hour.ago && ++ test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && ++ git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + test_must_fail git cat-file -p $obj2 +' Documentation/config/core.txt | 11 +++++++++++ cache.h | 1 + config.c | 4 ++++ environment.c | 1 + object-file.c | 4 ++++ t/t7701-repack-unpack-unreachable.sh | 14 ++++++++++++++ 6 files changed, 35 insertions(+) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a1..1e7cf366628 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -398,6 +398,17 @@ the largest projects. You probably do not need to adjust this value. + Common unit suffixes of 'k', 'm', or 'g' are supported. +core.freshenPackFiles:: + Normally we avoid writing existing object by freshening the mtime + of the *.pack file which contains it in order to aid some processes + such like prune. Turning off this option on some servers can speed + up the execution of some commands like 'git-upload-pack'(e.g. some + servers that mount the same NFS disk will re-sync the *.pack files + to cached file system if the mtime cahnges). ++ +The default is true which means the *.pack file will be freshened if we +want to write a existing object whthin it. + core.deltaBaseCacheLimit:: Maximum number of bytes per thread to reserve for caching base objects that may be referenced by multiple deltified objects. By storing the diff --git a/cache.h b/cache.h index ba04ff8bd36..46126c6977c 100644 --- a/cache.h +++ b/cache.h @@ -956,6 +956,7 @@ extern size_t packed_git_limit; extern size_t delta_base_cache_limit; extern unsigned long big_file_threshold; extern unsigned long pack_size_limit_cfg; +extern int core_freshen_packfiles; /* * Accessors for the core.sharedrepository config which lazy-load the value diff --git a/config.c b/config.c index f9c400ad306..02dcc8a028e 100644 --- a/config.c +++ b/config.c @@ -1431,6 +1431,10 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.freshenpackfiles")) { + core_freshen_packfiles = git_config_bool(var, value); + } + if (!strcmp(var, "core.deltabasecachelimit")) { delta_base_cache_limit = git_config_ulong(var, value); return 0; diff --git a/environment.c b/environment.c index 2f27008424a..397525609a8 100644 --- a/environment.c +++ b/environment.c @@ -73,6 +73,7 @@ int core_sparse_checkout_cone; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ unsigned long pack_size_limit_cfg; +int core_freshen_packfiles = 1; enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET; #ifndef PROTECT_HFS_DEFAULT diff --git a/object-file.c b/object-file.c index f233b440b22..884c3e92c38 100644 --- a/object-file.c +++ b/object-file.c @@ -1974,6 +1974,10 @@ static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; + + if (!core_freshen_packfiles) + return 1; + if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh index 937f89ee8c8..b6a0b6c9695 100755 --- a/t/t7701-repack-unpack-unreachable.sh +++ b/t/t7701-repack-unpack-unreachable.sh @@ -112,6 +112,20 @@ test_expect_success 'do not bother loosening old objects' ' test_must_fail git cat-file -p $obj2 ' +test_expect_success 'do not bother loosening old objects without freshen pack time' ' + obj1=$(echo three | git hash-object -w --stdin) && + obj2=$(echo four | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && + git -c core.freshenPackFiles=false prune-packed && + git cat-file -p $obj1 && + git cat-file -p $obj2 && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && + git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + test_must_fail git cat-file -p $obj2 +' + test_expect_success 'keep packed objects found only in index' ' echo my-unique-content >file && git add file && base-commit: 75ae10bc75336db031ee58d13c5037b929235912 -- gitgitgadget ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-19 19:53 ` [PATCH v3] " Sun Chao via GitGitGadget @ 2021-07-19 20:51 ` Taylor Blau 2021-07-20 0:07 ` Junio C Hamano ` (2 more replies) 2021-08-15 17:08 ` [PATCH v4 0/2] " Sun Chao via GitGitGadget 1 sibling, 3 replies; 34+ messages in thread From: Taylor Blau @ 2021-07-19 20:51 UTC (permalink / raw) To: Sun Chao via GitGitGadget Cc: git, Ævar Arnfjörð Bjarmason, Taylor Blau, Martin Fick, Son Luong Ngoc, Sun Chao, Jeff King On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote: > From: Sun Chao <16657101987@163.com> > > Commit 33d4221c79 (write_sha1_file: freshen existing objects, > 2014-10-15) avoid writing existing objects by freshen their > mtime (especially the packfiles contains them) in order to > aid the correct caching, and some process like find_lru_pack > can make good decision. However, this is unfriendly to > incremental backup jobs or services rely on cached file system > when there are large '.pack' files exists. > > For example, after packed all objects, use 'write-tree' to > create same commit with the same tree and same environments > such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can > notice the '.pack' file's mtime changed. Git servers > that mount the same NFS disk will re-sync the '.pack' files > to cached file system which will slow the git commands. > > So if add core.freshenPackfiles to indicate whether or not > packs can be freshened, turning off this option on some > servers can speed up the execution of some commands on servers > which use NFS disk instead of local disk. Hmm. I'm still quite unconvinced that we should be taking this direction without better motivation. We talked about your assumption that NFS seems to be invalidating the block cache when updating the inodes that point at those blocks, but I don't recall seeing further evidence. Regardless, a couple of idle thoughts: > + if (!core_freshen_packfiles) > + return 1; It is important to still freshen the object mtimes even when we cannot update the pack mtimes. That's why we return 0 when "freshen_file" returned 0: even if there was an error calling utime, we should still freshen the object. This is important because it impacts when unreachable objects are pruned. So I would have assumed that if a user set "core.freshenPackfiles=false" that they would still want their object mtimes updated, in which case the only option we have is to write those objects out loose. ...and that happens by the caller of freshen_packed_object (either write_object_file() or hash_object_file_literally()) then calling write_loose_object() if freshen_packed_object() failed. So I would have expected to see a "return 0" in the case that packfile freshening was disabled. But that leads us to an interesting problem: how many redundant objects do we expect to see on the server? It may be a lot, in which case you may end up having the same IO problems for a different reason. Peff mentioned to me off-list that he suspected write-tree was overeager in how many trees it would try to write out. I'm not sure. > +test_expect_success 'do not bother loosening old objects without freshen pack time' ' > + obj1=$(echo three | git hash-object -w --stdin) && > + obj2=$(echo four | git hash-object -w --stdin) && > + pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && > + pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && > + git -c core.freshenPackFiles=false prune-packed && > + git cat-file -p $obj1 && > + git cat-file -p $obj2 && > + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && > + git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && > + git cat-file -p $obj1 && > + test_must_fail git cat-file -p $obj2 > +' I had a little bit of a hard time following this test. AFAICT, it proceeds as follows: - Write two packs, each containing a unique unreachable blob. - Call 'git prune-packed' with packfile freshening disabled, then check that the object survived. - Then repack while in a state where one of the pack's contents would be pruned. - Make sure that one object survives and the other doesn't. This doesn't really seem to be testing the behavior of disabling packfile freshening so much as it's testing prune-packed, and repack's `--unpack-unreachable` option. I would probably have expected to see something more along the lines of: - Write an unreachable object, pack it, and then remove the loose copy you wrote in the first place. - Then roll the pack's mtime to some fixed value in the past. - Try to write the same object again with packfile freshening disabled, and verify that: - the pack's mtime was unchanged, - the object exists loose again But I'd really like to get some other opinions (especially from Peff, who brought up the potential concerns with write-tree) as to whether or not this is a direction worth pursuing. My opinion is that it is not, and that the bizarre caching behavior you are seeing is out of Git's control. Thanks, Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-19 20:51 ` Taylor Blau @ 2021-07-20 0:07 ` Junio C Hamano 2021-07-20 15:07 ` Sun Chao 2021-07-20 6:19 ` Ævar Arnfjörð Bjarmason 2021-07-20 15:00 ` Sun Chao 2 siblings, 1 reply; 34+ messages in thread From: Junio C Hamano @ 2021-07-20 0:07 UTC (permalink / raw) To: Taylor Blau Cc: Sun Chao via GitGitGadget, git, Ævar Arnfjörð Bjarmason, Martin Fick, Son Luong Ngoc, Sun Chao, Jeff King Taylor Blau <me@ttaylorr.com> writes: > Hmm. I'm still quite unconvinced that we should be taking this direction > without better motivation. We talked about your assumption that NFS > seems to be invalidating the block cache when updating the inodes that > point at those blocks, but I don't recall seeing further evidence. Me neither. Not touching the pack and not updating the "most recently used" time of individual objects smells like a recipe for repository corruption. > My opinion is that it is not, and that the bizarre caching behavior you > are seeing is out of Git's control. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-20 0:07 ` Junio C Hamano @ 2021-07-20 15:07 ` Sun Chao 0 siblings, 0 replies; 34+ messages in thread From: Sun Chao @ 2021-07-20 15:07 UTC (permalink / raw) To: Junio C Hamano Cc: Taylor Blau, Sun Chao via GitGitGadget, git, Ævar Arnfjörð Bjarmason, Martin Fick, Son Luong Ngoc, Jeff King > 2021年7月20日 08:07,Junio C Hamano <gitster@pobox.com> 写道: > > Taylor Blau <me@ttaylorr.com> writes: > >> Hmm. I'm still quite unconvinced that we should be taking this direction >> without better motivation. We talked about your assumption that NFS >> seems to be invalidating the block cache when updating the inodes that >> point at those blocks, but I don't recall seeing further evidence. > > Me neither. Not touching the pack and not updating the "most > recently used" time of individual objects smells like a recipe > for repository corruption. > >> My opinion is that it is not, and that the bizarre caching behavior you >> are seeing is out of Git's control. > Thanks Junio, I will try to get a more detial reports of the NFS caches and share it if it is valuable. Not touching the mtime of packfiles really has potencial problems just as Taylor said. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-19 20:51 ` Taylor Blau 2021-07-20 0:07 ` Junio C Hamano @ 2021-07-20 6:19 ` Ævar Arnfjörð Bjarmason 2021-07-20 15:34 ` Sun Chao 2021-07-20 15:00 ` Sun Chao 2 siblings, 1 reply; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-07-20 6:19 UTC (permalink / raw) To: Taylor Blau Cc: Sun Chao via GitGitGadget, git, Martin Fick, Son Luong Ngoc, Sun Chao, Jeff King On Mon, Jul 19 2021, Taylor Blau wrote: > On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote: >> From: Sun Chao <16657101987@163.com> >> >> Commit 33d4221c79 (write_sha1_file: freshen existing objects, >> 2014-10-15) avoid writing existing objects by freshen their >> mtime (especially the packfiles contains them) in order to >> aid the correct caching, and some process like find_lru_pack >> can make good decision. However, this is unfriendly to >> incremental backup jobs or services rely on cached file system >> when there are large '.pack' files exists. >> >> For example, after packed all objects, use 'write-tree' to >> create same commit with the same tree and same environments >> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can >> notice the '.pack' file's mtime changed. Git servers >> that mount the same NFS disk will re-sync the '.pack' files >> to cached file system which will slow the git commands. >> >> So if add core.freshenPackfiles to indicate whether or not >> packs can be freshened, turning off this option on some >> servers can speed up the execution of some commands on servers >> which use NFS disk instead of local disk. > > Hmm. I'm still quite unconvinced that we should be taking this direction > without better motivation. We talked about your assumption that NFS > seems to be invalidating the block cache when updating the inodes that > point at those blocks, but I don't recall seeing further evidence. I don't know about Sun's setup, but what he's describing is consistent with how NFS works, or can commonly be made to work. See e.g. "lookupcache" in nfs(5) on Linux, but also a lot of people use some random vendor's proprietary NFS implementation, and commonly tweak various options that make it anywhere between "I guess that's not too crazy" and "are you kidding me?" levels of non-POSIX compliant. > Regardless, a couple of idle thoughts: > >> + if (!core_freshen_packfiles) >> + return 1; > > It is important to still freshen the object mtimes even when we cannot > update the pack mtimes. That's why we return 0 when "freshen_file" > returned 0: even if there was an error calling utime, we should still > freshen the object. This is important because it impacts when > unreachable objects are pruned. > > So I would have assumed that if a user set "core.freshenPackfiles=false" > that they would still want their object mtimes updated, in which case > the only option we have is to write those objects out loose. > > ...and that happens by the caller of freshen_packed_object (either > write_object_file() or hash_object_file_literally()) then calling > write_loose_object() if freshen_packed_object() failed. So I would have > expected to see a "return 0" in the case that packfile freshening was > disabled. > > But that leads us to an interesting problem: how many redundant objects > do we expect to see on the server? It may be a lot, in which case you > may end up having the same IO problems for a different reason. Peff > mentioned to me off-list that he suspected write-tree was overeager in > how many trees it would try to write out. I'm not sure. In my experience with NFS the thing that kills you is anything that needs to do iterations, i.e. recursive readdir() and the like, or to read a lot of data, throughput was excellent. It's why I hacked core that core.checkCollisions patch. Jeff improved the situation I was mainly trying to fix with with the loose objects cache. I never got around to benchmarking the two in production, and now that setup is at an ex-job... >> +test_expect_success 'do not bother loosening old objects without freshen pack time' ' >> + obj1=$(echo three | git hash-object -w --stdin) && >> + obj2=$(echo four | git hash-object -w --stdin) && >> + pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && >> + pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && >> + git -c core.freshenPackFiles=false prune-packed && >> + git cat-file -p $obj1 && >> + git cat-file -p $obj2 && >> + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && >> + git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && >> + git cat-file -p $obj1 && >> + test_must_fail git cat-file -p $obj2 >> +' > > I had a little bit of a hard time following this test. AFAICT, it > proceeds as follows: > > - Write two packs, each containing a unique unreachable blob. > - Call 'git prune-packed' with packfile freshening disabled, then > check that the object survived. > - Then repack while in a state where one of the pack's contents would > be pruned. > - Make sure that one object survives and the other doesn't. > > This doesn't really seem to be testing the behavior of disabling > packfile freshening so much as it's testing prune-packed, and repack's > `--unpack-unreachable` option. I would probably have expected to see > something more along the lines of: > > - Write an unreachable object, pack it, and then remove the loose copy > you wrote in the first place. > - Then roll the pack's mtime to some fixed value in the past. > - Try to write the same object again with packfile freshening > disabled, and verify that: > - the pack's mtime was unchanged, > - the object exists loose again > > But I'd really like to get some other opinions (especially from Peff, > who brought up the potential concerns with write-tree) as to whether or > not this is a direction worth pursuing. > > My opinion is that it is not, and that the bizarre caching behavior you > are seeing is out of Git's control. Thanks for this, I found the test hard to follow too, but didn't have time to really think about it, this makes sense. Back to the topic: I share your sentiment of trying to avoid complexity in this area. Sun: Have you considered --keep-unreachable to "git repack"? It's orthagonal to what you're trying here, but I wonder if being more aggressive about keeping objects + some impromevents to perhaps skip this "touch" at all if we have that in effect wouldn't be more viable & something e.g. Taylor would be more comforable having part of git.git. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-20 6:19 ` Ævar Arnfjörð Bjarmason @ 2021-07-20 15:34 ` Sun Chao 0 siblings, 0 replies; 34+ messages in thread From: Sun Chao @ 2021-07-20 15:34 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Taylor Blau, Sun Chao via GitGitGadget, git, Martin Fick, Son Luong Ngoc, Jeff King > 2021年7月20日 14:19,Ævar Arnfjörð Bjarmason <avarab@gmail.com> 写道: > > > On Mon, Jul 19 2021, Taylor Blau wrote: > >> On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote: >>> From: Sun Chao <16657101987@163.com> >>> >>> Commit 33d4221c79 (write_sha1_file: freshen existing objects, >>> 2014-10-15) avoid writing existing objects by freshen their >>> mtime (especially the packfiles contains them) in order to >>> aid the correct caching, and some process like find_lru_pack >>> can make good decision. However, this is unfriendly to >>> incremental backup jobs or services rely on cached file system >>> when there are large '.pack' files exists. >>> >>> For example, after packed all objects, use 'write-tree' to >>> create same commit with the same tree and same environments >>> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can >>> notice the '.pack' file's mtime changed. Git servers >>> that mount the same NFS disk will re-sync the '.pack' files >>> to cached file system which will slow the git commands. >>> >>> So if add core.freshenPackfiles to indicate whether or not >>> packs can be freshened, turning off this option on some >>> servers can speed up the execution of some commands on servers >>> which use NFS disk instead of local disk. >> >> Hmm. I'm still quite unconvinced that we should be taking this direction >> without better motivation. We talked about your assumption that NFS >> seems to be invalidating the block cache when updating the inodes that >> point at those blocks, but I don't recall seeing further evidence. > > I don't know about Sun's setup, but what he's describing is consistent > with how NFS works, or can commonly be made to work. > > See e.g. "lookupcache" in nfs(5) on Linux, but also a lot of people use > some random vendor's proprietary NFS implementation, and commonly tweak > various options that make it anywhere between "I guess that's not too > crazy" and "are you kidding me?" levels of non-POSIX compliant. > >> Regardless, a couple of idle thoughts: >> >>> + if (!core_freshen_packfiles) >>> + return 1; >> >> It is important to still freshen the object mtimes even when we cannot >> update the pack mtimes. That's why we return 0 when "freshen_file" >> returned 0: even if there was an error calling utime, we should still >> freshen the object. This is important because it impacts when >> unreachable objects are pruned. >> >> So I would have assumed that if a user set "core.freshenPackfiles=false" >> that they would still want their object mtimes updated, in which case >> the only option we have is to write those objects out loose. >> >> ...and that happens by the caller of freshen_packed_object (either >> write_object_file() or hash_object_file_literally()) then calling >> write_loose_object() if freshen_packed_object() failed. So I would have >> expected to see a "return 0" in the case that packfile freshening was >> disabled. >> >> But that leads us to an interesting problem: how many redundant objects >> do we expect to see on the server? It may be a lot, in which case you >> may end up having the same IO problems for a different reason. Peff >> mentioned to me off-list that he suspected write-tree was overeager in >> how many trees it would try to write out. I'm not sure. > > In my experience with NFS the thing that kills you is anything that > needs to do iterations, i.e. recursive readdir() and the like, or to > read a lot of data, throughput was excellent. It's why I hacked core > that core.checkCollisions patch. I have read your patch, looks like a good idea to reduce the expensive operations like readdir(). And in my production environments, IOPS stress of the NFS server bothers me which make the git commands slow. > > Jeff improved the situation I was mainly trying to fix with with the > loose objects cache. I never got around to benchmarking the two in > production, and now that setup is at an ex-job... > >>> +test_expect_success 'do not bother loosening old objects without freshen pack time' ' >>> + obj1=$(echo three | git hash-object -w --stdin) && >>> + obj2=$(echo four | git hash-object -w --stdin) && >>> + pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && >>> + pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && >>> + git -c core.freshenPackFiles=false prune-packed && >>> + git cat-file -p $obj1 && >>> + git cat-file -p $obj2 && >>> + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && >>> + git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && >>> + git cat-file -p $obj1 && >>> + test_must_fail git cat-file -p $obj2 >>> +' >> >> I had a little bit of a hard time following this test. AFAICT, it >> proceeds as follows: >> >> - Write two packs, each containing a unique unreachable blob. >> - Call 'git prune-packed' with packfile freshening disabled, then >> check that the object survived. >> - Then repack while in a state where one of the pack's contents would >> be pruned. >> - Make sure that one object survives and the other doesn't. >> >> This doesn't really seem to be testing the behavior of disabling >> packfile freshening so much as it's testing prune-packed, and repack's >> `--unpack-unreachable` option. I would probably have expected to see >> something more along the lines of: >> >> - Write an unreachable object, pack it, and then remove the loose copy >> you wrote in the first place. >> - Then roll the pack's mtime to some fixed value in the past. >> - Try to write the same object again with packfile freshening >> disabled, and verify that: >> - the pack's mtime was unchanged, >> - the object exists loose again >> >> But I'd really like to get some other opinions (especially from Peff, >> who brought up the potential concerns with write-tree) as to whether or >> not this is a direction worth pursuing. >> >> My opinion is that it is not, and that the bizarre caching behavior you >> are seeing is out of Git's control. > > Thanks for this, I found the test hard to follow too, but didn't have > time to really think about it, this makes sense. > > Back to the topic: I share your sentiment of trying to avoid complexity > in this area. > > Sun: Have you considered --keep-unreachable to "git repack"? It's > orthagonal to what you're trying here, but I wonder if being more > aggressive about keeping objects + some impromevents to perhaps skip > this "touch" at all if we have that in effect wouldn't be more viable & > something e.g. Taylor would be more comforable having part of git.git. Yes, I will try to create some more useful test cases with both `--keep-unreachable` and `--unpack-unreachable` if I still believe I need to do something with the mtime, whatever I need to get my test reports first, thanks ;) ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-19 20:51 ` Taylor Blau 2021-07-20 0:07 ` Junio C Hamano 2021-07-20 6:19 ` Ævar Arnfjörð Bjarmason @ 2021-07-20 15:00 ` Sun Chao 2021-07-20 16:53 ` Taylor Blau 2 siblings, 1 reply; 34+ messages in thread From: Sun Chao @ 2021-07-20 15:00 UTC (permalink / raw) To: Taylor Blau Cc: Sun Chao via GitGitGadget, git, Ævar Arnfjörð Bjarmason, Martin Fick, Son Luong Ngoc, Jeff King > 2021年7月20日 04:51,Taylor Blau <me@ttaylorr.com> 写道: > > On Mon, Jul 19, 2021 at 07:53:19PM +0000, Sun Chao via GitGitGadget wrote: >> From: Sun Chao <16657101987@163.com> >> >> Commit 33d4221c79 (write_sha1_file: freshen existing objects, >> 2014-10-15) avoid writing existing objects by freshen their >> mtime (especially the packfiles contains them) in order to >> aid the correct caching, and some process like find_lru_pack >> can make good decision. However, this is unfriendly to >> incremental backup jobs or services rely on cached file system >> when there are large '.pack' files exists. >> >> For example, after packed all objects, use 'write-tree' to >> create same commit with the same tree and same environments >> such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can >> notice the '.pack' file's mtime changed. Git servers >> that mount the same NFS disk will re-sync the '.pack' files >> to cached file system which will slow the git commands. >> >> So if add core.freshenPackfiles to indicate whether or not >> packs can be freshened, turning off this option on some >> servers can speed up the execution of some commands on servers >> which use NFS disk instead of local disk. > > Hmm. I'm still quite unconvinced that we should be taking this direction > without better motivation. We talked about your assumption that NFS > seems to be invalidating the block cache when updating the inodes that > point at those blocks, but I don't recall seeing further evidence. Yes, these days I'm trying to asking help from our SRE to do tests in our production environments (where we can get the real traffic report of NFS server), such like: - setup a repository witch some large packfiles in a NFS disk - keep running 'git pack-objects --all --stdout --delta-base-offset >/dev/null' in multiple git servers that mount the same NFS disk above - 'touch' the packfiles in another server, and check (a) if the IOPS and IO traffic of the NFS server changes and (b) if the IO traffic of network interfaces from the git servers to the NFS server changes I like to share the data when I receive the reports. Meanwhile I find the description of the cached file system for NFS Client: https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system It is mentioned that: 3. To ensure that the cached directories and files are kept up to date, CacheFS periodically checks the consistency of files stored in the cache. It does this by comparing the current modification time to the previous modification time. 4. If the modification times are different, all data and attributes for the directory or file are purged from the cache, and new data and attributes are retrieved from the back file system. It looks like reasonable, but perhaps I should check it from my test reports ;) > > Regardless, a couple of idle thoughts: > >> + if (!core_freshen_packfiles) >> + return 1; > > It is important to still freshen the object mtimes even when we cannot > update the pack mtimes. That's why we return 0 when "freshen_file" > returned 0: even if there was an error calling utime, we should still > freshen the object. This is important because it impacts when > unreachable objects are pruned. > > So I would have assumed that if a user set "core.freshenPackfiles=false" > that they would still want their object mtimes updated, in which case > the only option we have is to write those objects out loose. > > ...and that happens by the caller of freshen_packed_object (either > write_object_file() or hash_object_file_literally()) then calling > write_loose_object() if freshen_packed_object() failed. So I would have > expected to see a "return 0" in the case that packfile freshening was > disabled. > > But that leads us to an interesting problem: how many redundant objects > do we expect to see on the server? It may be a lot, in which case you > may end up having the same IO problems for a different reason. Peff > mentioned to me off-list that he suspected write-tree was overeager in > how many trees it would try to write out. I'm not sure. You are right, I haven't thought it in details, if we do not update the mtime both of packfiles and loose files, 'prune' may delete them by accident. > >> +test_expect_success 'do not bother loosening old objects without freshen pack time' ' >> + obj1=$(echo three | git hash-object -w --stdin) && >> + obj2=$(echo four | git hash-object -w --stdin) && >> + pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && >> + pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && >> + git -c core.freshenPackFiles=false prune-packed && >> + git cat-file -p $obj1 && >> + git cat-file -p $obj2 && >> + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && >> + git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && >> + git cat-file -p $obj1 && >> + test_must_fail git cat-file -p $obj2 >> +' > > I had a little bit of a hard time following this test. AFAICT, it > proceeds as follows: > > - Write two packs, each containing a unique unreachable blob. > - Call 'git prune-packed' with packfile freshening disabled, then > check that the object survived. > - Then repack while in a state where one of the pack's contents would > be pruned. > - Make sure that one object survives and the other doesn't. > > This doesn't really seem to be testing the behavior of disabling > packfile freshening so much as it's testing prune-packed, and repack's > `--unpack-unreachable` option. I would probably have expected to see > something more along the lines of: > > - Write an unreachable object, pack it, and then remove the loose copy > you wrote in the first place. > - Then roll the pack's mtime to some fixed value in the past. > - Try to write the same object again with packfile freshening > disabled, and verify that: > - the pack's mtime was unchanged, > - the object exists loose again > > But I'd really like to get some other opinions (especially from Peff, > who brought up the potential concerns with write-tree) as to whether or > not this is a direction worth pursuing. > > My opinion is that it is not, and that the bizarre caching behavior you > are seeing is out of Git's control. OK, I will try this. And I will try to get the test reports from our SRE and check how the mtime impacts the caches. > > Thanks, > Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v3] packfile: freshen the mtime of packfile by configuration 2021-07-20 15:00 ` Sun Chao @ 2021-07-20 16:53 ` Taylor Blau 0 siblings, 0 replies; 34+ messages in thread From: Taylor Blau @ 2021-07-20 16:53 UTC (permalink / raw) To: Sun Chao Cc: Taylor Blau, Sun Chao via GitGitGadget, git, Ævar Arnfjörð Bjarmason, Martin Fick, Son Luong Ngoc, Jeff King On Tue, Jul 20, 2021 at 11:00:18PM +0800, Sun Chao wrote: > Meanwhile I find the description of the cached file system for NFS Client: > https://www.ibm.com/docs/en/aix/7.2?topic=performance-cache-file-system > It is mentioned that: > > 3. To ensure that the cached directories and files are kept up to date, > CacheFS periodically checks the consistency of files stored in the cache. > It does this by comparing the current modification time to the previous > modification time. > 4. If the modification times are different, all data and attributes > for the directory or file are purged from the cache, and new data and > attributes are retrieved from the back file system. > > It looks like reasonable, but perhaps I should check it from my test reports ;) That seems reasonable, assuming that you have CacheFS mounted and that's what you're interacting with (instead of talking to NFS directly). It seems reasonable that CacheFS would also have some way to tune how often the "purge cached blocks because of stale mtimes" would kick in, and what the grace period for determining if an mtime is "stale" is. So hopefully those values are (a) configurable and (b) you can find values that result in acceptable performance. Thanks, Taylor ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v4 0/2] packfile: freshen the mtime of packfile by configuration 2021-07-19 19:53 ` [PATCH v3] " Sun Chao via GitGitGadget 2021-07-19 20:51 ` Taylor Blau @ 2021-08-15 17:08 ` Sun Chao via GitGitGadget 2021-08-15 17:08 ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget 2021-08-15 17:08 ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget 1 sibling, 2 replies; 34+ messages in thread From: Sun Chao via GitGitGadget @ 2021-08-15 17:08 UTC (permalink / raw) To: git; +Cc: Sun Chao packfile: freshen the mtime of packfile by bump file We've talked about the cache reload through earlier patches, and we stopped because no further evidence can tell NFS client will reload the page caches if the file mtime changed. So our team have done these experiments: Step1: prepare git servers which mount the NFS disk and a big repo We prepared 3 vms named c1, s1 and s2, we also have a NFS server named n1. s1 and s2 mount the NFS disk from n1 by: mount -t nfs -o vers=3,timeo=600,nolock,noatime,lookupcache=postive,\ actimeo=3 <n1 ip addr>:/repositories /mnt/repositories We setup git server services on s1 and s2, so we can clone repos from s1 by git commands. Then we created a repository under /mnt/repositories, and pushed large files to the repository, so we can find a large .pack file in the repository with about 1.2 GB size. Step2: do first git clone from client after drop caches of s1 First we drop the caches from s1 by: sync; echo 3 > /proc/sys/vm/drop_caches Then we run git command in c1 to clone the huge repository we created in Step1, at the same time we run the two commands in s1: tcpdump -nn host <n1 ip addr> -w 1st_command.pcap nfsiostat 1 -p /mnt/repositories try to get the result and check what happends. Step3: do new git clones without drop caches of s1 After Step2, we called new git clone command in c1 to clone the huge repository for serveral times, and also run the commands at the same time: tcpdump -nn host <n1 ip addr> -w lots_of_command.pcap nfsiostat 1 -p /mnt/repositories Step4: do new git clones with packfile mtime changed After Step2 and Step3, we try to touch all the ".pack" files from s2, and we call a new git clone in c1 to download the huge repository again, and run the two command in s1 at the same time: tcpdump -nn host <n1 ip addr> -w mtime_changed_command.pcap nfsiostat 1 -p /mnt/repositories Result: We got a about 1.4GB big pcap file during Step2 and Step4, we can find lots of READ request and response after open it with wireshark. And by 'nfsiostat' command we can see the 'ops/s' and 'KB/s' of 'read' in the output shows a relatively large value for a while. But we got a 4MB pcap file in Step3, and open it with wireshark, we can only find GETATTR and FSSTAT requests and response. And we the 'nfsiostat' always show 0 in 'ops/s' and 'KB/s' of 'read' part in the output. We have done Step1 to Step4 serveral times, each time the result are same. So we can make sure the NFS client will reload the page cache if other NFS client changes the mtime of the large .pack files. And for git servers which use filesystem like NFS to manage large repositories, reload large files that only have mtime changed result big NFS server IOPS pressure and that also makes the git server slow because the IO is the bottleneck when there are too many client requests for the same big repositries. And I do think the team who manage the git servers need a configuration choise which can enhance the mtime of packfile through another file which should be small enough or even empty. It should be backward compatibility when it is in default value, but just as metioned by Ævar before, maybe somepeople what to use it in mixed-version environment, we should warn them in documents, but such configuration do big help for some team who run some servers mount the NFS disks. Sun Chao (2): packfile: rename `derive_filename()` to `derive_pack_filename()` packfile: freshen the mtime of packfile by bump file Documentation/config/core.txt | 11 +++ builtin/index-pack.c | 19 +---- cache.h | 1 + config.c | 5 ++ environment.c | 1 + object-file.c | 30 +++++++- packfile.c | 25 ++++++- packfile.h | 7 ++ t/t5326-pack-mtime-bumpfiles.sh | 118 ++++++++++++++++++++++++++++++++ 9 files changed, 198 insertions(+), 19 deletions(-) create mode 100755 t/t5326-pack-mtime-bumpfiles.sh base-commit: 5d213e46bb7b880238ff5ea3914e940a50ae9369 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1043%2Fsunchao9%2Fmaster-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1043/sunchao9/master-v4 Pull-Request: https://github.com/git/git/pull/1043 Range-diff vs v3: -: ----------- > 1: 81afc69d22c packfile: rename `derive_filename()` to `derive_pack_filename()` 1: 16c68923bea ! 2: 7166f776154 packfile: freshen the mtime of packfile by configuration @@ Metadata Author: Sun Chao <16657101987@163.com> ## Commit message ## - packfile: freshen the mtime of packfile by configuration + packfile: freshen the mtime of packfile by bump file Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their @@ Commit message create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed. Git servers - that mount the same NFS disk will re-sync the '.pack' files - to cached file system which will slow the git commands. + that use filesystems like NFS will reload the '.pack' files + to file system page cache, which will slow the git commands. - So if add core.freshenPackfiles to indicate whether or not - packs can be freshened, turning off this option on some - servers can speed up the execution of some commands on servers - which use NFS disk instead of local disk. + So if we freshen the mtime of packfile by updating a '.bump' + file instead, when we check the mtime of packfile, get it from + '.bump' file also. Large git repository may contains large + '.pack' files, but '.bump' files can be empty. This will avoid + file system page caches reload large files from NFS and then + make git commands faster. Signed-off-by: Sun Chao <16657101987@163.com> @@ Documentation/config/core.txt: the largest projects. You probably do not need t + Common unit suffixes of 'k', 'm', or 'g' are supported. -+core.freshenPackFiles:: ++core.packMtimeToBumpFiles:: + Normally we avoid writing existing object by freshening the mtime + of the *.pack file which contains it in order to aid some processes -+ such like prune. Turning off this option on some servers can speed -+ up the execution of some commands like 'git-upload-pack'(e.g. some -+ servers that mount the same NFS disk will re-sync the *.pack files -+ to cached file system if the mtime cahnges). ++ such like prune. Use a *.bump file instead of *.pack file will ++ avoid file system cache re-sync the large packfiles on filesystems ++ like NFS, and consequently make git commands faster. ++ -+The default is true which means the *.pack file will be freshened if we -+want to write a existing object whthin it. ++The default is 'false' which means the *.pack file will be freshened by ++default. If set to 'true', the file with the '.bump' suffix will be ++created automatically, and it's mtime will be freshened instead. + core.deltaBaseCacheLimit:: Maximum number of bytes per thread to reserve for caching base objects that may be referenced by multiple deltified objects. By storing the ## cache.h ## -@@ cache.h: extern size_t packed_git_limit; +@@ cache.h: extern const char *git_hooks_path; + extern int zlib_compression_level; + extern int core_compression_level; + extern int pack_compression_level; ++extern int pack_mtime_to_bumpfiles; + extern size_t packed_git_window_size; + extern size_t packed_git_limit; extern size_t delta_base_cache_limit; - extern unsigned long big_file_threshold; - extern unsigned long pack_size_limit_cfg; -+extern int core_freshen_packfiles; - - /* - * Accessors for the core.sharedrepository config which lazy-load the value ## config.c ## @@ config.c: static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } -+ if (!strcmp(var, "core.freshenpackfiles")) { -+ core_freshen_packfiles = git_config_bool(var, value); ++ if (!strcmp(var, "core.packmtimetobumpfiles")) { ++ pack_mtime_to_bumpfiles = git_config_bool(var, value); ++ return 0; + } + if (!strcmp(var, "core.deltabasecachelimit")) { @@ config.c: static int git_default_core_config(const char *var, const char *value, return 0; ## environment.c ## -@@ environment.c: int core_sparse_checkout_cone; - int merge_log_config = -1; - int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ - unsigned long pack_size_limit_cfg; -+int core_freshen_packfiles = 1; - enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET; - - #ifndef PROTECT_HFS_DEFAULT +@@ environment.c: const char *git_hooks_path; + int zlib_compression_level = Z_BEST_SPEED; + int core_compression_level; + int pack_compression_level = Z_DEFAULT_COMPRESSION; ++int pack_mtime_to_bumpfiles; + int fsync_object_files; + size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; + size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; ## object-file.c ## @@ object-file.c: static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; -+ -+ if (!core_freshen_packfiles) -+ return 1; ++ struct stat st; ++ struct strbuf name_buf = STRBUF_INIT; ++ const char *filename; + if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) + return 1; +- if (!freshen_file(e.p->pack_name)) +- return 0; ++ ++ filename = e.p->pack_name; ++ if (!pack_mtime_to_bumpfiles) { ++ if (!freshen_file(filename)) ++ return 0; ++ e.p->freshened = 1; ++ return 1; ++ } ++ ++ filename = derive_pack_filename(filename, "pack", "bump", &name_buf); ++ if (lstat(filename, &st) < 0) { ++ int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); ++ if (fd < 0) { ++ // here we need to check it again because other git process may created it ++ if (lstat(filename, &st) < 0) ++ die_errno("unable to create '%s'", filename); ++ } else { ++ close(fd); ++ } ++ } else { ++ if (!freshen_file(filename)) ++ return 0; ++ } ++ + e.p->freshened = 1; + return 1; + } - ## t/t7701-repack-unpack-unreachable.sh ## -@@ t/t7701-repack-unpack-unreachable.sh: test_expect_success 'do not bother loosening old objects' ' - test_must_fail git cat-file -p $obj2 - ' + ## packfile.c ## +@@ packfile.c: void close_object_store(struct raw_object_store *o) -+test_expect_success 'do not bother loosening old objects without freshen pack time' ' -+ obj1=$(echo three | git hash-object -w --stdin) && -+ obj2=$(echo four | git hash-object -w --stdin) && -+ pack1=$(echo $obj1 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && -+ pack2=$(echo $obj2 | git -c core.freshenPackFiles=false pack-objects .git/objects/pack/pack) && -+ git -c core.freshenPackFiles=false prune-packed && + void unlink_pack_path(const char *pack_name, int force_delete) + { +- static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"}; ++ static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".bump"}; + int i; + struct strbuf buf = STRBUF_INIT; + size_t plen; +@@ packfile.c: struct packed_git *add_packed_git(const char *path, size_t path_len, int local) + p->pack_size = st.st_size; + p->pack_local = local; + p->mtime = st.st_mtime; ++ ++ if (pack_mtime_to_bumpfiles) { ++ struct strbuf name_buf = STRBUF_INIT; ++ const char *filename; ++ ++ filename = derive_pack_filename(path, "idx", "bump", &name_buf); ++ if (!stat(filename, &st)) { ++ p->mtime = st.st_mtime; ++ } ++ } + if (path_len < the_hash_algo->hexsz || + get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) + hashclr(p->hash); + + ## t/t5326-pack-mtime-bumpfiles.sh (new) ## +@@ ++#!/bin/sh ++ ++test_description='packfile mtime use bump files' ++. ./test-lib.sh ++ ++if stat -c %Y . >/dev/null 2>&1; then ++ get_modified_time() { stat -c %Y "$1" 2>/dev/null; } ++elif stat -f %m . >/dev/null 2>&1; then ++ get_modified_time() { stat -f %m "$1" 2>/dev/null; } ++elif date -r . +%s >/dev/null 2>&1; then ++ get_modified_time() { date -r "$1" +%s 2>/dev/null; } ++else ++ echo 'get_modified_time() is unsupported' >&2 ++ get_modified_time() { printf '%s' 0; } ++fi ++ ++test_expect_success 'freshen existing packfile without core.packMtimeToBumpFiles' ' ++ obj1=$(echo one | git hash-object -w --stdin) && ++ obj2=$(echo two | git hash-object -w --stdin) && ++ pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && ++ test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* && ++ test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* && ++ pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && ++ pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && ++ (echo one | git hash-object -w --stdin) && ++ ! test_path_exists .git/objects/pack/pack-$pack1.bump && ++ ! test_path_exists .git/objects/pack/pack-$pack2.bump && ++ pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && ++ pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && ++ echo "$pack1_mtime : $pack1_mtime_new" && ++ test ! "$pack1_mtime" = "$pack1_mtime_new" && ++ test "$pack2_mtime" = "$pack2_mtime_new" ++ ++' ++ ++test_expect_success 'freshen existing packfile with core.packMtimeToBumpFiles' ' ++ ++ rm -rf .git/objects && git init && ++ obj1=$(echo one | git hash-object -w --stdin) && ++ obj2=$(echo two | git hash-object -w --stdin) && ++ pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && ++ test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* && ++ test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* && ++ pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && ++ pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && ++ (echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin) && ++ test_path_exists .git/objects/pack/pack-$pack1.bump && ++ ! test_path_exists .git/objects/pack/pack-$pack2.bump && ++ pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && ++ pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && ++ test "$pack1_mtime" = "$pack1_mtime_new" && ++ test "$pack2_mtime" = "$pack2_mtime_new" ++ ++' ++ ++test_expect_success 'repack prune unreachable objects without core.packMtimeToBumpFiles' ' ++ ++ rm -rf .git/objects && git init && ++ obj1=$(echo one | git hash-object -w --stdin) && ++ obj2=$(echo two | git hash-object -w --stdin) && ++ pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && ++ echo one | git hash-object -w --stdin && ++ echo two | git hash-object -w --stdin && ++ ! test_path_exists .git/objects/pack/pack-$pack1.bump && ++ ! test_path_exists .git/objects/pack/pack-$pack2.bump && ++ git prune-packed && + git cat-file -p $obj1 && + git cat-file -p $obj2 && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && -+ git -c core.freshenPackFiles=false repack -A -d --unpack-unreachable=1.hour.ago && ++ git repack -A -d --unpack-unreachable=1.hour.ago && ++ git cat-file -p $obj1 && ++ test_must_fail git cat-file -p $obj2 ++ ++' ++ ++test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and bump files' ' ++ ++ rm -rf .git/objects && git init && ++ obj1=$(echo one | git hash-object -w --stdin) && ++ obj2=$(echo two | git hash-object -w --stdin) && ++ pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && ++ echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && ++ echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && ++ test_path_exists .git/objects/pack/pack-$pack1.bump && ++ test_path_exists .git/objects/pack/pack-$pack2.bump && ++ test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && ++ git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago && ++ git cat-file -p $obj1 && ++ git cat-file -p $obj2 ++ ++' ++ ++test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and old bump files' ' ++ ++ rm -rf .git/objects && git init && ++ obj1=$(echo one | git hash-object -w --stdin) && ++ obj2=$(echo two | git hash-object -w --stdin) && ++ pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && ++ pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && ++ echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && ++ echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && ++ test_path_exists .git/objects/pack/pack-$pack1.bump && ++ test_path_exists .git/objects/pack/pack-$pack2.bump && ++ git prune-packed && ++ git cat-file -p $obj1 && ++ git cat-file -p $obj2 && ++ test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && ++ git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + test_must_fail git cat-file -p $obj2 ++ +' + - test_expect_success 'keep packed objects found only in index' ' - echo my-unique-content >file && - git add file && ++test_done -- gitgitgadget ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` 2021-08-15 17:08 ` [PATCH v4 0/2] " Sun Chao via GitGitGadget @ 2021-08-15 17:08 ` Sun Chao via GitGitGadget 2021-08-15 17:08 ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget 1 sibling, 0 replies; 34+ messages in thread From: Sun Chao via GitGitGadget @ 2021-08-15 17:08 UTC (permalink / raw) To: git; +Cc: Sun Chao, Sun Chao From: Sun Chao <16657101987@163.com> In order to allow some function get a new file name from `.pack` file with a new suffix, move `derive_filename()` in `builtin/index-pack.c` to `packfile.c` with a new name `derive_pack_filename(), and export it from `packfile.h`. Signed-off-by: Sun Chao <16657101987@163.com> --- builtin/index-pack.c | 19 +++---------------- packfile.c | 13 +++++++++++++ packfile.h | 7 +++++++ 3 files changed, 23 insertions(+), 16 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 8336466865c..3c83789ccef 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1435,19 +1435,6 @@ static void fix_unresolved_deltas(struct hashfile *f) free(sorted_by_pos); } -static const char *derive_filename(const char *pack_name, const char *strip, - const char *suffix, struct strbuf *buf) -{ - size_t len; - if (!strip_suffix(pack_name, strip, &len) || !len || - pack_name[len - 1] != '.') - die(_("packfile name '%s' does not end with '.%s'"), - pack_name, strip); - strbuf_add(buf, pack_name, len); - strbuf_addstr(buf, suffix); - return buf->buf; -} - static void write_special_file(const char *suffix, const char *msg, const char *pack_name, const unsigned char *hash, const char **report) @@ -1458,7 +1445,7 @@ static void write_special_file(const char *suffix, const char *msg, int msg_len = strlen(msg); if (pack_name) - filename = derive_filename(pack_name, "pack", suffix, &name_buf); + filename = derive_pack_filename(pack_name, "pack", suffix, &name_buf); else filename = odb_pack_name(&name_buf, hash, suffix); @@ -1853,13 +1840,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) if (from_stdin && hash_algo) die(_("--object-format cannot be used with --stdin")); if (!index_name && pack_name) - index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf); + index_name = derive_pack_filename(pack_name, "pack", "idx", &index_name_buf); opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY); if (rev_index) { opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV; if (index_name) - rev_index_name = derive_filename(index_name, + rev_index_name = derive_pack_filename(index_name, "idx", "rev", &rev_index_name_buf); } diff --git a/packfile.c b/packfile.c index 9ef6d982928..315c3da259a 100644 --- a/packfile.c +++ b/packfile.c @@ -40,6 +40,19 @@ char *sha1_pack_index_name(const unsigned char *sha1) return odb_pack_name(&buf, sha1, "idx"); } +const char *derive_pack_filename(const char *pack_name, const char *strip, + const char *suffix, struct strbuf *buf) +{ + size_t len; + if (!strip_suffix(pack_name, strip, &len) || !len || + pack_name[len - 1] != '.') + die(_("packfile name '%s' does not end with '.%s'"), + pack_name, strip); + strbuf_add(buf, pack_name, len); + strbuf_addstr(buf, suffix); + return buf->buf; +} + static unsigned int pack_used_ctr; static unsigned int pack_mmap_calls; static unsigned int peak_pack_open_windows; diff --git a/packfile.h b/packfile.h index 3ae117a8aef..ff702b22e6a 100644 --- a/packfile.h +++ b/packfile.h @@ -31,6 +31,13 @@ char *sha1_pack_name(const unsigned char *sha1); */ char *sha1_pack_index_name(const unsigned char *sha1); +/* + * Return the corresponding filename with given suffix from "file_name" + * which must has "strip" suffix. + */ +const char *derive_pack_filename(const char *file_name, const char *strip, + const char *suffix, struct strbuf *buf); + /* * Return the basename of the packfile, omitting any containing directory * (e.g., "pack-1234abcd[...].pack"). -- gitgitgadget ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file 2021-08-15 17:08 ` [PATCH v4 0/2] " Sun Chao via GitGitGadget 2021-08-15 17:08 ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget @ 2021-08-15 17:08 ` Sun Chao via GitGitGadget 1 sibling, 0 replies; 34+ messages in thread From: Sun Chao via GitGitGadget @ 2021-08-15 17:08 UTC (permalink / raw) To: git; +Cc: Sun Chao, Sun Chao From: Sun Chao <16657101987@163.com> Commit 33d4221c79 (write_sha1_file: freshen existing objects, 2014-10-15) avoid writing existing objects by freshen their mtime (especially the packfiles contains them) in order to aid the correct caching, and some process like find_lru_pack can make good decision. However, this is unfriendly to incremental backup jobs or services rely on cached file system when there are large '.pack' files exists. For example, after packed all objects, use 'write-tree' to create same commit with the same tree and same environments such like GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, we can notice the '.pack' file's mtime changed. Git servers that use filesystems like NFS will reload the '.pack' files to file system page cache, which will slow the git commands. So if we freshen the mtime of packfile by updating a '.bump' file instead, when we check the mtime of packfile, get it from '.bump' file also. Large git repository may contains large '.pack' files, but '.bump' files can be empty. This will avoid file system page caches reload large files from NFS and then make git commands faster. Signed-off-by: Sun Chao <16657101987@163.com> --- Documentation/config/core.txt | 11 +++ cache.h | 1 + config.c | 5 ++ environment.c | 1 + object-file.c | 30 +++++++- packfile.c | 12 +++- t/t5326-pack-mtime-bumpfiles.sh | 118 ++++++++++++++++++++++++++++++++ 7 files changed, 175 insertions(+), 3 deletions(-) create mode 100755 t/t5326-pack-mtime-bumpfiles.sh diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a1..963d1b54e7e 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -398,6 +398,17 @@ the largest projects. You probably do not need to adjust this value. + Common unit suffixes of 'k', 'm', or 'g' are supported. +core.packMtimeToBumpFiles:: + Normally we avoid writing existing object by freshening the mtime + of the *.pack file which contains it in order to aid some processes + such like prune. Use a *.bump file instead of *.pack file will + avoid file system cache re-sync the large packfiles on filesystems + like NFS, and consequently make git commands faster. ++ +The default is 'false' which means the *.pack file will be freshened by +default. If set to 'true', the file with the '.bump' suffix will be +created automatically, and it's mtime will be freshened instead. + core.deltaBaseCacheLimit:: Maximum number of bytes per thread to reserve for caching base objects that may be referenced by multiple deltified objects. By storing the diff --git a/cache.h b/cache.h index bd4869beee4..a563cbacfa2 100644 --- a/cache.h +++ b/cache.h @@ -960,6 +960,7 @@ extern const char *git_hooks_path; extern int zlib_compression_level; extern int core_compression_level; extern int pack_compression_level; +extern int pack_mtime_to_bumpfiles; extern size_t packed_git_window_size; extern size_t packed_git_limit; extern size_t delta_base_cache_limit; diff --git a/config.c b/config.c index f33abeab851..10ccf7c5581 100644 --- a/config.c +++ b/config.c @@ -1431,6 +1431,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.packmtimetobumpfiles")) { + pack_mtime_to_bumpfiles = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.deltabasecachelimit")) { delta_base_cache_limit = git_config_ulong(var, value); return 0; diff --git a/environment.c b/environment.c index d6b22ede7ea..5fa26cb3758 100644 --- a/environment.c +++ b/environment.c @@ -43,6 +43,7 @@ const char *git_hooks_path; int zlib_compression_level = Z_BEST_SPEED; int core_compression_level; int pack_compression_level = Z_DEFAULT_COMPRESSION; +int pack_mtime_to_bumpfiles; int fsync_object_files; size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; diff --git a/object-file.c b/object-file.c index a8be8994814..434073c17f1 100644 --- a/object-file.c +++ b/object-file.c @@ -1994,12 +1994,38 @@ static int freshen_loose_object(const struct object_id *oid) static int freshen_packed_object(const struct object_id *oid) { struct pack_entry e; + struct stat st; + struct strbuf name_buf = STRBUF_INIT; + const char *filename; + if (!find_pack_entry(the_repository, oid, &e)) return 0; if (e.p->freshened) return 1; - if (!freshen_file(e.p->pack_name)) - return 0; + + filename = e.p->pack_name; + if (!pack_mtime_to_bumpfiles) { + if (!freshen_file(filename)) + return 0; + e.p->freshened = 1; + return 1; + } + + filename = derive_pack_filename(filename, "pack", "bump", &name_buf); + if (lstat(filename, &st) < 0) { + int fd = open(filename, O_CREAT|O_EXCL|O_WRONLY, 0664); + if (fd < 0) { + // here we need to check it again because other git process may created it + if (lstat(filename, &st) < 0) + die_errno("unable to create '%s'", filename); + } else { + close(fd); + } + } else { + if (!freshen_file(filename)) + return 0; + } + e.p->freshened = 1; return 1; } diff --git a/packfile.c b/packfile.c index 315c3da259a..f5cee440601 100644 --- a/packfile.c +++ b/packfile.c @@ -374,7 +374,7 @@ void close_object_store(struct raw_object_store *o) void unlink_pack_path(const char *pack_name, int force_delete) { - static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"}; + static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor", ".bump"}; int i; struct strbuf buf = STRBUF_INIT; size_t plen; @@ -741,6 +741,16 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local) p->pack_size = st.st_size; p->pack_local = local; p->mtime = st.st_mtime; + + if (pack_mtime_to_bumpfiles) { + struct strbuf name_buf = STRBUF_INIT; + const char *filename; + + filename = derive_pack_filename(path, "idx", "bump", &name_buf); + if (!stat(filename, &st)) { + p->mtime = st.st_mtime; + } + } if (path_len < the_hash_algo->hexsz || get_sha1_hex(path + path_len - the_hash_algo->hexsz, p->hash)) hashclr(p->hash); diff --git a/t/t5326-pack-mtime-bumpfiles.sh b/t/t5326-pack-mtime-bumpfiles.sh new file mode 100755 index 00000000000..d6d9e6dc446 --- /dev/null +++ b/t/t5326-pack-mtime-bumpfiles.sh @@ -0,0 +1,118 @@ +#!/bin/sh + +test_description='packfile mtime use bump files' +. ./test-lib.sh + +if stat -c %Y . >/dev/null 2>&1; then + get_modified_time() { stat -c %Y "$1" 2>/dev/null; } +elif stat -f %m . >/dev/null 2>&1; then + get_modified_time() { stat -f %m "$1" 2>/dev/null; } +elif date -r . +%s >/dev/null 2>&1; then + get_modified_time() { date -r "$1" +%s 2>/dev/null; } +else + echo 'get_modified_time() is unsupported' >&2 + get_modified_time() { printf '%s' 0; } +fi + +test_expect_success 'freshen existing packfile without core.packMtimeToBumpFiles' ' + obj1=$(echo one | git hash-object -w --stdin) && + obj2=$(echo two | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && + test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* && + test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* && + pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && + pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && + (echo one | git hash-object -w --stdin) && + ! test_path_exists .git/objects/pack/pack-$pack1.bump && + ! test_path_exists .git/objects/pack/pack-$pack2.bump && + pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && + pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && + echo "$pack1_mtime : $pack1_mtime_new" && + test ! "$pack1_mtime" = "$pack1_mtime_new" && + test "$pack2_mtime" = "$pack2_mtime_new" + +' + +test_expect_success 'freshen existing packfile with core.packMtimeToBumpFiles' ' + + rm -rf .git/objects && git init && + obj1=$(echo one | git hash-object -w --stdin) && + obj2=$(echo two | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && + test-tool chmtime =-60 .git/objects/pack/pack-$pack1.* && + test-tool chmtime =-60 .git/objects/pack/pack-$pack2.* && + pack1_mtime=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && + pack2_mtime=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && + (echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin) && + test_path_exists .git/objects/pack/pack-$pack1.bump && + ! test_path_exists .git/objects/pack/pack-$pack2.bump && + pack1_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack1.pack) && + pack2_mtime_new=$(get_modified_time .git/objects/pack/pack-$pack2.pack) && + test "$pack1_mtime" = "$pack1_mtime_new" && + test "$pack2_mtime" = "$pack2_mtime_new" + +' + +test_expect_success 'repack prune unreachable objects without core.packMtimeToBumpFiles' ' + + rm -rf .git/objects && git init && + obj1=$(echo one | git hash-object -w --stdin) && + obj2=$(echo two | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && + echo one | git hash-object -w --stdin && + echo two | git hash-object -w --stdin && + ! test_path_exists .git/objects/pack/pack-$pack1.bump && + ! test_path_exists .git/objects/pack/pack-$pack2.bump && + git prune-packed && + git cat-file -p $obj1 && + git cat-file -p $obj2 && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && + git repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + test_must_fail git cat-file -p $obj2 + +' + +test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and bump files' ' + + rm -rf .git/objects && git init && + obj1=$(echo one | git hash-object -w --stdin) && + obj2=$(echo two | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && + echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && + echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && + test_path_exists .git/objects/pack/pack-$pack1.bump && + test_path_exists .git/objects/pack/pack-$pack2.bump && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.pack && + git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + git cat-file -p $obj2 + +' + +test_expect_success 'repack prune unreachable objects with core.packMtimeToBumpFiles and old bump files' ' + + rm -rf .git/objects && git init && + obj1=$(echo one | git hash-object -w --stdin) && + obj2=$(echo two | git hash-object -w --stdin) && + pack1=$(echo $obj1 | git pack-objects .git/objects/pack/pack) && + pack2=$(echo $obj2 | git pack-objects .git/objects/pack/pack) && + echo one | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && + echo two | git -c core.packMtimeToBumpFiles=true hash-object -w --stdin && + test_path_exists .git/objects/pack/pack-$pack1.bump && + test_path_exists .git/objects/pack/pack-$pack2.bump && + git prune-packed && + git cat-file -p $obj1 && + git cat-file -p $obj2 && + test-tool chmtime =-86400 .git/objects/pack/pack-$pack2.bump && + git -c core.packMtimeToBumpFiles=true repack -A -d --unpack-unreachable=1.hour.ago && + git cat-file -p $obj1 && + test_must_fail git cat-file -p $obj2 + +' + +test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 34+ messages in thread
end of thread, other threads:[~2021-08-15 17:09 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-10 19:01 [PATCH] packfile: enhance the mtime of packfile by idx file Sun Chao via GitGitGadget 2021-07-11 23:44 ` Ævar Arnfjörð Bjarmason 2021-07-12 16:17 ` Sun Chao 2021-07-14 1:28 ` [PATCH v2] packfile: freshen the mtime of packfile by configuration Sun Chao via GitGitGadget 2021-07-14 1:39 ` Ævar Arnfjörð Bjarmason 2021-07-14 2:52 ` Taylor Blau 2021-07-14 16:46 ` Sun Chao 2021-07-14 17:04 ` Taylor Blau 2021-07-14 18:19 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:11 ` Martin Fick 2021-07-14 19:41 ` Ævar Arnfjörð Bjarmason 2021-07-14 20:20 ` Martin Fick 2021-07-20 6:32 ` Ævar Arnfjörð Bjarmason 2021-07-15 8:23 ` Son Luong Ngoc 2021-07-20 6:29 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:30 ` Taylor Blau 2021-07-14 19:32 ` Ævar Arnfjörð Bjarmason 2021-07-14 19:52 ` Taylor Blau 2021-07-14 21:40 ` Junio C Hamano 2021-07-15 16:30 ` Sun Chao 2021-07-15 16:42 ` Taylor Blau 2021-07-15 16:48 ` Sun Chao 2021-07-14 16:11 ` Sun Chao 2021-07-19 19:53 ` [PATCH v3] " Sun Chao via GitGitGadget 2021-07-19 20:51 ` Taylor Blau 2021-07-20 0:07 ` Junio C Hamano 2021-07-20 15:07 ` Sun Chao 2021-07-20 6:19 ` Ævar Arnfjörð Bjarmason 2021-07-20 15:34 ` Sun Chao 2021-07-20 15:00 ` Sun Chao 2021-07-20 16:53 ` Taylor Blau 2021-08-15 17:08 ` [PATCH v4 0/2] " Sun Chao via GitGitGadget 2021-08-15 17:08 ` [PATCH v4 1/2] packfile: rename `derive_filename()` to `derive_pack_filename()` Sun Chao via GitGitGadget 2021-08-15 17:08 ` [PATCH v4 2/2] packfile: freshen the mtime of packfile by bump file Sun Chao via GitGitGadget
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).