* Refreshing index timestamps without reading content @ 2017-01-05 11:23 Quentin Casasnovas 2017-01-09 12:02 ` Duy Nguyen 0 siblings, 1 reply; 7+ messages in thread From: Quentin Casasnovas @ 2017-01-05 11:23 UTC (permalink / raw) To: git [-- Attachment #1: Type: text/plain, Size: 2657 bytes --] Hi guys, Apologies if this is documented somewhere, I have fairly bad search vudu skills. I'm looking for a way to cause a full refresh of the index without causing any read of the files, basically telling git "trust me, all worktree files are matching the index, but their stat information have changed". I have read about the update-index --assume-unchanged and --skip-worktree flags in the documentation, but these do not cause any index refresh - rather, they fake that the respective worktree files are matching the index until you remove those assume-unchanged/skip-worktree bits. This might sound like a really weird thing to do, but I do have a use case for it - we have some build farm setup where the resulting objects of a compilation are stored on a shared server. The source files are not stored on the shared server, but locally on each of the build server (as to decrease network load and make good use of local storage as caches). We then use an onion filesystem to mount the compiled objects on top of the local sources - and change the modification time of the source to be older than the object files, so that on subsequent builds, make does not rebuild the whole world. This works fine except for one thing, after changing the mtime of the source files, the first subsequent git command needing to compare the tree with the index will take a LONG time since it will read all of the object content: cd linux-2.6 # Less than a second when the index is up to date time git status > /dev/null git status 0.06s user 0.09s system 172% cpu 0.087 total ~~~~~~~~~~~ # Change the mtime.. git ls-tree -r --name-only HEAD | xargs -n 1024 touch # Now 30s.. time git status > /dev/null git status 2.73s user 1.79s system 13% cpu 32.453 total ~~~~~~~~~~~~ The timing information above was captured on my laptop SSD and the penalty is obviously much higher on spinning disks - especially when this operation is done on *hundreds* of different work tree in parallel, all hosted on the same filesystem (it can take tens of minutes!). Is there any way to tell git, after the git ls-tree command above, to refresh its stat cache information and trust us that the file content has not changed, as to avoid any useless file read (though it will obviously will have to stat all of them, but that's not something we can really avoid) If not, I am willing to implement a --assume-content-unchanged to the git update-index if you guys don't see something fundamentally wrong with this approach. Thanks for any hints you can give! :) Q [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Refreshing index timestamps without reading content 2017-01-05 11:23 Refreshing index timestamps without reading content Quentin Casasnovas @ 2017-01-09 12:02 ` Duy Nguyen 2017-01-09 12:17 ` Quentin Casasnovas 2017-01-09 15:01 ` Junio C Hamano 0 siblings, 2 replies; 7+ messages in thread From: Duy Nguyen @ 2017-01-09 12:02 UTC (permalink / raw) To: Quentin Casasnovas; +Cc: Git Mailing List On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas <quentin.casasnovas@oracle.com> wrote: > Hi guys, > > Apologies if this is documented somewhere, I have fairly bad search vudu > skills. > > I'm looking for a way to cause a full refresh of the index without causing > any read of the files, basically telling git "trust me, all worktree files > are matching the index, but their stat information have changed". I have > read about the update-index --assume-unchanged and --skip-worktree flags in > the documentation, but these do not cause any index refresh - rather, they > fake that the respective worktree files are matching the index until you > remove those assume-unchanged/skip-worktree bits. > > This might sound like a really weird thing to do, but I do have a use case > for it - we have some build farm setup where the resulting objects of a > compilation are stored on a shared server. The source files are not stored > on the shared server, but locally on each of the build server (as to > decrease network load and make good use of local storage as caches). > > We then use an onion filesystem to mount the compiled objects on top of the > local sources - and change the modification time of the source to be older > than the object files, so that on subsequent builds, make does not rebuild > the whole world. > > This works fine except for one thing, after changing the mtime of the > source files, the first subsequent git command needing to compare the tree > with the index will take a LONG time since it will read all of the object > content: > > cd linux-2.6 > > # Less than a second when the index is up to date > time git status > /dev/null > git status 0.06s user 0.09s system 172% cpu 0.087 total > ~~~~~~~~~~~ > > # Change the mtime.. > git ls-tree -r --name-only HEAD | xargs -n 1024 touch > > # Now 30s.. > time git status > /dev/null > git status 2.73s user 1.79s system 13% cpu 32.453 total > ~~~~~~~~~~~~ > > The timing information above was captured on my laptop SSD and the penalty > is obviously much higher on spinning disks - especially when this operation > is done on *hundreds* of different work tree in parallel, all hosted on the > same filesystem (it can take tens of minutes!). > > Is there any way to tell git, after the git ls-tree command above, to > refresh its stat cache information and trust us that the file content has > not changed, as to avoid any useless file read (though it will obviously > will have to stat all of them, but that's not something we can really > avoid) I don't think there's any way to do that, unfortunately. > If not, I am willing to implement a --assume-content-unchanged to the git > update-index if you guys don't see something fundamentally wrong with this > approach. If you do that, I think you should go with either of the following options - Extend git-update-index --index-info to take stat info as well (or maybe make a new option instead). Then you can feed stat info directly to git without a use-case-specific "assume-content-unchanged". - Add "git update-index --touch" that does what "touch" does. In this case, it blindly updates stat info to latest. But like touch, we can also specify mtime from command line if we need to. It's a bit less generic than the above option, but easier to use. Caveat: The options I'm proposing can be rejected. So maybe wait a bit to see how people feel and perhaps send an RFC patch, again to gauge the reception. -- Duy ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Refreshing index timestamps without reading content 2017-01-09 12:02 ` Duy Nguyen @ 2017-01-09 12:17 ` Quentin Casasnovas 2017-01-09 12:22 ` Quentin Casasnovas 2017-01-09 15:01 ` Junio C Hamano 1 sibling, 1 reply; 7+ messages in thread From: Quentin Casasnovas @ 2017-01-09 12:17 UTC (permalink / raw) To: Duy Nguyen; +Cc: Quentin Casasnovas, Git Mailing List [-- Attachment #1.1: Type: text/plain, Size: 1736 bytes --] On Mon, Jan 09, 2017 at 07:02:45PM +0700, Duy Nguyen wrote: > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas > <quentin.casasnovas@oracle.com> wrote: > > > If not, I am willing to implement a --assume-content-unchanged to the git > > update-index if you guys don't see something fundamentally wrong with this > > approach. > > If you do that, I think you should go with either of the following options > > - Extend git-update-index --index-info to take stat info as well (or > maybe make a new option instead). Then you can feed stat info directly > to git without a use-case-specific "assume-content-unchanged". > > - Add "git update-index --touch" that does what "touch" does. In this > case, it blindly updates stat info to latest. But like touch, we can > also specify mtime from command line if we need to. It's a bit less > generic than the above option, but easier to use. > > Caveat: The options I'm proposing can be rejected. So maybe wait a bit > to see how people feel and perhaps send an RFC patch, again to gauge > the reception. Hey Duy, Thanks for your answer. I've played with this a bit last week and added an extra command, which I called --refresh-stat, which works like your suggested --index. It works very well with my use case and improves the performances very significantly on some of our use cases. It is attached to this e-mail to gather comments, as you suggest, and is really not meant to be reviewed for inclusion yet as it lacks test cases, documentation changes, etc. It is just a convenient way to show what I need and receive comments. The logic is simple enugh and will just skip calling ie_modified() when refreshing the index. Cheers, Q [-- Attachment #1.2: 0001-git-update-index-add-refresh-stat-option.patch --] [-- Type: text/x-diff, Size: 5346 bytes --] From 461b4f75056fc476db6bbdf8bc3ff6e91a8ad08d Mon Sep 17 00:00:00 2001 From: Quentin Casasnovas <quentin.casasnovas@oracle.com> Date: Sat, 7 Jan 2017 09:26:29 +0100 Subject: [PATCH] git-update-index: add --refresh-stat option. git-update-index --refresh, contrary to what the documentation says, not only will refresh the stat information but will also update the sha1 of the objects in the working tree if the stat information is found to be out-of-date. We add a --refresh-stat option, which like --refresh will update the stat information, but will assume that any file in the working tree that is present in the index matches the index - hence prevents unecessarily reading the files in the working tree. It is different from --assume-unchanged or --skip-worktree in that the new stat information is recorded in the index, hence subsequent git update-index --refresh will not read the files if their mtime hasn't changed. This sounds like a very dangerous option to give to the user, since it could potentially cause changed files to be missed, but can be needed when we are sure the working tree files have not changed and are matching the index. Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> --- builtin/update-index.c | 10 ++++++++++ cache.h | 3 +++ read-cache.c | 7 +++++-- 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/builtin/update-index.c b/builtin/update-index.c index f3f07e7f1cb2..1215b6a09687 100644 --- a/builtin/update-index.c +++ b/builtin/update-index.c @@ -777,6 +777,12 @@ static int really_refresh_callback(const struct option *opt, return refresh(opt->value, REFRESH_REALLY); } +static int refresh_stat_callback(const struct option *opt, + const char *arg, int unset) +{ + return refresh(opt->value, REFRESH_STAT_ONLY); +} + static int chmod_callback(const struct option *opt, const char *arg, int unset) { @@ -934,6 +940,10 @@ int cmd_update_index(int argc, const char **argv, const char *prefix) N_("refresh stat information"), PARSE_OPT_NOARG | PARSE_OPT_NONEG, refresh_callback}, + {OPTION_CALLBACK, 0, "refresh-stat", &refresh_args, NULL, + N_("refresh only stat information (assume content has not changed)"), + PARSE_OPT_NOARG | PARSE_OPT_NONEG, + refresh_stat_callback}, {OPTION_CALLBACK, 0, "really-refresh", &refresh_args, NULL, N_("like --refresh, but ignore assume-unchanged setting"), PARSE_OPT_NOARG | PARSE_OPT_NONEG, diff --git a/cache.h b/cache.h index a50a61a19787..57d0f9fffbe5 100644 --- a/cache.h +++ b/cache.h @@ -611,6 +611,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig #define CE_MATCH_IGNORE_MISSING 0x08 /* enable stat refresh */ #define CE_MATCH_REFRESH 0x10 +/* only stat refresh, assume content hasn't changed */ +#define CE_MATCH_REFRESH_STAT_ONLY 0x20 extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int); extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int); @@ -643,6 +645,7 @@ extern void fill_stat_cache_info(struct cache_entry *ce, struct stat *st); #define REFRESH_IGNORE_MISSING 0x0008 /* ignore non-existent */ #define REFRESH_IGNORE_SUBMODULES 0x0010 /* ignore submodules */ #define REFRESH_IN_PORCELAIN 0x0020 /* user friendly output, not "needs update" */ +#define REFRESH_STAT_ONLY 0x0040 /* do not check content but update stat */ extern int refresh_index(struct index_state *, unsigned int flags, const struct pathspec *pathspec, char *seen, const char *header_msg); extern struct cache_entry *refresh_cache_entry(struct cache_entry *, unsigned int); diff --git a/read-cache.c b/read-cache.c index db5d91064266..e9334094c6f0 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1101,6 +1101,7 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate, int ignore_valid = options & CE_MATCH_IGNORE_VALID; int ignore_skip_worktree = options & CE_MATCH_IGNORE_SKIP_WORKTREE; int ignore_missing = options & CE_MATCH_IGNORE_MISSING; + int assume_content_unchanged = options & CE_MATCH_REFRESH_STAT_ONLY; if (!refresh || ce_uptodate(ce)) return ce; @@ -1161,7 +1162,7 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate, } } - if (ie_modified(istate, ce, &st, options)) { + if (!assume_content_unchanged && ie_modified(istate, ce, &st, options)) { if (err) *err = EINVAL; return NULL; @@ -1206,11 +1207,13 @@ int refresh_index(struct index_state *istate, unsigned int flags, int quiet = (flags & REFRESH_QUIET) != 0; int not_new = (flags & REFRESH_IGNORE_MISSING) != 0; int ignore_submodules = (flags & REFRESH_IGNORE_SUBMODULES) != 0; + int refresh_stat_only = (flags & REFRESH_STAT_ONLY) != 0; int first = 1; int in_porcelain = (flags & REFRESH_IN_PORCELAIN); unsigned int options = (CE_MATCH_REFRESH | (really ? CE_MATCH_IGNORE_VALID : 0) | - (not_new ? CE_MATCH_IGNORE_MISSING : 0)); + (not_new ? CE_MATCH_IGNORE_MISSING : 0) | + (refresh_stat_only ? CE_MATCH_REFRESH_STAT_ONLY : 0)); const char *modified_fmt; const char *deleted_fmt; const char *typechange_fmt; -- 2.11.0 [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Refreshing index timestamps without reading content 2017-01-09 12:17 ` Quentin Casasnovas @ 2017-01-09 12:22 ` Quentin Casasnovas 0 siblings, 0 replies; 7+ messages in thread From: Quentin Casasnovas @ 2017-01-09 12:22 UTC (permalink / raw) To: Quentin Casasnovas; +Cc: Duy Nguyen, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 1837 bytes --] On Mon, Jan 09, 2017 at 01:17:24PM +0100, Quentin Casasnovas wrote: > On Mon, Jan 09, 2017 at 07:02:45PM +0700, Duy Nguyen wrote: > > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas > > <quentin.casasnovas@oracle.com> wrote: > > > > > If not, I am willing to implement a --assume-content-unchanged to the git > > > update-index if you guys don't see something fundamentally wrong with this > > > approach. > > > > If you do that, I think you should go with either of the following options > > > > - Extend git-update-index --index-info to take stat info as well (or > > maybe make a new option instead). Then you can feed stat info directly > > to git without a use-case-specific "assume-content-unchanged". > > > > - Add "git update-index --touch" that does what "touch" does. In this > > case, it blindly updates stat info to latest. But like touch, we can > > also specify mtime from command line if we need to. It's a bit less > > generic than the above option, but easier to use. > > > > Caveat: The options I'm proposing can be rejected. So maybe wait a bit > > to see how people feel and perhaps send an RFC patch, again to gauge > > the reception. > > > Hey Duy, > > Thanks for your answer. > > I've played with this a bit last week and added an extra command, which I > called --refresh-stat, which works like your suggested --index. It works ^^^^^^^ Whoops, sorry! [...] *a bit* like your suggested *--touch*. [...] I don't really need it to support specific mtime from the CLI for my different use cases, but if someone has some ideas in how that would be useful, I can try implementing your --index-info change to support stat information to be passed around instead of my current simple implementation. Q [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Refreshing index timestamps without reading content 2017-01-09 12:02 ` Duy Nguyen 2017-01-09 12:17 ` Quentin Casasnovas @ 2017-01-09 15:01 ` Junio C Hamano 2017-01-09 15:55 ` Quentin Casasnovas 1 sibling, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2017-01-09 15:01 UTC (permalink / raw) To: Duy Nguyen; +Cc: Quentin Casasnovas, Git Mailing List Duy Nguyen <pclouds@gmail.com> writes: > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas > <quentin.casasnovas@oracle.com> wrote: >> Is there any way to tell git, after the git ls-tree command above, to >> refresh its stat cache information and trust us that the file content has >> not changed, as to avoid any useless file read (though it will obviously >> will have to stat all of them, but that's not something we can really >> avoid) > > I don't think there's any way to do that, unfortunately. Lose "unfortunately". >> If not, I am willing to implement a --assume-content-unchanged to the git >> update-index if you guys don't see something fundamentally wrong with this >> approach. > > If you do that, I think you should go with either of the following options > > - Extend git-update-index --index-info to take stat info as well (or > maybe make a new option instead). Then you can feed stat info directly > to git without a use-case-specific "assume-content-unchanged". > > - Add "git update-index --touch" that does what "touch" does. In this > case, it blindly updates stat info to latest. But like touch, we can > also specify mtime from command line if we need to. It's a bit less > generic than the above option, but easier to use. Even if we assume that it is a good idea to let people muck with the index like this, either of the above would be a usable addition, because the cached stat information does not consist solely of mtime. "git update-index --index-info" was invented for the case where a user or a script _knows_ the object ID of the blob that _would_ result if a contents of a file on the filesystem were run through hash-object. So from the interface's point of view, it may make sense to teach it to take an extra/optional argument that is the path to the file and take the stat info out of the named file when the extra/optional argument was given. But that assumes that it is a good idea to do this in the first place. It was deliberate design decision that setting the cached stat info for the entry was protected behind actual content comparison, and removing that protection will open the index to abuse. The userbase of Git has grown wide enough that it is harder to say "If you lie that a file whose contents does not match the index is up to date using this mechanism, you will lose data and all bad things happen---you can keep both halves". Once we release a version of Git with such a "feature", the first bug report will be "I did not want to run 'update-index --refresh' because it takes time, and some index entries apparently did not match what is on the filesystem, and I got a corrupt working file after a merge. Git should make sure that the contents match when using the new 'path to the file' argument when updating the cached stat info!". I do not have a good answer to such a bug report. So... ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Refreshing index timestamps without reading content 2017-01-09 15:01 ` Junio C Hamano @ 2017-01-09 15:55 ` Quentin Casasnovas 2017-01-10 14:17 ` Quentin Casasnovas 0 siblings, 1 reply; 7+ messages in thread From: Quentin Casasnovas @ 2017-01-09 15:55 UTC (permalink / raw) To: Junio C Hamano; +Cc: Duy Nguyen, Quentin Casasnovas, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 6095 bytes --] On Mon, Jan 09, 2017 at 07:01:36AM -0800, Junio C Hamano wrote: > Duy Nguyen <pclouds@gmail.com> writes: > > > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas > > <quentin.casasnovas@oracle.com> wrote: > >> Is there any way to tell git, after the git ls-tree command above, to > >> refresh its stat cache information and trust us that the file content has > >> not changed, as to avoid any useless file read (though it will obviously > >> will have to stat all of them, but that's not something we can really > >> avoid) > > > > I don't think there's any way to do that, unfortunately. > > Lose "unfortunately". > > >> If not, I am willing to implement a --assume-content-unchanged to the git > >> update-index if you guys don't see something fundamentally wrong with this > >> approach. > > > > If you do that, I think you should go with either of the following options > > > > - Extend git-update-index --index-info to take stat info as well (or > > maybe make a new option instead). Then you can feed stat info directly > > to git without a use-case-specific "assume-content-unchanged". > > > > - Add "git update-index --touch" that does what "touch" does. In this > > case, it blindly updates stat info to latest. But like touch, we can > > also specify mtime from command line if we need to. It's a bit less > > generic than the above option, but easier to use. > > Even if we assume that it is a good idea to let people muck with the > index like this, either of the above would be a usable addition, > because the cached stat information does not consist solely of > mtime. > > "git update-index --index-info" was invented for the case where a > user or a script _knows_ the object ID of the blob that _would_ > result if a contents of a file on the filesystem were run through > hash-object. So from the interface's point of view, it may make > sense to teach it to take an extra/optional argument that is the > path to the file and take the stat info out of the named file when > the extra/optional argument was given. > > But that assumes that it is a good idea to do this in the first > place. It was deliberate design decision that setting the cached > stat info for the entry was protected behind actual content > comparison, and removing that protection will open the index to > abuse. > Hi Junio, Thanks for your feedback, appreciated :) I do understand how it would be possible for someone to shoot themselves in the feet with such option, but it solves real life use cases and improved build times very signficantly here. Another use case we have is setting up very lightweight linux work trees, by reflinking from a base work-tree. This allows for a completely different work-tree taking up almost no size at first, whereas using a shared clone or the recent worktree subcommand would "waste" ~500MB*: # linux-2.6 is a shared clone of a bare clone residing locally ~ $ cp --reflink -a linux-2.6 linux-2.6-reflinked # At this point, the mtime inside linux-2.6-reflinked are matching the # mtime of the source linux-2.6 (since we used the '-a' option of 'cp) ~ $ diff -u <(stat linux-2.6/README) <(stat linux-2.6-reflinked/README) --- /proc/self/fd/11 2017-01-09 16:34:04.523438942 +0100 +++ /proc/self/fd/12 2017-01-09 16:34:04.523438942 +0100 @@ -1,8 +1,8 @@ - File: 'linux-2.6/README' + File: 'linux-2.6-reflinked/README' Size: 18372 Blocks: 40 IO Block: 4096 regular file -Device: fd00h/64768d Inode: 268467090 Links: 1 +Device: fd00h/64768d Inode: 805970606 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ quentin) Gid: ( 1000/ quentin) Access: 2017-01-09 12:04:15.317758718 +0100 Modify: 2017-01-09 12:04:12.566758772 +0100 -Change: 2017-01-09 12:04:12.566758772 +0100 +Change: 2017-01-09 16:29:48.305444003 +0100 Birth: # Now let's check how long it takes to refresh the index from the source # and destination.. ~/linux-2.6 $ time git update-index --refresh git update-index --refresh 0.04s user 0.08s system 204% cpu 0.058 total ~~~~~~~~~~~ ~/linux-2.6-reflinked $ time git update-index --refresh git update-index --refresh 2.40s user 1.43s system 38% cpu 10.003 total ~~~~~~~~~~~~ This is quite a high penalty when a power user knows that his lightweight copy worktree matches the index! That's on a single worktree on a fairly decent SSD but our build farm would do this with hundreds of work trees in parallel, all residing on a spinning disks - the penalty would be much worse than 10 seconds wasted. I might have a look at adding reflinking awareness to the worktree subcommand (if it hasn't been implemented already) to avoid this index force refreshing but that still would not fix our other use case where we knowingly change the mtime of the files on our worktrees. * That's a rather low estimate: git ls-tree -zr --name-only v4.9 | du -csh --files0-from=- | tail -n1 747M total > The userbase of Git has grown wide enough that it is harder to say "If > you lie that a file whose contents does not match the index is up to date > using this mechanism, you will lose data and all bad things happen---you > can keep both halves". Once we release a version of Git with such a > "feature", the first bug report will be "I did not want to run > 'update-index --refresh' because it takes time, and some index entries > apparently did not match what is on the filesystem, and I got a corrupt > working file after a merge. Git should make sure that the contents match > when using the new 'path to the file' argument when updating the cached > stat info!". I do not have a good answer to such a bug report. > The good answer would probably be that this is intended behaviour and would explicitly be mentionned in the manual for that specific option? :) How about renaming the option (which is more of a sub-command) to --i-know-this-can-eat-my-data-but-refresh-stat-only? Then we can print a big warning saying this is generally a VERY bad idea? Q [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Refreshing index timestamps without reading content 2017-01-09 15:55 ` Quentin Casasnovas @ 2017-01-10 14:17 ` Quentin Casasnovas 0 siblings, 0 replies; 7+ messages in thread From: Quentin Casasnovas @ 2017-01-10 14:17 UTC (permalink / raw) To: Quentin Casasnovas Cc: Vegard Nossum, Junio C Hamano, Duy Nguyen, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 5131 bytes --] On Mon, Jan 09, 2017 at 04:55:37PM +0100, Quentin Casasnovas wrote: > On Mon, Jan 09, 2017 at 07:01:36AM -0800, Junio C Hamano wrote: > > Duy Nguyen <pclouds@gmail.com> writes: > > > > > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas > > > <quentin.casasnovas@oracle.com> wrote: > > >> Is there any way to tell git, after the git ls-tree command above, to > > >> refresh its stat cache information and trust us that the file content has > > >> not changed, as to avoid any useless file read (though it will obviously > > >> will have to stat all of them, but that's not something we can really > > >> avoid) > > > > > > I don't think there's any way to do that, unfortunately. > > > > Lose "unfortunately". > > > > >> If not, I am willing to implement a --assume-content-unchanged to the git > > >> update-index if you guys don't see something fundamentally wrong with this > > >> approach. > > > > > > If you do that, I think you should go with either of the following options > > > > > > - Extend git-update-index --index-info to take stat info as well (or > > > maybe make a new option instead). Then you can feed stat info directly > > > to git without a use-case-specific "assume-content-unchanged". > > > > > > - Add "git update-index --touch" that does what "touch" does. In this > > > case, it blindly updates stat info to latest. But like touch, we can > > > also specify mtime from command line if we need to. It's a bit less > > > generic than the above option, but easier to use. > > > > Even if we assume that it is a good idea to let people muck with the > > index like this, either of the above would be a usable addition, > > because the cached stat information does not consist solely of > > mtime. > > > > "git update-index --index-info" was invented for the case where a > > user or a script _knows_ the object ID of the blob that _would_ > > result if a contents of a file on the filesystem were run through > > hash-object. So from the interface's point of view, it may make > > sense to teach it to take an extra/optional argument that is the > > path to the file and take the stat info out of the named file when > > the extra/optional argument was given. > > > > But that assumes that it is a good idea to do this in the first > > place. It was deliberate design decision that setting the cached > > stat info for the entry was protected behind actual content > > comparison, and removing that protection will open the index to > > abuse. > > > > Hi Junio, > > Thanks for your feedback, appreciated :) > > I do understand how it would be possible for someone to shoot themselves in > the feet with such option, but it solves real life use cases and improved > build times very signficantly here. > > Another use case we have is setting up very lightweight linux work trees, > by reflinking from a base work-tree. This allows for a completely > different work-tree taking up almost no size at first, whereas using a > shared clone or the recent worktree subcommand would "waste" ~500MB*: > > # linux-2.6 is a shared clone of a bare clone residing locally > ~ $ cp --reflink -a linux-2.6 linux-2.6-reflinked > > # At this point, the mtime inside linux-2.6-reflinked are matching the > # mtime of the source linux-2.6 (since we used the '-a' option of 'cp) > ~ $ diff -u <(stat linux-2.6/README) <(stat linux-2.6-reflinked/README) > --- /proc/self/fd/11 2017-01-09 16:34:04.523438942 +0100 > +++ /proc/self/fd/12 2017-01-09 16:34:04.523438942 +0100 > @@ -1,8 +1,8 @@ > - File: 'linux-2.6/README' > + File: 'linux-2.6-reflinked/README' > Size: 18372 Blocks: 40 IO Block: 4096 regular file > -Device: fd00h/64768d Inode: 268467090 Links: 1 > +Device: fd00h/64768d Inode: 805970606 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 1000/ quentin) Gid: ( 1000/ quentin) > Access: 2017-01-09 12:04:15.317758718 +0100 > Modify: 2017-01-09 12:04:12.566758772 +0100 > -Change: 2017-01-09 12:04:12.566758772 +0100 > +Change: 2017-01-09 16:29:48.305444003 +0100 > Birth: > > # Now let's check how long it takes to refresh the index from the source > # and destination.. > ~/linux-2.6 $ time git update-index --refresh > git update-index --refresh 0.04s user 0.08s system 204% cpu 0.058 total > ~~~~~~~~~~~ > ~/linux-2.6-reflinked $ time git update-index --refresh > git update-index --refresh 2.40s user 1.43s system 38% cpu 10.003 total > ~~~~~~~~~~~~ > After discussing this with my friend Vegard, he found the core.checkStat config which, if set to 'minimal', ignores the inode number which is enough for the above use case to work just fine - so please excuse my ignorance! For the initial problem I had when changing the mtime of all the files in the tree, I should be able to change the mtime of the object files instead, hence I don't really need the patch I sent earlier. Sorry for the wasted time! :) Q [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-01-10 14:13 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-05 11:23 Refreshing index timestamps without reading content Quentin Casasnovas 2017-01-09 12:02 ` Duy Nguyen 2017-01-09 12:17 ` Quentin Casasnovas 2017-01-09 12:22 ` Quentin Casasnovas 2017-01-09 15:01 ` Junio C Hamano 2017-01-09 15:55 ` Quentin Casasnovas 2017-01-10 14:17 ` Quentin Casasnovas
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).