git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Refreshing index timestamps without reading content
@ 2017-01-05 11:23 Quentin Casasnovas
  2017-01-09 12:02 ` Duy Nguyen
  0 siblings, 1 reply; 7+ messages in thread
From: Quentin Casasnovas @ 2017-01-05 11:23 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2657 bytes --]

Hi guys,

Apologies if this is documented somewhere, I have fairly bad search vudu
skills.

I'm looking for a way to cause a full refresh of the index without causing
any read of the files, basically telling git "trust me, all worktree files
are matching the index, but their stat information have changed".  I have
read about the update-index --assume-unchanged and --skip-worktree flags in
the documentation, but these do not cause any index refresh - rather, they
fake that the respective worktree files are matching the index until you
remove those assume-unchanged/skip-worktree bits.

This might sound like a really weird thing to do, but I do have a use case
for it - we have some build farm setup where the resulting objects of a
compilation are stored on a shared server.  The source files are not stored
on the shared server, but locally on each of the build server (as to
decrease network load and make good use of local storage as caches).

We then use an onion filesystem to mount the compiled objects on top of the
local sources - and change the modification time of the source to be older
than the object files, so that on subsequent builds, make does not rebuild
the whole world.

This works fine except for one thing, after changing the mtime of the
source files, the first subsequent git command needing to compare the tree
with the index will take a LONG time since it will read all of the object
content:

  cd linux-2.6

  # Less than a second  when the index is up to date
  time git status > /dev/null
  git status 0.06s user 0.09s system 172% cpu 0.087 total
                                              ~~~~~~~~~~~

  # Change the mtime..
  git ls-tree -r --name-only HEAD | xargs -n 1024 touch

  # Now 30s..
  time git status > /dev/null
  git status  2.73s user 1.79s system 13% cpu 32.453 total
                                              ~~~~~~~~~~~~

The timing information above was captured on my laptop SSD and the penalty
is obviously much higher on spinning disks - especially when this operation
is done on *hundreds* of different work tree in parallel, all hosted on the
same filesystem (it can take tens of minutes!).

Is there any way to tell git, after the git ls-tree command above, to
refresh its stat cache information and trust us that the file content has
not changed, as to avoid any useless file read (though it will obviously
will have to stat all of them, but that's not something we can really
avoid)

If not, I am willing to implement a --assume-content-unchanged to the git
update-index if you guys don't see something fundamentally wrong with this
approach.

Thanks for any hints you can give! :)

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Refreshing index timestamps without reading content
  2017-01-05 11:23 Refreshing index timestamps without reading content Quentin Casasnovas
@ 2017-01-09 12:02 ` Duy Nguyen
  2017-01-09 12:17   ` Quentin Casasnovas
  2017-01-09 15:01   ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Duy Nguyen @ 2017-01-09 12:02 UTC (permalink / raw)
  To: Quentin Casasnovas; +Cc: Git Mailing List

On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
<quentin.casasnovas@oracle.com> wrote:
> Hi guys,
>
> Apologies if this is documented somewhere, I have fairly bad search vudu
> skills.
>
> I'm looking for a way to cause a full refresh of the index without causing
> any read of the files, basically telling git "trust me, all worktree files
> are matching the index, but their stat information have changed".  I have
> read about the update-index --assume-unchanged and --skip-worktree flags in
> the documentation, but these do not cause any index refresh - rather, they
> fake that the respective worktree files are matching the index until you
> remove those assume-unchanged/skip-worktree bits.
>
> This might sound like a really weird thing to do, but I do have a use case
> for it - we have some build farm setup where the resulting objects of a
> compilation are stored on a shared server.  The source files are not stored
> on the shared server, but locally on each of the build server (as to
> decrease network load and make good use of local storage as caches).
>
> We then use an onion filesystem to mount the compiled objects on top of the
> local sources - and change the modification time of the source to be older
> than the object files, so that on subsequent builds, make does not rebuild
> the whole world.
>
> This works fine except for one thing, after changing the mtime of the
> source files, the first subsequent git command needing to compare the tree
> with the index will take a LONG time since it will read all of the object
> content:
>
>   cd linux-2.6
>
>   # Less than a second  when the index is up to date
>   time git status > /dev/null
>   git status 0.06s user 0.09s system 172% cpu 0.087 total
>                                               ~~~~~~~~~~~
>
>   # Change the mtime..
>   git ls-tree -r --name-only HEAD | xargs -n 1024 touch
>
>   # Now 30s..
>   time git status > /dev/null
>   git status  2.73s user 1.79s system 13% cpu 32.453 total
>                                               ~~~~~~~~~~~~
>
> The timing information above was captured on my laptop SSD and the penalty
> is obviously much higher on spinning disks - especially when this operation
> is done on *hundreds* of different work tree in parallel, all hosted on the
> same filesystem (it can take tens of minutes!).
>
> Is there any way to tell git, after the git ls-tree command above, to
> refresh its stat cache information and trust us that the file content has
> not changed, as to avoid any useless file read (though it will obviously
> will have to stat all of them, but that's not something we can really
> avoid)

I don't think there's any way to do that, unfortunately.

> If not, I am willing to implement a --assume-content-unchanged to the git
> update-index if you guys don't see something fundamentally wrong with this
> approach.

If you do that, I think you should go with either of the following options

- Extend git-update-index --index-info to take stat info as well (or
maybe make a new option instead). Then you can feed stat info directly
to git without a use-case-specific "assume-content-unchanged".

- Add "git update-index --touch" that does what "touch" does. In this
case, it blindly updates stat info to latest. But like touch, we can
also specify  mtime from command line if we need to. It's a bit less
generic than the above option, but easier to use.

Caveat: The options I'm proposing can be rejected. So maybe wait a bit
to see how people feel and perhaps send an RFC patch, again to gauge
the reception.

-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Refreshing index timestamps without reading content
  2017-01-09 12:02 ` Duy Nguyen
@ 2017-01-09 12:17   ` Quentin Casasnovas
  2017-01-09 12:22     ` Quentin Casasnovas
  2017-01-09 15:01   ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Quentin Casasnovas @ 2017-01-09 12:17 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Quentin Casasnovas, Git Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 1736 bytes --]

On Mon, Jan 09, 2017 at 07:02:45PM +0700, Duy Nguyen wrote:
> On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> <quentin.casasnovas@oracle.com> wrote:
> 
> > If not, I am willing to implement a --assume-content-unchanged to the git
> > update-index if you guys don't see something fundamentally wrong with this
> > approach.
> 
> If you do that, I think you should go with either of the following options
> 
> - Extend git-update-index --index-info to take stat info as well (or
> maybe make a new option instead). Then you can feed stat info directly
> to git without a use-case-specific "assume-content-unchanged".
> 
> - Add "git update-index --touch" that does what "touch" does. In this
> case, it blindly updates stat info to latest. But like touch, we can
> also specify  mtime from command line if we need to. It's a bit less
> generic than the above option, but easier to use.
> 
> Caveat: The options I'm proposing can be rejected. So maybe wait a bit
> to see how people feel and perhaps send an RFC patch, again to gauge
> the reception.


Hey Duy,

Thanks for your answer.

I've played with this a bit last week and added an extra command, which I
called --refresh-stat, which works like your suggested --index.  It works
very well with my use case and improves the performances very significantly
on some of our use cases.

It is attached to this e-mail to gather comments, as you suggest, and is
really not meant to be reviewed for inclusion yet as it lacks test cases,
documentation changes, etc.  It is just a convenient way to show what I
need and receive comments.

The logic is simple enugh and will just skip calling ie_modified() when
refreshing the index.

Cheers,
Q

[-- Attachment #1.2: 0001-git-update-index-add-refresh-stat-option.patch --]
[-- Type: text/x-diff, Size: 5346 bytes --]

From 461b4f75056fc476db6bbdf8bc3ff6e91a8ad08d Mon Sep 17 00:00:00 2001
From: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Date: Sat, 7 Jan 2017 09:26:29 +0100
Subject: [PATCH] git-update-index: add --refresh-stat option.

git-update-index --refresh, contrary to what the documentation says, not
only will refresh the stat information but will also update the sha1 of the
objects in the working tree if the stat information is found to be
out-of-date.

We add a --refresh-stat option, which like --refresh will update the stat
information, but will assume that any file in the working tree that is
present in the index matches the index - hence prevents unecessarily
reading the files in the working tree.  It is different from
--assume-unchanged or --skip-worktree in that the new stat information is
recorded in the index, hence subsequent git update-index --refresh will not
read the files if their mtime hasn't changed.

This sounds like a very dangerous option to give to the user, since it
could potentially cause changed files to be missed, but can be needed when
we are sure the working tree files have not changed and are matching the
index.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
---
 builtin/update-index.c | 10 ++++++++++
 cache.h                |  3 +++
 read-cache.c           |  7 +++++--
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index f3f07e7f1cb2..1215b6a09687 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -777,6 +777,12 @@ static int really_refresh_callback(const struct option *opt,
 	return refresh(opt->value, REFRESH_REALLY);
 }
 
+static int refresh_stat_callback(const struct option *opt,
+				const char *arg, int unset)
+{
+	return refresh(opt->value, REFRESH_STAT_ONLY);
+}
+
 static int chmod_callback(const struct option *opt,
 				const char *arg, int unset)
 {
@@ -934,6 +940,10 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 			N_("refresh stat information"),
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
 			refresh_callback},
+		{OPTION_CALLBACK, 0, "refresh-stat", &refresh_args, NULL,
+			N_("refresh only stat information (assume content has not changed)"),
+			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
+			refresh_stat_callback},
 		{OPTION_CALLBACK, 0, "really-refresh", &refresh_args, NULL,
 			N_("like --refresh, but ignore assume-unchanged setting"),
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG,
diff --git a/cache.h b/cache.h
index a50a61a19787..57d0f9fffbe5 100644
--- a/cache.h
+++ b/cache.h
@@ -611,6 +611,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig
 #define CE_MATCH_IGNORE_MISSING		0x08
 /* enable stat refresh */
 #define CE_MATCH_REFRESH		0x10
+/* only stat refresh, assume content hasn't changed */
+#define CE_MATCH_REFRESH_STAT_ONLY	0x20
 extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 
@@ -643,6 +645,7 @@ extern void fill_stat_cache_info(struct cache_entry *ce, struct stat *st);
 #define REFRESH_IGNORE_MISSING	0x0008	/* ignore non-existent */
 #define REFRESH_IGNORE_SUBMODULES	0x0010	/* ignore submodules */
 #define REFRESH_IN_PORCELAIN	0x0020	/* user friendly output, not "needs update" */
+#define REFRESH_STAT_ONLY	0x0040	/* do not check content but update stat */
 extern int refresh_index(struct index_state *, unsigned int flags, const struct pathspec *pathspec, char *seen, const char *header_msg);
 extern struct cache_entry *refresh_cache_entry(struct cache_entry *, unsigned int);
 
diff --git a/read-cache.c b/read-cache.c
index db5d91064266..e9334094c6f0 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1101,6 +1101,7 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
 	int ignore_valid = options & CE_MATCH_IGNORE_VALID;
 	int ignore_skip_worktree = options & CE_MATCH_IGNORE_SKIP_WORKTREE;
 	int ignore_missing = options & CE_MATCH_IGNORE_MISSING;
+	int assume_content_unchanged = options & CE_MATCH_REFRESH_STAT_ONLY;
 
 	if (!refresh || ce_uptodate(ce))
 		return ce;
@@ -1161,7 +1162,7 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
 		}
 	}
 
-	if (ie_modified(istate, ce, &st, options)) {
+	if (!assume_content_unchanged && ie_modified(istate, ce, &st, options)) {
 		if (err)
 			*err = EINVAL;
 		return NULL;
@@ -1206,11 +1207,13 @@ int refresh_index(struct index_state *istate, unsigned int flags,
 	int quiet = (flags & REFRESH_QUIET) != 0;
 	int not_new = (flags & REFRESH_IGNORE_MISSING) != 0;
 	int ignore_submodules = (flags & REFRESH_IGNORE_SUBMODULES) != 0;
+	int refresh_stat_only = (flags & REFRESH_STAT_ONLY) != 0;
 	int first = 1;
 	int in_porcelain = (flags & REFRESH_IN_PORCELAIN);
 	unsigned int options = (CE_MATCH_REFRESH |
 				(really ? CE_MATCH_IGNORE_VALID : 0) |
-				(not_new ? CE_MATCH_IGNORE_MISSING : 0));
+				(not_new ? CE_MATCH_IGNORE_MISSING : 0) |
+				(refresh_stat_only ? CE_MATCH_REFRESH_STAT_ONLY : 0));
 	const char *modified_fmt;
 	const char *deleted_fmt;
 	const char *typechange_fmt;
-- 
2.11.0


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Refreshing index timestamps without reading content
  2017-01-09 12:17   ` Quentin Casasnovas
@ 2017-01-09 12:22     ` Quentin Casasnovas
  0 siblings, 0 replies; 7+ messages in thread
From: Quentin Casasnovas @ 2017-01-09 12:22 UTC (permalink / raw)
  To: Quentin Casasnovas; +Cc: Duy Nguyen, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]

On Mon, Jan 09, 2017 at 01:17:24PM +0100, Quentin Casasnovas wrote:
> On Mon, Jan 09, 2017 at 07:02:45PM +0700, Duy Nguyen wrote:
> > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> > <quentin.casasnovas@oracle.com> wrote:
> > 
> > > If not, I am willing to implement a --assume-content-unchanged to the git
> > > update-index if you guys don't see something fundamentally wrong with this
> > > approach.
> > 
> > If you do that, I think you should go with either of the following options
> > 
> > - Extend git-update-index --index-info to take stat info as well (or
> > maybe make a new option instead). Then you can feed stat info directly
> > to git without a use-case-specific "assume-content-unchanged".
> > 
> > - Add "git update-index --touch" that does what "touch" does. In this
> > case, it blindly updates stat info to latest. But like touch, we can
> > also specify  mtime from command line if we need to. It's a bit less
> > generic than the above option, but easier to use.
> > 
> > Caveat: The options I'm proposing can be rejected. So maybe wait a bit
> > to see how people feel and perhaps send an RFC patch, again to gauge
> > the reception.
> 
> 
> Hey Duy,
> 
> Thanks for your answer.
> 
> I've played with this a bit last week and added an extra command, which I
> called --refresh-stat, which works like your suggested --index.  It works
                                                         ^^^^^^^

Whoops, sorry!

[...] *a bit* like your suggested *--touch*. [...]

I don't really need it to support specific mtime from the CLI for my
different use cases, but if someone has some ideas in how that would be
useful, I can try implementing your --index-info change to support stat
information to be passed around instead of my current simple
implementation.

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Refreshing index timestamps without reading content
  2017-01-09 12:02 ` Duy Nguyen
  2017-01-09 12:17   ` Quentin Casasnovas
@ 2017-01-09 15:01   ` Junio C Hamano
  2017-01-09 15:55     ` Quentin Casasnovas
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2017-01-09 15:01 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Quentin Casasnovas, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> <quentin.casasnovas@oracle.com> wrote:
>> Is there any way to tell git, after the git ls-tree command above, to
>> refresh its stat cache information and trust us that the file content has
>> not changed, as to avoid any useless file read (though it will obviously
>> will have to stat all of them, but that's not something we can really
>> avoid)
>
> I don't think there's any way to do that, unfortunately.

Lose "unfortunately".

>> If not, I am willing to implement a --assume-content-unchanged to the git
>> update-index if you guys don't see something fundamentally wrong with this
>> approach.
>
> If you do that, I think you should go with either of the following options
>
> - Extend git-update-index --index-info to take stat info as well (or
> maybe make a new option instead). Then you can feed stat info directly
> to git without a use-case-specific "assume-content-unchanged".
>
> - Add "git update-index --touch" that does what "touch" does. In this
> case, it blindly updates stat info to latest. But like touch, we can
> also specify  mtime from command line if we need to. It's a bit less
> generic than the above option, but easier to use.

Even if we assume that it is a good idea to let people muck with the
index like this, either of the above would be a usable addition,
because the cached stat information does not consist solely of
mtime.

"git update-index --index-info" was invented for the case where a
user or a script _knows_ the object ID of the blob that _would_
result if a contents of a file on the filesystem were run through
hash-object.  So from the interface's point of view, it may make
sense to teach it to take an extra/optional argument that is the
path to the file and take the stat info out of the named file when
the extra/optional argument was given.

But that assumes that it is a good idea to do this in the first
place.  It was deliberate design decision that setting the cached
stat info for the entry was protected behind actual content
comparison, and removing that protection will open the index to
abuse.

The userbase of Git has grown wide enough that it is harder to say
"If you lie that a file whose contents does not match the index is
up to date using this mechanism, you will lose data and all bad
things happen---you can keep both halves".  Once we release a
version of Git with such a "feature", the first bug report will be
"I did not want to run 'update-index --refresh' because it takes
time, and some index entries apparently did not match what is on the
filesystem, and I got a corrupt working file after a merge.  Git
should make sure that the contents match when using the new 'path to
the file' argument when updating the cached stat info!".  I do not
have a good answer to such a bug report.

So...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Refreshing index timestamps without reading content
  2017-01-09 15:01   ` Junio C Hamano
@ 2017-01-09 15:55     ` Quentin Casasnovas
  2017-01-10 14:17       ` Quentin Casasnovas
  0 siblings, 1 reply; 7+ messages in thread
From: Quentin Casasnovas @ 2017-01-09 15:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Quentin Casasnovas, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 6095 bytes --]

On Mon, Jan 09, 2017 at 07:01:36AM -0800, Junio C Hamano wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
> > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> > <quentin.casasnovas@oracle.com> wrote:
> >> Is there any way to tell git, after the git ls-tree command above, to
> >> refresh its stat cache information and trust us that the file content has
> >> not changed, as to avoid any useless file read (though it will obviously
> >> will have to stat all of them, but that's not something we can really
> >> avoid)
> >
> > I don't think there's any way to do that, unfortunately.
>
> Lose "unfortunately".
>
> >> If not, I am willing to implement a --assume-content-unchanged to the git
> >> update-index if you guys don't see something fundamentally wrong with this
> >> approach.
> >
> > If you do that, I think you should go with either of the following options
> >
> > - Extend git-update-index --index-info to take stat info as well (or
> > maybe make a new option instead). Then you can feed stat info directly
> > to git without a use-case-specific "assume-content-unchanged".
> >
> > - Add "git update-index --touch" that does what "touch" does. In this
> > case, it blindly updates stat info to latest. But like touch, we can
> > also specify  mtime from command line if we need to. It's a bit less
> > generic than the above option, but easier to use.
>
> Even if we assume that it is a good idea to let people muck with the
> index like this, either of the above would be a usable addition,
> because the cached stat information does not consist solely of
> mtime.
>
> "git update-index --index-info" was invented for the case where a
> user or a script _knows_ the object ID of the blob that _would_
> result if a contents of a file on the filesystem were run through
> hash-object.  So from the interface's point of view, it may make
> sense to teach it to take an extra/optional argument that is the
> path to the file and take the stat info out of the named file when
> the extra/optional argument was given.
>
> But that assumes that it is a good idea to do this in the first
> place.  It was deliberate design decision that setting the cached
> stat info for the entry was protected behind actual content
> comparison, and removing that protection will open the index to
> abuse.
>

Hi Junio,

Thanks for your feedback, appreciated :)

I do understand how it would be possible for someone to shoot themselves in
the feet with such option, but it solves real life use cases and improved
build times very signficantly here.

Another use case we have is setting up very lightweight linux work trees,
by reflinking from a base work-tree.  This allows for a completely
different work-tree taking up almost no size at first, whereas using a
shared clone or the recent worktree subcommand would "waste" ~500MB*:

 # linux-2.6 is a shared clone of a bare clone residing locally
 ~ $ cp --reflink -a linux-2.6 linux-2.6-reflinked

 # At this point, the mtime inside linux-2.6-reflinked are matching the
 # mtime of the source linux-2.6 (since we used the '-a' option of 'cp)
 ~ $ diff -u <(stat linux-2.6/README) <(stat linux-2.6-reflinked/README)
 --- /proc/self/fd/11  2017-01-09 16:34:04.523438942 +0100
 +++ /proc/self/fd/12  2017-01-09 16:34:04.523438942 +0100
 @@ -1,8 +1,8 @@
 -  File: 'linux-2.6/README'
 +  File: 'linux-2.6-reflinked/README'
    Size: 18372		Blocks: 40         IO Block: 4096   regular file
 -Device: fd00h/64768d	Inode: 268467090   Links: 1
 +Device: fd00h/64768d	Inode: 805970606   Links: 1
  Access: (0644/-rw-r--r--)  Uid: ( 1000/ quentin)   Gid: ( 1000/ quentin)
  Access: 2017-01-09 12:04:15.317758718 +0100
  Modify: 2017-01-09 12:04:12.566758772 +0100
 -Change: 2017-01-09 12:04:12.566758772 +0100
 +Change: 2017-01-09 16:29:48.305444003 +0100
   Birth:

  # Now let's check how long it takes to refresh the index from the source
  # and destination..
  ~/linux-2.6 $ time git update-index --refresh
  git update-index --refresh  0.04s user 0.08s system 204% cpu 0.058 total
                                                               ~~~~~~~~~~~
  ~/linux-2.6-reflinked $ time git update-index --refresh
  git update-index --refresh  2.40s user 1.43s system 38% cpu 10.003 total
                                                              ~~~~~~~~~~~~

This is quite a high penalty when a power user knows that his lightweight
copy worktree matches the index!  That's on a single worktree on a fairly
decent SSD but our build farm would do this with hundreds of work trees in
parallel, all residing on a spinning disks - the penalty would be much
worse than 10 seconds wasted.

I might have a look at adding reflinking awareness to the worktree
subcommand (if it hasn't been implemented already) to avoid this index
force refreshing but that still would not fix our other use case where we
knowingly change the mtime of the files on our worktrees.

* That's a rather low estimate:

  git ls-tree -zr --name-only v4.9 | du -csh --files0-from=- | tail -n1
  747M  total

> The userbase of Git has grown wide enough that it is harder to say "If
> you lie that a file whose contents does not match the index is up to date
> using this mechanism, you will lose data and all bad things happen---you
> can keep both halves".  Once we release a version of Git with such a
> "feature", the first bug report will be "I did not want to run
> 'update-index --refresh' because it takes time, and some index entries
> apparently did not match what is on the filesystem, and I got a corrupt
> working file after a merge.  Git should make sure that the contents match
> when using the new 'path to the file' argument when updating the cached
> stat info!".  I do not have a good answer to such a bug report.
>

The good answer would probably be that this is intended behaviour and would
explicitly be mentionned in the manual for that specific option?  :)

How about renaming the option (which is more of a sub-command) to
--i-know-this-can-eat-my-data-but-refresh-stat-only?  Then we can print a
big warning saying this is generally a VERY bad idea?

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Refreshing index timestamps without reading content
  2017-01-09 15:55     ` Quentin Casasnovas
@ 2017-01-10 14:17       ` Quentin Casasnovas
  0 siblings, 0 replies; 7+ messages in thread
From: Quentin Casasnovas @ 2017-01-10 14:17 UTC (permalink / raw)
  To: Quentin Casasnovas
  Cc: Vegard Nossum, Junio C Hamano, Duy Nguyen, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 5131 bytes --]

On Mon, Jan 09, 2017 at 04:55:37PM +0100, Quentin Casasnovas wrote:
> On Mon, Jan 09, 2017 at 07:01:36AM -0800, Junio C Hamano wrote:
> > Duy Nguyen <pclouds@gmail.com> writes:
> >
> > > On Thu, Jan 5, 2017 at 6:23 PM, Quentin Casasnovas
> > > <quentin.casasnovas@oracle.com> wrote:
> > >> Is there any way to tell git, after the git ls-tree command above, to
> > >> refresh its stat cache information and trust us that the file content has
> > >> not changed, as to avoid any useless file read (though it will obviously
> > >> will have to stat all of them, but that's not something we can really
> > >> avoid)
> > >
> > > I don't think there's any way to do that, unfortunately.
> >
> > Lose "unfortunately".
> >
> > >> If not, I am willing to implement a --assume-content-unchanged to the git
> > >> update-index if you guys don't see something fundamentally wrong with this
> > >> approach.
> > >
> > > If you do that, I think you should go with either of the following options
> > >
> > > - Extend git-update-index --index-info to take stat info as well (or
> > > maybe make a new option instead). Then you can feed stat info directly
> > > to git without a use-case-specific "assume-content-unchanged".
> > >
> > > - Add "git update-index --touch" that does what "touch" does. In this
> > > case, it blindly updates stat info to latest. But like touch, we can
> > > also specify  mtime from command line if we need to. It's a bit less
> > > generic than the above option, but easier to use.
> >
> > Even if we assume that it is a good idea to let people muck with the
> > index like this, either of the above would be a usable addition,
> > because the cached stat information does not consist solely of
> > mtime.
> >
> > "git update-index --index-info" was invented for the case where a
> > user or a script _knows_ the object ID of the blob that _would_
> > result if a contents of a file on the filesystem were run through
> > hash-object.  So from the interface's point of view, it may make
> > sense to teach it to take an extra/optional argument that is the
> > path to the file and take the stat info out of the named file when
> > the extra/optional argument was given.
> >
> > But that assumes that it is a good idea to do this in the first
> > place.  It was deliberate design decision that setting the cached
> > stat info for the entry was protected behind actual content
> > comparison, and removing that protection will open the index to
> > abuse.
> >
> 
> Hi Junio,
> 
> Thanks for your feedback, appreciated :)
> 
> I do understand how it would be possible for someone to shoot themselves in
> the feet with such option, but it solves real life use cases and improved
> build times very signficantly here.
> 
> Another use case we have is setting up very lightweight linux work trees,
> by reflinking from a base work-tree.  This allows for a completely
> different work-tree taking up almost no size at first, whereas using a
> shared clone or the recent worktree subcommand would "waste" ~500MB*:
> 
>  # linux-2.6 is a shared clone of a bare clone residing locally
>  ~ $ cp --reflink -a linux-2.6 linux-2.6-reflinked
> 
>  # At this point, the mtime inside linux-2.6-reflinked are matching the
>  # mtime of the source linux-2.6 (since we used the '-a' option of 'cp)
>  ~ $ diff -u <(stat linux-2.6/README) <(stat linux-2.6-reflinked/README)
>  --- /proc/self/fd/11  2017-01-09 16:34:04.523438942 +0100
>  +++ /proc/self/fd/12  2017-01-09 16:34:04.523438942 +0100
>  @@ -1,8 +1,8 @@
>  -  File: 'linux-2.6/README'
>  +  File: 'linux-2.6-reflinked/README'
>     Size: 18372		Blocks: 40         IO Block: 4096   regular file
>  -Device: fd00h/64768d	Inode: 268467090   Links: 1
>  +Device: fd00h/64768d	Inode: 805970606   Links: 1
>   Access: (0644/-rw-r--r--)  Uid: ( 1000/ quentin)   Gid: ( 1000/ quentin)
>   Access: 2017-01-09 12:04:15.317758718 +0100
>   Modify: 2017-01-09 12:04:12.566758772 +0100
>  -Change: 2017-01-09 12:04:12.566758772 +0100
>  +Change: 2017-01-09 16:29:48.305444003 +0100
>    Birth:
> 
>   # Now let's check how long it takes to refresh the index from the source
>   # and destination..
>   ~/linux-2.6 $ time git update-index --refresh
>   git update-index --refresh  0.04s user 0.08s system 204% cpu 0.058 total
>                                                                ~~~~~~~~~~~
>   ~/linux-2.6-reflinked $ time git update-index --refresh
>   git update-index --refresh  2.40s user 1.43s system 38% cpu 10.003 total
>                                                               ~~~~~~~~~~~~
> 

After discussing this with my friend Vegard, he found the core.checkStat
config which, if set to 'minimal', ignores the inode number which is enough
for the above use case to work just fine - so please excuse my ignorance!

For the initial problem I had when changing the mtime of all the files in
the tree, I should be able to change the mtime of the object files instead,
hence I don't really need the patch I sent earlier.

Sorry for the wasted time! :)

Q

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-01-10 14:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-05 11:23 Refreshing index timestamps without reading content Quentin Casasnovas
2017-01-09 12:02 ` Duy Nguyen
2017-01-09 12:17   ` Quentin Casasnovas
2017-01-09 12:22     ` Quentin Casasnovas
2017-01-09 15:01   ` Junio C Hamano
2017-01-09 15:55     ` Quentin Casasnovas
2017-01-10 14:17       ` Quentin Casasnovas

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).