Stat cache in .git/index hinders syncing of repositories

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Stat cache in .git/index hinders syncing of repositories
@ 2020-01-17 23:57 Christoph Groth
  2020-01-18 18:15 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Groth @ 2020-01-17 23:57 UTC (permalink / raw)
  To: git

Hello,

I am using unison to sync home directories across multiple machines.
This includes a fair number of git repositories and works very well.
Unison recently acquired a new feature that allows to treat selected
subdirectories (like .git) atomically.  This makes the syncing perfectly
safe.

Some people say that one should use git itself to sync git working
directories, but IMHO these people oversee the difference between
collaboration (using git) and being able to continue one’s own
unfinished work on a different machine, including uncommitted files,
stashes, and - if it has to be - in the middle of a merge.  Moreover, it
is simpler not to have to treat git repositories specially when syncing.
Syncing git repositories is thus clearly useful.

However, there is one problem with syncing git repositories, that has
been noticed by multiple people [1]: The file .git/index contains not
only the “git index”, but also a cache of stat-data of the files in the
working directory.  Some file synchronizers are able to sync mtimes, but
syncing ctimes would be bizarre (if it is even possible).

So, say that machines A and B are synced.  A new git repository appears
on machine A.  The synchronizer is run which results in copying all the
files of the new repo verbatim to machine B.  Note that now on machine
B the cache inside the file .git/index contains invalid stat
information.  So when "git status" is run on B .git/index gets
rewritten, and the next sync operation copies it back to A, where again
it is rewritten even by something as harmless as "git status".  And so
on, and so forth...

In my opinion the root of this ping-pong problem is that .git/index
mixes information about the status of the repository (=what has been
staged) that should be synced with a cache of machine-specific
filesystem metadata.

I am not an expert of git-internals, but perhaps it would be a good idea
to move the cache into a separate file that could be put on a "ignore"
list for synchronizers?  It seems to me that this has been already
proposed in a different context [2], and I would not be surprised if
factoring out the cache had other beneficial effects.

If it is not feasible to separate the cache, perhaps another possibility
would be to add a new possible value for core.checkStat that would
disable stat structure checking except for file sizes?

As a workaround for now, I exclude .git/index from syncing.  This seems
to work quite well, but I would be scared to sync unfinished merges like
this.

Thanks
Christoph

[1] https://stackoverflow.com/questions/12126247/why-does-git-index-change-when-i-havent-done-anything-to-my-repository
[2] https://www.mail-archive.com/git@vger.kernel.org/msg48065.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-17 23:57 Stat cache in .git/index hinders syncing of repositories Christoph Groth
@ 2020-01-18 18:15 ` Junio C Hamano
  2020-01-18 19:06   ` Christoph Groth
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2020-01-18 18:15 UTC (permalink / raw)
  To: Christoph Groth; +Cc: git

Christoph Groth <christoph@grothesque.org> writes:

> However, there is one problem with syncing git repositories, that has
> been noticed by multiple people [1]: The file .git/index contains not
> only the “git index”, but also a cache of stat-data of the files in the
> working directory.  Some file synchronizers are able to sync mtimes, but
> syncing ctimes would be bizarre (if it is even possible).

The stat-data in the index file is meant to be a mere optimization,
and after copying .git/index and the working tree files to a new
box, running "git update-index --refresh" would make them in sync,
no?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-18 18:15 ` Junio C Hamano
@ 2020-01-18 19:06   ` Christoph Groth
  2020-01-18 19:42     ` brian m. carlson
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Groth @ 2020-01-18 19:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:

> Christoph Groth <christoph@grothesque.org> writes:
>
> > However, there is one problem with syncing git repositories, that
> > has been noticed by multiple people [1]: The file .git/index
> > contains not only the “git index”, but also a cache of stat-data of
> > the files in the working directory.  Some file synchronizers are
> > able to sync mtimes, but syncing ctimes would be bizarre (if it is
> > even possible).
>
> The stat-data in the index file is meant to be a mere optimization,
> and after copying .git/index and the working tree files to a new box,
> running "git update-index --refresh" would make them in sync, no?

Let’s assume that one somehow manages to teach the synchronizer to run
"git update-index --refresh" immediately after copying the index file
and to consider the resulting modified .git/index file as identical to
the version on the other side.  (That would be already really difficult,
because it’s against the design of synchronizers to consider differing
files to be identical.)

Even then, .git/index will change again whenever the mtime of any
tracked file changes.  That can easily happen through "touch" or
modifying a file and then reverting the change.  There’s no way for the
synchronizer to tell the difference between relevant and irrelevant
changes to .git/index.  (Short of writing a specialized tool for
comparing git index files and embedding it.)

Would it be feasible to move this optimization data to a separate file
without breaking backwards and forwards compatibility?  I guess it
would: the data format of .git/index could remain identical, but newer
versions of git could ignore the stat data in the index file and use
a separate file for that.  If an older version of git is used with the
repository, it would simply notice that the stat data in .git/index is
not up-to-date.

But if the above is not feasible for some reason, would it be possible
to provide a switch for disabling stat caching optimization?

I believe that synchronizing files between machines is something that
will become ever more important with new tools like syncthing.  So it
would be really cool if git were to support such usage.  It *almost*
does!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-18 19:06   ` Christoph Groth
@ 2020-01-18 19:42     ` brian m. carlson
  2020-01-18 22:04       ` Christoph Groth
  0 siblings, 1 reply; 9+ messages in thread
From: brian m. carlson @ 2020-01-18 19:42 UTC (permalink / raw)
  To: Christoph Groth; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 1641 bytes --]

On 2020-01-18 at 19:06:21, Christoph Groth wrote:
> But if the above is not feasible for some reason, would it be possible
> to provide a switch for disabling stat caching optimization?

Git is going to perform really terribly on repositories of any size if
you disable stat caching, so we're not very likely to implement such a
feature.  Even if we did implement it, you probably wouldn't want to use
it.

However, there are the core.checkStat and core.trustctime options which
can control which information is used in the stat caching.  You can
restrict it to the whole second part of mtime and the file size if you
want.  See git-config(1) for more details.

Note that this assumes that (a) your sync tool can honor mtimes and (b)
that your sync tool syncs to another system of the same type.  You may
still run into problems if you share files between Linux and Windows
because symbolic links are different sizes there.  (This also bites
WSL.)  Since rsync can do the former, I think it's a reasonable
expectation that other tools can as well.

One final word of caution: you probably want to activate your sync tool
only manually and only when the repository is idle.  Tools like Dropbox
that automatically sync files one by one have been known to corrupt
repositories because the way they sync data leaves the repository in an
inconsistent state and doesn't honor standard POSIX file system
semantics which Git relies on for integrity.

Hopefully with that information you can find a configuration and tool
that work for you.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-18 19:42     ` brian m. carlson
@ 2020-01-18 22:04       ` Christoph Groth
  2020-01-20 12:01         ` Johannes Schindelin
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Groth @ 2020-01-18 22:04 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 2092 bytes --]

brian m. carlson wrote:
> On 2020-01-18 at 19:06:21, Christoph Groth wrote:
> > But if the above is not feasible for some reason, would it be
> > possible to provide a switch for disabling stat caching
> > optimization?
>
> Git is going to perform really terribly on repositories of any size if
> you disable stat caching, so we're not very likely to implement such
> a feature.  Even if we did implement it, you probably wouldn't want to
> use it.

OK, I see.  But please consider (one day) to split up the index file to
separate the local stat cache from the globally valid data.

(By the way, even after 12 years of using Git intensely I am confused
about what actually is the index.  I believed that it is the "staging
area", like in "git-add - Add file contents to the index".  But then the
.git/index file reflects all the tracked files, and not just staged
ones.  This usage is also reflected by the command "git update-index".)

> However, there are the core.checkStat and core.trustctime options
> which can control which information is used in the stat caching.  You
> can restrict it to the whole second part of mtime and the file size if
> you want.  See git-config(1) for more details.

Thanks a lot, that did the trick!  I’ve been already syncing mtimes.
Setting both core.checkStat and core.trustctime to the "weak" values
made the spurious modifications go away.

Still, this is a workaround, and the price is reduced robustness of file
modification detection.  Technically, that wouldn’t be necessary...
I hope that in practice it won’t matter.

> One final word of caution: you probably want to activate your sync
> tool only manually and only when the repository is idle.  Tools like
> Dropbox that automatically sync files one by one have been known to
> corrupt repositories because the way they sync data leaves the
> repository in an inconsistent state and doesn't honor standard POSIX
> file system semantics which Git relies on for integrity.

Yes, that’s why I still prefer Unison to more automatic real-time
tools.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-18 22:04       ` Christoph Groth
@ 2020-01-20 12:01         ` Johannes Schindelin
  2020-01-20 23:53           ` Christoph Groth
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Schindelin @ 2020-01-20 12:01 UTC (permalink / raw)
  To: Christoph Groth; +Cc: brian m. carlson, Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 3727 bytes --]

Hi Christoph,

On Sat, 18 Jan 2020, Christoph Groth wrote:

> brian m. carlson wrote:
> > On 2020-01-18 at 19:06:21, Christoph Groth wrote:
> > > But if the above is not feasible for some reason, would it be
> > > possible to provide a switch for disabling stat caching
> > > optimization?
> >
> > Git is going to perform really terribly on repositories of any size if
> > you disable stat caching, so we're not very likely to implement such
> > a feature.  Even if we did implement it, you probably wouldn't want to
> > use it.
>
> OK, I see.  But please consider (one day) to split up the index file to
> separate the local stat cache from the globally valid data.

I am sure that this has been considered even before Git was publicly
announced, and I would wager a guess that it was determined that it would
be better to keep all of Git's private data in one place.

Now, you are totally free to disagree, and even to work on a patch series
to separate the stat cache and offer a compelling argument why this change
should be made. If I were you, I would not expect any other person to be
interested in working on this.

> (By the way, even after 12 years of using Git intensely I am confused
> about what actually is the index.  I believed that it is the "staging
> area", like in "git-add - Add file contents to the index".  But then the
> .git/index file reflects all the tracked files, and not just staged
> ones.  This usage is also reflected by the command "git update-index".)

The concept of the Git index is slightly different from what is actually
stored inside `.git/index`. You should consider the latter to be an
implementation detail that is of concern only if you want to work on
internals. Otherwise the description of the index as a staging area is a
pretty good image.

The staging area contains of course more than just the stages you changed.
It contains the entire tree that is staged in order to become the next
commit.

If you asked a worker at a theater to make a minor change to the stage,
you would not expect the staging area to be empty, either.

> > However, there are the core.checkStat and core.trustctime options
> > which can control which information is used in the stat caching.  You
> > can restrict it to the whole second part of mtime and the file size if
> > you want.  See git-config(1) for more details.
>
> Thanks a lot, that did the trick!  I’ve been already syncing mtimes.
> Setting both core.checkStat and core.trustctime to the "weak" values
> made the spurious modifications go away.

And of course now you have a less performant setup because files have a
much better chance of being "racily clean", i.e. their mtime could be
identical to the `.git/index` file, in which case Git has to assume that
the file might have changed, and the index has to be refreshed.

Just saying that what you think of as a silver bullet comes at a price.

> Still, this is a workaround, and the price is reduced robustness of file
> modification detection.

You misunderstand how Git detects whether a file is modified or not.

A file is re-hashed if its mtime is newer than, _or equal to_, the mtime
of `.git/index`.

So no, it is not the robustness that is the problem. It is no less robust.
The problem is that you force re-hashing where it would not be necessary
otherwise.

In general, I am not sure that you are using the right tool for
synchronizing. If you cannot guarantee that a snapshot of the directory is
copied, you will always run the risk of inconsistent data, which is worse
than not having a backup at all: at least without a backup you do not have
a false sense of security.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-20 12:01         ` Johannes Schindelin
@ 2020-01-20 23:53           ` Christoph Groth
  2020-01-21  2:53             ` brian m. carlson
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Groth @ 2020-01-20 23:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

Johannes Schindelin wrote:
>
> On Sat, 18 Jan 2020, Christoph Groth wrote:
>
> > OK, I see.  But please consider (one day) to split up the index file
> > to separate the local stat cache from the globally valid data.
>
> I am sure that this has been considered even before Git was publicly
> announced,

I would be very interested to hear the rationale for keeping the
information about what is staged and the stat cache together in the same
file.  I, or someone else, might actually work on a patch one day, but
before starting, it would be good to understand the reasoning behind the
current design.

> and I would wager a guess that it was determined that it would be
> better to keep all of Git's private data in one place.

My point is that it’s not just private data: When I excluded .git/index
from synchronization, staging files for a commit was no longer
synchronized.

> > (By the way, even after 12 years of using Git intensely I am
> > confused about what actually is the index.  I believed that it is
> > the "staging area", like in "git-add - Add file contents to the
> > index".  But then the .git/index file reflects all the tracked
> > files, and not just staged ones.  This usage is also reflected by
> > the command "git update-index".)
>
> The concept of the Git index is slightly different from what is
> actually stored inside `.git/index`. You should consider the latter to
> be an implementation detail that is of concern only if you want to
> work on internals. Otherwise the description of the index as a staging
> area is a pretty good image.

To me, it does not seem to be a mere implementation detail.  For example
the command ’git update-index --refresh’ is part of the "public API" and
its action is to update the stat cache.  It does not modify what is
staged or not.

> > Still, this is a workaround, and the price is reduced robustness of
> > file modification detection.
>
> You misunderstand how Git detects whether a file is modified or not.
>
> A file is re-hashed if its mtime is newer than, _or equal to_, the
> mtime of `.git/index`.

You must mean "the mtime in ’.git/index’", but OK, I see.  Makes sense
of course.  So setting core.trustctime to false and core.checkstat to
minimal only means that some avoidable rehashings may be made.  But this
would require two modifications of a file in the same second, without
a change to the file size.

> In general, I am not sure that you are using the right tool for
> synchronizing. If you cannot guarantee that a snapshot of the
> directory is copied, you will always run the risk of inconsistent
> data, which is worse than not having a backup at all: at least without
> a backup you do not have a false sense of security.

I do not understand what makes you think so.

Unison is very robust software, I never had any problems with it and
never heard of anyone having any.  Moreover, as I noted in the opening
message of this thread, it recently gained an option to treat chosen
directories as atomic.  I’m using this for ".git" subdirectories.

Christoph

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-20 23:53           ` Christoph Groth
@ 2020-01-21  2:53             ` brian m. carlson
  2020-01-24  9:16               ` Christoph Groth
  0 siblings, 1 reply; 9+ messages in thread
From: brian m. carlson @ 2020-01-21  2:53 UTC (permalink / raw)
  To: Christoph Groth; +Cc: Johannes Schindelin, git

[-- Attachment #1: Type: text/plain, Size: 2887 bytes --]

On 2020-01-20 at 23:53:22, Christoph Groth wrote:
> Johannes Schindelin wrote:
> >
> > On Sat, 18 Jan 2020, Christoph Groth wrote:
> >
> > > OK, I see.  But please consider (one day) to split up the index file
> > > to separate the local stat cache from the globally valid data.
> >
> > I am sure that this has been considered even before Git was publicly
> > announced,
> 
> I would be very interested to hear the rationale for keeping the
> information about what is staged and the stat cache together in the same
> file.  I, or someone else, might actually work on a patch one day, but
> before starting, it would be good to understand the reasoning behind the
> current design.
> 
> > and I would wager a guess that it was determined that it would be
> > better to keep all of Git's private data in one place.
> 
> My point is that it’s not just private data: When I excluded .git/index
> from synchronization, staging files for a commit was no longer
> synchronized.

To try to answer this question, Git stores all of its state about the
working tree in the index.  Bare repositories don't typically have an
index because they don't have a working tree.  Whether that state is
staged contents or stat information, all of it is in one file.

Storing all of this data in one file means that only one file need be
mapped into memory and rewritten.  Git writes to the index by atomically
creating a lock file along side of it and writing the new contents into
it, and then doing an atomic replace.  This approach wouldn't be
possible with multiple files, and any update to it wouldn't be atomic.

There is support for a split index mode which means that the main index
need not be rewritten as often, which is helpful when making small
updates to large trees, where the cost of rewriting the index is
significant.  I don't know how locking is handled there[0], but I assume
that it is, because the people who implemented and reviewed it are
capable and thoughtful.

However, having said that, nobody has provided a compelling case for
using multiple files for storing different types of working tree state.
The existing options are available for cases like yours and others', and
they work.  Since there are clear benefits to the current model,
including simplicity and robustness, and few downsides, nobody has
decided to change it.

I should add that even if, for some reason, we did add support for
splitting this data out, I'm not sure if we'd support syncing only part
of the repository state and blowing away other state.  We don't really
support that now (other than through tools like fetch and clone) and I
don't think we'd want to encourage that behavior in the future.

[0] And I have not had the interest to look at this present moment.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Stat cache in .git/index hinders syncing of repositories
  2020-01-21  2:53             ` brian m. carlson
@ 2020-01-24  9:16               ` Christoph Groth
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Groth @ 2020-01-24  9:16 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Johannes Schindelin, git

[-- Attachment #1: Type: text/plain, Size: 2727 bytes --]

brian m. carlson wrote:
> On 2020-01-20 at 23:53:22, Christoph Groth wrote:

> > My point is that it’s not just private data: When I excluded
> > .git/index from synchronization, staging files for a commit was no
> > longer synchronized.
>
> (...)
>
> Storing all of this data in one file means that only one file need be
> mapped into memory and rewritten.  Git writes to the index by
> atomically creating a lock file along side of it and writing the new
> contents into it, and then doing an atomic replace.  This approach
> wouldn't be possible with multiple files, and any update to it
> wouldn't be atomic.

Thanks a lot for the explanation.  To me, it still seems less
satisfying, from a design point of view, to mix state (=what changes
have been staged) with an ephemeral cache that is specific to
a particular file system.  Without having thought deeply about it,
I have the impression that it wouldn’t matter if the stat cache and the
“staging state” of the repository would be atomic each on their own.

But I understand now that all of this hardly matters in practice (see
below), so I’m not motivated to work on this, and probably no one else
is. :-)

> However, having said that, nobody has provided a compelling case for
> using multiple files for storing different types of working tree
> state.  The existing options are available for cases like yours and
> others', and they work.  Since there are clear benefits to the current
> model, including simplicity and robustness, and few downsides, nobody
> has decided to change it.

Indeed, I do see hardly any disadvantages of globally setting

	trustctime = false
	checkstat = minimal

as I do now.  In fact, I wonder what is the purpose of caching the
subsecond part of mtime and the ctime in the first place.  Perhaps it
matters for scripted use of git where several operations can occur in
the same second, but even then only changes that keep file sizes
constant would be affected.

> I should add that even if, for some reason, we did add support for
> splitting this data out, I'm not sure if we'd support syncing only
> part of the repository state and blowing away other state.  We don't
> really support that now (other than through tools like fetch and
> clone) and I don't think we'd want to encourage that behavior in the
> future.

The stat cache file would not be really part of the state of the
repository, since deleting it would not change anything, but only slow
down the next operation.  (That’s at least my understanding currently,
perhaps I’m still overseeing something.)

Brian, Johannes, Junio, thanks a lot for taking the time to clarify this
issue.

Christoph

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-01-24  9:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-17 23:57 Stat cache in .git/index hinders syncing of repositories Christoph Groth
2020-01-18 18:15 ` Junio C Hamano
2020-01-18 19:06   ` Christoph Groth
2020-01-18 19:42     ` brian m. carlson
2020-01-18 22:04       ` Christoph Groth
2020-01-20 12:01         ` Johannes Schindelin
2020-01-20 23:53           ` Christoph Groth
2020-01-21  2:53             ` brian m. carlson
2020-01-24  9:16               ` Christoph Groth

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).