git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Protecting old temporary objects being reused from concurrent "git gc"?
@ 2016-11-15 14:13 Matt McCutchen
  2016-11-15 17:06 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Matt McCutchen @ 2016-11-15 14:13 UTC (permalink / raw)
  To: git

The Braid subproject management tool stores the subproject content in
the main tree and is able to switch to a different upstream revision of
a subproject by doing the equivalent of "git read-tree -m" on the
superproject tree and the two upstream trees.  The tricky part is
preparing temporary trees with the upstream content moved to the path
configured for the superproject.  The usual method is "git read-tree
--prefix", but using what index file?  Braid currently uses the user's
actual worktree, which can leave a mess if it gets interrupted:

https://github.com/cristibalan/braid/blob/7d81da6e86e24de62a74f3ab8d880666cb343b04/lib/braid/commands/update.rb#L98

I want to change this to something that won't leave an inconsistent
state if interrupted.  I've written code for this kind of thing before
that sets GIT_INDEX_FILE and uses a temporary index file and "git
write-tree".  But I realized that if "git gc" runs concurrently, the
generated tree could be deleted before it is used and the tool would
fail.  If I had a need to run "git commit-tree", it seems like I might
even end up with a commit object with a broken reference to a tree.
 "git gc" normally doesn't delete objects that were created in the last
2 weeks, but if an identical tree was added to the object database more
than 2 weeks ago by another operation and is unreferenced, it could be
reused without updating its mtime and it could still get deleted.

Is there a recommended way to avoid this kind of problem in add-on
tools?  (I searched the Git documentation and the web for information
about races with "git gc" and didn't find anything useful.)  If not, it
seems to be a significant design flaw in "git gc", even if the problem
is extremely rare in practice.  I wonder if some of the built-in
commands may have the same problem, though I haven't tried to test
them.  If this is confirmed to be a known problem affecting built-in
commands, then at least I won't feel bad about introducing the
same problem into add-on tools. :/

Thanks,
Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 14:13 Protecting old temporary objects being reused from concurrent "git gc"? Matt McCutchen
@ 2016-11-15 17:06 ` Jeff King
  2016-11-15 17:33   ` Matt McCutchen
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-11-15 17:06 UTC (permalink / raw)
  To: Matt McCutchen; +Cc: git

On Tue, Nov 15, 2016 at 09:13:14AM -0500, Matt McCutchen wrote:

> I want to change this to something that won't leave an inconsistent
> state if interrupted.  I've written code for this kind of thing before
> that sets GIT_INDEX_FILE and uses a temporary index file and "git
> write-tree".  But I realized that if "git gc" runs concurrently, the
> generated tree could be deleted before it is used and the tool would
> fail.  If I had a need to run "git commit-tree", it seems like I might
> even end up with a commit object with a broken reference to a tree.
>  "git gc" normally doesn't delete objects that were created in the last
> 2 weeks, but if an identical tree was added to the object database more
> than 2 weeks ago by another operation and is unreferenced, it could be
> reused without updating its mtime and it could still get deleted.

Modern versions of git do two things to help with this:

 - any object which is referenced by a "recent" object (within the 2
   weeks) is also considered recent. So if you create a new commit
   object that points to a tree, even before you reference the commit
   that tree is protected

 - when an object write is optimized out because we already have the
   object, git will update the mtime on the file (loose object or
   packfile) to freshen it

This isn't perfect, though. You can decide to reference an existing
object just as it is being deleted. And the pruning process itself is
not atomic (and it's tricky to make it so, just because of what we're
promised by the filesystem).

> Is there a recommended way to avoid this kind of problem in add-on
> tools?  (I searched the Git documentation and the web for information
> about races with "git gc" and didn't find anything useful.)  If not, it
> seems to be a significant design flaw in "git gc", even if the problem
> is extremely rare in practice.  I wonder if some of the built-in
> commands may have the same problem, though I haven't tried to test
> them.  If this is confirmed to be a known problem affecting built-in
> commands, then at least I won't feel bad about introducing the
> same problem into add-on tools. :/

If you have long-running data (like, a temporary index file that might
literally sit around for days or weeks) I think that is a potential
problem. And the solution is probably to use refs in some way to point
to your objects. If you're worried about a short-term operation where
somebody happens to run git-gc concurrently, I agree it's a possible
problem, but I suspect something you can ignore in practice.

For the most part, a lot of the client-side git tools assume that one
operation is happening at a time in the repository. And I think that
largely holds for a developer working on a single clone, and things just
work in practice.

Auto-gc makes that a little sketchier, but historically does not seem to
have really caused problems in practice.

For a busy multi-user server, I recommend turning off auto-gc entirely,
and repacking manually with "-k" to be on the safe side.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 17:06 ` Jeff King
@ 2016-11-15 17:33   ` Matt McCutchen
  2016-11-15 17:40     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Matt McCutchen @ 2016-11-15 17:33 UTC (permalink / raw)
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1349 bytes --]

On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote:
>  - when an object write is optimized out because we already have the
>    object, git will update the mtime on the file (loose object or
>    packfile) to freshen it

FWIW, I am not seeing this happen when I do "git read-tree --prefix"
followed by "git write-tree" using the current master (3ab2281).  See
the attached test script.

> If you have long-running data (like, a temporary index file that might
> literally sit around for days or weeks) I think that is a potential
> problem. And the solution is probably to use refs in some way to point
> to your objects.

Agreed.  This is not my current scenario.

> If you're worried about a short-term operation where
> somebody happens to run git-gc concurrently, I agree it's a possible
> problem, but I suspect something you can ignore in practice.
> 
> For the most part, a lot of the client-side git tools assume that one
> operation is happening at a time in the repository. And I think that
> largely holds for a developer working on a single clone, and things just
> work in practice.
> 
> Auto-gc makes that a little sketchier, but historically does not seem to
> have really caused problems in practice.

OK.  I'll write a patch to add a summary of this information to the
git-gc man page.

Matt

[-- Attachment #2: test-git-read-tree-write-tree-touch-object.sh --]
[-- Type: application/x-shellscript, Size: 502 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 17:33   ` Matt McCutchen
@ 2016-11-15 17:40     ` Jeff King
  2016-11-15 19:08       ` [PATCH] git-gc.txt: expand discussion of races with other processes Matt McCutchen
                         ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jeff King @ 2016-11-15 17:40 UTC (permalink / raw)
  To: Matt McCutchen; +Cc: git

On Tue, Nov 15, 2016 at 12:33:04PM -0500, Matt McCutchen wrote:

> On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote:
> >  - when an object write is optimized out because we already have the
> >    object, git will update the mtime on the file (loose object or
> >    packfile) to freshen it
> 
> FWIW, I am not seeing this happen when I do "git read-tree --prefix"
> followed by "git write-tree" using the current master (3ab2281).  See
> the attached test script.

The optimization I'm thinking about is the one from write_sha1_file(),
which learned to freshen in 33d4221c7 (write_sha1_file: freshen existing
objects, 2014-10-15).

I suspect the issue is that read-tree populates the cache-tree index
extension, and then write-tree omits the object write before it even
gets to write_sha1_file(). The solution is that it should probably be
calling one of the freshen() functions (possibly just replacing
has_sha1_file() with check_and_freshen(), but I haven't looked).

I'd definitely welcome patches in this area.

> OK.  I'll write a patch to add a summary of this information to the
> git-gc man page.

Sounds like a good idea. Thanks.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] git-gc.txt: expand discussion of races with other processes
  2016-11-15 17:40     ` Jeff King
@ 2016-11-15 19:08       ` Matt McCutchen
  2016-11-15 19:12       ` Protecting old temporary objects being reused from concurrent "git gc"? Matt McCutchen
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Matt McCutchen @ 2016-11-15 19:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King

In general, "git gc" may delete objects that another concurrent process
is using but hasn't created a reference to.  Git has some mitigations,
but they fall short of a complete solution.  Document this in the
git-gc(1) man page and add a reference from the documentation of the
gc.pruneExpire config variable.

Based on a write-up by Jeff King:

http://marc.info/?l=git&m=147922960131779&w=2

Signed-off-by: Matt McCutchen <matt@mattmccutchen.net>
---
 Documentation/config.txt |  4 +++-
 Documentation/git-gc.txt | 34 ++++++++++++++++++++++++++--------
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 21fdddf..3f1d931 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1409,7 +1409,9 @@ gc.pruneExpire::
 	Override the grace period with this config variable.  The value
 	"now" may be used to disable this grace period and always prune
 	unreachable objects immediately, or "never" may be used to
-	suppress pruning.
+	suppress pruning.  This feature helps prevent corruption when
+	'git gc' runs concurrently with another process writing to the
+	repository; see the "NOTES" section of linkgit:git-gc[1].
 
 gc.worktreePruneExpire::
 	When 'git gc' is run, it calls
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index bed60f4..852b72c 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -63,11 +63,10 @@ automatic consolidation of packs.
 --prune=<date>::
 	Prune loose objects older than date (default is 2 weeks ago,
 	overridable by the config variable `gc.pruneExpire`).
-	--prune=all prunes loose objects regardless of their age (do
-	not use --prune=all unless you know exactly what you are doing.
-	Unless the repository is quiescent, you will lose newly created
-	objects that haven't been anchored with the refs and end up
-	corrupting your repository).  --prune is on by default.
+	--prune=all prunes loose objects regardless of their age and
+	increases the risk of corruption if another process is writing to
+	the repository concurrently; see "NOTES" below. --prune is on by
+	default.
 
 --no-prune::
 	Do not prune any loose objects.
@@ -138,17 +137,36 @@ default is "2 weeks ago".
 Notes
 -----
 
-'git gc' tries very hard to be safe about the garbage it collects. In
+'git gc' tries very hard not to delete objects that are referenced
+anywhere in your repository. In
 particular, it will keep not only objects referenced by your current set
 of branches and tags, but also objects referenced by the index,
 remote-tracking branches, refs saved by 'git filter-branch' in
 refs/original/, or reflogs (which may reference commits in branches
 that were later amended or rewound).
-
-If you are expecting some objects to be collected and they aren't, check
+If you are expecting some objects to be deleted and they aren't, check
 all of those locations and decide whether it makes sense in your case to
 remove those references.
 
+On the other hand, when 'git gc' runs concurrently with another process,
+there is a risk of it deleting an object that the other process is using
+but hasn't created a reference to. This may just cause the other process
+to fail or may corrupt the repository if the other process later adds a
+reference to the deleted object. Git has two features that significantly
+mitigate this problem:
+
+. Any object with modification time newer than the `--prune` date is kept,
+  along with everything reachable from it.
+
+. Most operations that add an object to the database update the
+  modification time of the object if it is already present so that #1
+  applies.
+
+However, these features fall short of a complete solution, so users who
+run commands concurrently have to live with some risk of corruption (which
+seems to be low in practice) unless they turn off automatic garbage
+collection with 'git config gc.auto 0'.
+
 HOOKS
 -----
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 17:40     ` Jeff King
  2016-11-15 19:08       ` [PATCH] git-gc.txt: expand discussion of races with other processes Matt McCutchen
@ 2016-11-15 19:12       ` Matt McCutchen
  2016-11-15 20:01       ` Junio C Hamano
  2016-11-16 18:58       ` Junio C Hamano
  3 siblings, 0 replies; 13+ messages in thread
From: Matt McCutchen @ 2016-11-15 19:12 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Tue, 2016-11-15 at 12:40 -0500, Jeff King wrote:
> On Tue, Nov 15, 2016 at 12:33:04PM -0500, Matt McCutchen wrote:
> 
> > 
> > On Tue, 2016-11-15 at 12:06 -0500, Jeff King wrote:
> > > 
> > >  - when an object write is optimized out because we already have the
> > >    object, git will update the mtime on the file (loose object or
> > >    packfile) to freshen it
> > 
> > FWIW, I am not seeing this happen when I do "git read-tree --prefix"
> > followed by "git write-tree" using the current master (3ab2281).  See
> > the attached test script.
> 
> The optimization I'm thinking about is the one from write_sha1_file(),
> which learned to freshen in 33d4221c7 (write_sha1_file: freshen existing
> objects, 2014-10-15).
> 
> I suspect the issue is that read-tree populates the cache-tree index
> extension, and then write-tree omits the object write before it even
> gets to write_sha1_file(). The solution is that it should probably be
> calling one of the freshen() functions (possibly just replacing
> has_sha1_file() with check_and_freshen(), but I haven't looked).
> 
> I'd definitely welcome patches in this area.

Cool, it's nice to have an idea of what's going on.  I don't think I'm
going to try to fix it myself though.

By the way, thanks for the fast response to my original question!

Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 17:40     ` Jeff King
  2016-11-15 19:08       ` [PATCH] git-gc.txt: expand discussion of races with other processes Matt McCutchen
  2016-11-15 19:12       ` Protecting old temporary objects being reused from concurrent "git gc"? Matt McCutchen
@ 2016-11-15 20:01       ` Junio C Hamano
  2016-11-16  8:07         ` Jeff King
  2016-11-16 18:58       ` Junio C Hamano
  3 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-11-15 20:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Matt McCutchen, git

Jeff King <peff@peff.net> writes:

> I suspect the issue is that read-tree populates the cache-tree index
> extension, and then write-tree omits the object write before it even
> gets to write_sha1_file(). The solution is that it should probably be
> calling one of the freshen() functions (possibly just replacing
> has_sha1_file() with check_and_freshen(), but I haven't looked).

I think the final writing always happens via write_sha1_file(), but
an earlier cache-tree update that says "if we have a tree object
already, then use it, otherwise even though we know the object name
for this subtree, do not record it in the cache-tree" codepath may
decide to record the subtree's sha1 without refreshing the referent.

A fix may look like this.

 cache-tree.c | 2 +-
 cache.h      | 1 +
 sha1_file.c  | 9 +++++++--
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 345ea35963..3ae6d056b4 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -401,7 +401,7 @@ static int update_one(struct cache_tree *it,
 	if (repair) {
 		unsigned char sha1[20];
 		hash_sha1_file(buffer.buf, buffer.len, tree_type, sha1);
-		if (has_sha1_file(sha1))
+		if (freshen_object(sha1))
 			hashcpy(it->sha1, sha1);
 		else
 			to_invalidate = 1;
diff --git a/cache.h b/cache.h
index 5cdea6833e..1f5694f308 100644
--- a/cache.h
+++ b/cache.h
@@ -1126,6 +1126,7 @@ extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
 extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
 extern int hash_sha1_file_literally(const void *buf, unsigned long len, const char *type, unsigned char *sha1, unsigned flags);
+extern int freshen_object(const unsigned char *);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
 extern int git_open_cloexec(const char *name, int flags);
diff --git a/sha1_file.c b/sha1_file.c
index e030805497..1daeb05dcd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3275,6 +3275,11 @@ static int freshen_packed_object(const unsigned char *sha1)
 	return 1;
 }
 
+int freshen_object(const unsigned char *sha1)
+{
+	return freshen_packed_object(sha1) || freshen_loose_object(sha1);
+}
+
 int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1)
 {
 	char hdr[32];
@@ -3284,7 +3289,7 @@ int write_sha1_file(const void *buf, unsigned long len, const char *type, unsign
 	 * it out into .git/objects/??/?{38} file.
 	 */
 	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
-	if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+	if (freshen_object(sha1))
 		return 0;
 	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
 }
@@ -3302,7 +3307,7 @@ int hash_sha1_file_literally(const void *buf, unsigned long len, const char *typ
 
 	if (!(flags & HASH_WRITE_OBJECT))
 		goto cleanup;
-	if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+	if (freshen_object(sha1))
 		goto cleanup;
 	status = write_loose_object(sha1, header, hdrlen, buf, len, 0);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 20:01       ` Junio C Hamano
@ 2016-11-16  8:07         ` Jeff King
  2016-11-16 18:18           ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-11-16  8:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Matt McCutchen, git

On Tue, Nov 15, 2016 at 12:01:35PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I suspect the issue is that read-tree populates the cache-tree index
> > extension, and then write-tree omits the object write before it even
> > gets to write_sha1_file(). The solution is that it should probably be
> > calling one of the freshen() functions (possibly just replacing
> > has_sha1_file() with check_and_freshen(), but I haven't looked).
> 
> I think the final writing always happens via write_sha1_file(), but
> an earlier cache-tree update that says "if we have a tree object
> already, then use it, otherwise even though we know the object name
> for this subtree, do not record it in the cache-tree" codepath may
> decide to record the subtree's sha1 without refreshing the referent.
> 
> A fix may look like this.

Yeah, that's along the lines I was expecting, though I'm not familiar
enough with cache-tree to say whether it's sufficient. I notice there is
a return very early on in update_one() when has_sha1_file() matches, and
it seems like that would trigger in some interesting cases, too.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-16  8:07         ` Jeff King
@ 2016-11-16 18:18           ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2016-11-16 18:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Matt McCutchen, git

Jeff King <peff@peff.net> writes:

> ... I notice there is
> a return very early on in update_one() when has_sha1_file() matches, and
> it seems like that would trigger in some interesting cases, too.

Yeah, I missed that.  It says "we were asked to update one
cache_tree that corresponds to this subdirectory, found that
hashes everything below has been rolled up and still valid, and we
already have the right tree object in the object store".

It can simply become freshen(), which is "do we have it in the
object store?" with a side effect of touching iff the answer is
"yes".

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-15 17:40     ` Jeff King
                         ` (2 preceding siblings ...)
  2016-11-15 20:01       ` Junio C Hamano
@ 2016-11-16 18:58       ` Junio C Hamano
  2016-11-17  1:04         ` Jeff King
  3 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-11-16 18:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Matt McCutchen, git

Jeff King <peff@peff.net> writes:

> I suspect the issue is that read-tree populates the cache-tree index
> extension, and then write-tree omits the object write before it even
> gets to write_sha1_file().

Wait a minute.  The entries in the index and trees in the cache-tree
are root of "still in use" traversal for the purpose of pruning,
which makes the "something like this" patch unnecessary for the real
index file.

And for temporary index files that is kept for 6 months, touching
tree objects that cache-tree references is irrelevant---the blobs
recorded in the "list of objects" part of the index will go stale,
which is a lot more problematic.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-16 18:58       ` Junio C Hamano
@ 2016-11-17  1:04         ` Jeff King
  2016-11-17  1:35           ` Re* " Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-11-17  1:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Matt McCutchen, git

On Wed, Nov 16, 2016 at 10:58:30AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I suspect the issue is that read-tree populates the cache-tree index
> > extension, and then write-tree omits the object write before it even
> > gets to write_sha1_file().
> 
> Wait a minute.  The entries in the index and trees in the cache-tree
> are root of "still in use" traversal for the purpose of pruning,
> which makes the "something like this" patch unnecessary for the real
> index file.
> 
> And for temporary index files that is kept for 6 months, touching
> tree objects that cache-tree references is irrelevant---the blobs
> recorded in the "list of objects" part of the index will go stale,
> which is a lot more problematic.

I think the case that is helped here is somebody who runs "git
write-tree" and expects that the timestamp on those trees is fresh. So
even more a briefly used index, like:

  export GIT_INDEX_FILE=/tmp/foo
  git read-tree ...
  git write-tree
  rm -f $GIT_INDEX_FILE

we'd expect that a "git gc" which runs immediately after would see those
trees as recent and avoid pruning them (and transitively, any blobs that
are reachable from the trees). But I don't think that write-tree
actually freshens them (it sees "oh, we already have these; there is
nothing to write").

I could actually see an argument that the read-tree operation should
freshen the blobs themselves (because we know those blobs are now in
active use, and probably shouldn't be pruned), but I am not sure I agree
there. If only because it is weird that an operation which is otherwise
read-only with respect to the repository would modify the object
database.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re* Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-17  1:04         ` Jeff King
@ 2016-11-17  1:35           ` Junio C Hamano
  2016-11-17  1:43             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-11-17  1:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Matt McCutchen, git

Jeff King <peff@peff.net> writes:

> I think the case that is helped here is somebody who runs "git
> write-tree" and expects that the timestamp on those trees is fresh. So
> even more a briefly used index, like:
>
>   export GIT_INDEX_FILE=/tmp/foo
>   git read-tree ...
>   git write-tree
>   rm -f $GIT_INDEX_FILE
>
> we'd expect that a "git gc" which runs immediately after would see those
> trees as recent and avoid pruning them (and transitively, any blobs that
> are reachable from the trees). But I don't think that write-tree
> actually freshens them (it sees "oh, we already have these; there is
> nothing to write").

OK, here is what I have queued.

-- >8 --
Subject: cache-tree: make sure to "touch" tree objects the cache-tree records

The cache_tree_fully_valid() function is called by callers that want
to know if they need to call cache_tree_update(), i.e. as an attempt
to optimize. They all want to have a fully valid cache-tree in the
end so that they can write a tree object out.

We used to check if the cached tree object is up-to-date (i.e. the
index entires covered by the cache-tree entry hasn't been changed
since the roll-up hash was computed for the cache-tree entry) and
made sure the tree object is already in the object store.  Since the
top-level tree we would soon be asked to write out (and would find
in the object store) may not be anchored to any refs or commits
immediately, freshen the tree object when it happens.

Similarly, when we actually compute the cache-tree entries in
cache_tree_update(), we refrained from writing a tree object out
when it already exists in the object store.  We would want to
freshen the tree object in that case to protect it from premature
pruning.

Strictly speaking, freshing these tree objects at each and every
level is probably unnecessary, given that anything reachable from a
young object inherits the youth from the referring object to be
protected from pruning.  It should be sufficient to freshen only the
very top-level tree instead.  Benchmarking and optimization is left
as an exercise for later days.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache-tree.c | 6 +++---
 cache.h      | 1 +
 sha1_file.c  | 9 +++++++--
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 3ebf9c3aa4..c8c74a1e07 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -225,7 +225,7 @@ int cache_tree_fully_valid(struct cache_tree *it)
 	int i;
 	if (!it)
 		return 0;
-	if (it->entry_count < 0 || !has_sha1_file(it->sha1))
+	if (it->entry_count < 0 || !freshen_object(it->sha1))
 		return 0;
 	for (i = 0; i < it->subtree_nr; i++) {
 		if (!cache_tree_fully_valid(it->down[i]->cache_tree))
@@ -253,7 +253,7 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
-	if (0 <= it->entry_count && has_sha1_file(it->sha1))
+	if (0 <= it->entry_count && freshen_object(it->sha1))
 		return it->entry_count;
 
 	/*
@@ -393,7 +393,7 @@ static int update_one(struct cache_tree *it,
 	if (repair) {
 		unsigned char sha1[20];
 		hash_sha1_file(buffer.buf, buffer.len, tree_type, sha1);
-		if (has_sha1_file(sha1))
+		if (freshen_object(sha1))
 			hashcpy(it->sha1, sha1);
 		else
 			to_invalidate = 1;
diff --git a/cache.h b/cache.h
index 4ff196c259..72c0b321ac 100644
--- a/cache.h
+++ b/cache.h
@@ -1077,6 +1077,7 @@ extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
 extern int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
 extern int hash_sha1_file_literally(const void *buf, unsigned long len, const char *type, unsigned char *sha1, unsigned flags);
+extern int freshen_object(const unsigned char *);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
 extern int force_object_loose(const unsigned char *sha1, time_t mtime);
 extern int git_open_noatime(const char *name);
diff --git a/sha1_file.c b/sha1_file.c
index d0f2aa029b..9acce3d3b8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3151,6 +3151,11 @@ static int freshen_packed_object(const unsigned char *sha1)
 	return 1;
 }
 
+int freshen_object(const unsigned char *sha1)
+{
+	return freshen_packed_object(sha1) || freshen_loose_object(sha1);
+}
+
 int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1)
 {
 	char hdr[32];
@@ -3160,7 +3165,7 @@ int write_sha1_file(const void *buf, unsigned long len, const char *type, unsign
 	 * it out into .git/objects/??/?{38} file.
 	 */
 	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
-	if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+	if (freshen_object(sha1))
 		return 0;
 	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
 }
@@ -3178,7 +3183,7 @@ int hash_sha1_file_literally(const void *buf, unsigned long len, const char *typ
 
 	if (!(flags & HASH_WRITE_OBJECT))
 		goto cleanup;
-	if (freshen_packed_object(sha1) || freshen_loose_object(sha1))
+	if (freshen_object(sha1))
 		goto cleanup;
 	status = write_loose_object(sha1, header, hdrlen, buf, len, 0);
 
-- 
2.11.0-rc1-156-g07127df5c1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Re* Protecting old temporary objects being reused from concurrent "git gc"?
  2016-11-17  1:35           ` Re* " Junio C Hamano
@ 2016-11-17  1:43             ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2016-11-17  1:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Matt McCutchen, git

On Wed, Nov 16, 2016 at 05:35:47PM -0800, Junio C Hamano wrote:

> OK, here is what I have queued.
> 
> -- >8 --
> Subject: cache-tree: make sure to "touch" tree objects the cache-tree records
> 
> The cache_tree_fully_valid() function is called by callers that want
> to know if they need to call cache_tree_update(), i.e. as an attempt
> to optimize. They all want to have a fully valid cache-tree in the
> end so that they can write a tree object out.

That makes sense. I was focusing on cache_tree_update() call, but we do
not even get there in the fully-valid case.

So I think this approach is nice as long as there is not a caller who
asks "are we fully valid? I do not need to write, but was just
wondering". That should be a read-only operation, but the freshen calls
may fail with EPERM, for example.

I do not see any such callers, nor do I really expect any. Just trying
to think through the possible consequences.

> Strictly speaking, freshing these tree objects at each and every
> level is probably unnecessary, given that anything reachable from a
> young object inherits the youth from the referring object to be
> protected from pruning.  It should be sufficient to freshen only the
> very top-level tree instead.  Benchmarking and optimization is left
> as an exercise for later days.

Good observation, and nicely explained all around.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-11-17  1:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-15 14:13 Protecting old temporary objects being reused from concurrent "git gc"? Matt McCutchen
2016-11-15 17:06 ` Jeff King
2016-11-15 17:33   ` Matt McCutchen
2016-11-15 17:40     ` Jeff King
2016-11-15 19:08       ` [PATCH] git-gc.txt: expand discussion of races with other processes Matt McCutchen
2016-11-15 19:12       ` Protecting old temporary objects being reused from concurrent "git gc"? Matt McCutchen
2016-11-15 20:01       ` Junio C Hamano
2016-11-16  8:07         ` Jeff King
2016-11-16 18:18           ` Junio C Hamano
2016-11-16 18:58       ` Junio C Hamano
2016-11-17  1:04         ` Jeff King
2016-11-17  1:35           ` Re* " Junio C Hamano
2016-11-17  1:43             ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).