git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Order of operations at the end of fast-import?
@ 2020-04-16  4:24 Mike Hommey
  2020-04-16  5:09 ` Jeff King
  2020-04-16 11:34 ` Mike Hommey
  0 siblings, 2 replies; 5+ messages in thread
From: Mike Hommey @ 2020-04-16  4:24 UTC (permalink / raw)
  To: git

Hi,

I just noticed that the order of operations at the end of fast-import
are:
- end_packfile
- dump_branches
- dump_tags

The first may create loose objects if the pack is small (per
fastimport.unpackLimit, defaulting to 100). The latter two create refs.

There seems to be a theoretical race condition here, if something else
triggers a gc at the "wrong" time, the loose objects might be cleaned up
and the branches/tags refs become dangling.

I understand that the packfile does need to be finished before creating
the refs, and that the unpacking replaces that when there aren't enough
objects, but wouldn't it be more data-safe to actually finish the pack,
create the refs, and then unpack?

Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Order of operations at the end of fast-import?
  2020-04-16  4:24 Order of operations at the end of fast-import? Mike Hommey
@ 2020-04-16  5:09 ` Jeff King
  2020-04-16 18:34   ` Johannes Sixt
  2020-04-16 11:34 ` Mike Hommey
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2020-04-16  5:09 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

On Thu, Apr 16, 2020 at 01:24:49PM +0900, Mike Hommey wrote:

> I just noticed that the order of operations at the end of fast-import
> are:
> - end_packfile
> - dump_branches
> - dump_tags
> 
> The first may create loose objects if the pack is small (per
> fastimport.unpackLimit, defaulting to 100). The latter two create refs.
> 
> There seems to be a theoretical race condition here, if something else
> triggers a gc at the "wrong" time, the loose objects might be cleaned up
> and the branches/tags refs become dangling.
>
> I understand that the packfile does need to be finished before creating
> the refs, and that the unpacking replaces that when there aren't enough
> objects, but wouldn't it be more data-safe to actually finish the pack,
> create the refs, and then unpack?

That race is there even without the unpacking step. Another gc might
remove the pack, dropping its unreferenced objects.

We do add a ".keep" between writing the pack and updating the refs, but
it doesn't look like it's done atomically (i.e., we write the .idx file
and _then_ add the .keep). So there's a small race there.

But all of this is also true of any operation, like git-commit. It's
creating new loose objects, and then will try to reference them. In
between, a simultaneous gc will think they're unreachable. Likewise,
receiving a push may write a pack (with a .keep, though in the correct
order) or even loose objects.

This is usually handled by the gc expiration time, which is compared
against the file mtime. The default is 2 weeks, but even something short
like 5 minutes would be plenty to avoid this race (even for a long
import, we should be updating the mtime every time we call write()).

In fact, gc will use the same expiration for clearing out tempfiles. So
even before we write the final pack and its .keep, any temporary files
we're writing into would be at risk. But again, if we're actively
writing, their mtimes should save them.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Order of operations at the end of fast-import?
  2020-04-16  4:24 Order of operations at the end of fast-import? Mike Hommey
  2020-04-16  5:09 ` Jeff King
@ 2020-04-16 11:34 ` Mike Hommey
  1 sibling, 0 replies; 5+ messages in thread
From: Mike Hommey @ 2020-04-16 11:34 UTC (permalink / raw)
  To: git

On Thu, Apr 16, 2020 at 01:24:49PM +0900, Mike Hommey wrote:
> Hi,
> 
> I just noticed that the order of operations at the end of fast-import
> are:
> - end_packfile
> - dump_branches
> - dump_tags
> 
> The first may create loose objects if the pack is small (per
> fastimport.unpackLimit, defaulting to 100). The latter two create refs.
> 
> There seems to be a theoretical race condition here, if something else
> triggers a gc at the "wrong" time, the loose objects might be cleaned up
> and the branches/tags refs become dangling.

Come to think of it, this is only a problem with --prune=now, and the
same applies to loose objects created with git add, git write-tree,
git-commit-tree.

Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Order of operations at the end of fast-import?
  2020-04-16  5:09 ` Jeff King
@ 2020-04-16 18:34   ` Johannes Sixt
  2020-04-16 18:58     ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Johannes Sixt @ 2020-04-16 18:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, git

Am 16.04.20 um 07:09 schrieb Jeff King:
> We do add a ".keep" between writing the pack and updating the refs, but
> it doesn't look like it's done atomically (i.e., we write the .idx file
> and _then_ add the .keep). So there's a small race there.
> 
> But all of this is also true of any operation, like git-commit. It's
> creating new loose objects, and then will try to reference them. In
> between, a simultaneous gc will think they're unreachable. Likewise,
> receiving a push may write a pack (with a .keep, though in the correct
> order) or even loose objects.
> 
> This is usually handled by the gc expiration time, which is compared
> against the file mtime. The default is 2 weeks, but even something short
> like 5 minutes would be plenty to avoid this race (even for a long
> import, we should be updating the mtime every time we call write()).

Except that on Windows the times are only updated reliably when all file
handles refering to the file are closed. Would that poke a hole in your
argument? (I don't think  so, as long as expiration times are not
ridiculously short; and people who do gc --prune=now should know what
they are doing -- and when.)

> In fact, gc will use the same expiration for clearing out tempfiles. So
> even before we write the final pack and its .keep, any temporary files
> we're writing into would be at risk. But again, if we're actively
> writing, their mtimes should save them.

-- Hannes

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Order of operations at the end of fast-import?
  2020-04-16 18:34   ` Johannes Sixt
@ 2020-04-16 18:58     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2020-04-16 18:58 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Mike Hommey, git

On Thu, Apr 16, 2020 at 08:34:27PM +0200, Johannes Sixt wrote:

> > This is usually handled by the gc expiration time, which is compared
> > against the file mtime. The default is 2 weeks, but even something short
> > like 5 minutes would be plenty to avoid this race (even for a long
> > import, we should be updating the mtime every time we call write()).
> 
> Except that on Windows the times are only updated reliably when all file
> handles refering to the file are closed. Would that poke a hole in your
> argument? (I don't think  so, as long as expiration times are not
> ridiculously short; and people who do gc --prune=now should know what
> they are doing -- and when.)

Yes, if you do "git gc --prune=5.minutes" while an hour-long fast-import
is running, you might be in trouble. Don't do that.

Though on _any_ system, there are other cases where objects sit
untouched before being referenced. I guess we pull from the index for
reachability, so "git add" is safe for blobs. I don't know if we are
clever enough to use cache-tree (and it might not even be there), so
possibly those trees are at risk until you run "git commit" (or even
after you run while waiting in the editor).

Certainly I think the whole pruning system is a patchwork of assumptions
that could be violated in extreme cases. But it has been that way for 15
years. If people aren't routinely finding objects pruned out from under
them, I'm not inclined to spend a lot of time digging on it.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-04-16 18:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-16  4:24 Order of operations at the end of fast-import? Mike Hommey
2020-04-16  5:09 ` Jeff King
2020-04-16 18:34   ` Johannes Sixt
2020-04-16 18:58     ` Jeff King
2020-04-16 11:34 ` Mike Hommey

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).