git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: Preserve/Prune Old Pack Files
       [not found] <24abd0ed58c25ce832014f9bd5bb2090@codeaurora.org>
@ 2017-01-04 16:11 ` Martin Fick
  2017-01-09  6:21   ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Fick @ 2017-01-04 16:11 UTC (permalink / raw)
  To: repo-discuss; +Cc: jmelvin, jgit-dev, git

I am replying to this email across lists because I wanted to 
highlight to the git community this jgit change to repacking 
that we have up for review

 https://git.eclipse.org/r/#/c/87969/

This change introduces a new convention for how to preserve 
old pack files in a staging area 
(.git/objects/packs/preserved) before deleting them.  I 
wanted to ensure that the new proposed convention would be 
done in a way that would be satisfactory to the git 
community as a whole so that it would be more easy to 
provide the same behavior in git eventually.  The preserved 
pack files (and accompanying index and bitmap files), are not 
only moved, but they are also renamed so that they no longer 
will match recursive finds looking for pack files.

I look forward to any review (it need not happen on the 
change, replies to this email would be fine also), in 
particular with respect to the approach and naming 
conventions.

Thanks,

-Martin


On Tuesday, January 03, 2017 02:46:12 PM 
jmelvin@codeaurora.org wrote:
> We’ve noticed cases where Stale File Handle Exceptions
> occur during git operations, which can happen on users of
> NFS repos when repacking is done on them.
> 
> To address this issue, we’ve added two new options to the
> JGit GC command:
> 
> --preserve-oldpacks: moves old pack files into the
> preserved subdirectory instead of deleting them after
> repacking
> 
> --prune-preserved: prunes old pack files from the
> preserved subdirectory after repacking, but before
> potentially moving the latest old pack files to this
> subdirectory
> 
> The strategy is to preserve old pack files around until
> the next repack with the hopes that they will become
> unreferenced by then and not cause any exceptions to
> running processes when they are finally deleted (pruned).
> 
> Change is uploaded for review here:
> https://git.eclipse.org/r/#/c/87969/
> 
> Thanks,
> James

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Preserve/Prune Old Pack Files
  2017-01-04 16:11 ` Preserve/Prune Old Pack Files Martin Fick
@ 2017-01-09  6:21   ` Jeff King
  2017-01-09  7:01     ` Mike Hommey
  2017-01-09 16:17     ` Martin Fick
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff King @ 2017-01-09  6:21 UTC (permalink / raw)
  To: Martin Fick; +Cc: repo-discuss, jmelvin, jgit-dev, git

On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick wrote:

> I am replying to this email across lists because I wanted to 
> highlight to the git community this jgit change to repacking 
> that we have up for review
> 
>  https://git.eclipse.org/r/#/c/87969/
> 
> This change introduces a new convention for how to preserve 
> old pack files in a staging area 
> (.git/objects/packs/preserved) before deleting them.  I 
> wanted to ensure that the new proposed convention would be 
> done in a way that would be satisfactory to the git 
> community as a whole so that it would be more easy to 
> provide the same behavior in git eventually.  The preserved 
> pack files (and accompanying index and bitmap files), are not 
> only moved, but they are also renamed so that they no longer 
> will match recursive finds looking for pack files.

It looks like objects/pack/pack-123.pack becomes
objects/pack/preserved/pack-123.old-pack, and so forth.
Which seems reasonable, and I'm happy that:

  find objects/pack -name '*.pack'

would not find it. :)

I suspect the name-change will break a few tools that you might want to
use to look at a preserved pack (like verify-pack). I know that's not
your primary use case, but it seems plausible that somebody may one day
want to use a preserved pack to try to recover from corruption. I think
"git index-pack --stdin <objects/packs/preserved/pack-123.old-pack"
could always be a last-resort for re-admitting the objects to the
repository.

I notice this doesn't do anything for loose objects. I think they
technically suffer the same issue, though the race window is much
shorter (we mmap them and zlib inflate immediately, whereas packfiles
may stay mapped across many object requests).

I have one other thought that's tangentially related.

I've wondered if we could make object pruning more atomic by
speculatively moving items to be deleted into some kind of "outgoing"
object area. Right now you can have a case like:

  0. We have a pack that has commit X, which is reachable, and commit Y,
     which is not.

  1. Process A is repacking. It walks the object graph and finds that X
     is reachable. It begins creating a new pack with X and its
     dependent objects.

  2. Meanwhile, process B pushes up a merge of X and Y, and updates a
     ref to point to it.

  3. Process A finishes writing the new pack, and deletes the old one,
     removing Y. The repository is now corrupt.

I don't have a solution here.  I don't think we want to solve it by
locking the repository for updates during a repack. I have a vague sense
that a solution could be crafted around moving the old pack into a
holding area instead of deleting (during which time nobody else would
see the objects, and thus not reference them), while the repacking
process checks to see if the actual deletion would break any references
(and rolls back the deletion if it would).

That's _way_ more complicated than your problem, and as I said, I do not
have a finished solution. But it seems like they touch on a similar
concept (a post-delete holding area for objects). So I thought I'd
mention it in case if spurs any brilliance.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Preserve/Prune Old Pack Files
  2017-01-09  6:21   ` Jeff King
@ 2017-01-09  7:01     ` Mike Hommey
  2017-01-09 10:55       ` Jeff King
  2017-01-09 16:17     ` Martin Fick
  1 sibling, 1 reply; 7+ messages in thread
From: Mike Hommey @ 2017-01-09  7:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Jan 09, 2017 at 01:21:37AM -0500, Jeff King wrote:
> On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick wrote:
> 
> > I am replying to this email across lists because I wanted to 
> > highlight to the git community this jgit change to repacking 
> > that we have up for review
> > 
> >  https://git.eclipse.org/r/#/c/87969/
> > 
> > This change introduces a new convention for how to preserve 
> > old pack files in a staging area 
> > (.git/objects/packs/preserved) before deleting them.  I 
> > wanted to ensure that the new proposed convention would be 
> > done in a way that would be satisfactory to the git 
> > community as a whole so that it would be more easy to 
> > provide the same behavior in git eventually.  The preserved 
> > pack files (and accompanying index and bitmap files), are not 
> > only moved, but they are also renamed so that they no longer 
> > will match recursive finds looking for pack files.
> 
> It looks like objects/pack/pack-123.pack becomes
> objects/pack/preserved/pack-123.old-pack, and so forth.
> Which seems reasonable, and I'm happy that:
> 
>   find objects/pack -name '*.pack'
> 
> would not find it. :)
> 
> I suspect the name-change will break a few tools that you might want to
> use to look at a preserved pack (like verify-pack). I know that's not
> your primary use case, but it seems plausible that somebody may one day
> want to use a preserved pack to try to recover from corruption. I think
> "git index-pack --stdin <objects/packs/preserved/pack-123.old-pack"
> could always be a last-resort for re-admitting the objects to the
> repository.
> 
> I notice this doesn't do anything for loose objects. I think they
> technically suffer the same issue, though the race window is much
> shorter (we mmap them and zlib inflate immediately, whereas packfiles
> may stay mapped across many object requests).
> 
> I have one other thought that's tangentially related.
> 
> I've wondered if we could make object pruning more atomic by
> speculatively moving items to be deleted into some kind of "outgoing"
> object area. Right now you can have a case like:
> 
>   0. We have a pack that has commit X, which is reachable, and commit Y,
>      which is not.
> 
>   1. Process A is repacking. It walks the object graph and finds that X
>      is reachable. It begins creating a new pack with X and its
>      dependent objects.
> 
>   2. Meanwhile, process B pushes up a merge of X and Y, and updates a
>      ref to point to it.
> 
>   3. Process A finishes writing the new pack, and deletes the old one,
>      removing Y. The repository is now corrupt.
> 
> I don't have a solution here.  I don't think we want to solve it by
> locking the repository for updates during a repack. I have a vague sense
> that a solution could be crafted around moving the old pack into a
> holding area instead of deleting (during which time nobody else would
> see the objects, and thus not reference them), while the repacking
> process checks to see if the actual deletion would break any references
> (and rolls back the deletion if it would).
> 
> That's _way_ more complicated than your problem, and as I said, I do not
> have a finished solution. But it seems like they touch on a similar
> concept (a post-delete holding area for objects). So I thought I'd
> mention it in case if spurs any brilliance.

Something that is kind-of in the same family of problems is the
"loosening" or objects on repacks, before they can be pruned.

When you have a large repository and do large rewrite operations
(extreme case, a filter-branch on a multi-hundred-thousands commits),
and you gc for the first time, git will possibly create a *lot* of loose
objects, each of which will consume an inode and a file system block. In
the extreme case, you can end up with git gc filling up multiple extra
gigabytes on your disk.

Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Preserve/Prune Old Pack Files
  2017-01-09  7:01     ` Mike Hommey
@ 2017-01-09 10:55       ` Jeff King
  2017-01-09 16:20         ` Martin Fick
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2017-01-09 10:55 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

On Mon, Jan 09, 2017 at 04:01:19PM +0900, Mike Hommey wrote:

> > That's _way_ more complicated than your problem, and as I said, I do not
> > have a finished solution. But it seems like they touch on a similar
> > concept (a post-delete holding area for objects). So I thought I'd
> > mention it in case if spurs any brilliance.
> 
> Something that is kind-of in the same family of problems is the
> "loosening" or objects on repacks, before they can be pruned.
> 
> When you have a large repository and do large rewrite operations
> (extreme case, a filter-branch on a multi-hundred-thousands commits),
> and you gc for the first time, git will possibly create a *lot* of loose
> objects, each of which will consume an inode and a file system block. In
> the extreme case, you can end up with git gc filling up multiple extra
> gigabytes on your disk.

I think we're getting pretty far off-field here. :)

Yes, this can be a problem. The repack is smart enough not to write out
objects which would just get pruned immediately, but since the grace
period is 2 weeks, that can include a lot of objects (especially with
history rewriting as you note). It would be possible to write those
loose objects to a "cruft" pack, but there are some management issues
around the cruft pack. You do not want to keep repacking them into a new
cruft pack at each repack, since then they would never expire. So you
need some way of marking the pack as cruft, letting it age out, and then
deleting it after the grace period expires.

I don't think it would be _that_ hard, but AFAIK nobody has ever made
patches.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Preserve/Prune Old Pack Files
  2017-01-09  6:21   ` Jeff King
  2017-01-09  7:01     ` Mike Hommey
@ 2017-01-09 16:17     ` Martin Fick
  2017-01-10  9:14       ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Martin Fick @ 2017-01-09 16:17 UTC (permalink / raw)
  To: Jeff King; +Cc: repo-discuss, jmelvin, jgit-dev, git

On Monday, January 09, 2017 01:21:37 AM Jeff King wrote:
> On Wed, Jan 04, 2017 at 09:11:55AM -0700, Martin Fick 
wrote:
> > I am replying to this email across lists because I
> > wanted to highlight to the git community this jgit
> > change to repacking that we have up for review
> > 
> >  https://git.eclipse.org/r/#/c/87969/
> > 
> > This change introduces a new convention for how to
> > preserve old pack files in a staging area
> > (.git/objects/packs/preserved) before deleting them.  I
> > wanted to ensure that the new proposed convention would
> > be done in a way that would be satisfactory to the git
> > community as a whole so that it would be more easy to
> > provide the same behavior in git eventually.  The
> > preserved pack files (and accompanying index and bitmap
> > files), are not only moved, but they are also renamed
> > so that they no longer will match recursive finds
> > looking for pack files.
> It looks like objects/pack/pack-123.pack becomes
> objects/pack/preserved/pack-123.old-pack,

Yes, that's the idea.

> and so forth. Which seems reasonable, and I'm happy that:
> 
>   find objects/pack -name '*.pack'
> 
> would not find it. :)

Cool.

> I suspect the name-change will break a few tools that you
> might want to use to look at a preserved pack (like
> verify-pack).  I know that's not your primary use case,
> but it seems plausible that somebody may one day want to
> use a preserved pack to try to recover from corruption. I
> think "git index-pack --stdin
> <objects/packs/preserved/pack-123.old-pack" could always
> be a last-resort for re-admitting the objects to the
> repository.

or even a simple manual rename/move back to its orginal 
place?

> I notice this doesn't do anything for loose objects. I
> think they technically suffer the same issue, though the
> race window is much shorter (we mmap them and zlib
> inflate immediately, whereas packfiles may stay mapped
> across many object requests).

Hmm, yeah that's the next change, didn't you see it? :)  No, 
actually I forgot about those.  Our server tends to not have 
too many of those (loose objects), and I don't think we have 
seen any exceptions yet for them.  But, of course, you are 
right, they should get fixed too.  I will work on a followup 
change to do that.

Where would you suggest we store those?  Maybe under 
".git/objects/preserved/<xx>/<sha1>"?  Do they need to be 
renamed also somehow to avoid a find?

...
> I've wondered if we could make object pruning more atomic
> by speculatively moving items to be deleted into some
> kind of "outgoing" object area.
...
> I don't have a solution here.  I don't think we want to
> solve it by locking the repository for updates during a
> repack. I have a vague sense that a solution could be
> crafted around moving the old pack into a holding area
> instead of deleting (during which time nobody else would
> see the objects, and thus not reference them), while the
> repacking process checks to see if the actual deletion
> would break any references (and rolls back the deletion
> if it would).
> 
> That's _way_ more complicated than your problem, and as I
> said, I do not have a finished solution. But it seems
> like they touch on a similar concept (a post-delete
> holding area for objects). So I thought I'd mention it in
> case if spurs any brilliance.

I agree, this is a problem I have wanted to solve also.  I 
think having a "preserved" directory does open the door to 
such "recovery" solutions, although I think you would 
actually want to modify the many read code paths to fall 
back to looking at the preserved area and performing 
immediate "recovery" of the pack file if it ends up being 
needed.  That's a lot of work, but having the packs (and 
eventually the loose objects) preserved into a location 
where no new references will be built to depend on them is 
likely the first step.  Does the name "preserved" do well for 
that use case also, or would there be some better name, what 
would a transactional system call them?

Thanks for the review Peff!

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Preserve/Prune Old Pack Files
  2017-01-09 10:55       ` Jeff King
@ 2017-01-09 16:20         ` Martin Fick
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Fick @ 2017-01-09 16:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, git

On Monday, January 09, 2017 05:55:45 AM Jeff King wrote:
> On Mon, Jan 09, 2017 at 04:01:19PM +0900, Mike Hommey 
wrote:
> > > That's _way_ more complicated than your problem, and
> > > as I said, I do not have a finished solution. But it
> > > seems like they touch on a similar concept (a
> > > post-delete holding area for objects). So I thought
> > > I'd mention it in case if spurs any brilliance.
> > 
> > Something that is kind-of in the same family of problems
> > is the "loosening" or objects on repacks, before they
> > can be pruned.
...
> Yes, this can be a problem. The repack is smart enough not
> to write out objects which would just get pruned
> immediately, but since the grace period is 2 weeks, that
> can include a lot of objects (especially with history
> rewriting as you note). It would be possible to write
> those loose objects to a "cruft" pack, but there are some
> management issues around the cruft pack. You do not want
> to keep repacking them into a new cruft pack at each
> repack, since then they would never expire. So you need
> some way of marking the pack as cruft, letting it age
> out, and then deleting it after the grace period expires.
> 
> I don't think it would be _that_ hard, but AFAIK nobody
> has ever made patches.

FYI, jgit does this,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Preserve/Prune Old Pack Files
  2017-01-09 16:17     ` Martin Fick
@ 2017-01-10  9:14       ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2017-01-10  9:14 UTC (permalink / raw)
  To: Martin Fick; +Cc: repo-discuss, jmelvin, jgit-dev, git

On Mon, Jan 09, 2017 at 09:17:56AM -0700, Martin Fick wrote:

> > I suspect the name-change will break a few tools that you
> > might want to use to look at a preserved pack (like
> > verify-pack).  I know that's not your primary use case,
> > but it seems plausible that somebody may one day want to
> > use a preserved pack to try to recover from corruption. I
> > think "git index-pack --stdin
> > <objects/packs/preserved/pack-123.old-pack" could always
> > be a last-resort for re-admitting the objects to the
> > repository.
> 
> or even a simple manual rename/move back to its orginal 
> place?

Yes, that would work. There's not a tool to do it, but it's a fairly
straightforward transformation.

> [loose objects]
> Where would you suggest we store those?  Maybe under 
> ".git/objects/preserved/<xx>/<sha1>"?  Do they need to be 
> renamed also somehow to avoid a find?

It would make sense to me to have a single "preserved" root, with
"<xx>/<sha1>.old" and "packs/pack-<sha1>.old-pack" together under it.

You could also move the objects out of objects/ entirely. Say, to
".git/preserved-objects" or something. Then you could probably do away
with the filename munging altogether, and "restoring" an object or pack
would be a simple "mv" or "cp" (or you could even add preserved-objects
to $GIT_ALTERNATE_OBJECT_DIRECTORIES if you wanted to do a single
operation looking at both sets).

That's all outside the scope of your original purpose (which I think was
just to keep the files _somewhere_ so that the open descriptor stays
valid on NFS). But maybe it would make other related things more
convenient. I dunno. I'm just speaking off the top of my head.

> > That's _way_ more complicated than your problem, and as I
> > said, I do not have a finished solution. But it seems
> > like they touch on a similar concept (a post-delete
> > holding area for objects). So I thought I'd mention it in
> > case if spurs any brilliance.
> 
> I agree, this is a problem I have wanted to solve also.  I 
> think having a "preserved" directory does open the door to 
> such "recovery" solutions, although I think you would 
> actually want to modify the many read code paths to fall 
> back to looking at the preserved area and performing 
> immediate "recovery" of the pack file if it ends up being 
> needed.

In my (admittedly not very concrete) plan, the read code paths
_wouldn't_ know to look in the preserved area. It would be up to the
repacking process to rollback in case of a race. That does open a period
(between the faux delete and the rollback) where readers may be broken.
But that's much better than the state today, which is that the readers
are broken, and that breakage persists forever.

But there may be other better ways of doing it.  What we're really
talking about is a transactional system where neither side locks (or at
least not for an appreciable amount of time), and one side is capable of
falling back and modifying its operation when there's a relevant race.
There's probably some research in this area and some standard solutions,
but it's not an area I'm overly familiar with (and building any solution
on top of POSIX filesystem semantics adds an extra challenge).

> That's a lot of work, but having the packs (and 
> eventually the loose objects) preserved into a location 
> where no new references will be built to depend on them is 
> likely the first step.  Does the name "preserved" do well for 
> that use case also, or would there be some better name, what 
> would a transactional system call them?

I wasn't going to bikeshed, but since you ask...:)

"preserved" to me sounds like something we'd be keeping forever. These
objects are more in a "pending delete" state, or a purgatory. Maybe
something along those lines would be more appropriate.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-01-10  9:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <24abd0ed58c25ce832014f9bd5bb2090@codeaurora.org>
2017-01-04 16:11 ` Preserve/Prune Old Pack Files Martin Fick
2017-01-09  6:21   ` Jeff King
2017-01-09  7:01     ` Mike Hommey
2017-01-09 10:55       ` Jeff King
2017-01-09 16:20         ` Martin Fick
2017-01-09 16:17     ` Martin Fick
2017-01-10  9:14       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).