git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
@ 2014-03-01 21:04 Brian Gesiak
  2014-03-03 22:42 ` Jeff King
  2014-03-11 16:27 ` Michael Haggerty
  0 siblings, 2 replies; 7+ messages in thread
From: Brian Gesiak @ 2014-03-01 21:04 UTC (permalink / raw)
  To: git

Hello all,

My name is Brian Gesiak. I'm a research student at the University of
Tokyo, and I'm hoping to participate in this year's Google Summer of
Code by contributing to Git. I'm a longtime user, first-time
contributor--some of you may have noticed my "microproject"
patches.[1][2]

I'd like to gather some information on one of the GSoC ideas posted on
the ideas page. Namely, I'm interested in refactoring the way
tempfiles are cleaned up.

The ideas page points out that while lock files are closed and
unlinked[3] when the program exits[4], object pack files implement
their own brand of temp file creation and deletion. This
implementation doesn't share the same guarantees as lock files--it is
possible that the program terminates before the temp file is
unlinked.[5]

Lock file references are stored in a linked list. When the program
exits, this list is traversed and each file is closed and unlinked. It
seems to me that this mechanism is appropriate for temp files in
general, not just lock files. Thus, my proposal would be to extract
this logic into a separate module--tempfile.h, perhaps. Lock and
object files would share the tempfile implementation.

That is, both object and lock temp files would be stored in a linked
list, and all of these would be deleted at program exit.

I'm very enthused about this project--I think it has it all:

- Tangible benefits for the end-user
- Reduced complexity in the codebase
- Ambitious enough to be interesting
- Small enough to realistically be completed in a summer

Please let me know if this seems like it would make for an interesting
proposal, or if perhaps there is something I am overlooking. Any
feedback at all would be appreciated. Thank you!

- Brian Gesiak

[1] http://thread.gmane.org/gmane.comp.version-control.git/242891
[2] http://thread.gmane.org/gmane.comp.version-control.git/242893
[3] https://github.com/git/git/blob/v1.9.0/lockfile.c#L18
[4] https://github.com/git/git/blob/v1.9.0/lockfile.c#L143
[5] https://github.com/git/git/blob/v1.9.0/pack-write.c#L350

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
  2014-03-01 21:04 [GSoC14][RFC] Proposal Draft: Refactor tempfile handling Brian Gesiak
@ 2014-03-03 22:42 ` Jeff King
  2014-03-08 17:04   ` Brian Gesiak
  2014-03-11 16:27 ` Michael Haggerty
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-03-03 22:42 UTC (permalink / raw)
  To: Brian Gesiak; +Cc: git

On Sun, Mar 02, 2014 at 06:04:39AM +0900, Brian Gesiak wrote:

> My name is Brian Gesiak. I'm a research student at the University of
> Tokyo, and I'm hoping to participate in this year's Google Summer of
> Code by contributing to Git. I'm a longtime user, first-time
> contributor--some of you may have noticed my "microproject"
> patches.[1][2]

Yes, we did notice them. Thanks, and welcome. :)

> The ideas page points out that while lock files are closed and
> unlinked[3] when the program exits[4], object pack files implement
> their own brand of temp file creation and deletion. This
> implementation doesn't share the same guarantees as lock files--it is
> possible that the program terminates before the temp file is
> unlinked.[5]
> 
> Lock file references are stored in a linked list. When the program
> exits, this list is traversed and each file is closed and unlinked. It
> seems to me that this mechanism is appropriate for temp files in
> general, not just lock files. Thus, my proposal would be to extract
> this logic into a separate module--tempfile.h, perhaps. Lock and
> object files would share the tempfile implementation.
> 
> That is, both object and lock temp files would be stored in a linked
> list, and all of these would be deleted at program exit.

Yes, I think this is definitely the right way to go. We should be able
to unify the tempfile handling for all of git. Once the logic is
extracted into a nice API, there are several other places that can use
it, too:

  - the external diff code creates tempfiles and uses its own cleanup
    routines

  - the shallow_XXXXXX tempfiles (these are not cleaned right now,
    though I sent a patch recently for them to do their own cleanup)

Those are just off the top of my head. There may be other spots, too.

It is worth thinking in your proposal about some of the things that the
API will want to handle. What are the mismatches in how lockfiles and
object files are handled? E.g., how do we finalize them into place?
How should the API be designed to minimize race conditions (e.g., if we
get a signal delivered while we are committing or cleaning up a file)?

> Please let me know if this seems like it would make for an interesting
> proposal, or if perhaps there is something I am overlooking. Any
> feedback at all would be appreciated. Thank you!

You definitely have a grasp of what the project is aiming for, and which
areas need to be touched.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
  2014-03-03 22:42 ` Jeff King
@ 2014-03-08 17:04   ` Brian Gesiak
  2014-03-11  1:33     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Gesiak @ 2014-03-08 17:04 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Excellent, thank you very much for the feedback, Jeff! It was very
helpful and encouraging. I've done some more research based on your
comments.

> Once the logic is extracted into a nice API, there are
> several other places that can use it, too: ...

I've found the following four areas so far:

1. lockfile.lock_file
2. git-compat-util.odb_mkstemp
3. git-compat-util.odb_pack_keep
4. diff.prepare_temp_file

Tons of files use (1) and (2). (3) is less common, and (4) is only
used for external diffs.

> the shallow_XXXXXX tempfiles

I'm not sure I was able to find this one. Are you referring to the
lock files used when fetching, such as in fetch-pack.c?

> What are the mismatches in how lockfiles and object files are
> handled? E.g., how do we finalize them into place? How should
> the API be designed to minimize race conditions (e.g., if we
> get a signal delivered while we are committing or cleaning up a file)?

I'd say the biggest difference between lockfiles and object files is
that tempfile methods like odb_mkstemp need to know the location of
the object directory. Aside from that, lockfiles and the external diff
files appear to be cleaned up at exit, while temporary object files
tend to have a more finely controlled lifecycle. I'm still
investigating this aspect of the proposal, though.

One question, though: the idea on the ideas page specifies that
temporary pack and object files may "optionally" be cleaned up in case
of error during program execution. How will users specify their
preference? I think the API for creating temporary files should allow
cleanup options to be specified on a per-file basis. That way each
part of the program that creates tempfiles can specify a different
config value to determine the cleanup policy.

Thanks for all your help so far!

- Brian Gesiak

PS: I'm maintaining a working draft of my proposal here, in case
anyone wants to offer any feedback prior to its submission:
https://gist.github.com/modocache/9434914


On Tue, Mar 4, 2014 at 7:42 AM, Jeff King <peff@peff.net> wrote:
> On Sun, Mar 02, 2014 at 06:04:39AM +0900, Brian Gesiak wrote:
>
>> My name is Brian Gesiak. I'm a research student at the University of
>> Tokyo, and I'm hoping to participate in this year's Google Summer of
>> Code by contributing to Git. I'm a longtime user, first-time
>> contributor--some of you may have noticed my "microproject"
>> patches.[1][2]
>
> Yes, we did notice them. Thanks, and welcome. :)
>
>> The ideas page points out that while lock files are closed and
>> unlinked[3] when the program exits[4], object pack files implement
>> their own brand of temp file creation and deletion. This
>> implementation doesn't share the same guarantees as lock files--it is
>> possible that the program terminates before the temp file is
>> unlinked.[5]
>>
>> Lock file references are stored in a linked list. When the program
>> exits, this list is traversed and each file is closed and unlinked. It
>> seems to me that this mechanism is appropriate for temp files in
>> general, not just lock files. Thus, my proposal would be to extract
>> this logic into a separate module--tempfile.h, perhaps. Lock and
>> object files would share the tempfile implementation.
>>
>> That is, both object and lock temp files would be stored in a linked
>> list, and all of these would be deleted at program exit.
>
> Yes, I think this is definitely the right way to go. We should be able
> to unify the tempfile handling for all of git. Once the logic is
> extracted into a nice API, there are several other places that can use
> it, too:
>
>   - the external diff code creates tempfiles and uses its own cleanup
>     routines
>
>   - the shallow_XXXXXX tempfiles (these are not cleaned right now,
>     though I sent a patch recently for them to do their own cleanup)
>
> Those are just off the top of my head. There may be other spots, too.
>
> It is worth thinking in your proposal about some of the things that the
> API will want to handle. What are the mismatches in how lockfiles and
> object files are handled? E.g., how do we finalize them into place?
> How should the API be designed to minimize race conditions (e.g., if we
> get a signal delivered while we are committing or cleaning up a file)?
>
>> Please let me know if this seems like it would make for an interesting
>> proposal, or if perhaps there is something I am overlooking. Any
>> feedback at all would be appreciated. Thank you!
>
> You definitely have a grasp of what the project is aiming for, and which
> areas need to be touched.
>
> -Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
  2014-03-08 17:04   ` Brian Gesiak
@ 2014-03-11  1:33     ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2014-03-11  1:33 UTC (permalink / raw)
  To: Brian Gesiak; +Cc: git

On Sun, Mar 09, 2014 at 02:04:16AM +0900, Brian Gesiak wrote:

> > Once the logic is extracted into a nice API, there are
> > several other places that can use it, too: ...
> 
> I've found the following four areas so far:
> 
> 1. lockfile.lock_file
> 2. git-compat-util.odb_mkstemp
> 3. git-compat-util.odb_pack_keep
> 4. diff.prepare_temp_file
> 
> Tons of files use (1) and (2). (3) is less common, and (4) is only
> used for external diffs.

Yeah, I would expect (1) and (2) to be the most frequent. (3) gets
written on every push and fetch, but only for a short period. (4) is
also used for diff's textconv, though like external diffs, they are
relatively rare.

In my experience, most of the cruft that gets left is from (2), since a
push or fetch will spool to a tmpfile, then verify the results via "git
index-pack". Any failure there leaves the file in place.

There are a few other potential candidates we can find by grepping for
mkstemp. Not all of those might want cleanup, but it's a starting point
for investigation.

> > the shallow_XXXXXX tempfiles
> 
> I'm not sure I was able to find this one. Are you referring to the
> lock files used when fetching, such as in fetch-pack.c?

I mean the xmkstemp from setup_temporary_shallow in shallow.c.

> I'd say the biggest difference between lockfiles and object files is
> that tempfile methods like odb_mkstemp need to know the location of
> the object directory. Aside from that, lockfiles and the external diff
> files appear to be cleaned up at exit, while temporary object files
> tend to have a more finely controlled lifecycle. I'm still
> investigating this aspect of the proposal, though.

The diff tempfiles are true tempfiles; they always go away in the end
(though of course we want to clean them up as we finish with them,
rather than doing it all at the end). Lockfiles may get committed into
place (i.e., via atomic rename) or rolled back (deleted).

Object files should generally be hard-linked into place, but there is
some extra magic in move_temp_to_file to fallback to renames.  Some of
that we may be able to get rid of (e.g., we try to avoid doing
cross-directory renames at all these days, so the comment there may be
out of date).

> One question, though: the idea on the ideas page specifies that
> temporary pack and object files may "optionally" be cleaned up in case
> of error during program execution. How will users specify their
> preference? I think the API for creating temporary files should allow
> cleanup options to be specified on a per-file basis. That way each
> part of the program that creates tempfiles can specify a different
> config value to determine the cleanup policy.

That probably makes sense. I certainly had a config option in mind. I
mentioned above that the most common cruft is leftover packfiles from
pushes and fetches. We haven't deleted those historically because the
same person often controls both the client and the server, and they
would want to possibly do forensics on the packfile sent to the remote,
or even rescue objects out of it. But the remote end may simply have
rejected the pack by some policy, and has no interest in forensics.

Having a config option for each type of file may be cool, but I don't
know how useful it would be in practice. Still, it's certainly worth
thinking about and looking into.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
  2014-03-01 21:04 [GSoC14][RFC] Proposal Draft: Refactor tempfile handling Brian Gesiak
  2014-03-03 22:42 ` Jeff King
@ 2014-03-11 16:27 ` Michael Haggerty
  2014-03-11 18:05   ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Haggerty @ 2014-03-11 16:27 UTC (permalink / raw)
  To: Brian Gesiak; +Cc: git, Jeff King

On 03/01/2014 10:04 PM, Brian Gesiak wrote:
> Hello all,
> 
> My name is Brian Gesiak. I'm a research student at the University of
> Tokyo, and I'm hoping to participate in this year's Google Summer of
> Code by contributing to Git. I'm a longtime user, first-time
> contributor--some of you may have noticed my "microproject"
> patches.[1][2]
> 
> I'd like to gather some information on one of the GSoC ideas posted on
> the ideas page. Namely, I'm interested in refactoring the way
> tempfiles are cleaned up.
> 
> The ideas page points out that while lock files are closed and
> unlinked[3] when the program exits[4], object pack files implement
> their own brand of temp file creation and deletion. This
> implementation doesn't share the same guarantees as lock files--it is
> possible that the program terminates before the temp file is
> unlinked.[5]
> 
> Lock file references are stored in a linked list. When the program
> exits, this list is traversed and each file is closed and unlinked. It
> seems to me that this mechanism is appropriate for temp files in
> general, not just lock files. Thus, my proposal would be to extract
> this logic into a separate module--tempfile.h, perhaps. Lock and
> object files would share the tempfile implementation.
> 
> That is, both object and lock temp files would be stored in a linked
> list, and all of these would be deleted at program exit.
> 
> I'm very enthused about this project--I think it has it all:
> 
> - Tangible benefits for the end-user
> - Reduced complexity in the codebase
> - Ambitious enough to be interesting
> - Small enough to realistically be completed in a summer
> 
> Please let me know if this seems like it would make for an interesting
> proposal, or if perhaps there is something I am overlooking. Any
> feedback at all would be appreciated. Thank you!

Hi Brian,

Thanks for your proposal.  I have a technical point that I think your
proposal should address:

Currently the linked list of lockfiles only grows, never shrinks.  Once
an object has been linked into the list, there is no way to remove it
again even after the lock has been released.  So if a lock needs to be
created dynamically at a random place in the code, its memory is
unavoidably leaked.

This hasn't been much of a problem in the past because (1) the number of
locks acquired/released during a Git invocation is reasonable, and (2) a
lock object (even if it is already in the list) can be reused after the
lock has been released.  So there are many lock callsites that define
one static lock instance and use it over and over again.

But I have a feeling that if we want to use a similar mechanism to
handle all temporary files (of which there can be more), then it would
be a good idea to lift this limitation.  It will require some care,
though, to make sure that record removal is done in a way that is
threadsafe and safe in the event of all expected kinds of process death.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
  2014-03-11 16:27 ` Michael Haggerty
@ 2014-03-11 18:05   ` Jeff King
  2014-03-12 12:26     ` Brian Gesiak
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2014-03-11 18:05 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Brian Gesiak, git

On Tue, Mar 11, 2014 at 05:27:05PM +0100, Michael Haggerty wrote:

> Thanks for your proposal.  I have a technical point that I think your
> proposal should address:
> 
> Currently the linked list of lockfiles only grows, never shrinks.  Once
> an object has been linked into the list, there is no way to remove it
> again even after the lock has been released.  So if a lock needs to be
> created dynamically at a random place in the code, its memory is
> unavoidably leaked.

Thanks, I remember thinking about this when I originally conceived of
the idea, but I forgot to mention it in the idea writeup.

In most cases the potential leaks are finite and small, but object
creation and diff tempfiles could both be unbounded. So this is
definitely something to consider. In both cases we have a bounded number
of _simultaneous_ tempfiles, so one strategy could be to continue using
static objects. But it should not be hard to do it dynamically, and I
suspect the resulting API will be a lot easier to comprehend.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [GSoC14][RFC] Proposal Draft: Refactor tempfile handling
  2014-03-11 18:05   ` Jeff King
@ 2014-03-12 12:26     ` Brian Gesiak
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Gesiak @ 2014-03-12 12:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, git

> Currently the linked list of lockfiles only grows, never shrinks.  Once
> an object has been linked into the list, there is no way to remove it
> again even after the lock has been released.  So if a lock needs to be
> created dynamically at a random place in the code, its memory is
> unavoidably leaked.

Ah yes, I see. I think a good example is
config.git_config_set_multivar_in_file, which even contains a comment
detailing the problem: "Since lockfile.c keeps a linked list of all
created lock_file structures, it isn't safe to free(lock).  It's
better to just leave it hanging around."

> But I have a feeling that if we want to use a similar mechanism to
> handle all temporary files (of which there can be more), then it would
> be a good idea to lift this limitation.  It will require some care,
> though, to make sure that record removal is done in a way that is
> threadsafe and safe in the event of all expected kinds of process death.

It sounds like a threadsafe linked-list with an interface to manually
remove elements from the list is the solution here; does that sound
reasonable? Ensuring thread safety without sacrificing readability is
probably more difficult than it sounds, but I don't think it's
impossible.

I'll add some more details on this to my proposal[1]. Thank you!

- Brian Gesiak

[1] https://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/modocache/5629499534213120

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-03-12 12:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-01 21:04 [GSoC14][RFC] Proposal Draft: Refactor tempfile handling Brian Gesiak
2014-03-03 22:42 ` Jeff King
2014-03-08 17:04   ` Brian Gesiak
2014-03-11  1:33     ` Jeff King
2014-03-11 16:27 ` Michael Haggerty
2014-03-11 18:05   ` Jeff King
2014-03-12 12:26     ` Brian Gesiak

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).