git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* proposal for extending smudge/clean filters with raw file access
@ 2016-05-12 18:24 Joey Hess
  2016-05-12 19:01 ` Junio C Hamano
  2016-05-12 20:14 ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Joey Hess @ 2016-05-12 18:24 UTC (permalink / raw)
  To: git

I'm using smudge/clean filters in git-annex now, and it's not been an
entirely smooth fit between the interface and what git-annex wants
to do.

The clean filter has to consume the whole file content on stdin;
not reading it all will make git think the clean filter failed.
But, git-annex often doesn't need to read the whole content of a
work-tree file in order to clean it.

The smudge filter has to output the whole file content to stdout. But
git-annex often has the file's content on disk already, and could just
move it into place in the working tree. This would save CPU and IO and
often disk space too. But the smudge interface doesn't let git-annex use
the efficient approach.

So I propose extending the filter driver with two more optional
commands. Call them raw-clean and raw-smudge for now.

raw-clean would be like clean, but rather than being fed the whole
content of a large file on stdin, it would be passed the filename, and
can access the file itself. Like the clean filter, it outputs the
cleaned version on stdout.

raw-smudge would be like smudge, but rather than needing to output the
whole content of a large file on stdout, it would be passed a filename,
and can create that file itself.

To keep this backwards compatible, and to handle the cases where the
object being filtered is not a file on disk, the smudge and clean
filters would be required to be configured too, in order for raw-clean
and raw-smudge to be used.

It seems fairly easy to implement raw-clean. In sha1_file.c, index_path
would use raw-clean when available, while index_fd etc keep on using
the clean filter. I have not investigated what would be needed to implement
raw-smudge yet.

-- 
see shy jo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: proposal for extending smudge/clean filters with raw file access
  2016-05-12 18:24 proposal for extending smudge/clean filters with raw file access Joey Hess
@ 2016-05-12 19:01 ` Junio C Hamano
  2016-05-12 21:02   ` Joey Hess
  2016-05-12 20:14 ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-05-12 19:01 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

Joey Hess <id@joeyh.name> writes:

> I'm using smudge/clean filters in git-annex now, and it's not been an
> entirely smooth fit between the interface and what git-annex wants
> to do.
>
> The clean filter has to consume the whole file content on stdin;
> not reading it all will make git think the clean filter failed.
> But, git-annex often doesn't need to read the whole content of a
> work-tree file in order to clean it.
>
> The smudge filter has to output the whole file content to stdout. But
> git-annex often has the file's content on disk already, and could just
> move it into place in the working tree. This would save CPU and IO and
> often disk space too. But the smudge interface doesn't let git-annex use
> the efficient approach.

The smudge happens to be the last to run, so it is quite true that
it can say "Hey Git, I've written it out already".

I didn't check all codepaths to ensure that we won't need the
smudged result in core at all, but I am guessing you did so before
making this proposal, so in that case I would say this sounds fine.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: proposal for extending smudge/clean filters with raw file access
  2016-05-12 18:24 proposal for extending smudge/clean filters with raw file access Joey Hess
  2016-05-12 19:01 ` Junio C Hamano
@ 2016-05-12 20:14 ` Junio C Hamano
  2016-05-12 20:46   ` Joey Hess
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-05-12 20:14 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

Joey Hess <id@joeyh.name> writes:

> The clean filter has to consume the whole file content on stdin;
> not reading it all will make git think the clean filter failed.
> But, git-annex often doesn't need to read the whole content of a
> work-tree file in order to clean it.

This side, I do not think we even need a new variant.  We can just
update the code to interact with "clean" so that it the writer to
the pipe ignores SIGPIPE, detects EPIPE on write(2), says "ah, the
other end does not need the full input to produce its output".  The
reader from the pipe consumes its output without failing AS LONG AS
the "clean" filter exits with zero (we do check its exit status,
right?)

And I think we should do the same for any codepath that spawns
custom script and feeds it via a pipe from us (I am talking about
hooks here).  

What may require a new variant is when your clean filter may not
even need earlier contents of the file, in which case we are better
off not opening the file ourselves and slurping it into core, only
to pipe it and earlier part discarded by the filter.  "clean-from-fs"
filter that gets a path on the working tree and feeds us via pipe
would be appropriate to deal with such a requirement.

> The smudge filter has to output the whole file content to stdout.

We cannot do a similar "we can just unconditionally update" like I
said on the "clean" side to "smudge", so it would need a new
variant.  I do not want to call it anything "raw", as there is
nothing "raw" about it.  "smudge-to-fs" would be a better name.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: proposal for extending smudge/clean filters with raw file access
  2016-05-12 20:14 ` Junio C Hamano
@ 2016-05-12 20:46   ` Joey Hess
  2016-05-12 20:55     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Joey Hess @ 2016-05-12 20:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

Junio C Hamano wrote:
> This side, I do not think we even need a new variant.  We can just
> update the code to interact with "clean" so that it the writer to
> the pipe ignores SIGPIPE, detects EPIPE on write(2), says "ah, the
> other end does not need the full input to produce its output".  The
> reader from the pipe consumes its output without failing AS LONG AS
> the "clean" filter exits with zero (we do check its exit status,
> right?)

There are two problems with doing that. First, any clean filter that
relied on that would not work with older versions of git.

Secondly, and harder to get around, the filename passed to the clean
filter is not necessarily a path to the actual existing file that is
being cleaned. For example, git hash-object --stdin --path=whatever.
So the current clean filter can't really safely rely on accessing the
file to short-circuit its cleaning. (Although it seems to mostly work..
currently..)

> We cannot do a similar "we can just unconditionally update" like I
> said on the "clean" side to "smudge", so it would need a new
> variant.  I do not want to call it anything "raw", as there is
> nothing "raw" about it.  "smudge-to-fs" would be a better name.

"raw" was just a placeholder. "clean-from-fs" and "smudge-to-fs" are
pretty descriptive.

-- 
see shy jo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: proposal for extending smudge/clean filters with raw file access
  2016-05-12 20:46   ` Joey Hess
@ 2016-05-12 20:55     ` Junio C Hamano
  2016-05-12 21:17       ` Joey Hess
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-05-12 20:55 UTC (permalink / raw)
  To: Joey Hess; +Cc: git

Joey Hess <id@joeyh.name> writes:

> Junio C Hamano wrote:
>> This side, I do not think we even need a new variant.  We can just
>> update the code to interact with "clean" so that it the writer to
>> the pipe ignores SIGPIPE, detects EPIPE on write(2), says "ah, the
>> other end does not need the full input to produce its output".  The
>> reader from the pipe consumes its output without failing AS LONG AS
>> the "clean" filter exits with zero (we do check its exit status,
>> right?)
>
> There are two problems with doing that. First, any clean filter that
> relied on that would not work with older versions of git.

That's a fair point.

> Secondly, and harder to get around, the filename passed to the clean
> filter is not necessarily a path to the actual existing file that is
> being cleaned.

Either one of us is confused.  I was talking about updating the
current "clean" implementation without changing its interface,
i.e. gets fed via its standard input, expected to respond to its
standard output.  There is no filename involved.

And yes, "clean-from-fs" that is spawned by us with the name of the
file in the filesystem that has the contents as its argument, must
be prepared to see a filename that does not have any relation to the
filename in the working tree.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: proposal for extending smudge/clean filters with raw file access
  2016-05-12 19:01 ` Junio C Hamano
@ 2016-05-12 21:02   ` Joey Hess
  0 siblings, 0 replies; 7+ messages in thread
From: Joey Hess @ 2016-05-12 21:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]

Junio C Hamano wrote:
> The smudge happens to be the last to run, so it is quite true that
> it can say "Hey Git, I've written it out already".
> 
> I didn't check all codepaths to ensure that we won't need the
> smudged result in core at all, but I am guessing you did so before
> making this proposal, so in that case I would say this sounds fine.

Well, the idea is to only use smudge-to-file when the smudged content is
going to be written out to a file. Any other code paths that need to
smudge some content would use the smudge filter.

So, try_create_file would use it. Maybe some other places I have not
found yet could as well.

-- 
see shy jo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: proposal for extending smudge/clean filters with raw file access
  2016-05-12 20:55     ` Junio C Hamano
@ 2016-05-12 21:17       ` Joey Hess
  0 siblings, 0 replies; 7+ messages in thread
From: Joey Hess @ 2016-05-12 21:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1183 bytes --]

Junio C Hamano wrote:
> > Secondly, and harder to get around, the filename passed to the clean
> > filter is not necessarily a path to the actual existing file that is
> > being cleaned.
> 
> Either one of us is confused.  I was talking about updating the
> current "clean" implementation without changing its interface,
> i.e. gets fed via its standard input, expected to respond to its
> standard output.  There is no filename involved.

I'm talking about the %f that can be passed to the clean filter.
The context that I left out is that my clean filter could avoid reading
all of stdin, and quickly produce the cleaned result, but only if it
can examine the file that's being cleaned. Which is not currently
entirely safe to use the %f for.

There may be a way to make a clean filter that can do something useful
without reading all of stdin, and without examining the file that's
being cleaned. Maybe. Hard to see how. I don't feel such a hypothetical
clean filter is worth changing the current EPIPE behavior to support.

So I think it's better to add a separate clean-from-fs and keep the
current clean filter interface as it stands.

-- 
see shy jo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-05-12 21:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-12 18:24 proposal for extending smudge/clean filters with raw file access Joey Hess
2016-05-12 19:01 ` Junio C Hamano
2016-05-12 21:02   ` Joey Hess
2016-05-12 20:14 ` Junio C Hamano
2016-05-12 20:46   ` Joey Hess
2016-05-12 20:55     ` Junio C Hamano
2016-05-12 21:17       ` Joey Hess

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).