git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [Question] Signature calculation ignoring parts of binary files
@ 2018-09-12 19:16 Randall S. Becker
  2018-09-12 20:48 ` Johannes Sixt
  0 siblings, 1 reply; 10+ messages in thread
From: Randall S. Becker @ 2018-09-12 19:16 UTC (permalink / raw)
  To: git

I feel really bad asking this, and I should know the answer, and yet.

I have a binary file that needs to go into a repo intact (unchanged). I also
have a program that interprets the contents, like a textconv, that can
output the relevant portions of the file in whatever format I like - used
for diff typically, dumps in 1K chunks by file section. What I'm looking for
is to have the SHA1 signature calculated with just the relevant portions of
the file so that two actually different files will be considered the same by
git during a commit or status. In real terms, I'm trying to ignore the
Creator metadata of a JPG because it is mutable and irrelevant to my repo
contents.

I'm sorry to ask, but I thought this was in .gitattributes but I can't
confirm the SHA1 behaviour.

Sheepishly,
Randall


-- Brief whoami:
 NonStop developer since approximately 211288444200000000
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] Signature calculation ignoring parts of binary files
  2018-09-12 19:16 [Question] Signature calculation ignoring parts of binary files Randall S. Becker
@ 2018-09-12 20:48 ` Johannes Sixt
  2018-09-12 20:53   ` Randall S. Becker
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Sixt @ 2018-09-12 20:48 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: git

Am 12.09.18 um 21:16 schrieb Randall S. Becker:
> I feel really bad asking this, and I should know the answer, and yet.
> 
> I have a binary file that needs to go into a repo intact (unchanged). I also
> have a program that interprets the contents, like a textconv, that can
> output the relevant portions of the file in whatever format I like - used
> for diff typically, dumps in 1K chunks by file section. What I'm looking for
> is to have the SHA1 signature calculated with just the relevant portions of
> the file so that two actually different files will be considered the same by
> git during a commit or status. In real terms, I'm trying to ignore the
> Creator metadata of a JPG because it is mutable and irrelevant to my repo
> contents.
> 
> I'm sorry to ask, but I thought this was in .gitattributes but I can't
> confirm the SHA1 behaviour.

You are looking for a clean filter. See the 'filter' attribute in 
gitattributes(5). Your clean filter program or script should strip the 
unwanted metadata or set it to a constant known-good value.

(You shouldn't need a smudge filter.)

-- Hannes

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] Signature calculation ignoring parts of binary files
  2018-09-12 20:48 ` Johannes Sixt
@ 2018-09-12 20:53   ` Randall S. Becker
  2018-09-12 22:20     ` Randall S. Becker
  0 siblings, 1 reply; 10+ messages in thread
From: Randall S. Becker @ 2018-09-12 20:53 UTC (permalink / raw)
  To: 'Johannes Sixt'; +Cc: git

> -----Original Message-----
> From: git-owner@vger.kernel.org <git-owner@vger.kernel.org> On Behalf
> Of Johannes Sixt
> Sent: September 12, 2018 4:48 PM
> To: Randall S. Becker <rsbecker@nexbridge.com>
> Cc: git@vger.kernel.org
> Subject: Re: [Question] Signature calculation ignoring parts of binary files
> 
> Am 12.09.18 um 21:16 schrieb Randall S. Becker:
> > I feel really bad asking this, and I should know the answer, and yet.
> >
> > I have a binary file that needs to go into a repo intact (unchanged).
> > I also have a program that interprets the contents, like a textconv,
> > that can output the relevant portions of the file in whatever format I
> > like - used for diff typically, dumps in 1K chunks by file section.
> > What I'm looking for is to have the SHA1 signature calculated with
> > just the relevant portions of the file so that two actually different
> > files will be considered the same by git during a commit or status. In
> > real terms, I'm trying to ignore the Creator metadata of a JPG because
> > it is mutable and irrelevant to my repo contents.
> >
> > I'm sorry to ask, but I thought this was in .gitattributes but I can't
> > confirm the SHA1 behaviour.
> 
> You are looking for a clean filter. See the 'filter' attribute in gitattributes(5).
> Your clean filter program or script should strip the unwanted metadata or set
> it to a constant known-good value.
> 
> (You shouldn't need a smudge filter.)
> 
> -- Hannes

Thanks Hannes. I thought about the clean filter, but I don't actually want to modify the file when going into git, just for SHA calculation. I need to be able to keep some origin metadata that might change with subsequent copies, so just cleaning the origin is not going to work - actually knowing the original author is important to our process. My objective is to keep the original file 100% exact as supplied and then ignore any changes to the metadata that I don't care about (like Creator) if the remainder of the file is the same.

Regards,
Randall



^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] Signature calculation ignoring parts of binary files
  2018-09-12 20:53   ` Randall S. Becker
@ 2018-09-12 22:20     ` Randall S. Becker
  2018-09-12 22:59       ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Randall S. Becker @ 2018-09-12 22:20 UTC (permalink / raw)
  To: 'Johannes Sixt'; +Cc: git

On September 12, 2018 4:54 PM, I wrote:
> On September 12, 2018 4:48 PM, Johannes Sixt wrote:
> > Am 12.09.18 um 21:16 schrieb Randall S. Becker:
> > > I feel really bad asking this, and I should know the answer, and yet.
> > >
> > > I have a binary file that needs to go into a repo intact (unchanged).
> > > I also have a program that interprets the contents, like a textconv,
> > > that can output the relevant portions of the file in whatever format
> > > I like - used for diff typically, dumps in 1K chunks by file section.
> > > What I'm looking for is to have the SHA1 signature calculated with
> > > just the relevant portions of the file so that two actually
> > > different files will be considered the same by git during a commit
> > > or status. In real terms, I'm trying to ignore the Creator metadata
> > > of a JPG because it is mutable and irrelevant to my repo contents.
> > >
> > > I'm sorry to ask, but I thought this was in .gitattributes but I
> > > can't confirm the SHA1 behaviour.
> >
> > You are looking for a clean filter. See the 'filter' attribute in gitattributes(5).
> > Your clean filter program or script should strip the unwanted metadata
> > or set it to a constant known-good value.
> >
> > (You shouldn't need a smudge filter.)
> >
> > -- Hannes
> 
> Thanks Hannes. I thought about the clean filter, but I don't actually want to
> modify the file when going into git, just for SHA calculation. I need to be able
> to keep some origin metadata that might change with subsequent copies, so
> just cleaning the origin is not going to work - actually knowing the original
> author is important to our process. My objective is to keep the original file
> 100% exact as supplied and then ignore any changes to the metadata that I
> don't care about (like Creator) if the remainder of the file is the same.

I had a thought that might be workable, opinions are welcome on this.

The commit of my rather weird project is done by a script so I have flexibility in my approach. What I could do is set up a diff textconv configuration so that the text diff of the two JPG files will show no differences if the immutable fields and the image are the same. I can then trigger a git add and git commit for only those files where git diff reports no differences. That way the actual original file is stored in git with 100% fidelity (no cleaning). It's not as elegant as I'd like, but it does solve what I'm trying to do. Does this sound reasonable and/or is there a better way?

Cheers,
Randall

-- Brief whoami:
 NonStop developer since approximately 211288444200000000
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] Signature calculation ignoring parts of binary files
  2018-09-12 22:20     ` Randall S. Becker
@ 2018-09-12 22:59       ` Junio C Hamano
  2018-09-13 12:19         ` Randall S. Becker
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2018-09-12 22:59 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Johannes Sixt', git

"Randall S. Becker" <rsbecker@nexbridge.com> writes:

>> author is important to our process. My objective is to keep the original file
>> 100% exact as supplied and then ignore any changes to the metadata that I
>> don't care about (like Creator) if the remainder of the file is the same.

That will *not* work.  If person A gave you a version of original,
which hashes to X after you strip the cruft you do not care about,
you would register that original with person A's fingerprint on
under the name of X.  What happens when person B gives you another
version, which is not byte-for-byte identical to the one you got
earlier from person A, but does hash to the same X after you strip
the cruft?  If you are going to store it in Git, and if by SHA-1 you
are calling what we perceive as "object name" in Git land, you must
store that one with person B's fingerprint on it also under the name
of X.  Now which version will you get from Git when you ask it to
give you the object that hashes to X?  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] Signature calculation ignoring parts of binary files
  2018-09-12 22:59       ` Junio C Hamano
@ 2018-09-13 12:19         ` Randall S. Becker
  2018-09-13 15:03           ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Randall S. Becker @ 2018-09-13 12:19 UTC (permalink / raw)
  To: 'Junio C Hamano'; +Cc: 'Johannes Sixt', git

On September 12, 2018 7:00 PM, Junio C Hamano wrote:
> "Randall S. Becker" <rsbecker@nexbridge.com> writes:
> 
> >> author is important to our process. My objective is to keep the
> >> original file 100% exact as supplied and then ignore any changes to
> >> the metadata that I don't care about (like Creator) if the remainder of
the
> file is the same.
> 
> That will *not* work.  If person A gave you a version of original, which
> hashes to X after you strip the cruft you do not care about, you would
> register that original with person A's fingerprint on under the name of X.
> What happens when person B gives you another version, which is not byte-
> for-byte identical to the one you got earlier from person A, but does hash
to
> the same X after you strip the cruft?  If you are going to store it in
Git, and if
> by SHA-1 you are calling what we perceive as "object name" in Git land,
you
> must store that one with person B's fingerprint on it also under the name
of
> X.  Now which version will you get from Git when you ask it to give you
the
> object that hashes to X?

The scenario is slightly different.
1. Person A gives me a new binary file-1 with fingerprint A1. This goes into
git unchanged.
2. Person B gives me binary file-2 with fingerprint B2. This does not go
into git yet.
3. We attempt a git diff between the committed file-1 and uncommitted file-2
using a textconv implementation that strips what we don't need to compare.
4. If file-1 and file-2 have no difference when textconv is used, file-2 is
not added and not committed. It is discarded with impunity, never to be seen
again, although we might whine a lot at the user for attempting to put
file-2 in - but that's not git's issue.
5. If file-1 and file-2 have differences when textconv is used, file-2 is
committed with fingerprint B2.
6. Even if an error is made by the user and they commit file-2 with B2
regardless of textconv, there will be a human who complains about it, but
git has two unambiguous fingerprints that happen to have no diffs after
textconv is applied.

My original hope was that textconv could be used to influence the
fingerprint, but I do not think that is the case, so I went with an
alternative. In the application, I am not allowed to strip any cruft off
file-1 when it is stored - it must be byte-for-byte the original file. This
application is marginally related to a DRM-like situation where we only care
about the original image provided by a user, but any copies that are
provided by another user with modified metadata will be disallowed from
repository.

Does that make more sense? 

Cheers,
Randall


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] Signature calculation ignoring parts of binary files
  2018-09-13 12:19         ` Randall S. Becker
@ 2018-09-13 15:03           ` Junio C Hamano
  2018-09-13 15:38             ` Randall S. Becker
  2018-09-13 17:51             ` Junio C Hamano
  0 siblings, 2 replies; 10+ messages in thread
From: Junio C Hamano @ 2018-09-13 15:03 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Johannes Sixt', git

"Randall S. Becker" <rsbecker@nexbridge.com> writes:

> The scenario is slightly different.
> 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into
> git unchanged.
> 2. Person B gives me binary file-2 with fingerprint B2. This does not go
> into git yet.
> 3. We attempt a git diff between the committed file-1 and uncommitted file-2
> using a textconv implementation that strips what we don't need to compare.
> 4. If file-1 and file-2 have no difference when textconv is used, file-2 is
> not added and not committed. It is discarded with impunity, never to be seen
> again, although we might whine a lot at the user for attempting to put
> file-2 in - but that's not git's issue.

You are forgetting that Git is a distributed version control system,
aren't you?  Person A and B can introduce their "moral equivalent
but bytewise different" copies to their repository under the same
object name, and you can pull from them--what happens?

It is fundamental that one object name given to Git identifies one
specific byte sequence contained in an object uniquely.  Once you
broke that, you no longer have Git.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] Signature calculation ignoring parts of binary files
  2018-09-13 15:03           ` Junio C Hamano
@ 2018-09-13 15:38             ` Randall S. Becker
  2018-09-13 17:51             ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Randall S. Becker @ 2018-09-13 15:38 UTC (permalink / raw)
  To: 'Junio C Hamano'; +Cc: 'Johannes Sixt', git

On September 13, 2018 11:03 AM, Junio C Hamano wrote:
> "Randall S. Becker" <rsbecker@nexbridge.com> writes:
> 
> > The scenario is slightly different.
> > 1. Person A gives me a new binary file-1 with fingerprint A1. This
> > goes into git unchanged.
> > 2. Person B gives me binary file-2 with fingerprint B2. This does not
> > go into git yet.
> > 3. We attempt a git diff between the committed file-1 and uncommitted
> > file-2 using a textconv implementation that strips what we don't need to
> compare.
> > 4. If file-1 and file-2 have no difference when textconv is used,
> > file-2 is not added and not committed. It is discarded with impunity,
> > never to be seen again, although we might whine a lot at the user for
> > attempting to put
> > file-2 in - but that's not git's issue.
> 
> You are forgetting that Git is a distributed version control system,
aren't you?
> Person A and B can introduce their "moral equivalent but bytewise
different"
> copies to their repository under the same object name, and you can pull
from
> them--what happens?
> 
> It is fundamental that one object name given to Git identifies one
specific
> byte sequence contained in an object uniquely.  Once you broke that, you
no
> longer have Git.

At that point I have a morally questionable situation, agreed. However, both
are permitted to exist in the underlying tree without conflict in git -
which I do consider a legitimately possible situation that will not break
the application at all - although there is a semantic conflict in the
application (not in git) that requires human decision to resolve. The fact
that both objects can exist in git with different fingerprints is a good
thing because it provides immutable evidence and ownership of someone
bypassing the intent of the application.

So, rather than using textconv, I shall implement this rule in the
application rather than trying to configure git to do it. If two conflicting
objects enter the commit history, the application will have the
responsibility to resolve the semantic/legal conflict.

Thanks,
Randall



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] Signature calculation ignoring parts of binary files
  2018-09-13 15:03           ` Junio C Hamano
  2018-09-13 15:38             ` Randall S. Becker
@ 2018-09-13 17:51             ` Junio C Hamano
  2018-09-13 17:55               ` Randall S. Becker
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2018-09-13 17:51 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Johannes Sixt', git

Junio C Hamano <gitster@pobox.com> writes:

> "Randall S. Becker" <rsbecker@nexbridge.com> writes:
>
>> The scenario is slightly different.
>> 1. Person A gives me a new binary file-1 with fingerprint A1. This goes into
>> git unchanged.
>> 2. Person B gives me binary file-2 with fingerprint B2. This does not go
>> into git yet.
>> 3. We attempt a git diff between the committed file-1 and uncommitted file-2
>> using a textconv implementation that strips what we don't need to compare.
>> 4. If file-1 and file-2 have no difference when textconv is used, file-2 is
>> not added and not committed. It is discarded with impunity, never to be seen
>> again, although we might whine a lot at the user for attempting to put
>> file-2 in - but that's not git's issue.
>
> You are forgetting that Git is a distributed version control system,
> aren't you?  Person A and B can introduce their "moral equivalent
> but bytewise different" copies to their repository under the same
> object name, and you can pull from them--what happens?
>
> It is fundamental that one object name given to Git identifies one
> specific byte sequence contained in an object uniquely.  Once you
> broke that, you no longer have Git.

Having said all that, if you want to keep the original with frills
but somehow give these bytewise different things that reduce to the
same essence (e.g. when passed thru a filter like textconv), I
suspect a better approach might be to store both the "original" and
the result of passing the "original" through the filter in the
object database.  In the above example, you'll get two "original"
objects from person A and person B, plus one "canonical" object that
are bytewise different from either of these two originals, but what
they reduce to when you use the filter on them.  Then you record the
fact that to derive the "essence" object, you can reduce either
person A's or person B's "original" through the filter, perhaps by
using "git notes" attached to the "essence" object, recording the
object names of these originals (the reason why using notes in this
direction is because you can mechanically determine which "essence"
object any given "original" object reduces to---it is just the
matter of passing it through the filter.  But there can be more than
one "original" that reduces to the same "essence").


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Question] Signature calculation ignoring parts of binary files
  2018-09-13 17:51             ` Junio C Hamano
@ 2018-09-13 17:55               ` Randall S. Becker
  0 siblings, 0 replies; 10+ messages in thread
From: Randall S. Becker @ 2018-09-13 17:55 UTC (permalink / raw)
  To: 'Junio C Hamano'; +Cc: 'Johannes Sixt', git

On September 13, 2018 1:52 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > "Randall S. Becker" <rsbecker@nexbridge.com> writes:
> >
> >> The scenario is slightly different.
> >> 1. Person A gives me a new binary file-1 with fingerprint A1. This
> >> goes into git unchanged.
> >> 2. Person B gives me binary file-2 with fingerprint B2. This does not
> >> go into git yet.
> >> 3. We attempt a git diff between the committed file-1 and uncommitted
> >> file-2 using a textconv implementation that strips what we don't need
to
> compare.
> >> 4. If file-1 and file-2 have no difference when textconv is used,
> >> file-2 is not added and not committed. It is discarded with impunity,
> >> never to be seen again, although we might whine a lot at the user for
> >> attempting to put
> >> file-2 in - but that's not git's issue.
> >
> > You are forgetting that Git is a distributed version control system,
> > aren't you?  Person A and B can introduce their "moral equivalent but
> > bytewise different" copies to their repository under the same object
> > name, and you can pull from them--what happens?
> >
> > It is fundamental that one object name given to Git identifies one
> > specific byte sequence contained in an object uniquely.  Once you
> > broke that, you no longer have Git.
> 
> Having said all that, if you want to keep the original with frills but
somehow
> give these bytewise different things that reduce to the same essence (e.g.
> when passed thru a filter like textconv), I suspect a better approach
might be
> to store both the "original" and the result of passing the "original"
through
> the filter in the object database.  In the above example, you'll get two
> "original"
> objects from person A and person B, plus one "canonical" object that are
> bytewise different from either of these two originals, but what they
reduce
> to when you use the filter on them.  Then you record the fact that to
derive
> the "essence" object, you can reduce either person A's or person B's
> "original" through the filter, perhaps by using "git notes" attached to
the
> "essence" object, recording the object names of these originals (the
reason
> why using notes in this direction is because you can mechanically
determine
> which "essence"
> object any given "original" object reduces to---it is just the matter of
passing
> it through the filter.  But there can be more than one "original" that
reduces
> to the same "essence").

I like that idea. It turns the reduced object into a contract. Thanks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-09-13 17:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-12 19:16 [Question] Signature calculation ignoring parts of binary files Randall S. Becker
2018-09-12 20:48 ` Johannes Sixt
2018-09-12 20:53   ` Randall S. Becker
2018-09-12 22:20     ` Randall S. Becker
2018-09-12 22:59       ` Junio C Hamano
2018-09-13 12:19         ` Randall S. Becker
2018-09-13 15:03           ` Junio C Hamano
2018-09-13 15:38             ` Randall S. Becker
2018-09-13 17:51             ` Junio C Hamano
2018-09-13 17:55               ` Randall S. Becker

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).