git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Working with zip files
@ 2016-08-16 16:25 Nikolaus Rath
  2016-08-16 16:27 ` David Lang
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolaus Rath @ 2016-08-16 16:25 UTC (permalink / raw)
  To: git

Hello,

I would like to store Simulink models in a Git
repository. Unfortunately, the file format is binary. But luckily, the
binary format happens to be a zipfile containing nicely formatted XML
files.

Is there a way to teach Git to take advantage of this when storing,
diff-ing and merging these files?

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 16:25 Working with zip files Nikolaus Rath
@ 2016-08-16 16:27 ` David Lang
  2016-08-16 16:32   ` Nikolaus Rath
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: David Lang @ 2016-08-16 16:27 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: git

On Tue, 16 Aug 2016, Nikolaus Rath wrote:

> I would like to store Simulink models in a Git
> repository. Unfortunately, the file format is binary. But luckily, the
> binary format happens to be a zipfile containing nicely formatted XML
> files.
>
> Is there a way to teach Git to take advantage of this when storing,
> diff-ing and merging these files?

you should be able to use clean/smudge to have git store the files uncompressed, 
which will help a lot.

I think there's a way to tell it to do a xml aware diff/patch, but I don't 
remember how.

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 16:27 ` David Lang
@ 2016-08-16 16:32   ` Nikolaus Rath
  2016-08-16 16:48     ` David Lang
  2016-08-16 16:58   ` Junio C Hamano
  2016-08-16 21:14   ` Nikolaus Rath
  2 siblings, 1 reply; 14+ messages in thread
From: Nikolaus Rath @ 2016-08-16 16:32 UTC (permalink / raw)
  To: David Lang; +Cc: git

On Aug 16 2016, David Lang <david@lang.hm> wrote:
> On Tue, 16 Aug 2016, Nikolaus Rath wrote:
>
>> I would like to store Simulink models in a Git
>> repository. Unfortunately, the file format is binary. But luckily, the
>> binary format happens to be a zipfile containing nicely formatted XML
>> files.
>>
>> Is there a way to teach Git to take advantage of this when storing,
>> diff-ing and merging these files?
>
> you should be able to use clean/smudge to have git store the files
> uncompressed, which will help a lot.

Cool, I'll look into that.

> I think there's a way to tell it to do a xml aware diff/patch, but I
> don't remember how.

Oh, I didn't even want to go that far. I'm perfectly happy if it does a
text-based diff/patch of the contained XML files. Would clean/smudge
provide that already?


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 16:32   ` Nikolaus Rath
@ 2016-08-16 16:48     ` David Lang
  0 siblings, 0 replies; 14+ messages in thread
From: David Lang @ 2016-08-16 16:48 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: git

On Tue, 16 Aug 2016, Nikolaus Rath wrote:

> On Aug 16 2016, David Lang <david@lang.hm> wrote:
>> On Tue, 16 Aug 2016, Nikolaus Rath wrote:
>>
>>> I would like to store Simulink models in a Git
>>> repository. Unfortunately, the file format is binary. But luckily, the
>>> binary format happens to be a zipfile containing nicely formatted XML
>>> files.
>>>
>>> Is there a way to teach Git to take advantage of this when storing,
>>> diff-ing and merging these files?
>>
>> you should be able to use clean/smudge to have git store the files
>> uncompressed, which will help a lot.
>
> Cool, I'll look into that.
>
>> I think there's a way to tell it to do a xml aware diff/patch, but I
>> don't remember how.
>
> Oh, I didn't even want to go that far. I'm perfectly happy if it does a
> text-based diff/patch of the contained XML files. Would clean/smudge
> provide that already?

yes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 16:27 ` David Lang
  2016-08-16 16:32   ` Nikolaus Rath
@ 2016-08-16 16:58   ` Junio C Hamano
  2016-08-16 19:56     ` Jakub Narębski
  2016-08-16 21:14   ` Nikolaus Rath
  2 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2016-08-16 16:58 UTC (permalink / raw)
  To: David Lang; +Cc: Nikolaus Rath, git

David Lang <david@lang.hm> writes:

> you should be able to use clean/smudge to have git store the files
> uncompressed, which will help a lot.
>
> I think there's a way to tell it to do a xml aware diff/patch, but I
> don't remember how.

I do not know about "patch" (in the sense of "git apply"), but "git
diff" (and "git log -p") can take advantage of the clean/smudge
mechanism.  I used to deal with a file format that is gzipped xml so
my clean filter was "gzip -dc" while the smudge was "gzip -cn".
Essentially, this stors the xml before compression in the repository
so blobs delta well with each other and also the revisions are
made textually diff-able.

Nikolaus's case has one extra layer of complexity in that the "file"
is actually an archive of multiple files.  The clean/smudge pair he
writes need to be a filter that flattens the archive into a single
human-readable text byte stream and its reverse.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 16:58   ` Junio C Hamano
@ 2016-08-16 19:56     ` Jakub Narębski
  2016-08-16 20:19       ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Narębski @ 2016-08-16 19:56 UTC (permalink / raw)
  To: Junio C Hamano, David Lang; +Cc: Nikolaus Rath, git

W dniu 16.08.2016 o 18:58, Junio C Hamano pisze:
> David Lang <david@lang.hm> writes:
> 
>> you should be able to use clean/smudge to have git store the files
>> uncompressed, which will help a lot.

You can find rezip clean/smudge filter (originally intended for
OpenDocument Format (ODF), that is OpenOffice.org etc.) that stores
zip or zip-archive (like ODT, jar, etc.) uncompressed.  I think
you can find it on GitWiki, but I might be mistaken.

>> I think there's a way to tell it to do a xml aware diff/patch, but I
>> don't remember how.
> 
> I do not know about "patch" (in the sense of "git apply"), but "git
> diff" (and "git log -p") can take advantage of the clean/smudge
> mechanism.  I used to deal with a file format that is gzipped xml so
> my clean filter was "gzip -dc" while the smudge was "gzip -cn".
> Essentially, this stores the xml before compression in the repository
> so blobs delta well with each other and also the revisions are
> made textually diff-able.
> 
> Nikolaus's case has one extra layer of complexity in that the "file"
> is actually an archive of multiple files.  The clean/smudge pair he
> writes need to be a filter that flattens the archive into a single
> human-readable text byte stream and its reverse.

There is also `textconv` filter that can be used instead; it might
be 'unzip -c' (extract files to stdout, with filenames), or 'unzip -p'
(same, without filenames).

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 19:56     ` Jakub Narębski
@ 2016-08-16 20:19       ` Junio C Hamano
  2016-08-18 12:16         ` Jakub Narębski
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2016-08-16 20:19 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: David Lang, Nikolaus Rath, git

Jakub Narębski <jnareb@gmail.com> writes:

> There is also `textconv` filter that can be used instead; it might
> be 'unzip -c' (extract files to stdout, with filenames), or 'unzip -p'
> (same, without filenames).

That assumes that the in-repository data is zipped binary blob; the
result won't delta well, will it?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 16:27 ` David Lang
  2016-08-16 16:32   ` Nikolaus Rath
  2016-08-16 16:58   ` Junio C Hamano
@ 2016-08-16 21:14   ` Nikolaus Rath
  2016-08-17  5:31     ` Jacob Keller
  2016-08-17  9:58     ` David Lang
  2 siblings, 2 replies; 14+ messages in thread
From: Nikolaus Rath @ 2016-08-16 21:14 UTC (permalink / raw)
  To: git

On Aug 16 2016, David Lang <david@lang.hm> wrote:
> On Tue, 16 Aug 2016, Nikolaus Rath wrote:
>
>> I would like to store Simulink models in a Git
>> repository. Unfortunately, the file format is binary. But luckily, the
>> binary format happens to be a zipfile containing nicely formatted XML
>> files.
>>
>> Is there a way to teach Git to take advantage of this when storing,
>> diff-ing and merging these files?
>
> you should be able to use clean/smudge to have git store the files
> uncompressed, which will help a lot.

Having looked at that, I'm not sure if this really helps:

As I understand, the smudge command is run on checkout to convert the
blob in the repository to the format that is desired in the working
tree. But this is the opposite of what I need: on checkout, I need to
convert the text data in the repository to a blob in the working tree.

Furthermore, I need to convert multiple text files into one blob, will
smudge/clean seem to do just 1:1 conversions.

Am I missing something? Are there any other options?

Best,
Nikolaus
-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 21:14   ` Nikolaus Rath
@ 2016-08-17  5:31     ` Jacob Keller
  2016-08-17  9:58     ` David Lang
  1 sibling, 0 replies; 14+ messages in thread
From: Jacob Keller @ 2016-08-17  5:31 UTC (permalink / raw)
  To: Git mailing list

On Tue, Aug 16, 2016 at 2:14 PM, Nikolaus Rath <Nikolaus@rath.org> wrote:
> On Aug 16 2016, David Lang <david@lang.hm> wrote:
>> On Tue, 16 Aug 2016, Nikolaus Rath wrote:
>>
>>> I would like to store Simulink models in a Git
>>> repository. Unfortunately, the file format is binary. But luckily, the
>>> binary format happens to be a zipfile containing nicely formatted XML
>>> files.
>>>
>>> Is there a way to teach Git to take advantage of this when storing,
>>> diff-ing and merging these files?
>>
>> you should be able to use clean/smudge to have git store the files
>> uncompressed, which will help a lot.
>
> Having looked at that, I'm not sure if this really helps:
>
> As I understand, the smudge command is run on checkout to convert the
> blob in the repository to the format that is desired in the working
> tree. But this is the opposite of what I need: on checkout, I need to
> convert the text data in the repository to a blob in the working tree.
>
> Furthermore, I need to convert multiple text files into one blob, will
> smudge/clean seem to do just 1:1 conversions.
>
> Am I missing something? Are there any other options?

You want to store the contents of the zip file as *one* blob that is
the uncompressed contents of the archive somehow concatenated
together. That should still be a 1:1 relationship.

You won't store one blob per file in the zip.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 21:14   ` Nikolaus Rath
  2016-08-17  5:31     ` Jacob Keller
@ 2016-08-17  9:58     ` David Lang
  1 sibling, 0 replies; 14+ messages in thread
From: David Lang @ 2016-08-17  9:58 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: git

On Tue, 16 Aug 2016, Nikolaus Rath wrote:

> On Aug 16 2016, David Lang <david@lang.hm> wrote:
>> On Tue, 16 Aug 2016, Nikolaus Rath wrote:
>>
>>> I would like to store Simulink models in a Git
>>> repository. Unfortunately, the file format is binary. But luckily, the
>>> binary format happens to be a zipfile containing nicely formatted XML
>>> files.
>>>
>>> Is there a way to teach Git to take advantage of this when storing,
>>> diff-ing and merging these files?
>>
>> you should be able to use clean/smudge to have git store the files
>> uncompressed, which will help a lot.
>
> Having looked at that, I'm not sure if this really helps:
>
> As I understand, the smudge command is run on checkout to convert the
> blob in the repository to the format that is desired in the working
> tree. But this is the opposite of what I need: on checkout, I need to
> convert the text data in the repository to a blob in the working tree.
>
> Furthermore, I need to convert multiple text files into one blob, will
> smudge/clean seem to do just 1:1 conversions.
>
> Am I missing something? Are there any other options?

so the smudge command would zip the file and the clean command would unzip the 
file (assuming it's a single file, if the zip is multiple files, you will have 
to add something to combine them)

you want the working tree to have a zip file and the repository to have text.

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-16 20:19       ` Junio C Hamano
@ 2016-08-18 12:16         ` Jakub Narębski
  2016-08-18 16:56           ` David Lang
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Narębski @ 2016-08-18 12:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: David Lang, Nikolaus Rath, git

W dniu 16.08.2016 o 22:19, Junio C Hamano pisze:
> Jakub Narębski <jnareb@gmail.com> writes:
> 
>> There is also `textconv` filter that can be used instead; it might
>> be 'unzip -c' (extract files to stdout, with filenames), or 'unzip -p'
>> (same, without filenames).
> 
> That assumes that the in-repository data is zipped binary blob; the
> result won't delta well, will it?

Full solution would involve `clean` filter to rezip with no compression
(which should delta well) and optional `smudge` filter to recompress;
if round-trip bit-for-bit equality is needed, the original zip parameters
must be saved somewhere, e.g. as ZIP archive comments.  This was mentioned
in the earlier part of my email (which might have been not clear):

JN>> You can find rezip clean/smudge filter (originally intended for
JN>> OpenDocument Format (ODF), that is OpenOffice.org etc.) that stores
JN>> zip or zip-archive (like ODT, jar, etc.) uncompressed.  I think
JN>> you can find it on GitWiki, but I might be mistaken.
 
Using 'unzip -c' as separate / additional `textconv` filter for diff
generation allows to separate the problem of deltifiable storage format
from textual representation for diff-ing.

Though best results could be had with `diff` and `merge` drivers...

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-18 12:16         ` Jakub Narębski
@ 2016-08-18 16:56           ` David Lang
  2016-08-18 17:45             ` Jakub Narębski
  0 siblings, 1 reply; 14+ messages in thread
From: David Lang @ 2016-08-18 16:56 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Junio C Hamano, Nikolaus Rath, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 838 bytes --]

On Thu, 18 Aug 2016, Jakub Narębski wrote:

> JN>> You can find rezip clean/smudge filter (originally intended for
> JN>> OpenDocument Format (ODF), that is OpenOffice.org etc.) that stores
> JN>> zip or zip-archive (like ODT, jar, etc.) uncompressed.  I think
> JN>> you can find it on GitWiki, but I might be mistaken.
>
> Using 'unzip -c' as separate / additional `textconv` filter for diff
> generation allows to separate the problem of deltifiable storage format
> from textual representation for diff-ing.
>
> Though best results could be had with `diff` and `merge` drivers...

can you point at an example of how to do this? when I went looking about a year 
ago to deal with single-line json data I wasn't able to find anything good. I 
ended up using clean/smudge to pretty-print the json so it was easier to handle.

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-18 16:56           ` David Lang
@ 2016-08-18 17:45             ` Jakub Narębski
  2016-08-19  3:00               ` David Lang
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Narębski @ 2016-08-18 17:45 UTC (permalink / raw)
  To: David Lang; +Cc: Junio C Hamano, Nikolaus Rath, git

On 18 August 2016 at 18:56, David Lang <david@lang.hm> wrote:
> On Thu, 18 Aug 2016, Jakub Narębski wrote:
>
>> JN>> You can find rezip clean/smudge filter (originally intended for
>> JN>> OpenDocument Format (ODF), that is OpenOffice.org etc.) that stores
>> JN>> zip or zip-archive (like ODT, jar, etc.) uncompressed.  I think
>> JN>> you can find it on GitWiki, but I might be mistaken.
>>
>> Using 'unzip -c' as separate / additional `textconv` filter for diff
>> generation allows to separate the problem of deltifiable storage format
>> from textual representation for diff-ing.
>>
>> Though best results could be had with `diff` and `merge` drivers...
>
>
> can you point at an example of how to do this? when I went looking about a
> year ago to deal with single-line json data I wasn't able to find anything
> good. I ended up using clean/smudge to pretty-print the json so it was
> easier to handle.

Pro Git has a chapter "Customizing Git - Git Attributes" about gitattributes
https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes

The section "Diffing Binary Files" has two examples: docx2txt (with wrapper)
for DOCX (MS Word) files, and exiftool for images. For JSON you could use
some prettyprinter / formatter like pp-json.

"Performing text diffs of binary files" section of gitattributes(1) manpage
covers 'textconv' vs 'diff', and uses 'exif' tool as textconv example.

HTH
-- 
Jakub Narębski




-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Working with zip files
  2016-08-18 17:45             ` Jakub Narębski
@ 2016-08-19  3:00               ` David Lang
  0 siblings, 0 replies; 14+ messages in thread
From: David Lang @ 2016-08-19  3:00 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Junio C Hamano, Nikolaus Rath, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1857 bytes --]

On Thu, 18 Aug 2016, Jakub Narębski wrote:

> On 18 August 2016 at 18:56, David Lang <david@lang.hm> wrote:
>> On Thu, 18 Aug 2016, Jakub Narębski wrote:
>>
>>> JN>> You can find rezip clean/smudge filter (originally intended for
>>> JN>> OpenDocument Format (ODF), that is OpenOffice.org etc.) that stores
>>> JN>> zip or zip-archive (like ODT, jar, etc.) uncompressed.  I think
>>> JN>> you can find it on GitWiki, but I might be mistaken.
>>>
>>> Using 'unzip -c' as separate / additional `textconv` filter for diff
>>> generation allows to separate the problem of deltifiable storage format
>>> from textual representation for diff-ing.
>>>
>>> Though best results could be had with `diff` and `merge` drivers...
>>
>>
>> can you point at an example of how to do this? when I went looking about a
>> year ago to deal with single-line json data I wasn't able to find anything
>> good. I ended up using clean/smudge to pretty-print the json so it was
>> easier to handle.
>
> Pro Git has a chapter "Customizing Git - Git Attributes" about gitattributes
> https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
>
> The section "Diffing Binary Files" has two examples: docx2txt (with wrapper)
> for DOCX (MS Word) files, and exiftool for images. For JSON you could use
> some prettyprinter / formatter like pp-json.
>
> "Performing text diffs of binary files" section of gitattributes(1) manpage
> covers 'textconv' vs 'diff', and uses 'exif' tool as textconv example.

As I read that section, it only applies to the human readable output of git 
diff.

And the merge section only talks about the default of using patch vs accepting a 
specific version in a merge.

It seems to me that what I'm looking for would be something to tell git to use a 
different command instead of diff/patch internally when creating and using the 
bundles.

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-08-19  7:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-16 16:25 Working with zip files Nikolaus Rath
2016-08-16 16:27 ` David Lang
2016-08-16 16:32   ` Nikolaus Rath
2016-08-16 16:48     ` David Lang
2016-08-16 16:58   ` Junio C Hamano
2016-08-16 19:56     ` Jakub Narębski
2016-08-16 20:19       ` Junio C Hamano
2016-08-18 12:16         ` Jakub Narębski
2016-08-18 16:56           ` David Lang
2016-08-18 17:45             ` Jakub Narębski
2016-08-19  3:00               ` David Lang
2016-08-16 21:14   ` Nikolaus Rath
2016-08-17  5:31     ` Jacob Keller
2016-08-17  9:58     ` David Lang

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).