git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Git and OpenDocument (OpenOffice.org) files
@ 2007-08-27  9:52 Matthieu Moy
  2007-08-27 10:08 ` Junio C Hamano
  2007-08-27 10:17 ` Johannes Schindelin
  0 siblings, 2 replies; 12+ messages in thread
From: Matthieu Moy @ 2007-08-27  9:52 UTC (permalink / raw)
  To: git

Hi,

I found a way to use git comfortably with OpenDocument files (that is,
what OpenOffice.org and Koffice produce. Text, Presentations and
Spreadsheets).

Briefly, you have to install odf2txt ( http://stosberg.net/odt2txt/ )
and the script below, together with GIT_EXTERNAL_DIFF and/or diff
drivers in .gitattributes. That give you the text diff you're used to.

Everything is documented here:

  http://www-verimag.imag.fr/~moy/opendocument/

Remarks are welcome (I'll post some remarks about Git's custom diff
driver in a separate thread).


Script available from
http://www-verimag.imag.fr/~moy/opendocument/git-oodiff and reproduced
here :

#! /bin/sh

# Script acceptable as a value for GIT_EXTERNAL_DIFF.
# For example, you can see the changes in your working tree with
# 
# $ GIT_EXTERNAL_DIFF=git-oodiff diff

echo $0 "$@"

if odt2txt "$2"  > /tmp/oodiff.$$.1  && \
    odt2txt "$5" > /tmp/oodiff.$$.2 ; then
    if diff -L "a/$1" -L "b/$1" -u /tmp/oodiff.$$.{1,2}; then
        # no text change
        if diff -q "$2" "$5"; then
            : # no change at all
        else
            echo "OpenDocument files a/$1 and b/$1 files differ (same text content)"
        fi
    fi
else
    # conversion failed. Fall back to plain diff.
    diff -L "a/$1" -L "b/$1" -u "$2" "$5"
fi

rm -f /tmp/oodiff.$$.{1,2}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27  9:52 Git and OpenDocument (OpenOffice.org) files Matthieu Moy
@ 2007-08-27 10:08 ` Junio C Hamano
  2007-08-27 12:35   ` Matthieu Moy
  2007-08-27 10:17 ` Johannes Schindelin
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-08-27 10:08 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

Matthieu Moy <Matthieu.Moy@imag.fr> writes:

> Remarks are welcome (I'll post some remarks about Git's custom diff
> driver in a separate thread).

Good.

I think creation/deletion will get /dev/null as the temporary
file name, so as long as odt2txt knows how to deal with
/dev/null you would not have to worry much about them.

You might want to be careful about unmerged paths, though.  They
will not get anything other than $1 (name).

You would probably not care about the mode changes for oo
documents, but they are available as $4 and $6 respectively, if
you care.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27  9:52 Git and OpenDocument (OpenOffice.org) files Matthieu Moy
  2007-08-27 10:08 ` Junio C Hamano
@ 2007-08-27 10:17 ` Johannes Schindelin
  1 sibling, 0 replies; 12+ messages in thread
From: Johannes Schindelin @ 2007-08-27 10:17 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

Hi,

On Mon, 27 Aug 2007, Matthieu Moy wrote:

> I found a way to use git comfortably with OpenDocument files (that is, 
> what OpenOffice.org and Koffice produce. Text, Presentations and 
> Spreadsheets).

Heh.  I had that problem, too.  I added an attribute "*.odt diff=odt" and 
the diff driver unpacks the zip and executes an xmldiff on the files.  
Since at times, it is more interesting to do a word based diff, depending 
on the environment variable WORDDIFF, my diff driver executes "git diff 
--color-words" instead.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 10:08 ` Junio C Hamano
@ 2007-08-27 12:35   ` Matthieu Moy
  2007-08-27 13:03     ` Mike Hommey
  0 siblings, 1 reply; 12+ messages in thread
From: Matthieu Moy @ 2007-08-27 12:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Matthieu Moy <Matthieu.Moy@imag.fr> writes:
>
>> Remarks are welcome (I'll post some remarks about Git's custom diff
>> driver in a separate thread).
>
> Good.
>
> I think creation/deletion will get /dev/null as the temporary
> file name, so as long as odt2txt knows how to deal with
> /dev/null you would not have to worry much about them.

But odt2txt doesn't know how to deal with /dev/null. New version of
git-oodiff that manages it correctly online and below.

> You might want to be careful about unmerged paths, though.  They
> will not get anything other than $1 (name).

I don't know how to manage this correctly, so I just display a message
"Unmerged path $1" and die.

> You would probably not care about the mode changes for oo
> documents, but they are available as $4 and $6 respectively, if
> you care.

I don't care, but the new version still manages them ;-).

All this convince me that the ability to provide a plaintext converter
(see the other thread I started) would make it much simpler to write
such kind of things. The mode change, for example, could be managed
automatically by git, I wouldn't need to write my own 'echo "new
mode ..."'.

Thanks for the advices.

-- 
Matthieu

http://www-verimag.imag.fr/~moy/opendocument/git-oodiff

#! /bin/sh

# Script acceptable as a value for GIT_EXTERNAL_DIFF.
# For example, you can see the changes in your working tree with
# 
# $ GIT_EXTERNAL_DIFF=git-oodiff diff

convert_to_txt ()
{
    if [ x"$1" = x"/dev/null" ]; then
        printf "" > /tmp/oodiff.$$."$2"
        eval "label$2=/dev/null/"
    else
        odt2txt "$1" > /tmp/oodiff.$$."$2" 2>/dev/null
    fi
}

echo $(basename $0) "$2" "$5"

if [ "$#" = "1" ]; then
    echo "Unmerged path $1"
    exit 0
fi

if   [ x"$4" = x"." ]; then
    echo "new file mode $7"
elif [ x"$7" = x"." ]; then
    echo "deleted file mode $4"
elif [ x"$4" != x"$7" ]; then
    echo "old mode $4"
    echo "new mode $7"
fi

label1="a/$1"
label2="a/$1"

if convert_to_txt "$2" "1" &&
    convert_to_txt "$5" "2" ; then
    if diff -L "$label1" -L "$label2" -u /tmp/oodiff.$$.{1,2}; then
        # no text change
        if diff -q "$2" "$5"; then
            : # no change at all
        else
            echo "OpenDocument files a/$1 and b/$1 files differ (same text content)"
        fi
    fi
else
    # conversion failed. Fall back to plain diff.
    diff -L "$label1" -L "$label2" -u "$2" "$5"
fi

rm -f /tmp/oodiff.$$.{1,2}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 12:35   ` Matthieu Moy
@ 2007-08-27 13:03     ` Mike Hommey
  2007-08-27 13:41       ` Johannes Schindelin
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Hommey @ 2007-08-27 13:03 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Junio C Hamano, git

On Mon, Aug 27, 2007 at 02:35:14PM +0200, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Matthieu Moy <Matthieu.Moy@imag.fr> writes:
> >
> >> Remarks are welcome (I'll post some remarks about Git's custom diff
> >> driver in a separate thread).
> >
> > Good.
> >
> > I think creation/deletion will get /dev/null as the temporary
> > file name, so as long as odt2txt knows how to deal with
> > /dev/null you would not have to worry much about them.
> 
> But odt2txt doesn't know how to deal with /dev/null. New version of
> git-oodiff that manages it correctly online and below.

BTW, wouldn't it be more efficient to store the odf files unzipped ?

Mike

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 13:03     ` Mike Hommey
@ 2007-08-27 13:41       ` Johannes Schindelin
  2007-08-27 13:58         ` David Kastrup
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Schindelin @ 2007-08-27 13:41 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Matthieu Moy, Junio C Hamano, git

Hi,

On Mon, 27 Aug 2007, Mike Hommey wrote:

> On Mon, Aug 27, 2007 at 02:35:14PM +0200, Matthieu Moy <Matthieu.Moy@imag.fr> wrote:
> > Junio C Hamano <gitster@pobox.com> writes:
> > 
> > > Matthieu Moy <Matthieu.Moy@imag.fr> writes:
> > >
> > >> Remarks are welcome (I'll post some remarks about Git's custom diff
> > >> driver in a separate thread).
> > >
> > > Good.
> > >
> > > I think creation/deletion will get /dev/null as the temporary
> > > file name, so as long as odt2txt knows how to deal with
> > > /dev/null you would not have to worry much about them.
> > 
> > But odt2txt doesn't know how to deal with /dev/null. New version of
> > git-oodiff that manages it correctly online and below.
> 
> BTW, wouldn't it be more efficient to store the odf files unzipped ?

Efficient how?  Speed-wise: no.  Space-wise: yes.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 13:41       ` Johannes Schindelin
@ 2007-08-27 13:58         ` David Kastrup
  2007-08-27 14:06           ` Matthieu Moy
                             ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Kastrup @ 2007-08-27 13:58 UTC (permalink / raw)
  To: git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Mon, 27 Aug 2007, Mike Hommey wrote:
>
>> BTW, wouldn't it be more efficient to store the odf files unzipped ?
>
> Efficient how?  Speed-wise: no.  Space-wise: yes.

Huh?  At least the "Space-wise: yes" seems rather nonsensical.
"Speed-wise" is not as clear: it depends on the relation between
memory/disk bandwidth and decompression speed.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 13:58         ` David Kastrup
@ 2007-08-27 14:06           ` Matthieu Moy
  2007-08-27 14:15             ` Johannes Schindelin
  2007-08-27 14:16           ` Mike Hommey
       [not found]           ` <?= =?ISO-8859-1?Q?200708271416=0400.?= =?ISO-8859-1?Q?GA11000@glandium?= =?ISO-8859-1?Q?.org>
  2 siblings, 1 reply; 12+ messages in thread
From: Matthieu Moy @ 2007-08-27 14:06 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
>> On Mon, 27 Aug 2007, Mike Hommey wrote:
>>
>>> BTW, wouldn't it be more efficient to store the odf files unzipped ?
>>
>> Efficient how?  Speed-wise: no.  Space-wise: yes.
>
> Huh?  At least the "Space-wise: yes" seems rather nonsensical.

I don't know enough about git delta-compression and OpenDocument, but
git has better chance to efficiently delta-compress different versions
of the document if they're not compressed themselves.

(but that's a necessary and not sufficient condition. line-based
delta-compression wouldn't work if the file is a one-line XML file for
example).

> "Speed-wise" is not as clear: it depends on the relation between
> memory/disk bandwidth and decompression speed.

Probably network operations would be faster, and checkout would be
slower. I wouldn't bet ;-).

-- 
Matthieu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 14:06           ` Matthieu Moy
@ 2007-08-27 14:15             ` Johannes Schindelin
  0 siblings, 0 replies; 12+ messages in thread
From: Johannes Schindelin @ 2007-08-27 14:15 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: git

Hi,

On Mon, 27 Aug 2007, Matthieu Moy wrote:

> David Kastrup <dak@gnu.org> writes:
> 
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> >
> >> On Mon, 27 Aug 2007, Mike Hommey wrote:
> >>
> >>> BTW, wouldn't it be more efficient to store the odf files unzipped ?
> >>
> >> Efficient how?  Speed-wise: no.  Space-wise: yes.
> >
> > Huh?  At least the "Space-wise: yes" seems rather nonsensical.
> 
> I don't know enough about git delta-compression and OpenDocument, but 
> git has better chance to efficiently delta-compress different versions 
> of the document if they're not compressed themselves.

A standalone zip archive (which is what an .odt file is, with a defined 
file structure) cannot be as efficient in compressing text, especially if 
it is versioned text with relatively few differences between versions, as 
delta compression.

So yes, you guessed the explanation (which I omitted) correctly.

As for the speed wise: I doubt that unpacking and then repacking can be 
more efficient than not doing it -- even if the files are transmitted via 
network.  (Remember: blobs are stored compressed, be they in a pack, or 
loose.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 13:58         ` David Kastrup
  2007-08-27 14:06           ` Matthieu Moy
@ 2007-08-27 14:16           ` Mike Hommey
  2007-08-27 15:16             ` Sergio Callegari
       [not found]           ` <?= =?ISO-8859-1?Q?200708271416=0400.?= =?ISO-8859-1?Q?GA11000@glandium?= =?ISO-8859-1?Q?.org>
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Hommey @ 2007-08-27 14:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Mon, Aug 27, 2007 at 03:58:04PM +0200, David Kastrup <dak@gnu.org> wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Mon, 27 Aug 2007, Mike Hommey wrote:
> >
> >> BTW, wouldn't it be more efficient to store the odf files unzipped ?
> >
> > Efficient how?  Speed-wise: no.  Space-wise: yes.
> 
> Huh?  At least the "Space-wise: yes" seems rather nonsensical.

A zipped file will be 100% different at each revision.
The unzipped counterpart may be similar for 90% or more between revisions.

Mike

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
       [not found]           ` <?= =?ISO-8859-1?Q?200708271416=0400.?= =?ISO-8859-1?Q?GA11000@glandium?= =?ISO-8859-1?Q?.org>
@ 2007-08-27 15:05             ` David Kastrup
  0 siblings, 0 replies; 12+ messages in thread
From: David Kastrup @ 2007-08-27 15:05 UTC (permalink / raw)
  To: git

Mike Hommey <mh@glandium.org> writes:

> On Mon, Aug 27, 2007 at 03:58:04PM +0200, David Kastrup <dak@gnu.org> wrote:
>> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>> 
>> > On Mon, 27 Aug 2007, Mike Hommey wrote:
>> >
>> >> BTW, wouldn't it be more efficient to store the odf files unzipped ?
>> >
>> > Efficient how?  Speed-wise: no.  Space-wise: yes.
>> 
>> Huh?  At least the "Space-wise: yes" seems rather nonsensical.
>
> A zipped file will be 100% different at each revision.
> The unzipped counterpart may be similar for 90% or more between revisions.

Ah, right.

This applies however to gzipped files or single-file zip files, and
not zipped files in general: a zip file compresses each file
individually, so unchanged single files inside of the zip will deltify
reasonably well, as opposed to unchanged single files in a .tar.gz
file.

But that's a minor point not relevant here, and of course you are
right.  I just somehow did not register that "store the odf files" was
supposed to mean "get checked into git in numerous versions".

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Git and OpenDocument (OpenOffice.org) files
  2007-08-27 14:16           ` Mike Hommey
@ 2007-08-27 15:16             ` Sergio Callegari
  0 siblings, 0 replies; 12+ messages in thread
From: Sergio Callegari @ 2007-08-27 15:16 UTC (permalink / raw)
  To: git

Mike Hommey <mh <at> glandium.org> writes:


> 
> A zipped file will be 100% different at each revision.
> The unzipped counterpart may be similar for 90% or more between revisions.
> 
> Mike
> 

In my (modest) experience, not really:

in fact, odf files are a zip collection of many individual files (for instance
if you have an impress presentation, the zip collection will contain all
the images that appear in the presentation...)

Now: zip is different from .tar.gz in that tar.gz first concatenates the
files and then compresses the overall thing, while zip compresses or stores
the individual files and then concatenates and indexes the result.

The difference is that in a tar.gz file, changing a single byte in one of
the internal files can lead to a completely different compressed stream,
while in a zip file, changing an internal file only affects the relevant
part of the zipped file.

This means that:
- if you have an odf document containing lots of internal objects (e.g.
images) that do not change very much from version to version, git can make
very good deltas.
- conversely if you have an odf document whose size is dominated by proper
content, then git will not be able to make good deltas.

As an example, I am finding that impress presentations (dominated by images)
can delta very well, while calc spreadsheets (dominated by content) do not.

Probably it could be nice to make a filter that takes an odf file and 
re-zips it so that the content.xml inner file is only stored, rather
than deflated.  Then this could be used with the git file filtering
machinery.

Sergio

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-08-27 15:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-27  9:52 Git and OpenDocument (OpenOffice.org) files Matthieu Moy
2007-08-27 10:08 ` Junio C Hamano
2007-08-27 12:35   ` Matthieu Moy
2007-08-27 13:03     ` Mike Hommey
2007-08-27 13:41       ` Johannes Schindelin
2007-08-27 13:58         ` David Kastrup
2007-08-27 14:06           ` Matthieu Moy
2007-08-27 14:15             ` Johannes Schindelin
2007-08-27 14:16           ` Mike Hommey
2007-08-27 15:16             ` Sergio Callegari
     [not found]           ` <?= =?ISO-8859-1?Q?200708271416=0400.?= =?ISO-8859-1?Q?GA11000@glandium?= =?ISO-8859-1?Q?.org>
2007-08-27 15:05             ` David Kastrup
2007-08-27 10:17 ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).