git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Sam Vilain <sam@vilain.net>
To: Martin Langhoff <martin.langhoff@gmail.com>
Cc: "Dmitry Potapov" <dpotapov@gmail.com>,
	"Peter Stahlir" <peter.stahlir@googlemail.com>,
	"Karl Hasselström" <kha@treskal.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	git@vger.kernel.org
Subject: Re: Git as a filesystem
Date: Sat, 22 Sep 2007 15:09:01 +1200	[thread overview]
Message-ID: <46F4874D.8000305@vilain.net> (raw)
In-Reply-To: <46a038f90709211656n5b23783eu330e8b655cd42aa8@mail.gmail.com>

Martin Langhoff wrote:
> On 9/22/07, Dmitry Potapov <dpotapov@gmail.com> wrote:
>   
>> used to create the original file. So, if you put any .deb file in such
>> a system, you will get back a different .deb file (with a different SHA1).
>> So, aside high CPU and memory requirements, this system cannot work in
>> principle unless all users have exactly the same version of a compressor.
>>     
>
> Was thinking the same - compression machinery, ordering of the files,
> everything. It'd be a nightmare to ensure you get back the same .deb,
> without a single different bit.
>
> Debian packaging toolchain could be reworked to use a more GIT-like
> approach - off the top of my head, at least
>
>   - signing/validating the "tree" of the package rather than the
> completed package could allow the savings in distribution you mention,
> decouple the signing from the compression, and simplify things like
> debdiff
>
>   - git or git-like strategies for source packages
>   

Nightmare indeed.  I actually wrote a proof of concept for this idea for
gzip.

http://git.catalyst.net.nz/gw?p=git.git;a=shortlog;h=archive-blobs
(see also
http://planet.catalyst.net.nz/blog/2006/07/17/samv/xteddy_caught_consuming_rampant_amounts_of_disk_space)

I usually warn people that this undertaking is "slightly insane".

My implementation was designed to be called like "git-hash-object". 
What it did was look at the input stream, and detect quickly whether it
looked like a gzip stream.  If it was, it would decompress it and then
try to compress the first few blocks using different compression
libraries and settings to determine what settings were used.  If it
could find the right settings for the first meg or so, then it would
bank on the rest being identical as well, record which compressor and
what settings were used and write the uncompressed object, as well as
the information needed to reconstruct the gzip header, to a new type of
object called an "archive" object.  If the stream could not be
reproduced then it would save the raw stream instead.  For something
like a Debian archive, it is very likely that all compressed streams
will be reproducible, because they will almost all be compressed using
the same implementation of gzip.

For tar and .ar files, this can be slightly more deterministic of
course.  It doesn't even need to be particularly savvy of what all the
fields are - just locate the files in the .tar, write out a tree, and
then write a TOC that lists tree entries and contains any extra data (ie
headers, etc).

In hindsight, making a new object type was probably a mistake.  If I
were to re-undertake this I would not go down that path, though I'd
certainly consider using tag objects for the extra data, and throwing
them in the tree like submodules.  It would also be essential in a
"real" solution to bundle reference copies of the zlib and gzip
compressors (yes, their output streams differ with longer inputs and
even some short ones).

Sam.

  reply	other threads:[~2007-09-22  3:13 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-21 10:51 Git as a filesystem Peter Stahlir
2007-09-21 11:11 ` Johannes Schindelin
2007-09-21 11:41   ` Peter Stahlir
2007-09-21 12:53     ` Karl Hasselström
2007-09-21 13:28       ` Peter Stahlir
2007-09-21 13:41         ` Michael Poole
2007-09-21 14:38         ` jlh
2007-09-21 17:29         ` Dmitry Potapov
2007-09-21 23:56           ` Martin Langhoff
2007-09-22  3:09             ` Sam Vilain [this message]
2007-09-21 13:22     ` Nicolas Pitre
2007-09-21 13:35       ` Peter Stahlir
2007-09-21 13:45         ` Nicolas Pitre
2007-09-21 15:46         ` Christian von Kietzell
2007-09-21 23:33       ` Eric Wong
2007-09-21 23:42         ` Johannes Schindelin
2007-09-22  2:06           ` Eric Wong
2007-09-22 12:06             ` Johannes Schindelin
2007-09-21 14:22   ` Miklos Vajna

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46F4874D.8000305@vilain.net \
    --to=sam@vilain.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=dpotapov@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=kha@treskal.com \
    --cc=martin.langhoff@gmail.com \
    --cc=peter.stahlir@googlemail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).