Re: space compression (again)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: David Lang <david.lang@digitalinsight.com>
To: Ray Heasman <lists@mythral.org>
Cc: git@vger.kernel.org
Subject: Re: space compression (again)
Date: Sat, 16 Apr 2005 05:29:02 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.62.0504160526530.21837@qynat.qvtvafvgr.pbz> (raw)
In-Reply-To: <1113593583.29624.46.camel@maze.mythral.org>

we alrady have the concept of objects that contain objects and therefor 
don'e need to be re-checked (directories), the chunks inside a file could 
be the same type of thing.

currently we say that if the hash on the directory is the same we don't 
need to re-check each of the files in that directory, this would be that 
if the hash on the file hasn't changed we don't need to re-check the 
chunks inside that file.

David Lang


  On Fri, 15 Apr 2005, Ray Heasman wrote:

> Date: Fri, 15 Apr 2005 12:33:03 -0700
> From: Ray Heasman <lists@mythral.org>
> To: git@vger.kernel.org
> Subject: Re: space compression (again)
> 
> For for this email not threading properly, I have been lurking on the
> mail list archives and just had to reply to this message.
>
> I was planning to ask exactly this question, and Scott beat me to to. I
> even wanted to call them "chunks" too. :-)
>
> It's probably worthwhile for anyone discussing this subject to read this
> link: http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf . I know it's
> been posted before, but it really is worth reading. :-)
>
> On Fri, 15 Apr 2005, Linus Torvalds wrote:
>> On Fri, 15 Apr 2005, C. Scott Ananian wrote:
>>>
>>> Why are blobs per-file?  [After all, Linus insists that files are an
>>> illusion.]  Why not just have 'chunks', and assemble *these*
>>> into blobs (read, 'files')?  A good chunk size would fit evenly into some
>>> number of disk blocks (no wasted space!).
>>
>> I actually considered that. I ended up not doing it, because it's not
>> obvious how to "block" things up (and even more so because while I like
>> the notion, it flies in the face of the other issues I had: performance
>> and simplicity).
>
> I don't think it's as bad as you think.
>
> Let's conceptually have two types of files - Pobs (Proxy Objects, or
> Pointer Objects), and chunks. Both are stored and referenced by their
> content hash, as usual. Pobs just contain a list of hashes referencing
> the chunks in a file. When a file is initially stored, we chunk it so
> each chunk fits comfortably in a block, but otherwise we aren't too
> critical about sizes. When a file is changed (say, a single line edit),
> we update the chunk that contains that line, hash it and store it with
> its new name, and update the Pob, which we rehash and restore. If a
> chunk grows to be very large (say > 2 disk blocks), we can rechunk it
> and update the Pob to include the new chunks.
>
>> The problem with chunking is:
>>  - it complicates a lot of the routines. Things like "is this file
>>    unchanged" suddenly become "is this file still the same set of chunks",
>>    which is just a _lot_ more code and a lot more likely to have bugs.
>
> You're half right; it will be more complex, but I don't think it's as
> bad as you think. Pobs are stored by hash just like anything else. If
> some chunks are different, the pob is different, which means it has a
> different hash. It's exactly the same as dealing with changed file now.
> Sure, when you have to fetch the data, you have to read the pob and get
> a list of chunks to concatenate and return, but your example given
> doesn't change.
>
>>  - you have to find a blocking factor. I thought of just going it fixed
>>    chunks, and that just doesn't help at all.
>
> Just use the block size of the filesystem. Some filesystems do tail
> packing, so space isn't an issue, though speed can be. We don't actually
> care how big a chunk is, except to make it easy on the filesystem.
> Individual chunks can be any size.
>
>>  - we already have wasted space due to the low-level filesystem (as
>>    opposed to "git") usually being block-based, which means that space
>>    utilization for small objects tends to suck. So you really want to
>>    prefer objects that are several kB (compressed), and a small block just
>>    wastes tons of space.
>
> If a chunk is smaller than a disk block, this is true. However, if we
> size it right this is no worse than any other file. Small files (less
> than a block) can't be made any larger, so they waste space anyway.
> Large files end up wasting space in one block unless they are a perfect
> multiple of the block size.
>
> When we increase the size of a chunk, it will waste space, but we would
> have created an entire new file, so we win there too.
>
> Admittedly, Pobs will be wasting space too.
>
> On the other hand, I use ReiserFS, so I don't care. ;-)
>
>>  - there _is_ a natural blocking factor already. That's what a file
>>    boundary really is within the project, and finding any other is really
>>    quite hard.
>
> Nah. I think I've made a good case it isn't.
>
>> So I'm personally 100% sure that it's not worth it. But I'm not opposed to
>> the _concept_: it makes total sense in the "filesystem" view, and is 100%
>> equivalent to having an inode with pointers to blocks. I just don't think
>> the concept plays out well in reality.
>
> Well, the reason I think this would be worth it is that you really win
> when you have multiple parallel copies of a source tree, and changes are
> cheaper too. If you store all the chunks for all your git repositories
> in one place, and otherwise treat your trees of Pobs as the real
> repository, your copied trees only cost you space for the Pobs.
> Obviously this also applies for file updates within past revisions of a
> tree, but I don't know how much it would save. It fits beautifully into
> the current abstraction, and saves space without having to resort to
> rolling hashes or xdeltas.
>
> The _real_ reason why I am excited about git is that I have a vision of
> using this as the filesystem (in a FUSE wrapper or something) for my
> home directory. MP3s and AVIs aside, it will make actual work much
> easier for me. I have a dream; a dream where I save files using the same
> name, safe in the knowledge that I can get to any version I want. I will
> live in a world of autosaves, deletes without confirmation, and /etcs
> immune from the vagaries of my package management systems, not to
> mention users not asking me leading questions about backups. *sigh*
> *sniff* Excuse me, I think I have to go now.
>
> -Ray
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

next prev parent reply	other threads:[~2005-04-16 12:25 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-15 19:33 space compression (again) Ray Heasman
2005-04-16 12:29 ` David Lang [this message]
  -- strict thread matches above, loose matches on Subject: below --
2005-04-15 17:19 C. Scott Ananian
2005-04-15 18:34 ` Linus Torvalds
2005-04-15 18:45   ` C. Scott Ananian
2005-04-15 19:00     ` Derek Fawcus
2005-04-15 19:11     ` Linus Torvalds
2005-04-16 14:39       ` Martin Uecker
2005-04-16 15:11         ` C. Scott Ananian
2005-04-16 17:37           ` Martin Uecker
2005-04-19 12:39             ` Martin Uecker
2005-04-15 18:50 ` Derek Fawcus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.62.0504160526530.21837@qynat.qvtvafvgr.pbz \
    --to=david.lang@digitalinsight.com \
    --cc=git@vger.kernel.org \
    --cc=lists@mythral.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).