From: Linus Torvalds <torvalds@osdl.org>
To: Dan Holmsand <holmsand@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] improved delta support for git
Date: Wed, 18 May 2005 08:12:26 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.58.0505180754060.18337@ppc970.osdl.org> (raw)
In-Reply-To: <d6dohe$dql$1@sea.gmane.org>
On Tue, 17 May 2005, Dan Holmsand wrote:
>
> Therefore, I tried some other approaches. This one seemed to work
> best:
>
> 1) I limit the maximum size of any delta to 10% of the size of the new
> version. That guarantees a big saving, as long as any delta is
> produced.
>
> 2) If the "previous" version of a blob is a delta, I produce the new
> delta form the old deltas base version. This works surprisingly well.
> I'm guessing the reason for this is that most changes are really
> small, and they tend to be in the same area as a previous change (as
> in "Commit new feature. Commit bugfix for new feature. Commit fix for
> bugfix of new feature. Delete new feature as it doesn't work...").
>
> 3) I use the same method for all tree objects.
Has anybody tried:
4) don't limit yourself to previous-history-objects
One of the things I liked best about the delta patches was that it is
history-neutral, and can happily delta an object against any other random
object, in the same tree, in a future tree, or in a past tree.
Even without any history at all, there should be a noticeable amount of
delta opportunities, as different architectures often end up sharing files
that are quite similar, but not exactly the same.
Now, that's a very expensive thing to do, since it changes the question of
"which object should I delta against" from O(1) to O(n) (where "n" is tyhe
total number of objects), and thus the whole deltafication from O(n) to
O(n**2), but especially together with your "max 10%" rule, you should be
able to limit your choices very effectively: if you know that your delta
should be within 10% of the total size, you can limit your "let's try that
object" search to other objects that are also within 10% of your object.
That doesn't change the basic expense factor much in theory (if sizes were
truly evenly distributed in <n> it might change it, but there's probably
only a few different "classes" of file sizes, much fewer than <n>, so it's
still probably O(n**2)), but it should cut down the work by some
noticeable constant factor, making it a hopefully realistic experiment.
So your first rule makes a global deltafication cheaper, and in fact,
together with your second rule, you might even decide to make the size
differential depend on the size of the _compressed_ object, since you
don't care about objects that have already been deltafied, and if they are
within 10% of each other, then they should also likely compress similarly,
and it should thus be pretty equivalent to just compare compressed sizes.
Again, that second optimization wouldn't change the O(n**2) nature of the
expense, but should give another nice factor of speedup, maybe making the
exercise possible in the first place.
As to the long-term "O(n**2) deltafication is not practical for big
projects with lots of history" issue, doing things incrementally should
hopefully solve that, and turn it into a series of O(n) operations at the
cost of saying "we'll never re-delta an object against the future once
we've found a delta in the past or used it as a base for a delta".
The fsck "scan all objects" code could be a good starting point.
Doing this experiment at least once should be interesting. It may turn out
that the incremental space savings aren't all that noticeable, and that
the pure history-based one already finds 90% of all savings, making the
expensive version not worth it. It would be nice to _know_, though.
Linus
next prev parent reply other threads:[~2005-05-18 15:36 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-05-12 3:51 [PATCH] improved delta support for git Nicolas Pitre
2005-05-12 4:36 ` Junio C Hamano
2005-05-12 14:27 ` Chris Mason
[not found] ` <2cfc403205051207467755cdf@mail.gmail.com>
2005-05-12 14:47 ` Jon Seymour
2005-05-12 15:18 ` Nicolas Pitre
2005-05-12 17:16 ` Junio C Hamano
2005-05-13 11:44 ` Chris Mason
2005-05-17 18:22 ` Thomas Glanzmann
2005-05-17 19:02 ` Thomas Glanzmann
2005-05-17 19:10 ` Thomas Glanzmann
2005-05-17 21:43 ` Dan Holmsand
2005-05-18 4:32 ` Nicolas Pitre
2005-05-18 8:54 ` Dan Holmsand
2005-05-18 18:41 ` Nicolas Pitre
2005-05-18 19:32 ` Dan Holmsand
2005-05-18 15:12 ` Linus Torvalds [this message]
2005-05-18 17:15 ` Dan Holmsand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.58.0505180754060.18337@ppc970.osdl.org \
--to=torvalds@osdl.org \
--cc=git@vger.kernel.org \
--cc=holmsand@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).