git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Dan Holmsand <holmsand@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] improved delta support for git
Date: Wed, 18 May 2005 08:12:26 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.58.0505180754060.18337@ppc970.osdl.org> (raw)
In-Reply-To: <d6dohe$dql$1@sea.gmane.org>



On Tue, 17 May 2005, Dan Holmsand wrote:
> 
> Therefore, I tried some other approaches. This one seemed to work
> best:
> 
> 1) I limit the maximum size of any delta to 10% of the size of the new
> version. That guarantees a big saving, as long as any delta is
> produced.
> 
> 2) If the "previous" version of a blob is a delta, I produce the new
> delta form the old deltas base version. This works surprisingly well.
> I'm guessing the reason for this is that most changes are really
> small, and they tend to be in the same area as a previous change (as
> in "Commit new feature. Commit bugfix for new feature. Commit fix for
> bugfix of new feature. Delete new feature as it doesn't work...").
> 
> 3) I use the same method for all tree objects.

Has anybody tried:

 4) don't limit yourself to previous-history-objects

One of the things I liked best about the delta patches was that it is
history-neutral, and can happily delta an object against any other random
object, in the same tree, in a future tree, or in a past tree.

Even without any history at all, there should be a noticeable amount of 
delta opportunities, as different architectures often end up sharing files 
that are quite similar, but not exactly the same.

Now, that's a very expensive thing to do, since it changes the question of
"which object should I delta against" from O(1) to O(n) (where "n" is tyhe
total number of objects), and thus the whole deltafication from O(n) to
O(n**2), but especially together with your "max 10%" rule, you should be
able to limit your choices very effectively: if you know that your delta
should be within 10% of the total size, you can limit your "let's try that
object" search to other objects that are also within 10% of your object.

That doesn't change the basic expense factor much in theory (if sizes were
truly evenly distributed in <n> it might change it, but there's probably
only a few different "classes" of file sizes, much fewer than <n>, so it's
still probably O(n**2)), but it should cut down the work by some
noticeable constant factor, making it a hopefully realistic experiment.

So your first rule makes a global deltafication cheaper, and in fact,
together with your second rule, you might even decide to make the size 
differential depend on the size of the _compressed_ object, since you 
don't care about objects that have already been deltafied, and if they are 
within 10% of each other, then they should also likely compress similarly, 
and it should thus be pretty equivalent to just compare compressed sizes.

Again, that second optimization wouldn't change the O(n**2) nature of the
expense, but should give another nice factor of speedup, maybe making the
exercise possible in the first place.

As to the long-term "O(n**2) deltafication is not practical for big 
projects with lots of history" issue, doing things incrementally should 
hopefully solve that, and turn it into a series of O(n) operations at the 
cost of saying "we'll never re-delta an object against the future once 
we've found a delta in the past or used it as a base for a delta".

The fsck "scan all objects" code could be a good starting point.

Doing this experiment at least once should be interesting. It may turn out
that the incremental space savings aren't all that noticeable, and that
the pure history-based one already finds 90% of all savings, making the
expensive version not worth it. It would be nice to _know_, though.

		Linus

  parent reply	other threads:[~2005-05-18 15:36 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-05-12  3:51 [PATCH] improved delta support for git Nicolas Pitre
2005-05-12  4:36 ` Junio C Hamano
2005-05-12 14:27   ` Chris Mason
     [not found]     ` <2cfc403205051207467755cdf@mail.gmail.com>
2005-05-12 14:47       ` Jon Seymour
2005-05-12 15:18         ` Nicolas Pitre
2005-05-12 17:16           ` Junio C Hamano
2005-05-13 11:44             ` Chris Mason
2005-05-17 18:22 ` Thomas Glanzmann
2005-05-17 19:02   ` Thomas Glanzmann
2005-05-17 19:10   ` Thomas Glanzmann
2005-05-17 21:43 ` Dan Holmsand
2005-05-18  4:32   ` Nicolas Pitre
2005-05-18  8:54     ` Dan Holmsand
2005-05-18 18:41       ` Nicolas Pitre
2005-05-18 19:32         ` Dan Holmsand
2005-05-18 15:12   ` Linus Torvalds [this message]
2005-05-18 17:15     ` Dan Holmsand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.58.0505180754060.18337@ppc970.osdl.org \
    --to=torvalds@osdl.org \
    --cc=git@vger.kernel.org \
    --cc=holmsand@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).