git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: David Kastrup <dak@gnu.org>
To: git@vger.kernel.org
Subject: Re: [PATCH] Keep last used delta base in the delta window
Date: Mon, 27 Aug 2007 12:07:58 +0200	[thread overview]
Message-ID: <86veb1thmp.fsf@lola.quinscape.zz> (raw)
In-Reply-To: 7v3ay5l5wq.fsf@gitster.siamese.dyndns.org

Junio C Hamano <gitster@pobox.com> writes:

> Martin Koegler <mkoegler@auto.tuwien.ac.at> writes:
>
>> Keeping the last used delta base object, if it would be dropped,
>> supports creating smaller packs with shorter delta chains.
>
> Instead of treating the "the last used one happened to be on the
> horizon -- try not to drop it" special case, I wonder if it
> makes sense to float the last used delta base object early in
> the window _after_ it is used.  Wouldn't we keep more than one
> very recently used delta base objects in the window that way?

Well, given the little amount of spare time I have for personal
projects, I should not go flaunting them too much, but anyway...

I am just drafting implementing a delta-diff size estimator: for every
hash value it does not store hash chains or pointers to memory images,
but just a bit map of at most 32 (possibly 64 where ulong delivers it,
possibly even just 16 bits when one restricts oneself to 16 element
windows in order to save memory) delta window files, telling which of
the files show such a hash value anywhere.  When a file is considered
for deltaing, it is run once against this delta-diff estimator which
does not even look at the files again.  This delivers a lower bound
for the size a delta towards each of the delta candidates can take.
The candidates are then sorted in increasing size of expected delta
size and are considered in turn.  Once a delta has a lower size than
the lowest bound for further delta candidates, no further candidates
need to get considered.

Also the calculation of a delta can get aborted once it exceeds the
size of a previous delta.  Somewhat more complex and error-prone would
be to abort based on continually correcting the estimated lower bound
from the estimator while one is creating a delta and dropping out as
soon as it becomes impossible to beat a previous delta.

The nice thing about this is that one can start out with a simplistic
estimator as long as it is guaranteed never to deliver an estimate
that is too high.  As long as that is the case, the packs will never
get worse than they are now: a bad estimator will just mean that more
elements of the window need to be considered before the conclusion is
reached.

Since the estimator data structure does not point into any
memory-mapped files or hash chains scattered through memory, it should
be comparatively light (in comparison to an actual delta) on page
thrashing which appears to me one of the main performance problems,
while delivering a reasonably useful lower bound for all deltas in the
window at once.

Its size needs to be large enough so that the bit maps will be
dominated by zeros: hash collisions not actually corresponding to
matching data lead to underestimates.  If those become too many, the
pruning of comparisons will become ineffective:

A match implies that the next 2*RABIN_SIZE-1 (31) bytes at least could
possibly be expressed as a delta, and every match after another
RABIN_SIZE implies another RABIN_SIZE bytes might get folded into the
previous delta.

So false positives seriously distort the estimate.  There is still a
good chance that a good candidate will land early in the sort order
due to such an estimate and thus other candidates will have their
deltaing cut off early, but of course it is quite more effective if
their deltaing does not even have to start.

-- 
David Kastrup

  reply	other threads:[~2007-08-27 10:08 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-26 20:56 [PATCH] Keep last used delta base in the delta window Martin Koegler
2007-08-27  8:48 ` Junio C Hamano
2007-08-27 10:07   ` David Kastrup [this message]
2007-08-27 20:43     ` Martin Koegler
2007-08-27 21:04       ` David Kastrup
2007-08-27 12:12   ` Junio C Hamano
2007-08-29 18:21     ` Nicolas Pitre
2007-08-29 19:26       ` Junio C Hamano
2007-08-29 21:02         ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86veb1thmp.fsf@lola.quinscape.zz \
    --to=dak@gnu.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).