Re: [PATCH] Keep last used delta base in the delta window

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: mkoegler@auto.tuwien.ac.at (Martin Koegler)
To: David Kastrup <dak@gnu.org>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] Keep last used delta base in the delta window
Date: Mon, 27 Aug 2007 22:43:20 +0200	[thread overview]
Message-ID: <20070827204320.GA17126@auto.tuwien.ac.at> (raw)
In-Reply-To: <86veb1thmp.fsf@lola.quinscape.zz>

On Mon, Aug 27, 2007 at 12:07:58PM +0200, David Kastrup wrote:
> Well, given the little amount of spare time I have for personal
> projects, I should not go flaunting them too much, but anyway...
> 
> I am just drafting implementing a delta-diff size estimator: for every
> hash value it does not store hash chains or pointers to memory images,
> but just a bit map of at most 32 (possibly 64 where ulong delivers it,
> possibly even just 16 bits when one restricts oneself to 16 element
> windows in order to save memory) delta window files, telling which of
> the files show such a hash value anywhere. 

How do you want to select these few hash values? 

I eg. dump database tables as SQL statements in per table files and
commit all changes. All lines in each file share a common prefix
(INSERT INTO tablename (COL1, COL2, ...) VALUES). Statistics showed, that up
to 50% of the hash elements of the delta index were dropped, because
they had eqal has values.

With you apropach:

If rare used values are selected as the 32 hash values, you will
probably regard files, which contain different tables (and therefore a
different prefix), as "similar", although the files have not very much
similarity otherwise.

If you select the hash values of the prefixes, you will create for each
table one cluster (if the table count is < 32/64). This already happens
by using the hash value of the path name in the sorting.

I don't belive, that your idea will improve very much in this case.
I fear, that it will be even worser than the current algorithm.

> When a file is considered
> for deltaing, it is run once against this delta-diff estimator which
> does not even look at the files again.  This delivers a lower bound
> for the size a delta towards each of the delta candidates can take.
> The candidates are then sorted in increasing size of expected delta
> size and are considered in turn.  Once a delta has a lower size than
> the lowest bound for further delta candidates, no further candidates
> need to get considered.

So this means, that we have to hash the target entry too (which could
be merged with creating the delta index). We currently only hash an
entry, if it is really needed as source in a delta-diff.

On bigger blobs, this could slow the process down.

> Also the calculation of a delta can get aborted once it exceeds the
> size of a previous delta. 

This already happens.

> Somewhat more complex and error-prone would
> be to abort based on continually correcting the estimated lower bound
> from the estimator while one is creating a delta and dropping out as
> soon as it becomes impossible to beat a previous delta.
> 
> The nice thing about this is that one can start out with a simplistic
> estimator as long as it is guaranteed never to deliver an estimate
> that is too high.  As long as that is the case, the packs will never
> get worse than they are now: a bad estimator will just mean that more
> elements of the window need to be considered before the conclusion is
> reached.
> 
> Since the estimator data structure does not point into any
> memory-mapped files or hash chains scattered through memory, it should
> be comparatively light (in comparison to an actual delta) on page
> thrashing which appears to me one of the main performance problems,
> while delivering a reasonably useful lower bound for all deltas in the
> window at once.

Do you want to do your estimations only against a small window of objects
or all objects?

mfg Martin Kögler

next prev parent reply	other threads:[~2007-08-27 20:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-26 20:56 [PATCH] Keep last used delta base in the delta window Martin Koegler
2007-08-27  8:48 ` Junio C Hamano
2007-08-27 10:07   ` David Kastrup
2007-08-27 20:43     ` Martin Koegler [this message]
2007-08-27 21:04       ` David Kastrup
2007-08-27 12:12   ` Junio C Hamano
2007-08-29 18:21     ` Nicolas Pitre
2007-08-29 19:26       ` Junio C Hamano
2007-08-29 21:02         ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070827204320.GA17126@auto.tuwien.ac.at \
    --to=mkoegler@auto.tuwien.ac.at \
    --cc=dak@gnu.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).