git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: David Kastrup <dak@gnu.org>
To: mkoegler@auto.tuwien.ac.at (Martin Koegler)
Cc: git@vger.kernel.org
Subject: Re: [PATCH] Keep last used delta base in the delta window
Date: Mon, 27 Aug 2007 23:04:20 +0200	[thread overview]
Message-ID: <8564308zaj.fsf@lola.goethe.zz> (raw)
In-Reply-To: <20070827204320.GA17126@auto.tuwien.ac.at> (Martin Koegler's message of "Mon\, 27 Aug 2007 22\:43\:20 +0200")

mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:

> On Mon, Aug 27, 2007 at 12:07:58PM +0200, David Kastrup wrote:
>> Well, given the little amount of spare time I have for personal
>> projects, I should not go flaunting them too much, but anyway...
>> 
>> I am just drafting implementing a delta-diff size estimator: for every
>> hash value it does not store hash chains or pointers to memory images,
>> but just a bit map of at most 32 (possibly 64 where ulong delivers it,
>> possibly even just 16 bits when one restricts oneself to 16 element
>> windows in order to save memory) delta window files, telling which of
>> the files show such a hash value anywhere. 
>
> How do you want to select these few hash values?

Hm?  Are you familiar with diff-delta.c?  I was not going to change
the hash value computation in there.  It is a 32bit CRC over 16byte
passages: the delta source is checksummed with a stride of 16 bytes
and the resulting CRC values are masked to a convenient number of bits
and used as index into a hash table with disambiguation to actual code
passages using linked lists.  Then the destination delta is
checksummed in the same manner but with a stride of 1, and the sums
are looked up in the hash until a matching prefix is found.

So I was going to do the same calculation, but look up a bit mask
rather than a linked list (namely calculating under the assumption
that a hash hit implies an actual match).

> I eg. dump database tables as SQL statements in per table files and
> commit all changes. All lines in each file share a common prefix
> (INSERT INTO tablename (COL1, COL2, ...) VALUES). Statistics showed,
> that up to 50% of the hash elements of the delta index were dropped,
> because they had equal hash values.

Yes, I am aware of that.  One somewhat crazy consequence of that is
that git's deltaing works better on Linux style indentation than on
GNU style indentation.

> With you apropach:
>
> If rare used values are selected as the 32 hash values, you will
> probably regard files, which contain different tables (and therefore
> a different prefix), as "similar", although the files have not very
> much similarity otherwise.

It's statistics.  Random matches will tend to level out.

> If you select the hash values of the prefixes, you will create for
> each table one cluster (if the table count is < 32/64). This already
> happens by using the hash value of the path name in the sorting.
>
> I don't belive, that your idea will improve very much in this case.
> I fear, that it will be even worser than the current algorithm.

Huh?  I am not replacing the current algorithm.  I am doing some
upfront estimates for getting a good order for doing the current
algorithm, so that I might abort parts of it early when I know they
can't improve things.  If the estimates are completely random and/or
useless, it will mean that you get a slowdown by a comparatively small
constant factor.

>> When a file is considered for deltaing, it is run once against this
>> delta-diff estimator which does not even look at the files again.
>> This delivers a lower bound for the size a delta towards each of
>> the delta candidates can take.  The candidates are then sorted in
>> increasing size of expected delta size and are considered in turn.
>> Once a delta has a lower size than the lowest bound for further
>> delta candidates, no further candidates need to get considered.
>
> So this means, that we have to hash the target entry too (which
> could be merged with creating the delta index).

Huh?  I don't see what you are getting at.

> We currently only hash an entry, if it is really needed as source in
> a delta-diff.

Why would you want to hash a non-candidate?

>> Also the calculation of a delta can get aborted once it exceeds the
>> size of a previous delta.
>
> This already happens.

Ah yes.  Overlooked this.  Presorting the candidates should be helpful
right away then.

> Do you want to do your estimations only against a small window of
> objects or all objects?

The estimates are made against the candidates in the window.  When a
candidate leaves the window, its bit in the bit masks gets reused for
the next candidate entering the window.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

  reply	other threads:[~2007-08-27 21:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-26 20:56 [PATCH] Keep last used delta base in the delta window Martin Koegler
2007-08-27  8:48 ` Junio C Hamano
2007-08-27 10:07   ` David Kastrup
2007-08-27 20:43     ` Martin Koegler
2007-08-27 21:04       ` David Kastrup [this message]
2007-08-27 12:12   ` Junio C Hamano
2007-08-29 18:21     ` Nicolas Pitre
2007-08-29 19:26       ` Junio C Hamano
2007-08-29 21:02         ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8564308zaj.fsf@lola.goethe.zz \
    --to=dak@gnu.org \
    --cc=git@vger.kernel.org \
    --cc=mkoegler@auto.tuwien.ac.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).