From: Andreas Ericsson <ae@op5.se>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
David Tweed <david.tweed@gmail.com>,
Git Mailing List <git@vger.kernel.org>
Subject: Re: Calculating tree nodes
Date: Tue, 04 Sep 2007 16:41:14 +0200 [thread overview]
Message-ID: <46DD6E8A.7000802@op5.se> (raw)
In-Reply-To: <9e4733910709040719n135a1c2dw3a2d5c470b74791a@mail.gmail.com>
Jon Smirl wrote:
> On 9/4/07, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Andreas Ericsson <ae@op5.se> wrote:
>>> Jon Smirl wrote:
>>>> On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
>>>>> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
>>>>>> Git has picked up the hierarchical storage scheme since it was built
>>>>>> on a hierarchical file system.
>> ...
>>>>> One of the nice things about tree nodes is that for doing a diff
>>>>> between versions you can, to overwhelming probability, decide
>>>>> equality/inequality of two arbitrarily deep and complicated subtrees
>>>>> by comparing 40 characters, regardless of how remote and convoluted
>>>>> their common ancestry. With delta chains don't you end up having to
>>>>> trace back to a common "entry" in the history? (Of course, I don't
>>>>> know how packs affect this - presumably there's some delta chasing to
>>>>> get to the bare objects as well.)
>>>> While it is a 40 character compare, how many disk accesses were needed
>>>> to get those two SHAs into memory?
>>> One more than there would have been to read only the commit, and one more
>>> per level of recursion, assuming you never ever pack your repository.
>>>
>>> If you *do* pack it, the tree(s) needed to compare are likely already
>>> inside the sliding packfile window. In that case, there are no extra
>>> disk accesses.
>> Even better, lets do some back of the napkin math on the Linux
>> kernel tree. My local (out of date but close enough) copy has
>> 22,730 files in the tip revision. Values shown are uncompressed
>> and compressed (gzip -9 | wc -c), but are excluding deltification.
>>
>> Current Scheme Jon's Flat Scheme
>> ----------------- -----------------
>> commit raw 932 932 + 22,730*20 = 455,532
>> (compressed) 521 456,338
>>
>> root tree raw 876 0
>> (compressed) 805 0
>
> This is not a fair comparison. The current scheme is effectively
> diffed against the previous version. You aren't showing an equivalent
> diff for the flat scheme. Both schemes are dealing with the same
> 22,000 SHAs.
>
How, with your scheme, would you solve
git diff -M master pu
in the git repo?
You'd have to build both trees completely, utilizing the last known
complete tree-listing (the root commit, since you propose to do away
with trees altogether) and then applying diffs on top of that to
finally generate an in-memory tree-structure in which you will have
to compare every single file against every single other file to find
out which code has been moved/copied/renamed/whatever.
That's (n*(n+1))/2 operations for file-level diffs alone. For the
kernels 22730 files, you're looking at 258337815 file comparisons
without the tree objects.
Sure, you can probably shave away quite a few of those comparisons
at the expense of computing the tree-hashes on the fly, but in that
case, why get rid of them in the first place?
> The size win is from diffing, not compressing.
>
It was declared in May 2006 by someone insightful that diskspace
and bandwidth are cheap, while human time is priceless.
IOW, size wins had better be proportionally huge to justify slowing
git down and thereby taking more than necessary of the users' time.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
next prev parent reply other threads:[~2007-09-04 14:41 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-04 2:13 Calculating tree nodes Jon Smirl
2007-09-04 2:51 ` Shawn O. Pearce
2007-09-04 3:26 ` Jon Smirl
2007-09-04 3:40 ` Johannes Schindelin
2007-09-04 3:54 ` Jon Smirl
2007-09-04 4:21 ` Martin Langhoff
2007-09-04 5:37 ` Jon Smirl
2007-09-04 5:51 ` Andreas Ericsson
2007-09-04 10:33 ` Johannes Schindelin
2007-09-04 14:31 ` Jon Smirl
2007-09-04 15:05 ` Johannes Schindelin
2007-09-04 15:14 ` Andreas Ericsson
2007-09-04 21:02 ` Martin Langhoff
2007-09-04 4:28 ` Junio C Hamano
2007-09-04 5:50 ` Jon Smirl
2007-09-04 4:19 ` David Tweed
2007-09-04 5:52 ` Jon Smirl
2007-09-04 5:55 ` Andreas Ericsson
2007-09-04 6:16 ` Shawn O. Pearce
2007-09-04 14:19 ` Jon Smirl
2007-09-04 14:41 ` Andreas Ericsson [this message]
2007-09-04 6:16 ` David Tweed
2007-09-04 6:26 ` Shawn O. Pearce
2007-09-04 17:39 ` Junio C Hamano
2007-09-06 3:20 ` Shawn O. Pearce
2007-09-06 5:21 ` Junio C Hamano
2007-09-04 16:20 ` Daniel Hulme
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46DD6E8A.7000802@op5.se \
--to=ae@op5.se \
--cc=david.tweed@gmail.com \
--cc=git@vger.kernel.org \
--cc=jonsmirl@gmail.com \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).