git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Andreas Ericsson <ae@op5.se>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	David Tweed <david.tweed@gmail.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Calculating tree nodes
Date: Tue, 04 Sep 2007 16:41:14 +0200	[thread overview]
Message-ID: <46DD6E8A.7000802@op5.se> (raw)
In-Reply-To: <9e4733910709040719n135a1c2dw3a2d5c470b74791a@mail.gmail.com>

Jon Smirl wrote:
> On 9/4/07, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Andreas Ericsson <ae@op5.se> wrote:
>>> Jon Smirl wrote:
>>>> On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
>>>>> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
>>>>>> Git has picked up the hierarchical storage scheme since it was built
>>>>>> on a hierarchical file system.
>> ...
>>>>> One of the nice things about tree nodes is that for doing a diff
>>>>> between versions you can, to overwhelming probability, decide
>>>>> equality/inequality of two arbitrarily deep and complicated subtrees
>>>>> by comparing 40 characters, regardless of how remote and convoluted
>>>>> their common ancestry. With delta chains don't you end up having to
>>>>> trace back to a common "entry" in the history? (Of course, I don't
>>>>> know how packs affect this - presumably there's some delta chasing to
>>>>> get to the bare objects as well.)
>>>> While it is a 40 character compare, how many disk accesses were needed
>>>> to get those two SHAs into memory?
>>> One more than there would have been to read only the commit, and one more
>>> per level of recursion, assuming you never ever pack your repository.
>>>
>>> If you *do* pack it, the tree(s) needed to compare are likely already
>>> inside the sliding packfile window. In that case, there are no extra
>>> disk accesses.
>> Even better, lets do some back of the napkin math on the Linux
>> kernel tree.  My local (out of date but close enough) copy has
>> 22,730 files in the tip revision.  Values shown are uncompressed
>> and compressed (gzip -9 | wc -c), but are excluding deltification.
>>
>>                  Current Scheme       Jon's Flat Scheme
>>                  -----------------    -----------------
>> commit raw       932                  932 + 22,730*20 = 455,532
>> (compressed)     521                  456,338
>>
>> root tree raw    876                  0
>> (compressed)     805                  0
> 
> This is not a fair comparison. The current scheme is effectively
> diffed against the previous version. You aren't showing an equivalent
> diff for the flat scheme. Both schemes are dealing with the same
> 22,000 SHAs.
> 

How, with your scheme, would you solve

	git diff -M master pu

in the git repo?

You'd have to build both trees completely, utilizing the last known
complete tree-listing (the root commit, since you propose to do away
with trees altogether) and then applying diffs on top of that to
finally generate an in-memory tree-structure in which you will have
to compare every single file against every single other file to find
out which code has been moved/copied/renamed/whatever.

That's (n*(n+1))/2 operations for file-level diffs alone. For the
kernels 22730 files, you're looking at 258337815 file comparisons
without the tree objects.

Sure, you can probably shave away quite a few of those comparisons
at the expense of computing the tree-hashes on the fly, but in that
case, why get rid of them in the first place?

> The size win is from diffing, not compressing.
> 

It was declared in May 2006 by someone insightful that diskspace
and bandwidth are cheap, while human time is priceless.

IOW, size wins had better be proportionally huge to justify slowing
git down and thereby taking more than necessary of the users' time.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

  reply	other threads:[~2007-09-04 14:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-04  2:13 Calculating tree nodes Jon Smirl
2007-09-04  2:51 ` Shawn O. Pearce
2007-09-04  3:26   ` Jon Smirl
2007-09-04  3:40     ` Johannes Schindelin
2007-09-04  3:54       ` Jon Smirl
2007-09-04  4:21         ` Martin Langhoff
2007-09-04  5:37           ` Jon Smirl
2007-09-04  5:51             ` Andreas Ericsson
2007-09-04 10:33             ` Johannes Schindelin
2007-09-04 14:31               ` Jon Smirl
2007-09-04 15:05                 ` Johannes Schindelin
2007-09-04 15:14                 ` Andreas Ericsson
2007-09-04 21:02                   ` Martin Langhoff
2007-09-04  4:28         ` Junio C Hamano
2007-09-04  5:50           ` Jon Smirl
2007-09-04  4:19     ` David Tweed
2007-09-04  5:52       ` Jon Smirl
2007-09-04  5:55         ` Andreas Ericsson
2007-09-04  6:16           ` Shawn O. Pearce
2007-09-04 14:19             ` Jon Smirl
2007-09-04 14:41               ` Andreas Ericsson [this message]
2007-09-04  6:16         ` David Tweed
2007-09-04  6:26     ` Shawn O. Pearce
2007-09-04 17:39       ` Junio C Hamano
2007-09-06  3:20         ` Shawn O. Pearce
2007-09-06  5:21           ` Junio C Hamano
2007-09-04 16:20     ` Daniel Hulme

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46DD6E8A.7000802@op5.se \
    --to=ae@op5.se \
    --cc=david.tweed@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).