git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Shawn O. Pearce" <spearce@spearce.org>
To: Andreas Ericsson <ae@op5.se>
Cc: Jon Smirl <jonsmirl@gmail.com>,
	David Tweed <david.tweed@gmail.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Calculating tree nodes
Date: Tue, 4 Sep 2007 02:16:11 -0400	[thread overview]
Message-ID: <20070904061611.GY18160@spearce.org> (raw)
In-Reply-To: <46DCF361.2090402@op5.se>

Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> >On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
> >>On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> >>>Git has picked up the hierarchical storage scheme since it was built
> >>>on a hierarchical file system.
...
> >>One of the nice things about tree nodes is that for doing a diff
> >>between versions you can, to overwhelming probability, decide
> >>equality/inequality of two arbitrarily deep and complicated subtrees
> >>by comparing 40 characters, regardless of how remote and convoluted
> >>their common ancestry. With delta chains don't you end up having to
> >>trace back to a common "entry" in the history? (Of course, I don't
> >>know how packs affect this - presumably there's some delta chasing to
> >>get to the bare objects as well.)
> >
> >While it is a 40 character compare, how many disk accesses were needed
> >to get those two SHAs into memory?
> 
> One more than there would have been to read only the commit, and one more
> per level of recursion, assuming you never ever pack your repository.
> 
> If you *do* pack it, the tree(s) needed to compare are likely already
> inside the sliding packfile window. In that case, there are no extra
> disk accesses.

Even better, lets do some back of the napkin math on the Linux
kernel tree.  My local (out of date but close enough) copy has
22,730 files in the tip revision.  Values shown are uncompressed
and compressed (gzip -9 | wc -c), but are excluding deltification.

                 Current Scheme       Jon's Flat Scheme
                 -----------------    -----------------
commit raw       932                  932 + 22,730*20 = 455,532
(compressed)     521                  456,338

root tree raw    876                  0
(compressed)     805                  0

I'm not even bothering with the individual subtrees.  The numbers
will fall off quickly when you start to do subtree elimination and
only load the levels you need.

You are talking about doing disk IO for less than 4KiB with
the current scheme, and almost 456 KiB for the flat scheme.
That's before deltification.  So if you also assume deltification
its going to be higher as you need to read back to a base object
that is roughly the final size and then unpack the smaller deltas
to reach the real commit.

Remember, SHA-1s can be stored as 20 bytes of binary data but they
are also generally uncompressible.  That's why the root tree does
not compress very well, the SHA-1 data inside the tree cannot be
compressed and only the filenames have any shot at being compressed.

-- 
Shawn.

  reply	other threads:[~2007-09-04  6:16 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-04  2:13 Calculating tree nodes Jon Smirl
2007-09-04  2:51 ` Shawn O. Pearce
2007-09-04  3:26   ` Jon Smirl
2007-09-04  3:40     ` Johannes Schindelin
2007-09-04  3:54       ` Jon Smirl
2007-09-04  4:21         ` Martin Langhoff
2007-09-04  5:37           ` Jon Smirl
2007-09-04  5:51             ` Andreas Ericsson
2007-09-04 10:33             ` Johannes Schindelin
2007-09-04 14:31               ` Jon Smirl
2007-09-04 15:05                 ` Johannes Schindelin
2007-09-04 15:14                 ` Andreas Ericsson
2007-09-04 21:02                   ` Martin Langhoff
2007-09-04  4:28         ` Junio C Hamano
2007-09-04  5:50           ` Jon Smirl
2007-09-04  4:19     ` David Tweed
2007-09-04  5:52       ` Jon Smirl
2007-09-04  5:55         ` Andreas Ericsson
2007-09-04  6:16           ` Shawn O. Pearce [this message]
2007-09-04 14:19             ` Jon Smirl
2007-09-04 14:41               ` Andreas Ericsson
2007-09-04  6:16         ` David Tweed
2007-09-04  6:26     ` Shawn O. Pearce
2007-09-04 17:39       ` Junio C Hamano
2007-09-06  3:20         ` Shawn O. Pearce
2007-09-06  5:21           ` Junio C Hamano
2007-09-04 16:20     ` Daniel Hulme

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070904061611.GY18160@spearce.org \
    --to=spearce@spearce.org \
    --cc=ae@op5.se \
    --cc=david.tweed@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).