git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Junio C Hamano" <gitster@pobox.com>
Cc: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Shawn O. Pearce" <spearce@spearce.org>,
	"Git Mailing List" <git@vger.kernel.org>
Subject: Re: Calculating tree nodes
Date: Tue, 4 Sep 2007 01:50:21 -0400	[thread overview]
Message-ID: <9e4733910709032250r1198379cmafd4e14fd330513b@mail.gmail.com> (raw)
In-Reply-To: <7vmyw3835y.fsf@gitster.siamese.dyndns.org>

On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> >> Yes.  For performance reasons, since a simple commit would kill you in any
> >> reasonably sized repo.
> >
> > That's not an obvious conclusion. A new commit is just a series of
> > edits to the previous commit. Start with the previous commit, edit it,
> > delta it and store it. Storing of the file objects is the same. Why
> > isn't this scheme fast than the current one?
>
> I think you seem to be forgetting about tree comparison.
>
> With a large project that has a reasonable directory structure
> (i.e. not insanely narrow), a commit touches isolated subparts
> of the whole tree.  Think of an architecture specific patch to
> the Linux kernel touching only include/asm-i386 and arch/i386
> directories.
>
> Being able to cull an entire subdirectory (e.g. drivers/ which
> has 5700 files underneath) by only looking at the tree SHA-1 of
> the containing tree is a _HUGE_ win.

In my scheme you have all of the SHAs for the commit in RAM because
the are contained in the commit and you have the commit in RAM. It
take microseconds to compare these two lists in RAM.

The current scheme is doing disk accesses to get those tree nodes so
of course it is a win to cull the 5700 files.

>
> And this is not just about two tree comparison.  When you say:
>
>         git log v2.6.20 -- arch/i386/
>
> what you are seeing is a simplified history that consists of
> commits that touch only these paths.  How would we determine if
> a commit touch these paths efficiently?  By comparing the "i386"
> entry in tree objects for $commit^:arch and $commit:arch.  You
> do not have to look inside arch/i386/ trees to see if any of the
> 330 files in it is different.  You just check a single SHA-1
> pair.

It's more than just comparing a SHA, you have to do disk accesses to
retrieve the SHA.

I'm proposing that we only really need commit and file objects. I also
mentioned that if you think of the file objects as a table you could
use triggers to build cached indexes. To get performance back to the
current level we may want to construct some of these indexes. We need
to explore the scheme more before we can figure out the best cached
indexes to build.

Right now we only have a single index type, the tree nodes. And it's a
permanent part of the storage not cached. A hierarchical index is not
very useful of indexing non file name attributes.

-- 
Jon Smirl
jonsmirl@gmail.com

  reply	other threads:[~2007-09-04  5:50 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-04  2:13 Calculating tree nodes Jon Smirl
2007-09-04  2:51 ` Shawn O. Pearce
2007-09-04  3:26   ` Jon Smirl
2007-09-04  3:40     ` Johannes Schindelin
2007-09-04  3:54       ` Jon Smirl
2007-09-04  4:21         ` Martin Langhoff
2007-09-04  5:37           ` Jon Smirl
2007-09-04  5:51             ` Andreas Ericsson
2007-09-04 10:33             ` Johannes Schindelin
2007-09-04 14:31               ` Jon Smirl
2007-09-04 15:05                 ` Johannes Schindelin
2007-09-04 15:14                 ` Andreas Ericsson
2007-09-04 21:02                   ` Martin Langhoff
2007-09-04  4:28         ` Junio C Hamano
2007-09-04  5:50           ` Jon Smirl [this message]
2007-09-04  4:19     ` David Tweed
2007-09-04  5:52       ` Jon Smirl
2007-09-04  5:55         ` Andreas Ericsson
2007-09-04  6:16           ` Shawn O. Pearce
2007-09-04 14:19             ` Jon Smirl
2007-09-04 14:41               ` Andreas Ericsson
2007-09-04  6:16         ` David Tweed
2007-09-04  6:26     ` Shawn O. Pearce
2007-09-04 17:39       ` Junio C Hamano
2007-09-06  3:20         ` Shawn O. Pearce
2007-09-06  5:21           ` Junio C Hamano
2007-09-04 16:20     ` Daniel Hulme

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910709032250r1198379cmafd4e14fd330513b@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).