Calculating tree nodes

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Calculating tree nodes
@ 2007-09-04  2:13 Jon Smirl
  2007-09-04  2:51 ` Shawn O. Pearce
  0 siblings, 1 reply; 27+ messages in thread
From: Jon Smirl @ 2007-09-04  2:13 UTC (permalink / raw)
  To: Git Mailing List

When I change a file it creates a new object with a new SHA. This new
SHA causes the tree node pointing to it to change. Changing the tree
node forces its parent to change and so on. Of course git batches all
of the changes together into a commit so that this ripple effect
doesn't happen for every file. But every commit causes a new root tree
node to be created, right?

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  2:13 Calculating tree nodes Jon Smirl
@ 2007-09-04  2:51 ` Shawn O. Pearce
  2007-09-04  3:26   ` Jon Smirl
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2007-09-04  2:51 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

Jon Smirl <jonsmirl@gmail.com> wrote:
> When I change a file it creates a new object with a new SHA. This new
> SHA causes the tree node pointing to it to change. Changing the tree
> node forces its parent to change and so on. Of course git batches all
> of the changes together into a commit so that this ripple effect
> doesn't happen for every file. But every commit causes a new root tree
> node to be created, right?

Only if at least one file (or tree) differed.  This may not be the
case if you do a merge with the ours merge strategy, but these are
very rare.  So you can pretty much just say that yes, every commit
causes a new root tree to be created.

Usually the smallest number of objects created per commit is 3:

  - the new commit
  - the new root tree
  - the new blob for a file in the root directory

The number increases as more files are modified or if they are in
subdirectories of the root.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  2:51 ` Shawn O. Pearce
@ 2007-09-04  3:26   ` Jon Smirl
  2007-09-04  3:40     ` Johannes Schindelin
                       ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Jon Smirl @ 2007-09-04  3:26 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Git Mailing List

Are tree objects really needed?

1) Make the path an attribute of the file object.
2) Commits are simply a list of all the objects that make up the commit.
Sort the SHAs in the commit and delta them.

This is something that has always bugged me about file systems. File
systems force hierarchical naming due to their directory structure.
There is no reason they have to work that way. Google is an example of
a giant file system that works just fine without hierarchical
directories. The full path should be just another attribute on the
file. If you want a hierarchical index into the file system you can
generate it by walking the files or using triggers. But you could also
delete the hierarchical directory and replace it with something else
like a full text index. Directories would become a computationally
generated cache, not a critical part of the file system. But this is a
git list so I shouldn't go too far off into file system design.

Git has picked up the hierarchical storage scheme since it was built
on a hierarchical file system. I don't this this is necessarily a good
thing moving forward.

If we really need tree objects they could become a new class of
computationally generated objects that could be deleted out of the
database at any time and recreated. For example if you think of the
file objects as being in a table, inserting a new row into this table
would compute new tree objects (an index).

Index is the key here, we may want other kinds of indexes in the
future. It was the mail about auto-generating the Maintainers list
that caused me to think about this. If file objects are a table with
triggers, building a hierarchical index for the Maintainers field
doesn't make sense.

These are just some initial thoughts on a different way to view the
data git is storing. Thinking about it as a database with fields and
indexes built via triggers may change the way we want to structure
things.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:26   ` Jon Smirl
@ 2007-09-04  3:40     ` Johannes Schindelin
  2007-09-04  3:54       ` Jon Smirl
  2007-09-04  4:19     ` David Tweed
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Johannes Schindelin @ 2007-09-04  3:40 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Git Mailing List

Hi,

On Mon, 3 Sep 2007, Jon Smirl wrote:

> Are tree objects really needed?

Yes.  For performance reasons, since a simple commit would kill you in any 
reasonably sized repo.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:40     ` Johannes Schindelin
@ 2007-09-04  3:54       ` Jon Smirl
  2007-09-04  4:21         ` Martin Langhoff
  2007-09-04  4:28         ` Junio C Hamano
  0 siblings, 2 replies; 27+ messages in thread
From: Jon Smirl @ 2007-09-04  3:54 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, Git Mailing List

On 9/3/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Mon, 3 Sep 2007, Jon Smirl wrote:
>
> > Are tree objects really needed?
>
> Yes.  For performance reasons, since a simple commit would kill you in any
> reasonably sized repo.

That's not an obvious conclusion. A new commit is just a series of
edits to the previous commit. Start with the previous commit, edit it,
delta it and store it. Storing of the file objects is the same. Why
isn't this scheme fast than the current one?


>
> Hth,
> Dscho
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:26   ` Jon Smirl
  2007-09-04  3:40     ` Johannes Schindelin
@ 2007-09-04  4:19     ` David Tweed
  2007-09-04  5:52       ` Jon Smirl
  2007-09-04  6:26     ` Shawn O. Pearce
  2007-09-04 16:20     ` Daniel Hulme
  3 siblings, 1 reply; 27+ messages in thread
From: David Tweed @ 2007-09-04  4:19 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Git Mailing List

On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> Git has picked up the hierarchical storage scheme since it was built
> on a hierarchical file system.

FWIW my memory is that initial git used path-to-blob lists (as you're
describing but without delta-ing) and tree nodes were added after a
couple of weeks, the motivation _at the time_ being they were a
natural way to dramatically reduce the size of repos.

One of the nice things about tree nodes is that for doing a diff
between versions you can, to overwhelming probability, decide
equality/inequality of two arbitrarily deep and complicated subtrees
by comparing 40 characters, regardless of how remote and convoluted
their common ancestry. With delta chains don't you end up having to
trace back to a common "entry" in the history? (Of course, I don't
know how packs affect this - presumably there's some delta chasing to
get to the bare objects as well.)

-- 
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:54       ` Jon Smirl
@ 2007-09-04  4:21         ` Martin Langhoff
  2007-09-04  5:37           ` Jon Smirl
  2007-09-04  4:28         ` Junio C Hamano
  1 sibling, 1 reply; 27+ messages in thread
From: Martin Langhoff @ 2007-09-04  4:21 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, Shawn O. Pearce, Git Mailing List

On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> > Yes.  For performance reasons, since a simple commit would kill you in any
> > reasonably sized repo.
>
> That's not an obvious conclusion. A new commit is just a series of

Hi Jon!

If you search the archives you'll find Linus explaining that the
initial git had all the directory structure in one single "tree"
object that held all the paths, not matter how deep. The problem with
that was taht every commit generated a huge new tree object, so he
switched to the current "nested trees" structure, which also has the
nice feature of speeding up diffs/merges if whole subtrees haven't
changed.

> edits to the previous commit. Start with the previous commit, edit it,
> delta it and store it. Storing of the file objects is the same. Why
> isn't this scheme fast than the current one?

I think you're a bit confused between 2 different things:

 - git is _snapshot_ based, so every commit-tree-blob set is
completely independent. The "canonical" storage is each of those
gzipped in .git/objects
 - however, for performance and on-disk-footprint, we delta them (very
efficiently I hear)

So if you ask the GIT APIs about a tree, you end up dealing with the
nested trees I describe. Similarly, if you ask for a blob, you get the
blob. But internally git _is_ delta-compressing them.

It's not compressing them immediately -- only when you run git gc. But
from an API perspective, you don't have to worry about that.

HTH

martin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:54       ` Jon Smirl
  2007-09-04  4:21         ` Martin Langhoff
@ 2007-09-04  4:28         ` Junio C Hamano
  2007-09-04  5:50           ` Jon Smirl
  1 sibling, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2007-09-04  4:28 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, Shawn O. Pearce, Git Mailing List

"Jon Smirl" <jonsmirl@gmail.com> writes:

>> Yes.  For performance reasons, since a simple commit would kill you in any
>> reasonably sized repo.
>
> That's not an obvious conclusion. A new commit is just a series of
> edits to the previous commit. Start with the previous commit, edit it,
> delta it and store it. Storing of the file objects is the same. Why
> isn't this scheme fast than the current one?

I think you seem to be forgetting about tree comparison.

With a large project that has a reasonable directory structure
(i.e. not insanely narrow), a commit touches isolated subparts
of the whole tree.  Think of an architecture specific patch to
the Linux kernel touching only include/asm-i386 and arch/i386
directories.

Being able to cull an entire subdirectory (e.g. drivers/ which
has 5700 files underneath) by only looking at the tree SHA-1 of
the containing tree is a _HUGE_ win.

And this is not just about two tree comparison.  When you say:

	git log v2.6.20 -- arch/i386/

what you are seeing is a simplified history that consists of
commits that touch only these paths.  How would we determine if
a commit touch these paths efficiently?  By comparing the "i386"
entry in tree objects for $commit^:arch and $commit:arch.  You
do not have to look inside arch/i386/ trees to see if any of the
330 files in it is different.  You just check a single SHA-1
pair.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  4:21         ` Martin Langhoff
@ 2007-09-04  5:37           ` Jon Smirl
  2007-09-04  5:51             ` Andreas Ericsson
  2007-09-04 10:33             ` Johannes Schindelin
  0 siblings, 2 replies; 27+ messages in thread
From: Jon Smirl @ 2007-09-04  5:37 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Johannes Schindelin, Shawn O. Pearce, Git Mailing List

On 9/4/07, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> > > Yes.  For performance reasons, since a simple commit would kill you in any
> > > reasonably sized repo.
> >
> > That's not an obvious conclusion. A new commit is just a series of
>
> Hi Jon!
>
> If you search the archives you'll find Linus explaining that the
> initial git had all the directory structure in one single "tree"
> object that held all the paths, not matter how deep. The problem with
> that was taht every commit generated a huge new tree object, so he
> switched to the current "nested trees" structure, which also has the
> nice feature of speeding up diffs/merges if whole subtrees haven't
> changed.

In my scheme the commit is only a list of SHA's. The paths are stored
as attributes of the file objects. Commits are just edits to the list
of SHA's in the commit objects. If these lists are kept sorted, then
the delta should be tiny. Just the info on the adds/deletes to the
list.

This is very different that a single tree blob that contains all of the paths.

Diffing two trees in the scheme is quite fast. Just get their commit
objects into RAM and compare the lists of SHAs.

> > edits to the previous commit. Start with the previous commit, edit it,
> > delta it and store it. Storing of the file objects is the same. Why
> > isn't this scheme fast than the current one?
>
> I think you're a bit confused between 2 different things:
>
>  - git is _snapshot_ based, so every commit-tree-blob set is
> completely independent. The "canonical" storage is each of those
> gzipped in .git/objects
>  - however, for performance and on-disk-footprint, we delta them (very
> efficiently I hear)

The systems are essential the same with a little reorganization. In
the current system the paths and SHA for a commit are spread over the
tree nodes.

In my scheme the path info is moved into the file object nodes and the
SHA list is in the commit node.

git still works exactly as it has before. I just moved things around
in the storage system. The only thing that should be impacted is
performance.


>
> So if you ask the GIT APIs about a tree, you end up dealing with the
> nested trees I describe. Similarly, if you ask for a blob, you get the
> blob. But internally git _is_ delta-compressing them.
>
> It's not compressing them immediately -- only when you run git gc. But
> from an API perspective, you don't have to worry about that.
>
> HTH
>
>
> martin
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  4:28         ` Junio C Hamano
@ 2007-09-04  5:50           ` Jon Smirl
  0 siblings, 0 replies; 27+ messages in thread
From: Jon Smirl @ 2007-09-04  5:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Shawn O. Pearce, Git Mailing List

On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> >> Yes.  For performance reasons, since a simple commit would kill you in any
> >> reasonably sized repo.
> >
> > That's not an obvious conclusion. A new commit is just a series of
> > edits to the previous commit. Start with the previous commit, edit it,
> > delta it and store it. Storing of the file objects is the same. Why
> > isn't this scheme fast than the current one?
>
> I think you seem to be forgetting about tree comparison.
>
> With a large project that has a reasonable directory structure
> (i.e. not insanely narrow), a commit touches isolated subparts
> of the whole tree.  Think of an architecture specific patch to
> the Linux kernel touching only include/asm-i386 and arch/i386
> directories.
>
> Being able to cull an entire subdirectory (e.g. drivers/ which
> has 5700 files underneath) by only looking at the tree SHA-1 of
> the containing tree is a _HUGE_ win.

In my scheme you have all of the SHAs for the commit in RAM because
the are contained in the commit and you have the commit in RAM. It
take microseconds to compare these two lists in RAM.

The current scheme is doing disk accesses to get those tree nodes so
of course it is a win to cull the 5700 files.

>
> And this is not just about two tree comparison.  When you say:
>
>         git log v2.6.20 -- arch/i386/
>
> what you are seeing is a simplified history that consists of
> commits that touch only these paths.  How would we determine if
> a commit touch these paths efficiently?  By comparing the "i386"
> entry in tree objects for $commit^:arch and $commit:arch.  You
> do not have to look inside arch/i386/ trees to see if any of the
> 330 files in it is different.  You just check a single SHA-1
> pair.

It's more than just comparing a SHA, you have to do disk accesses to
retrieve the SHA.

I'm proposing that we only really need commit and file objects. I also
mentioned that if you think of the file objects as a table you could
use triggers to build cached indexes. To get performance back to the
current level we may want to construct some of these indexes. We need
to explore the scheme more before we can figure out the best cached
indexes to build.

Right now we only have a single index type, the tree nodes. And it's a
permanent part of the storage not cached. A hierarchical index is not
very useful of indexing non file name attributes.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  5:37           ` Jon Smirl
@ 2007-09-04  5:51             ` Andreas Ericsson
  2007-09-04 10:33             ` Johannes Schindelin
  1 sibling, 0 replies; 27+ messages in thread
From: Andreas Ericsson @ 2007-09-04  5:51 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Martin Langhoff, Johannes Schindelin, Shawn O. Pearce,
	Git Mailing List

Jon Smirl wrote:
> On 9/4/07, Martin Langhoff <martin.langhoff@gmail.com> wrote:
>> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
>>>> Yes.  For performance reasons, since a simple commit would kill you in any
>>>> reasonably sized repo.
>>> That's not an obvious conclusion. A new commit is just a series of
>> Hi Jon!
>>
>> If you search the archives you'll find Linus explaining that the
>> initial git had all the directory structure in one single "tree"
>> object that held all the paths, not matter how deep. The problem with
>> that was taht every commit generated a huge new tree object, so he
>> switched to the current "nested trees" structure, which also has the
>> nice feature of speeding up diffs/merges if whole subtrees haven't
>> changed.
> 
> In my scheme the commit is only a list of SHA's. The paths are stored
> as attributes of the file objects. Commits are just edits to the list
> of SHA's in the commit objects. If these lists are kept sorted, then
> the delta should be tiny. Just the info on the adds/deletes to the
> list.
> 

It will stop being fast when you need to apply (revisions*avg_num_files_changed)
patches before you can start diffing things properly.

> This is very different that a single tree blob that contains all of the paths.
> 
> Diffing two trees in the scheme is quite fast. Just get their commit
> objects into RAM and compare the lists of SHAs.
> 

That's not a very useful diff though. I'd run, screaming, from an SCM that didn't
tell me *how* things have changed in addition to *what*.

>>> edits to the previous commit. Start with the previous commit, edit it,
>>> delta it and store it. Storing of the file objects is the same. Why
>>> isn't this scheme fast than the current one?
>> I think you're a bit confused between 2 different things:
>>
>>  - git is _snapshot_ based, so every commit-tree-blob set is
>> completely independent. The "canonical" storage is each of those
>> gzipped in .git/objects
>>  - however, for performance and on-disk-footprint, we delta them (very
>> efficiently I hear)
> 
> The systems are essential the same with a little reorganization. In
> the current system the paths and SHA for a commit are spread over the
> tree nodes.
> 
> In my scheme the path info is moved into the file object nodes and the
> SHA list is in the commit node.
> 
> git still works exactly as it has before. I just moved things around
> in the storage system. The only thing that should be impacted is
> performance.
> 

Perhaps, but negatively so. Git is fast when applying patches, primarily
because it can exclude entire subtrees. It knows it can exclude those
subtrees because their SHA1 hashes are identical. It wouldn't know that
if there weren't the tree objects (well, it could, but walking all the
commits, counting changes and considering '0' to be "no changes" doesn't
scale, so that's a moot point).

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  4:19     ` David Tweed
@ 2007-09-04  5:52       ` Jon Smirl
  2007-09-04  5:55         ` Andreas Ericsson
  2007-09-04  6:16         ` David Tweed
  0 siblings, 2 replies; 27+ messages in thread
From: Jon Smirl @ 2007-09-04  5:52 UTC (permalink / raw)
  To: David Tweed; +Cc: Shawn O. Pearce, Git Mailing List

On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> > Git has picked up the hierarchical storage scheme since it was built
> > on a hierarchical file system.
>
> FWIW my memory is that initial git used path-to-blob lists (as you're
> describing but without delta-ing) and tree nodes were added after a
> couple of weeks, the motivation _at the time_ being they were a
> natural way to dramatically reduce the size of repos.
>
> One of the nice things about tree nodes is that for doing a diff
> between versions you can, to overwhelming probability, decide
> equality/inequality of two arbitrarily deep and complicated subtrees
> by comparing 40 characters, regardless of how remote and convoluted
> their common ancestry. With delta chains don't you end up having to
> trace back to a common "entry" in the history? (Of course, I don't
> know how packs affect this - presumably there's some delta chasing to
> get to the bare objects as well.)

While it is a 40 character compare, how many disk accesses were needed
to get those two SHAs into memory?

>
> --
> cheers, dave tweed__________________________
> david.tweed@gmail.com
> Rm 124, School of Systems Engineering, University of Reading.
> "we had no idea that when we added templates we were adding a Turing-
> complete compile-time language." -- C++ standardisation committee
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  5:52       ` Jon Smirl
@ 2007-09-04  5:55         ` Andreas Ericsson
  2007-09-04  6:16           ` Shawn O. Pearce
  2007-09-04  6:16         ` David Tweed
  1 sibling, 1 reply; 27+ messages in thread
From: Andreas Ericsson @ 2007-09-04  5:55 UTC (permalink / raw)
  To: Jon Smirl; +Cc: David Tweed, Shawn O. Pearce, Git Mailing List

Jon Smirl wrote:
> On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
>> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
>>> Git has picked up the hierarchical storage scheme since it was built
>>> on a hierarchical file system.
>> FWIW my memory is that initial git used path-to-blob lists (as you're
>> describing but without delta-ing) and tree nodes were added after a
>> couple of weeks, the motivation _at the time_ being they were a
>> natural way to dramatically reduce the size of repos.
>>
>> One of the nice things about tree nodes is that for doing a diff
>> between versions you can, to overwhelming probability, decide
>> equality/inequality of two arbitrarily deep and complicated subtrees
>> by comparing 40 characters, regardless of how remote and convoluted
>> their common ancestry. With delta chains don't you end up having to
>> trace back to a common "entry" in the history? (Of course, I don't
>> know how packs affect this - presumably there's some delta chasing to
>> get to the bare objects as well.)
> 
> While it is a 40 character compare, how many disk accesses were needed
> to get those two SHAs into memory?
> 

One more than there would have been to read only the commit, and one more
per level of recursion, assuming you never ever pack your repository.

If you *do* pack it, the tree(s) needed to compare are likely already
inside the sliding packfile window. In that case, there are no extra
disk accesses.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  5:55         ` Andreas Ericsson
@ 2007-09-04  6:16           ` Shawn O. Pearce
  2007-09-04 14:19             ` Jon Smirl
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2007-09-04  6:16 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Jon Smirl, David Tweed, Git Mailing List

Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> >On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
> >>On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> >>>Git has picked up the hierarchical storage scheme since it was built
> >>>on a hierarchical file system.
...
> >>One of the nice things about tree nodes is that for doing a diff
> >>between versions you can, to overwhelming probability, decide
> >>equality/inequality of two arbitrarily deep and complicated subtrees
> >>by comparing 40 characters, regardless of how remote and convoluted
> >>their common ancestry. With delta chains don't you end up having to
> >>trace back to a common "entry" in the history? (Of course, I don't
> >>know how packs affect this - presumably there's some delta chasing to
> >>get to the bare objects as well.)
> >
> >While it is a 40 character compare, how many disk accesses were needed
> >to get those two SHAs into memory?
> 
> One more than there would have been to read only the commit, and one more
> per level of recursion, assuming you never ever pack your repository.
> 
> If you *do* pack it, the tree(s) needed to compare are likely already
> inside the sliding packfile window. In that case, there are no extra
> disk accesses.

Even better, lets do some back of the napkin math on the Linux
kernel tree.  My local (out of date but close enough) copy has
22,730 files in the tip revision.  Values shown are uncompressed
and compressed (gzip -9 | wc -c), but are excluding deltification.

                 Current Scheme       Jon's Flat Scheme
                 -----------------    -----------------
commit raw       932                  932 + 22,730*20 = 455,532
(compressed)     521                  456,338

root tree raw    876                  0
(compressed)     805                  0

I'm not even bothering with the individual subtrees.  The numbers
will fall off quickly when you start to do subtree elimination and
only load the levels you need.

You are talking about doing disk IO for less than 4KiB with
the current scheme, and almost 456 KiB for the flat scheme.
That's before deltification.  So if you also assume deltification
its going to be higher as you need to read back to a base object
that is roughly the final size and then unpack the smaller deltas
to reach the real commit.

Remember, SHA-1s can be stored as 20 bytes of binary data but they
are also generally uncompressible.  That's why the root tree does
not compress very well, the SHA-1 data inside the tree cannot be
compressed and only the filenames have any shot at being compressed.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  5:52       ` Jon Smirl
  2007-09-04  5:55         ` Andreas Ericsson
@ 2007-09-04  6:16         ` David Tweed
  1 sibling, 0 replies; 27+ messages in thread
From: David Tweed @ 2007-09-04  6:16 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Git Mailing List

On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
> > On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> > > Git has picked up the hierarchical storage scheme since it was built
> > > on a hierarchical file system.
> >
> > FWIW my memory is that initial git used path-to-blob lists (as you're
> > describing but without delta-ing) and tree nodes were added after a
> > couple of weeks, the motivation _at the time_ being they were a
> > natural way to dramatically reduce the size of repos.
> >
> > One of the nice things about tree nodes is that for doing a diff
> > between versions you can, to overwhelming probability, decide
> > equality/inequality of two arbitrarily deep and complicated subtrees
> > by comparing 40 characters, regardless of how remote and convoluted
> > their common ancestry. With delta chains don't you end up having to
> > trace back to a common "entry" in the history? (Of course, I don't
> > know how packs affect this - presumably there's some delta chasing to
> > get to the bare objects as well.)
>
> While it is a 40 character compare, how many disk accesses were needed
> to get those two SHAs into memory?

That's a difficult question. Clearly if you're starting on a machine
without anything cached, you're probably a bit worse off. If not and
you've got a "wide, shallow tree where changes occur clustered within
directories rather than spread uniformly throughout the tree" then
it's likely already there.

But it's all I suspect it should be looked at in relative terms: with
the respect to the delta-d list of SHA's, do you need to do more or
less disk reads to be able to compare stuff? Dunno: I'd imagine this
depends on precisely what's delta'd against what and how many of those
you need to read in order to put two entries into a comparable form.
The key point I was making was not so much the 40 characters as that
(at least with loose objects) if you can are given the top-level
SHA's, you can efficiently decide equality (a key to efficient
diffing) _without having to care how those two are related in the
history or read any extra history_.

-- 
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:26   ` Jon Smirl
  2007-09-04  3:40     ` Johannes Schindelin
  2007-09-04  4:19     ` David Tweed
@ 2007-09-04  6:26     ` Shawn O. Pearce
  2007-09-04 17:39       ` Junio C Hamano
  2007-09-04 16:20     ` Daniel Hulme
  3 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2007-09-04  6:26 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

Jon Smirl <jonsmirl@gmail.com> wrote:
> Index is the key here, we may want other kinds of indexes in the
> future. It was the mail about auto-generating the Maintainers list
> that caused me to think about this. If file objects are a table with
> triggers, building a hierarchical index for the Maintainers field
> doesn't make sense.

There's nothing stopping us from creating additional indexes.

For example we have been kicking around this idea of a "note"
object that can be attached to commits.  Lightweight enough that one
could be attached to every commit, such as one way to implement the
"Signed-off-by" lines.  Notes can be looked up by commit SHA-1 using
a hash, giving near O(1) time to locate the note for any commit.

But we can also store the notes alongside the commits in the
packfile, so that if the data for the commit has been paged in
by the kernel then the note data is also most likely in memory,
and if not, is in the read-ahead queue.  Clustering the notes
alongside the commits makes access to them even faster, as we
don't need to consult an external hash to locate the position.

So I guess where I'm going is additional indexes can be implemented,
efficiently, without changing any of the core storage model in Git.
We just haven't made it easily user pluggable yet, because nobody
has really thought about the applications for such a function.
And code hasn't been posted for it (with the exception of the
notes prototypes).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  5:37           ` Jon Smirl
  2007-09-04  5:51             ` Andreas Ericsson
@ 2007-09-04 10:33             ` Johannes Schindelin
  2007-09-04 14:31               ` Jon Smirl
  1 sibling, 1 reply; 27+ messages in thread
From: Johannes Schindelin @ 2007-09-04 10:33 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, Shawn O. Pearce, Git Mailing List

Hi,

On Tue, 4 Sep 2007, Jon Smirl wrote:

> In my scheme the path info is moved into the file object nodes and the 
> SHA list is in the commit node.

And how should this "SHA list" be any different from a single tree object, 
except that you now merge it with the commit object?

Really, go back to the mail Martin mentioned.  Having all objects in one 
list kills performance.

> Diffing two trees in the scheme is quite fast. Just get their commit
> objects into RAM and compare the lists of SHAs.

No, it is not fast.  Just loading the complete list into RAM is likely 
much, much slower than a simple diff _right_ _now_.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  6:16           ` Shawn O. Pearce
@ 2007-09-04 14:19             ` Jon Smirl
  2007-09-04 14:41               ` Andreas Ericsson
  0 siblings, 1 reply; 27+ messages in thread
From: Jon Smirl @ 2007-09-04 14:19 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Andreas Ericsson, David Tweed, Git Mailing List

On 9/4/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Andreas Ericsson <ae@op5.se> wrote:
> > Jon Smirl wrote:
> > >On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
> > >>On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> > >>>Git has picked up the hierarchical storage scheme since it was built
> > >>>on a hierarchical file system.
> ...
> > >>One of the nice things about tree nodes is that for doing a diff
> > >>between versions you can, to overwhelming probability, decide
> > >>equality/inequality of two arbitrarily deep and complicated subtrees
> > >>by comparing 40 characters, regardless of how remote and convoluted
> > >>their common ancestry. With delta chains don't you end up having to
> > >>trace back to a common "entry" in the history? (Of course, I don't
> > >>know how packs affect this - presumably there's some delta chasing to
> > >>get to the bare objects as well.)
> > >
> > >While it is a 40 character compare, how many disk accesses were needed
> > >to get those two SHAs into memory?
> >
> > One more than there would have been to read only the commit, and one more
> > per level of recursion, assuming you never ever pack your repository.
> >
> > If you *do* pack it, the tree(s) needed to compare are likely already
> > inside the sliding packfile window. In that case, there are no extra
> > disk accesses.
>
> Even better, lets do some back of the napkin math on the Linux
> kernel tree.  My local (out of date but close enough) copy has
> 22,730 files in the tip revision.  Values shown are uncompressed
> and compressed (gzip -9 | wc -c), but are excluding deltification.
>
>                  Current Scheme       Jon's Flat Scheme
>                  -----------------    -----------------
> commit raw       932                  932 + 22,730*20 = 455,532
> (compressed)     521                  456,338
>
> root tree raw    876                  0
> (compressed)     805                  0

This is not a fair comparison. The current scheme is effectively
diffed against the previous version. You aren't showing an equivalent
diff for the flat scheme. Both schemes are dealing with the same
22,000 SHAs.

The size win is from diffing, not compressing.

> I'm not even bothering with the individual subtrees.  The numbers
> will fall off quickly when you start to do subtree elimination and
> only load the levels you need.
>
> You are talking about doing disk IO for less than 4KiB with
> the current scheme, and almost 456 KiB for the flat scheme.
> That's before deltification.  So if you also assume deltification
> its going to be higher as you need to read back to a base object
> that is roughly the final size and then unpack the smaller deltas
> to reach the real commit.
>
> Remember, SHA-1s can be stored as 20 bytes of binary data but they
> are also generally uncompressible.  That's why the root tree does
> not compress very well, the SHA-1 data inside the tree cannot be
> compressed and only the filenames have any shot at being compressed.
>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04 10:33             ` Johannes Schindelin
@ 2007-09-04 14:31               ` Jon Smirl
  2007-09-04 15:05                 ` Johannes Schindelin
  2007-09-04 15:14                 ` Andreas Ericsson
  0 siblings, 2 replies; 27+ messages in thread
From: Jon Smirl @ 2007-09-04 14:31 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Shawn O. Pearce, Git Mailing List

On 9/4/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Tue, 4 Sep 2007, Jon Smirl wrote:
>
> > In my scheme the path info is moved into the file object nodes and the
> > SHA list is in the commit node.
>
> And how should this "SHA list" be any different from a single tree object,
> except that you now merge it with the commit object?
>
> Really, go back to the mail Martin mentioned.  Having all objects in one
> list kills performance.

You are ignoring the fact the in this scheme temp indexes can be
created as needed. These temp indexes could look just like tree nodes.

I'm saying that it may be a mistake to be recording the indexes (aka
file names) as part of the commit when they really aren't. The
essential part of the commit is the SHA1 list. The path names belong
to the file objects and should be stored there.


>
> > Diffing two trees in the scheme is quite fast. Just get their commit
> > objects into RAM and compare the lists of SHAs.
>
> No, it is not fast.  Just loading the complete list into RAM is likely
> much, much slower than a simple diff _right_ _now_.
>
> Hth,
> Dscho
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04 14:19             ` Jon Smirl
@ 2007-09-04 14:41               ` Andreas Ericsson
  0 siblings, 0 replies; 27+ messages in thread
From: Andreas Ericsson @ 2007-09-04 14:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, David Tweed, Git Mailing List

Jon Smirl wrote:
> On 9/4/07, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Andreas Ericsson <ae@op5.se> wrote:
>>> Jon Smirl wrote:
>>>> On 9/4/07, David Tweed <david.tweed@gmail.com> wrote:
>>>>> On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
>>>>>> Git has picked up the hierarchical storage scheme since it was built
>>>>>> on a hierarchical file system.
>> ...
>>>>> One of the nice things about tree nodes is that for doing a diff
>>>>> between versions you can, to overwhelming probability, decide
>>>>> equality/inequality of two arbitrarily deep and complicated subtrees
>>>>> by comparing 40 characters, regardless of how remote and convoluted
>>>>> their common ancestry. With delta chains don't you end up having to
>>>>> trace back to a common "entry" in the history? (Of course, I don't
>>>>> know how packs affect this - presumably there's some delta chasing to
>>>>> get to the bare objects as well.)
>>>> While it is a 40 character compare, how many disk accesses were needed
>>>> to get those two SHAs into memory?
>>> One more than there would have been to read only the commit, and one more
>>> per level of recursion, assuming you never ever pack your repository.
>>>
>>> If you *do* pack it, the tree(s) needed to compare are likely already
>>> inside the sliding packfile window. In that case, there are no extra
>>> disk accesses.
>> Even better, lets do some back of the napkin math on the Linux
>> kernel tree.  My local (out of date but close enough) copy has
>> 22,730 files in the tip revision.  Values shown are uncompressed
>> and compressed (gzip -9 | wc -c), but are excluding deltification.
>>
>>                  Current Scheme       Jon's Flat Scheme
>>                  -----------------    -----------------
>> commit raw       932                  932 + 22,730*20 = 455,532
>> (compressed)     521                  456,338
>>
>> root tree raw    876                  0
>> (compressed)     805                  0
> 
> This is not a fair comparison. The current scheme is effectively
> diffed against the previous version. You aren't showing an equivalent
> diff for the flat scheme. Both schemes are dealing with the same
> 22,000 SHAs.
> 

How, with your scheme, would you solve

	git diff -M master pu

in the git repo?

You'd have to build both trees completely, utilizing the last known
complete tree-listing (the root commit, since you propose to do away
with trees altogether) and then applying diffs on top of that to
finally generate an in-memory tree-structure in which you will have
to compare every single file against every single other file to find
out which code has been moved/copied/renamed/whatever.

That's (n*(n+1))/2 operations for file-level diffs alone. For the
kernels 22730 files, you're looking at 258337815 file comparisons
without the tree objects.

Sure, you can probably shave away quite a few of those comparisons
at the expense of computing the tree-hashes on the fly, but in that
case, why get rid of them in the first place?

> The size win is from diffing, not compressing.
> 

It was declared in May 2006 by someone insightful that diskspace
and bandwidth are cheap, while human time is priceless.

IOW, size wins had better be proportionally huge to justify slowing
git down and thereby taking more than necessary of the users' time.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04 14:31               ` Jon Smirl
@ 2007-09-04 15:05                 ` Johannes Schindelin
  2007-09-04 15:14                 ` Andreas Ericsson
  1 sibling, 0 replies; 27+ messages in thread
From: Johannes Schindelin @ 2007-09-04 15:05 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, Shawn O. Pearce, Git Mailing List

Hi,

On Tue, 4 Sep 2007, Jon Smirl wrote:

> On 9/4/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
> > On Tue, 4 Sep 2007, Jon Smirl wrote:
> >
> > > In my scheme the path info is moved into the file object nodes and 
> > > the SHA list is in the commit node.
> >
> > And how should this "SHA list" be any different from a single tree 
> > object, except that you now merge it with the commit object?
> >
> > Really, go back to the mail Martin mentioned.  Having all objects in 
> > one list kills performance.
> 
> You are ignoring the fact the in this scheme temp indexes can be
> created as needed. These temp indexes could look just like tree nodes.
> 
> I'm saying that it may be a mistake to be recording the indexes (aka
> file names) as part of the commit when they really aren't.

You still have not acknowledged that you want to do (in essence) _exactly_ 
the same.  Your "temp indexes" awfully remind me of tree objects.

And that is no wonder, since you _need_ something like that, however you 
want to avoid it.  Just admit it that you found -- again -- that flat 
directory structure does not work, and you therefore need some smaller 
structures.  We just happen to call them "tree objects", and for ease of 
concept we wrap directories in them.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04 14:31               ` Jon Smirl
  2007-09-04 15:05                 ` Johannes Schindelin
@ 2007-09-04 15:14                 ` Andreas Ericsson
  2007-09-04 21:02                   ` Martin Langhoff
  1 sibling, 1 reply; 27+ messages in thread
From: Andreas Ericsson @ 2007-09-04 15:14 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Johannes Schindelin, Martin Langhoff, Shawn O. Pearce,
	Git Mailing List

Jon Smirl wrote:
> On 9/4/07, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> Hi,
>>
>> On Tue, 4 Sep 2007, Jon Smirl wrote:
>>
>>> In my scheme the path info is moved into the file object nodes and the
>>> SHA list is in the commit node.
>> And how should this "SHA list" be any different from a single tree object,
>> except that you now merge it with the commit object?
>>
>> Really, go back to the mail Martin mentioned.  Having all objects in one
>> list kills performance.
> 
> You are ignoring the fact the in this scheme temp indexes can be
> created as needed. These temp indexes could look just like tree nodes.
> 

What's the cost associated with creating such an index?
What are the estimated savings in size?
What other performance penalties will git take due to this?

While I'm certainly interested in improving git (well, discussing its
improvements anyway - I write too shabby code to participate fully),
I fail to see how this scheme of storing things could help doing that,
seeing as it's already the fastest SCM out there and does an insanely
good job at storing large repositories packed up tight.

What you're proposing would effectively shave off ~4kb or so of stored
data and save a *possible* disk access (note that there will most
likely be none, as most users pack large repos, and for small ones it
shouldn't matter much either way). The way you propose to do so
involves fundamental and backwards-incompatible alterations to the
very core of the git plumbing.

Yet you haven't shown a single estimate of how much diskspace would
be saved, or what other gains are to be had, let alone any code to 
implement even a prototype of this new storage scheme.

Note that I'm currently in a state of mind where I'm feeling inclined
to whoop and cheer at every small suggestion that sounds even the
slightest bit of fun. No, I'm not doing drugs, but I just met the 
possible love of my life, got a big fat raise and my stocks just
went ballistic. Someone less fortunate than I am right now is probably
too bored to even argue against it.

> I'm saying that it may be a mistake to be recording the indexes (aka
> file names) as part of the commit when they really aren't.

But they are, since they're part of the snapshot.

> The
> essential part of the commit is the SHA1 list. The path names belong
> to the file objects and should be stored there.
> 

Sorry, but it'll take code and benchmarks to convince me this is a
good idea. Not that it matters much what I think, but I've got an
inkling Junio will need the same before he's swayed to your point
of view.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  3:26   ` Jon Smirl
                       ` (2 preceding siblings ...)
  2007-09-04  6:26     ` Shawn O. Pearce
@ 2007-09-04 16:20     ` Daniel Hulme
  3 siblings, 0 replies; 27+ messages in thread
From: Daniel Hulme @ 2007-09-04 16:20 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn O. Pearce, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

On Mon, Sep 03, 2007 at 11:26:30PM -0400, Jon Smirl wrote:
> This is something that has always bugged me about file systems. File
> systems force hierarchical naming due to their directory structure.
> There is no reason they have to work that way. Google is an example of
> a giant file system that works just fine without hierarchical
> directories. The full path should be just another attribute on the
> file. If you want a hierarchical index into the file system you can
> generate it by walking the files or using triggers. But you could also
> delete the hierarchical directory and replace it with something else
> like a full text index. Directories would become a computationally
> generated cache, not a critical part of the file system. But this is a
> git list so I shouldn't go too far off into file system design.

Am I the only one who thinks that this idea of moving filenames from
tree objects into blobs does the *opposite* of what you're trying to
achieve?

It seems, though I could be completely misinterpreting what you're
saying, that you want to be able to get rid of directories and replace
them with some other index into your files: maybe a full-text index,
maybe a spatial index for geographic data, maybe something else
entirely. As things stand, you could do that by editing the core to
introduce a new object type 'fulltext' whose contents maybe look like

aardvark <sha1 of a blob> <sha1 of another blob>
abacus <sha1>
...
zebra <yet another sha1> <maybe the same sha1 I mentioned before>

or even something hierarchical, with each index mapping from the first
letter of the index term to the sha1 of another index, which in term
maps second letters, and so on. Whatever. The point is, it works
parallel to tree. You could have the blobs referenced by your fulltext
object also be referenced by a tree object. If you really don't like
directory trees, you can dispense with tree objects in your repo
entirely. Either way you have a mapping from keys to blobs.

Then you could have your commits and tags include sha1's of fulltext
objects rather than (or as well as) tree objects, and you get your wish.

OTOH, imagine if you move filenames into the blobs. Now, no matter what
other index types you introduce, they'll always be secondary to the
traditional, path-and-filename method of finding files. Crucially, you
can't introduce new blobs into the repo without giving them filenames.

As you said in your other thread,

> Integrating indexing into the data is not normally done in a database.

But isn't this exactly what integrating filenames into blobs would do?

-- 
There is no such thing as a small specification change.
http://surreal.istic.org/            Forcing the lines through the snow.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04  6:26     ` Shawn O. Pearce
@ 2007-09-04 17:39       ` Junio C Hamano
  2007-09-06  3:20         ` Shawn O. Pearce
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2007-09-04 17:39 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> writes:

> There's nothing stopping us from creating additional indexes.
> ...
> But we can also store the notes alongside the commits in the
> packfile, so that if the data for the commit has been paged in
> by the kernel then the note data is also most likely in memory,
> and if not, is in the read-ahead queue.  Clustering the notes
> alongside the commits makes access to them even faster, as we
> don't need to consult an external hash to locate the position.

I would agree with your main thrust "nobody prevents you from
building additional index", but on a tangent, I am skeptical
about adding too much to pack v4.  Especially "clustering the
notes" part.

Many operations (like "git log" that is not path limited) do not
even need trees.  The current packfile format has commits at the
beginning without any associated trees, and they are stored in
traversal order (modulo delta-base requirements can move base
object earlier), which is geared toward optimizing such a common
case.

Now, hopefully many operations do not need notes either,
although notes themselves can store _anything_ so each of them
could be large and/or each commit could have large number of
them.  I suspect clustering notes along with the commit they
annotate would break the locality of access for common case.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04 15:14                 ` Andreas Ericsson
@ 2007-09-04 21:02                   ` Martin Langhoff
  0 siblings, 0 replies; 27+ messages in thread
From: Martin Langhoff @ 2007-09-04 21:02 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Jon Smirl, Johannes Schindelin, Shawn O. Pearce, Git Mailing List

On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> > The
> > essential part of the commit is the SHA1 list. The path names belong
> > to the file objects and should be stored there.
>
> Sorry, but it'll take code and benchmarks to convince me this is a
> good idea.

Same here. But I do want to note that adding the pathname to the blob
(and storing them on the combined SHA1 I assume) is broken, broken,
broken. It won't support any of the great semantics that git has
today.

Jon: I think you need to think carefully about the key operations like
diff and log on paths on projects like linux or mozilla. Even better,
craft a proof-of-concept patch that shows how that'd work, and why
it's faster/smaller.

cheers,

m

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-04 17:39       ` Junio C Hamano
@ 2007-09-06  3:20         ` Shawn O. Pearce
  2007-09-06  5:21           ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  3:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jon Smirl, Git Mailing List

Junio C Hamano <gitster@pobox.com> wrote:
> "Shawn O. Pearce" <spearce@spearce.org> writes:
> 
> > There's nothing stopping us from creating additional indexes.
> > ...
> > But we can also store the notes alongside the commits in the
> > packfile, so that if the data for the commit has been paged in
...
> 
> I would agree with your main thrust "nobody prevents you from
> building additional index", but on a tangent, I am skeptical
> about adding too much to pack v4.  Especially "clustering the
> notes" part.
...
> Now, hopefully many operations do not need notes either,
> although notes themselves can store _anything_ so each of them
> could be large and/or each commit could have large number of
> them.  I suspect clustering notes along with the commit they
> annotate would break the locality of access for common case.

I'm inclined to agree.

Its something I've thought about doing.  I haven't even prototyped
code for it.  Let alone shown numbers that say one way or the other.

One of the notes proposals was talking about having lots of different
classes of notes.  E.g. a "Signed-off-by" class and a "build and test
log results" class.

The former would generally be very small and may even want to be
shown most of the time the commit body is displayed (e.g. in gitk,
git-log).  These would be good candidates to cluster alongside the
commit.  Indeed they are clustered there today, just hung inside
of the commit object itself.  Nobody is bitching about the hit they
cause on the common case of `pack-objects`.  :)

The latter (build and test log) would generally be very large.
We would *not* want to cluster them.  But we might want to store
next to the commit a very small pointer to the note itself.  Such
as the note's SHA-1.  Or its offset within the packfile's index.
This would make locating those notes very cheap, while not having
a huge impact on the common case of commit traversal.

Likewise we might want to pack a tag's SHA-1 alongside of the commit
it points at, as parsing the commit would immediately give us all
annotated tags that refer to that commit.  Tags are (usually) few
and far between.  But tools like git-describe are commonly used and
would benefit from not needing to build the commit->tag hashtable.
OK, well, git-describe cheats and uses the struct object hashtable,
but whatever.

You get my point.  I think.  And I got yours about not making the
common case worse than it already is today.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Calculating tree nodes
  2007-09-06  3:20         ` Shawn O. Pearce
@ 2007-09-06  5:21           ` Junio C Hamano
  0 siblings, 0 replies; 27+ messages in thread
From: Junio C Hamano @ 2007-09-06  5:21 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Jon Smirl, Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> writes:

> The latter (build and test log) would generally be very large.
> We would *not* want to cluster them.  But we might want to store
> next to the commit a very small pointer to the note itself.  Such
> as the note's SHA-1.  Or its offset within the packfile's index.
> This would make locating those notes very cheap, while not having
> a huge impact on the common case of commit traversal.
>
> Likewise we might want to pack a tag's SHA-1 alongside of the commit
> it points at, as parsing the commit would immediately give us all
> annotated tags that refer to that commit.  Tags are (usually) few
> and far between.  But tools like git-describe are commonly used and
> would benefit from not needing to build the commit->tag hashtable.

I am not so sure.

It is a good idea to have an internal API that takes an object
and gives back a set of objects (be they notes to commit or tags
point at object) related to it, but you need to always deal with
loose objects and v2 packs, so you will have to traverse all the
tags and find what they point at _anyway_ to implement the API.
Having a specialized table in v4 does not mean you can look at
that table and nothing else, so I do not immediately see a gain.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-09-06  5:21 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-04  2:13 Calculating tree nodes Jon Smirl
2007-09-04  2:51 ` Shawn O. Pearce
2007-09-04  3:26   ` Jon Smirl
2007-09-04  3:40     ` Johannes Schindelin
2007-09-04  3:54       ` Jon Smirl
2007-09-04  4:21         ` Martin Langhoff
2007-09-04  5:37           ` Jon Smirl
2007-09-04  5:51             ` Andreas Ericsson
2007-09-04 10:33             ` Johannes Schindelin
2007-09-04 14:31               ` Jon Smirl
2007-09-04 15:05                 ` Johannes Schindelin
2007-09-04 15:14                 ` Andreas Ericsson
2007-09-04 21:02                   ` Martin Langhoff
2007-09-04  4:28         ` Junio C Hamano
2007-09-04  5:50           ` Jon Smirl
2007-09-04  4:19     ` David Tweed
2007-09-04  5:52       ` Jon Smirl
2007-09-04  5:55         ` Andreas Ericsson
2007-09-04  6:16           ` Shawn O. Pearce
2007-09-04 14:19             ` Jon Smirl
2007-09-04 14:41               ` Andreas Ericsson
2007-09-04  6:16         ` David Tweed
2007-09-04  6:26     ` Shawn O. Pearce
2007-09-04 17:39       ` Junio C Hamano
2007-09-06  3:20         ` Shawn O. Pearce
2007-09-06  5:21           ` Junio C Hamano
2007-09-04 16:20     ` Daniel Hulme

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).