git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* Git's database structure
@ 2007-09-04 15:23 Jon Smirl
  2007-09-04 15:55 ` Andreas Ericsson
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 15:23 UTC (permalink / raw)
  To: Git Mailing List

Let's back up a little bit from "Caclulating tree node".  What are the
elements of git's data structures?

Right now we have an index structure (tree nodes) integrated in to a
base table. Integrating indexing into the data is not normally done in
a database. Doing a normalization analysis like this may expose flaws
in the way the data is structured. Of course we may also decide to
leave everything the way it is.

What about the special status of a rename? In the current model we
effectively have three tables.

commit - a set of all SHAs in the commit, previous commit, comment, author, etc
blob - a file, permissions, etc.
file names - name, SHA

The file name table is encoded as an index and it has been
intermingled with the commit table.

Looking at this from a set theory angle brings up the question, do we
really have three tables and file names are an independent variable
from the blobs, or should file names be an attribute of the blob?

How this gets structured in the db is an independent question about
how renames get detected on a commit. The current scheme for detecting
renames by comparing diffs is working fine. The question is, once we
detect a rename how should it be stored?

Ignoring the performance impacts and looking at the problem from the
set theory view point, should:
the pathnames be in their own table with a row for each alias
the pathnames be stored as an attribute of the blob

Both of these are the same information, we're just looking at how
things are normalized.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 15:23 Git's database structure Jon Smirl
@ 2007-09-04 15:55 ` Andreas Ericsson
  2007-09-04 16:07   ` Mike Hommey
                     ` (2 more replies)
  2007-09-04 16:28 ` Jon Smirl
  2007-09-04 17:19 ` Julian Phillips
  2 siblings, 3 replies; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-04 15:55 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

Jon Smirl wrote:
> Let's back up a little bit from "Caclulating tree node".  What are the
> elements of git's data structures?
> 
> Right now we have an index structure (tree nodes) integrated in to a
> base table. Integrating indexing into the data is not normally done in
> a database. Doing a normalization analysis like this may expose flaws
> in the way the data is structured. Of course we may also decide to
> leave everything the way it is.
> 
> What about the special status of a rename? In the current model we
> effectively have three tables.
> 
> commit - a set of all SHAs in the commit, previous commit, comment, author, etc

> blob - a file, permissions, etc.
> file names - name, SHA

commit - SHA1 of its parent(s) and its root-tree, along with
         author info and a free-form field
blob - content addressable by *multiple trees*
file names - List of path-names inside a tree object.


To draw some sort of relationship model here, you'd have

commit 1<->M roottree
tree M<->M tree
tree M<->M blob

Assuming SHA1 never collides (collisions rule out any form of storage,
so we might as well hope it never happens), that leaves us with this:

Each root tree can only ever belong to a single commit, unless you
intentionally force git to make completely empty commits. git
won't complain about this, so long as you don't make two in the
same second, because it relies more heavily on the DAG than on
developer sanity.

Each root tree can point to multiple sub-trees. The sub-trees can be
linked to any number of root-trees. 

Blobs can be linked to any number of tree objects, or even multiple
times to the same tree object. This wouldn't be possible if the
blob objects had their own pathnames stored inside them, so to speak.

> 
> The file name table is encoded as an index and it has been
> intermingled with the commit table.
> 
> Looking at this from a set theory angle brings up the question, do we
> really have three tables and file names are an independent variable
> from the blobs, or should file names be an attribute of the blob?
> 

File names are not independant variables. They belong inside the
table created for them, which is the tree objects.

> How this gets structured in the db is an independent question about
> how renames get detected on a commit. The current scheme for detecting
> renames by comparing diffs is working fine. The question is, once we
> detect a rename how should it be stored?
> 

Do you realize that you're contradicting yourself in two upon each
other following sentences here?

Detecting renames after the fashion works fine. Not storing them
is part of the "detect them by comparing diffs".

> Ignoring the performance impacts and looking at the problem from the
> set theory view point, should:
> the pathnames be in their own table with a row for each alias
> the pathnames be stored as an attribute of the blob
> 
> Both of these are the same information, we're just looking at how
> things are normalized.
> 

Except that

git init
echo foo > a
cp -a a b
git add .
git commit -m testing
git count-objects

yields 3 objects at the moment; A commit-object, a tree object and *one*
blob object. With your scheme the 2 blob objects would differ, and there
would be 4 of them. If you propose to ignore the path-name you have
effectively broken support for having two identical files with different
names in the same directory.

Now, can you please tell me what gains you're hoping to see with this
new layout of yours?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 15:55 ` Andreas Ericsson
@ 2007-09-04 16:07   ` Mike Hommey
  2007-09-04 16:10     ` Andreas Ericsson
  2007-09-04 16:19   ` Jon Smirl
  2007-09-04 17:21   ` Junio C Hamano
  2 siblings, 1 reply; 39+ messages in thread
From: Mike Hommey @ 2007-09-04 16:07 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Git Mailing List

On Tue, Sep 04, 2007 at 05:55:16PM +0200, Andreas Ericsson <ae@op5.se> wrote:
> Each root tree can only ever belong to a single commit, unless you
> intentionally force git to make completely empty commits. git
> won't complain about this, so long as you don't make two in the
> same second, because it relies more heavily on the DAG than on
> developer sanity.

Actually, you don't need to be insane to have multiple commits pointing
at the same root tree. It is actually very easy:
- git clone
- do some stuff on your master branch and commit
- send your changes upstream
- upstream applies as is
- git pull

You now have everything merged, and the last commit on your master branch,
while being a different commit object due to its parenting, has the same
root tree as the tip of the remote branch.

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:07   ` Mike Hommey
@ 2007-09-04 16:10     ` Andreas Ericsson
  0 siblings, 0 replies; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-04 16:10 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Git Mailing List

Mike Hommey wrote:
> On Tue, Sep 04, 2007 at 05:55:16PM +0200, Andreas Ericsson <ae@op5.se> wrote:
>> Each root tree can only ever belong to a single commit, unless you
>> intentionally force git to make completely empty commits. git
>> won't complain about this, so long as you don't make two in the
>> same second, because it relies more heavily on the DAG than on
>> developer sanity.
> 
> Actually, you don't need to be insane to have multiple commits pointing
> at the same root tree. It is actually very easy:
> - git clone
> - do some stuff on your master branch and commit
> - send your changes upstream
> - upstream applies as is
> - git pull
> 
> You now have everything merged, and the last commit on your master branch,
> while being a different commit object due to its parenting, has the same
> root tree as the tip of the remote branch.
> 

That explains why it felt so awkward writing that sentence. :)
Thanks for correcting me. Even so, one more M<->M relation-ship
certainly speaks for rather than against the current model.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 15:55 ` Andreas Ericsson
  2007-09-04 16:07   ` Mike Hommey
@ 2007-09-04 16:19   ` Jon Smirl
  2007-09-04 16:29     ` Andreas Ericsson
                       ` (2 more replies)
  2007-09-04 17:21   ` Junio C Hamano
  2 siblings, 3 replies; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 16:19 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Git Mailing List

On 9/4/07, Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> > Let's back up a little bit from "Caclulating tree node".  What are the
> > elements of git's data structures?
> >
> > Right now we have an index structure (tree nodes) integrated in to a
> > base table. Integrating indexing into the data is not normally done in
> > a database. Doing a normalization analysis like this may expose flaws
> > in the way the data is structured. Of course we may also decide to
> > leave everything the way it is.
> >
> > What about the special status of a rename? In the current model we
> > effectively have three tables.
> >
> > commit - a set of all SHAs in the commit, previous commit, comment, author, etc
>
> > blob - a file, permissions, etc.
> > file names - name, SHA
>
> commit - SHA1 of its parent(s) and its root-tree, along with
>          author info and a free-form field
> blob - content addressable by *multiple trees*
> file names - List of path-names inside a tree object.
>
>
> To draw some sort of relationship model here, you'd have
>
> commit 1<->M roottree
> tree M<->M tree
> tree M<->M blob

By introducing tree nodes you have blended a specific indexing scheme
into the data. There are many other ways the path names could be
indexed hash tables, binary trees, etc.

This problem exists in files systems. Since the path names have been
encoded into the directory structures there is no way to query
something like "all files created yesterday" from a file system
without building another mapping table or a brute force search. I keep
using Google as an example, Google is indexing hierarchical URLs but
they do not use a hierarchical index to do it.

Databases keep the knowledge of how things are indexed out of the
data. A data structure analysis of git should remove the blended index
and start from the set theory.

> Assuming SHA1 never collides (collisions rule out any form of storage,
> so we might as well hope it never happens), that leaves us with this:
>
> Each root tree can only ever belong to a single commit, unless you
> intentionally force git to make completely empty commits. git
> won't complain about this, so long as you don't make two in the
> same second, because it relies more heavily on the DAG than on
> developer sanity.
>
> Each root tree can point to multiple sub-trees. The sub-trees can be
> linked to any number of root-trees.
>
> Blobs can be linked to any number of tree objects, or even multiple
> times to the same tree object. This wouldn't be possible if the
> blob objects had their own pathnames stored inside them, so to speak.
>
> >
> > The file name table is encoded as an index and it has been
> > intermingled with the commit table.
> >
> > Looking at this from a set theory angle brings up the question, do we
> > really have three tables and file names are an independent variable
> > from the blobs, or should file names be an attribute of the blob?
> >
>
> File names are not independant variables. They belong inside the
> table created for them, which is the tree objects.
>
> > How this gets structured in the db is an independent question about
> > how renames get detected on a commit. The current scheme for detecting
> > renames by comparing diffs is working fine. The question is, once we
> > detect a rename how should it be stored?
> >
>
> Do you realize that you're contradicting yourself in two upon each
> other following sentences here?
>
> Detecting renames after the fashion works fine. Not storing them
> is part of the "detect them by comparing diffs".
>
> > Ignoring the performance impacts and looking at the problem from the
> > set theory view point, should:
> > the pathnames be in their own table with a row for each alias
> > the pathnames be stored as an attribute of the blob
> >
> > Both of these are the same information, we're just looking at how
> > things are normalized.
> >
>
> Except that
>
> git init
> echo foo > a
> cp -a a b
> git add .
> git commit -m testing
> git count-objects
>
> yields 3 objects at the moment; A commit-object, a tree object and *one*
> blob object. With your scheme the 2 blob objects would differ, and there
> would be 4 of them. If you propose to ignore the path-name you have
> effectively broken support for having two identical files with different
> names in the same directory.
>
> Now, can you please tell me what gains you're hoping to see with this
> new layout of yours?
>
> --
> Andreas Ericsson                   andreas.ericsson@op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 15:23 Git's database structure Jon Smirl
  2007-09-04 15:55 ` Andreas Ericsson
@ 2007-09-04 16:28 ` Jon Smirl
  2007-09-04 16:31   ` Andreas Ericsson
  2007-09-04 17:25   ` Junio C Hamano
  2007-09-04 17:19 ` Julian Phillips
  2 siblings, 2 replies; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 16:28 UTC (permalink / raw)
  To: Git Mailing List

Another way of looking at the problem,

Let's build a full-text index for git. You put a string into the index
and it returns the SHAs of all the file nodes that contain the string.
How do I recover the path names of these SHAs?

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:19   ` Jon Smirl
@ 2007-09-04 16:29     ` Andreas Ericsson
  2007-09-04 17:09     ` Jeff King
  2007-09-04 20:17     ` David Tweed
  2 siblings, 0 replies; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-04 16:29 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

Jon Smirl wrote:
> On 9/4/07, Andreas Ericsson <ae@op5.se> wrote:
>> Jon Smirl wrote:
>>> Let's back up a little bit from "Caclulating tree node".  What are the
>>> elements of git's data structures?
>>>
>>> Right now we have an index structure (tree nodes) integrated in to a
>>> base table. Integrating indexing into the data is not normally done in
>>> a database. Doing a normalization analysis like this may expose flaws
>>> in the way the data is structured. Of course we may also decide to
>>> leave everything the way it is.
>>>
>>> What about the special status of a rename? In the current model we
>>> effectively have three tables.
>>>
>>> commit - a set of all SHAs in the commit, previous commit, comment, author, etc
>>> blob - a file, permissions, etc.
>>> file names - name, SHA
>> commit - SHA1 of its parent(s) and its root-tree, along with
>>          author info and a free-form field
>> blob - content addressable by *multiple trees*
>> file names - List of path-names inside a tree object.
>>
>>
>> To draw some sort of relationship model here, you'd have
>>
>> commit 1<->M roottree
>> tree M<->M tree
>> tree M<->M blob
> 
> By introducing tree nodes you have blended a specific indexing scheme
> into the data. There are many other ways the path names could be
> indexed hash tables, binary trees, etc.
> 
> This problem exists in files systems. Since the path names have been
> encoded into the directory structures there is no way to query
> something like "all files created yesterday" from a file system
> without building another mapping table or a brute force search. I keep
> using Google as an example, Google is indexing hierarchical URLs but
> they do not use a hierarchical index to do it.
> 

Pathnames are by far the most common search-/delimiting criteria for
git though, so I fail to see why this is a problem for you.

> Databases keep the knowledge of how things are indexed out of the
> data. A data structure analysis of git should remove the blended index
> and start from the set theory.
> 

Why? This is the core of the problem, really. You haven't specified a
single, real-life reason *why* it should be any other way than it
already is. It sounds a bit to me as if you've been to a really
inspiring seminar about "how database-like things *should* be done"
and then decided to go berserk on your favourite database-like thing,
which is git.

Code and benchmarks or bust. In the meantime, I'll settle for a recount
of what problems you're having with the current layout, or what gains
you're hoping to achieve with the new one. As it's the 3rd time I'm
asking, this'll be the last.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:28 ` Jon Smirl
@ 2007-09-04 16:31   ` Andreas Ericsson
  2007-09-04 16:47     ` Jon Smirl
  2007-09-04 17:25   ` Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-04 16:31 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

Jon Smirl wrote:
> Another way of looking at the problem,
> 
> Let's build a full-text index for git. You put a string into the index
> and it returns the SHAs of all the file nodes that contain the string.
> How do I recover the path names of these SHAs?
> 

I wouldn't know, but presumably any table can have more than one column.

Is this a problem you face with git so often that it requires a complete
re-design of its very core?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:31   ` Andreas Ericsson
@ 2007-09-04 16:47     ` Jon Smirl
  2007-09-04 16:51       ` Andreas Ericsson
  0 siblings, 1 reply; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 16:47 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Git Mailing List

On 9/4/07, Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> > Another way of looking at the problem,
> >
> > Let's build a full-text index for git. You put a string into the index
> > and it returns the SHAs of all the file nodes that contain the string.
> > How do I recover the path names of these SHAs?
> >
>
> I wouldn't know, but presumably any table can have more than one column.
>
> Is this a problem you face with git so often that it requires a complete
> re-design of its very core?

That's the whole point. We need to discuss the impact of merging a
field (path names) with an index (tree nodes) has on future things we
may want to do with the data stored in git.

Databases don't usually blend fields/indexes without also duplicating
the field in the table. You need all the fields in the table so that
it is possible to create indexes on other fields.


>
> --
> Andreas Ericsson                   andreas.ericsson@op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:47     ` Jon Smirl
@ 2007-09-04 16:51       ` Andreas Ericsson
  0 siblings, 0 replies; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-04 16:51 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

Jon Smirl wrote:
> On 9/4/07, Andreas Ericsson <ae@op5.se> wrote:
>> Jon Smirl wrote:
>>> Another way of looking at the problem,
>>>
>>> Let's build a full-text index for git. You put a string into the index
>>> and it returns the SHAs of all the file nodes that contain the string.
>>> How do I recover the path names of these SHAs?
>>>
>> I wouldn't know, but presumably any table can have more than one column.
>>
>> Is this a problem you face with git so often that it requires a complete
>> re-design of its very core?
> 
> That's the whole point. We need to discuss the impact of merging a
> field (path names) with an index (tree nodes) has on future things we
> may want to do with the data stored in git.
> 

Yes, but as nobody seems to know what those future things are, it feels
rather pointless speculating about adding support to git for them. git
is a tool. It's a great one at that, because it was built to solve a
particular problem, which it does an amazing job at.

Other SCM's which had the potential to become amazingly good tools too
drowned somewhere between prototype and product in a sea of intellectual
masturbation, which had little to do with solving real-world problems.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:19   ` Jon Smirl
  2007-09-04 16:29     ` Andreas Ericsson
@ 2007-09-04 17:09     ` Jeff King
  2007-09-04 20:17     ` David Tweed
  2 siblings, 0 replies; 39+ messages in thread
From: Jeff King @ 2007-09-04 17:09 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Andreas Ericsson, Git Mailing List

On Tue, Sep 04, 2007 at 12:19:33PM -0400, Jon Smirl wrote:

> By introducing tree nodes you have blended a specific indexing scheme
> into the data. There are many other ways the path names could be
> indexed hash tables, binary trees, etc.

That is correct. However, given that indexing scheme, many of the common
operations just "fall out" simply and efficiently, without the need to
keep separate indices. So yes, git is geared towards a particular set of
operations.

Your complaint seems to be two-fold:

 1. there is an inelegance in the blending of data and indexing. The
    problem with changing this is:
      a. we are all already using git, and it would require completely
         re-vamping the core data structure
      b. there is some feeling that the blending is necessary for
         performance. Given the difficulty of (a), I think you would
         have to provide compelling evidence (i.e., numbers) that a
         git-like system based around set theory with separate indices
         would perform as well.

 2. you want perform some operations to which the hierarchy is not
    well-suited. In this case, I think you can get by with the same
    solution you have proposed already: indices external to the data
    structure (in fact, this is exactly what Google is doing: taking
    hierarchical URLs and indexing them in different ways).

    Have you taken a look at the pack v4 work by Shawn and Nicolas? It
    is an attempt to build such indices at pack time (but keeping the
    core git data structure intact).

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 15:23 Git's database structure Jon Smirl
  2007-09-04 15:55 ` Andreas Ericsson
  2007-09-04 16:28 ` Jon Smirl
@ 2007-09-04 17:19 ` Julian Phillips
  2007-09-04 17:30   ` Jon Smirl
  2 siblings, 1 reply; 39+ messages in thread
From: Julian Phillips @ 2007-09-04 17:19 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

On Tue, 4 Sep 2007, Jon Smirl wrote:

> Let's back up a little bit from "Caclulating tree node".  What are the
> elements of git's data structures?
>
> Right now we have an index structure (tree nodes) integrated in to a
> base table. Integrating indexing into the data is not normally done in
> a database. Doing a normalization analysis like this may expose flaws
> in the way the data is structured. Of course we may also decide to
> leave everything the way it is.
>
> What about the special status of a rename? In the current model we
> effectively have three tables.
>
> commit - a set of all SHAs in the commit, previous commit, comment, author, etc
> blob - a file, permissions, etc.
> file names - name, SHA
>
> The file name table is encoded as an index and it has been
> intermingled with the commit table.
>
> Looking at this from a set theory angle brings up the question, do we
> really have three tables and file names are an independent variable
> from the blobs, or should file names be an attribute of the blob?

There isn't a one-to-one mapping of file names to blobs.  The blob only 
describes the contents of the file.  In the extreme case you could have 
one blob for every single file in your tree.  For example:

# git ls-tree -r HEAD
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    bar/foo
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo2
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo3
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo4
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo5
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo6

>
> How this gets structured in the db is an independent question about
> how renames get detected on a commit. The current scheme for detecting
> renames by comparing diffs is working fine. The question is, once we
> detect a rename how should it be stored?
>
> Ignoring the performance impacts and looking at the problem from the
> set theory view point, should:
> the pathnames be in their own table with a row for each alias
> the pathnames be stored as an attribute of the blob
>
> Both of these are the same information, we're just looking at how
> things are normalized.
>
>

-- 
Julian

  ---
"You shouldn't make my toaster angry."
-- Household security explained in "Johnny Quest"

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 15:55 ` Andreas Ericsson
  2007-09-04 16:07   ` Mike Hommey
  2007-09-04 16:19   ` Jon Smirl
@ 2007-09-04 17:21   ` Junio C Hamano
  2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2007-09-04 17:21 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Jon Smirl, Git Mailing List

Andreas Ericsson <ae@op5.se> writes:

> Each root tree can only ever belong to a single commit, unless you
> intentionally force git to make completely empty commits. git
> won't complain about this, so long as you don't make two in the
> same second, because it relies more heavily on the DAG than on
> developer sanity.

This actually can happen without even using 'ours' strategy.

If two people independently applied the same patch on their
branches and later their results were merged.  And "the same
second" requirement is not even there and not interesting.
There are other things like developer identity, log message, and
their ancestry that would make the resulting commit object
distinct.

> Each root tree can point to multiple sub-trees. The sub-trees can be
> linked to any number of root-trees.
>
> Blobs can be linked to any number of tree objects, or even multiple
> times to the same tree object. This wouldn't be possible if the
> blob objects had their own pathnames stored inside them, so to speak.

More importantly, in git, filenames and modes are not considered
part of "contents", which git tracks.  Although it is an
entirely possible and valid alternate design to move that as
part of "blob" to build a system that is different from git,
which Jon seems to be aiming at, the benefit of such a design is
unclear to me, both from theoretical point of view (now blobs
are not about pure contents anymore) nor performance point of
view (Linus's done flat tree object in an early stage of git,
and it was not nice) as other people explained.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:28 ` Jon Smirl
  2007-09-04 16:31   ` Andreas Ericsson
@ 2007-09-04 17:25   ` Junio C Hamano
  2007-09-04 17:44     ` Jon Smirl
  1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2007-09-04 17:25 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

"Jon Smirl" <jonsmirl@gmail.com> writes:

> Another way of looking at the problem,
>
> Let's build a full-text index for git. You put a string into the index
> and it returns the SHAs of all the file nodes that contain the string.
> How do I recover the path names of these SHAs?

That question does not make much sense without specifying "which
commit's path you are talking about".

If you want to encode such "contextual information" in addition
to "contents", you could do so, but you essentially need to
record commit + pathname + mode bits + contents as "blob" and
hash that to come up with a name.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 17:19 ` Julian Phillips
@ 2007-09-04 17:30   ` Jon Smirl
  2007-09-04 18:51     ` Andreas Ericsson
  0 siblings, 1 reply; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 17:30 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Git Mailing List

On 9/4/07, Julian Phillips <julian@quantumfyre.co.uk> wrote:
> On Tue, 4 Sep 2007, Jon Smirl wrote:
>
> > Let's back up a little bit from "Caclulating tree node".  What are the
> > elements of git's data structures?
> >
> > Right now we have an index structure (tree nodes) integrated in to a
> > base table. Integrating indexing into the data is not normally done in
> > a database. Doing a normalization analysis like this may expose flaws
> > in the way the data is structured. Of course we may also decide to
> > leave everything the way it is.
> >
> > What about the special status of a rename? In the current model we
> > effectively have three tables.
> >
> > commit - a set of all SHAs in the commit, previous commit, comment, author, etc
> > blob - a file, permissions, etc.
> > file names - name, SHA
> >
> > The file name table is encoded as an index and it has been
> > intermingled with the commit table.
> >
> > Looking at this from a set theory angle brings up the question, do we
> > really have three tables and file names are an independent variable
> > from the blobs, or should file names be an attribute of the blob?
>
> There isn't a one-to-one mapping of file names to blobs.  The blob only
> describes the contents of the file.  In the extreme case you could have
> one blob for every single file in your tree.  For example:
>
> # git ls-tree -r HEAD
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    bar/foo
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo2
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo3
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo4
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo5
> 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c    foo6

Both schemes support aliasing. In the flat scheme you would create a
second blob which contains the file and the aliased path name. When
the blob gets delta'd the second copy of the file will disappear.

I'm not proposing a change to data being stored in git, it is a
proposal to consider the impacts of how this data has been normalized
in the data store.

> > How this gets structured in the db is an independent question about
> > how renames get detected on a commit. The current scheme for detecting
> > renames by comparing diffs is working fine. The question is, once we
> > detect a rename how should it be stored?
> >
> > Ignoring the performance impacts and looking at the problem from the
> > set theory view point, should:
> > the pathnames be in their own table with a row for each alias
> > the pathnames be stored as an attribute of the blob
> >
> > Both of these are the same information, we're just looking at how
> > things are normalized.
> >
> >
>
> --
> Julian
>
>   ---
> "You shouldn't make my toaster angry."
> -- Household security explained in "Johnny Quest"
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 17:25   ` Junio C Hamano
@ 2007-09-04 17:44     ` Jon Smirl
  2007-09-04 18:04       ` Mike Hommey
                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 17:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote:
> "Jon Smirl" <jonsmirl@gmail.com> writes:
>
> > Another way of looking at the problem,
> >
> > Let's build a full-text index for git. You put a string into the index
> > and it returns the SHAs of all the file nodes that contain the string.
> > How do I recover the path names of these SHAs?
>
> That question does not make much sense without specifying "which
> commit's path you are talking about".
>
> If you want to encode such "contextual information" in addition
> to "contents", you could do so, but you essentially need to
> record commit + pathname + mode bits + contents as "blob" and
> hash that to come up with a name.

I left the details out of the full-text example to make it more
obvious that we can't recover the path names.

Doing this type of analysis may point out that even more fields are
missing from the blob table such as commit id.

The current data store design is not very flexible. Databases solved
the flexibility problem long ago. I'm just wondering if we should
steal some good ideas out of the database world and apply them to git.
Ten years from now we may have 100GB git databases and really wish we
had more flexible ways of querying them.

The reason databases don't encode the fields into the index is that
you can only have a single index on the table if you do that.
Databases do sometimes duplicate the field in both the index and the
table. Databases also have the property that indexes are just a cache
and can be dropped at any time.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 17:44     ` Jon Smirl
@ 2007-09-04 18:04       ` Mike Hommey
  2007-09-04 19:44         ` Reece Dunn
  2007-09-04 18:06       ` Junio C Hamano
  2007-09-04 21:25       ` Theodore Tso
  2 siblings, 1 reply; 39+ messages in thread
From: Mike Hommey @ 2007-09-04 18:04 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Junio C Hamano, Git Mailing List

On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote:
> > "Jon Smirl" <jonsmirl@gmail.com> writes:
> >
> > > Another way of looking at the problem,
> > >
> > > Let's build a full-text index for git. You put a string into the index
> > > and it returns the SHAs of all the file nodes that contain the string.
> > > How do I recover the path names of these SHAs?
> >
> > That question does not make much sense without specifying "which
> > commit's path you are talking about".
> >
> > If you want to encode such "contextual information" in addition
> > to "contents", you could do so, but you essentially need to
> > record commit + pathname + mode bits + contents as "blob" and
> > hash that to come up with a name.
> 
> I left the details out of the full-text example to make it more
> obvious that we can't recover the path names.
> 
> Doing this type of analysis may point out that even more fields are
> missing from the blob table such as commit id.
> 
> The current data store design is not very flexible. Databases solved
> the flexibility problem long ago. I'm just wondering if we should
> steal some good ideas out of the database world and apply them to git.
> Ten years from now we may have 100GB git databases and really wish we
> had more flexible ways of querying them.
> 
> The reason databases don't encode the fields into the index is that
> you can only have a single index on the table if you do that.
> Databases do sometimes duplicate the field in both the index and the
> table. Databases also have the property that indexes are just a cache
> and can be dropped at any time.

The big difference between a database and git is that a database is a
general purpose tool. git has a much more restricted scope. As such, it
doesn't need *that much* flexibility.

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 17:44     ` Jon Smirl
  2007-09-04 18:04       ` Mike Hommey
@ 2007-09-04 18:06       ` Junio C Hamano
  2007-09-04 21:25       ` Theodore Tso
  2 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2007-09-04 18:06 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List

"Jon Smirl" <jonsmirl@gmail.com> writes:

> On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote:
>> "Jon Smirl" <jonsmirl@gmail.com> writes:
>>
>> > Another way of looking at the problem,
>> >
>> > Let's build a full-text index for git. You put a string into the index
>> > and it returns the SHAs of all the file nodes that contain the string.
>> > How do I recover the path names of these SHAs?
>>
>> That question does not make much sense without specifying "which
>> commit's path you are talking about".
>>
>> If you want to encode such "contextual information" in addition
>> to "contents", you could do so, but you essentially need to
>> record commit + pathname + mode bits + contents as "blob" and
>> hash that to come up with a name.
>
> I left the details out of the full-text example to make it more
> obvious that we can't recover the path names.
>
> Doing this type of analysis may point out that even more fields are
> missing from the blob table such as commit id.

Quite the contrary.  You just illustrated why it is wrong to put
anything but contents in the blob.

The specialized indexing is a different issue.  If you want to
have a full text index to answer "what paths in which commits
had this string?", then your database table would have columns
such as commit (sha-1), path (string) as values, indexed with
the search string.

Now the current set of "git" operation does not need to answer
that query, so we do not build nor maintain such an index that
nobody uses.  But your application may benefit from such an
index, and as others said, nobody prevents you from building
one.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 17:30   ` Jon Smirl
@ 2007-09-04 18:51     ` Andreas Ericsson
  0 siblings, 0 replies; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-04 18:51 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Julian Phillips, Git Mailing List

Jon Smirl wrote:
> 
> I'm not proposing a change to data being stored in git, it is a
> proposal to consider the impacts of how this data has been normalized
> in the data store.
> 

But to what end?

We all *know* the impacts:
* Excellent performance at what it does now.
* Currently zero capability to replace google as the #1 search engine.

Since replacing google's db was never, and will never, be the goal of
git, what is it you wish to achieve? Seriously, I'm dying to know, so
please tell me. If you have already and I'm too daft to understand it,
humor me and reiterate :-)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 18:04       ` Mike Hommey
@ 2007-09-04 19:44         ` Reece Dunn
  0 siblings, 0 replies; 39+ messages in thread
From: Reece Dunn @ 2007-09-04 19:44 UTC (permalink / raw)
  To: Mike Hommey, Jon Smirl, Junio C Hamano, Git Mailing List

On 04/09/07, Mike Hommey <mh@glandium.org> wrote:
> On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl <jonsmirl@gmail.com> wrote:
> > The reason databases don't encode the fields into the index is that
> > you can only have a single index on the table if you do that.
> > Databases do sometimes duplicate the field in both the index and the
> > table. Databases also have the property that indexes are just a cache
> > and can be dropped at any time.
>
> The big difference between a database and git is that a database is a
> general purpose tool. git has a much more restricted scope. As such, it
> doesn't need *that much* flexibility.

Databases are designed to be efficient at storing and accessing large
amounts of data. The key thing about a database is that it does not
track the *history* of the data it is storing. This is the main
problem with using a database as a metadata storage facility.

Modern source control systems such as Perforce (and possibly
Subversion), use a database to track metadata such as branch/merge
history, user data and so on. This, IMHO is a huge weakness of these
SCM systems. It is impossible to fully roll back to a given point in
time, because that metadata is stored independently of the file
content tracking.

Git *is not a database*. This is fundamental to understanding how git
works. Git stores *all* of its data in a Directed Acyclic Graph (with
the exception of the pointers to tag and the current head of each
branch, that it stores locally in the .git directory). Read
http://eagain.net/articles/git-for-computer-scientists/ for more
information on this.

What this means is that for any commit, git has all the information it
needs about the repository at that point in time. It doesn't need
anything else. If you then store information in a database, you lose
having the complete picture at any point in the history of the
repository.

- Reece

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 16:19   ` Jon Smirl
  2007-09-04 16:29     ` Andreas Ericsson
  2007-09-04 17:09     ` Jeff King
@ 2007-09-04 20:17     ` David Tweed
  2 siblings, 0 replies; 39+ messages in thread
From: David Tweed @ 2007-09-04 20:17 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Andreas Ericsson, Git Mailing List

On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> without building another mapping table or a brute force search. I keep
> using Google as an example, Google is indexing hierarchical URLs but
> they do not use a hierarchical index to do it.

It might help the discussion if you could point to a reference,
preferably one that discusses the trade-offs in the design, with more
concrete details about what google or other search engines actually
do. It would be particularly useful if it addressed issues of

1. the type of queries the representation is optimised for.
2. consistency requirements. (Can a search engine use different data
structures if they improve average performance at the cost of
occasional inconsistency/lossage?)

Finally, this design space is not totally unexplored, for example,

http://plan9.bell-labs.com/sys/doc/venti/venti.html

AFAICS they only use SHA-1 for blocks within files (although this
might be misreading the paper) so presumably they'd have knowledge
about the trade-offs.

-- 
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 17:44     ` Jon Smirl
  2007-09-04 18:04       ` Mike Hommey
  2007-09-04 18:06       ` Junio C Hamano
@ 2007-09-04 21:25       ` Theodore Tso
  2007-09-04 21:54         ` Jon Smirl
  2 siblings, 1 reply; 39+ messages in thread
From: Theodore Tso @ 2007-09-04 21:25 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Junio C Hamano, Git Mailing List

On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote:
> The current data store design is not very flexible. Databases solved
> the flexibility problem long ago. I'm just wondering if we should
> steal some good ideas out of the database world and apply them to git.
> Ten years from now we may have 100GB git databases and really wish we
> had more flexible ways of querying them.

Databases solved the flexibility problem, at the cost of performance.
And if you use full normalized form in your database scheme, it costs
you even more in performance, because of all of the joins that you
need in order get the information you need to do, you know, useful
work as opposed to database wanking.

If you take a look at the really big databases with super high
performance requirements, say like those used to managed airline
tickets/reservation/fares, you will find that they are not normalized,
and they are not relational; they can't afford to be.  And if you take
a look at some of git competition that use relational databases to
store their SCM data, and take a look at how loooooong they they take
to do even basic operations, I would say that the onus is on you to
prove that normalization is actually a win in terms of real (not
theoretical) advantages, and that it doesn't cause performance to go
into the toilet.

I think the fundamental disconnect here is that no one is buying your
claim that just because the data design is "more flexible" that this
is automatically a good thing in and of itself, and we should even for
a moment, "put performance aside".  

I also don't think that attempting to force git's data structures into
database terms makes sense; it is much closer to an filesystem using
an object based store --- and very few people except for folks like
Hans Resiers believes that Filesystems and Database should be
unified....

						- Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 21:25       ` Theodore Tso
@ 2007-09-04 21:54         ` Jon Smirl
  2007-09-05  7:18           ` Andreas Ericsson
  0 siblings, 1 reply; 39+ messages in thread
From: Jon Smirl @ 2007-09-04 21:54 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Junio C Hamano, Git Mailing List

On 9/4/07, Theodore Tso <tytso@mit.edu> wrote:
> On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote:
> > The current data store design is not very flexible. Databases solved
> > the flexibility problem long ago. I'm just wondering if we should
> > steal some good ideas out of the database world and apply them to git.
> > Ten years from now we may have 100GB git databases and really wish we
> > had more flexible ways of querying them.
>
> Databases solved the flexibility problem, at the cost of performance.
> And if you use full normalized form in your database scheme, it costs
> you even more in performance, because of all of the joins that you
> need in order get the information you need to do, you know, useful
> work as opposed to database wanking.
>
> If you take a look at the really big databases with super high
> performance requirements, say like those used to managed airline
> tickets/reservation/fares, you will find that they are not normalized,
> and they are not relational; they can't afford to be.  And if you take
> a look at some of git competition that use relational databases to
> store their SCM data, and take a look at how loooooong they they take
> to do even basic operations, I would say that the onus is on you to
> prove that normalization is actually a win in terms of real (not
> theoretical) advantages, and that it doesn't cause performance to go
> into the toilet.
>
> I think the fundamental disconnect here is that no one is buying your
> claim that just because the data design is "more flexible" that this
> is automatically a good thing in and of itself, and we should even for
> a moment, "put performance aside".

It is very easy to get bogged down in performance arguments on
database design when the correct answer is that there are always lots
of different ways to achieve the same goal. I wanted to defer debating
performance until we closely looked at the relationships between the
data at an abstract level.

Since git hasn't stored all of the fields in the object table (the
path is encoded in the index) we are never going to be able to build
an alternative way of indexing the object table. Not being able to
build alternative indexes is likely to cause problems when the
database starts getting really big. Without an index every query that
can't use the path name index is reduced to doing full table scans.

A few things that could benefit from alternative indexing, blame,
full-text search, automating the Maintainers file, etc.

I'm just asking if we really want to make full table scans the only
possible way to implement these types of queries. If the answer is no,
then let's first explore how to fix things at an abstract level before
diving into the performance arguments.

An obvious parallel from the file system world is the locate database
and how it is forced to continuously rescan the file system and store
full path names.


>
> I also don't think that attempting to force git's data structures into
> database terms makes sense; it is much closer to an filesystem using
> an object based store --- and very few people except for folks like
> Hans Resiers believes that Filesystems and Database should be
> unified....
>
>                                                 - Ted
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-04 21:54         ` Jon Smirl
@ 2007-09-05  7:18           ` Andreas Ericsson
  2007-09-05 13:41             ` Jon Smirl
  0 siblings, 1 reply; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-05  7:18 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List

Jon Smirl wrote:
> On 9/4/07, Theodore Tso <tytso@mit.edu> wrote:
>> On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote:
>>> The current data store design is not very flexible. Databases solved
>>> the flexibility problem long ago. I'm just wondering if we should
>>> steal some good ideas out of the database world and apply them to git.
>>> Ten years from now we may have 100GB git databases and really wish we
>>> had more flexible ways of querying them.
>> Databases solved the flexibility problem, at the cost of performance.
>> And if you use full normalized form in your database scheme, it costs
>> you even more in performance, because of all of the joins that you
>> need in order get the information you need to do, you know, useful
>> work as opposed to database wanking.
>>
>> If you take a look at the really big databases with super high
>> performance requirements, say like those used to managed airline
>> tickets/reservation/fares, you will find that they are not normalized,
>> and they are not relational; they can't afford to be.  And if you take
>> a look at some of git competition that use relational databases to
>> store their SCM data, and take a look at how loooooong they they take
>> to do even basic operations, I would say that the onus is on you to
>> prove that normalization is actually a win in terms of real (not
>> theoretical) advantages, and that it doesn't cause performance to go
>> into the toilet.
>>
>> I think the fundamental disconnect here is that no one is buying your
>> claim that just because the data design is "more flexible" that this
>> is automatically a good thing in and of itself, and we should even for
>> a moment, "put performance aside".
> 
> It is very easy to get bogged down in performance arguments on
> database design when the correct answer is that there are always lots
> of different ways to achieve the same goal. I wanted to defer debating
> performance until we closely looked at the relationships between the
> data at an abstract level.
> 

But you cannot. Git is performance-critical, for the same reason every
other performance-critical application is: It's a tool to save human
time. Linux development *could* be done using patchfiles by the bundle
and masses of tarballs. It's just not the fastest way to do it, so enter
git, and lots of problems just go away. It's not the only way of doing
it, but it saves time. If you were to add 2 seconds to each commit,
that's several months of developer time that is lost every day!


> Since git hasn't stored all of the fields in the object table (the
> path is encoded in the index) we are never going to be able to build
> an alternative way of indexing the object table.

We can still build alternative indexes. They just have to be separate
from the DAG and the current indexing scheme. Junio has pointed out
ways of doing this already.

> Not being able to
> build alternative indexes is likely to cause problems when the
> database starts getting really big. Without an index every query that
> can't use the path name index is reduced to doing full table scans.
> 

I've said it before; The most common delimiter used today is paths. It's
a behaviour git was designed to handle well, because it *is* the most
common way of limiting and separating content. It's not some random
fluke that has made git perform very well on actions that commonly
performed in large scale software projects; Linus designed it that way
from the start, and kudos to him for a job well done.

> A few things that could benefit from alternative indexing, blame,
> full-text search, automating the Maintainers file, etc.
> 

Yes, but getting rid of the tree objects and storing pathnames in
blob objects would penalize log-viewing, diffs and merges, which
are far more common operations than full-text searches in a software
project.

> I'm just asking if we really want to make full table scans the only
> possible way to implement these types of queries. If the answer is no,
> then let's first explore how to fix things at an abstract level before
> diving into the performance arguments.
> 

Personally, I really don't care. But you should really have read Junio's
mail a bit more carefully. He explained about 'notes' that can be attached
to commits and contain arbitrary data. By all means, create your indexes
there and use them for whatever you like, but leave the foundation on which
git was built *alone*. The design hasn't changed since April 2006 (subtrees
were introduced April 26, I think), because it's a *good* design.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05  7:18           ` Andreas Ericsson
@ 2007-09-05 13:41             ` Jon Smirl
  2007-09-05 14:51               ` Andreas Ericsson
  2007-09-05 19:52               ` Andy Parkins
  0 siblings, 2 replies; 39+ messages in thread
From: Jon Smirl @ 2007-09-05 13:41 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List

On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> > On 9/4/07, Theodore Tso <tytso@mit.edu> wrote:
> >> On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote:
> >>> The current data store design is not very flexible. Databases solved
> >>> the flexibility problem long ago. I'm just wondering if we should
> >>> steal some good ideas out of the database world and apply them to git.
> >>> Ten years from now we may have 100GB git databases and really wish we
> >>> had more flexible ways of querying them.
> >> Databases solved the flexibility problem, at the cost of performance.
> >> And if you use full normalized form in your database scheme, it costs
> >> you even more in performance, because of all of the joins that you
> >> need in order get the information you need to do, you know, useful
> >> work as opposed to database wanking.
> >>
> >> If you take a look at the really big databases with super high
> >> performance requirements, say like those used to managed airline
> >> tickets/reservation/fares, you will find that they are not normalized,
> >> and they are not relational; they can't afford to be.  And if you take
> >> a look at some of git competition that use relational databases to
> >> store their SCM data, and take a look at how loooooong they they take
> >> to do even basic operations, I would say that the onus is on you to
> >> prove that normalization is actually a win in terms of real (not
> >> theoretical) advantages, and that it doesn't cause performance to go
> >> into the toilet.
> >>
> >> I think the fundamental disconnect here is that no one is buying your
> >> claim that just because the data design is "more flexible" that this
> >> is automatically a good thing in and of itself, and we should even for
> >> a moment, "put performance aside".
> >
> > It is very easy to get bogged down in performance arguments on
> > database design when the correct answer is that there are always lots
> > of different ways to achieve the same goal. I wanted to defer debating
> > performance until we closely looked at the relationships between the
> > data at an abstract level.
> >
>
> But you cannot. Git is performance-critical, for the same reason every
> other performance-critical application is: It's a tool to save human
> time. Linux development *could* be done using patchfiles by the bundle
> and masses of tarballs. It's just not the fastest way to do it, so enter
> git, and lots of problems just go away. It's not the only way of doing
> it, but it saves time. If you were to add 2 seconds to each commit,
> that's several months of developer time that is lost every day!
>
>
> > Since git hasn't stored all of the fields in the object table (the
> > path is encoded in the index) we are never going to be able to build
> > an alternative way of indexing the object table.
>
> We can still build alternative indexes. They just have to be separate
> from the DAG and the current indexing scheme. Junio has pointed out
> ways of doing this already.
>
> > Not being able to
> > build alternative indexes is likely to cause problems when the
> > database starts getting really big. Without an index every query that
> > can't use the path name index is reduced to doing full table scans.
> >
>
> I've said it before; The most common delimiter used today is paths. It's
> a behaviour git was designed to handle well, because it *is* the most
> common way of limiting and separating content. It's not some random
> fluke that has made git perform very well on actions that commonly
> performed in large scale software projects; Linus designed it that way
> from the start, and kudos to him for a job well done.


This is why I wanted to separate the abstract data structure design
discussion from the performance one. In the flat design indexes are
like caches and can be created and destroyed. There will definitely be
an index created on the the paths. This index will work like the
current tree nodes. The difference is that this index is a cache
unlike the current tree nodes which are an immutable part of the the
data base.

The path name field needs to be moved back into the blobs to support
alternative indexes. For example I want an index on the Signed-off-by
field. I use this index to give me the SHAs for the blobs
Signed-off-by a particular person. In the current design I have no way
of recovering the path name for these blobs other than a brute force
search following every path looking for the right SHA.



>
> > A few things that could benefit from alternative indexing, blame,
> > full-text search, automating the Maintainers file, etc.
> >
>
> Yes, but getting rid of the tree objects and storing pathnames in
> blob objects would penalize log-viewing, diffs and merges, which
> are far more common operations than full-text searches in a software
> project.
>
> > I'm just asking if we really want to make full table scans the only
> > possible way to implement these types of queries. If the answer is no,
> > then let's first explore how to fix things at an abstract level before
> > diving into the performance arguments.
> >
>
> Personally, I really don't care. But you should really have read Junio's
> mail a bit more carefully. He explained about 'notes' that can be attached
> to commits and contain arbitrary data. By all means, create your indexes
> there and use them for whatever you like, but leave the foundation on which
> git was built *alone*. The design hasn't changed since April 2006 (subtrees
> were introduced April 26, I think), because it's a *good* design.
>
> --
> Andreas Ericsson                   andreas.ericsson@op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 13:41             ` Jon Smirl
@ 2007-09-05 14:51               ` Andreas Ericsson
  2007-09-05 15:37                 ` Jon Smirl
  2007-09-05 19:52               ` Andy Parkins
  1 sibling, 1 reply; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-05 14:51 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List

Jon Smirl wrote:
> 
> The path name field needs to be moved back into the blobs to support
> alternative indexes. For example I want an index on the Signed-off-by
> field. I use this index to give me the SHAs for the blobs
> Signed-off-by a particular person. In the current design I have no way
> of recovering the path name for these blobs other than a brute force
> search following every path looking for the right SHA.
> 

Ah, there we go. A use-case at last :)

So now we have a concrete problem that we can formulate thus:
"How can one create a database listing the relationship between 'signers'
and blobs?"

So the second question: Do you seriously argue that git should take a
huge performance loss on its common operations to accommodate a need that
I suspect very few people have?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 14:51               ` Andreas Ericsson
@ 2007-09-05 15:37                 ` Jon Smirl
  2007-09-05 15:54                   ` Julian Phillips
  0 siblings, 1 reply; 39+ messages in thread
From: Jon Smirl @ 2007-09-05 15:37 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List

On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
> Jon Smirl wrote:
> >
> > The path name field needs to be moved back into the blobs to support
> > alternative indexes. For example I want an index on the Signed-off-by
> > field. I use this index to give me the SHAs for the blobs
> > Signed-off-by a particular person. In the current design I have no way
> > of recovering the path name for these blobs other than a brute force
> > search following every path looking for the right SHA.
> >
>
> Ah, there we go. A use-case at last :)
>
> So now we have a concrete problem that we can formulate thus:
> "How can one create a database listing the relationship between 'signers'
> and blobs?"
>
> So the second question: Do you seriously argue that git should take a
> huge performance loss on its common operations to accommodate a need that
> I suspect very few people have?

Why do you keep jumping to a performance loss? Both schemes will have
an index based on paths. The problem is how those indexes are
constructed, not the existence of the index. Moving the paths into the
blobs in no way prevents you from creating an index on that field.

The problem is that the SHAs have been intertwined with the tree
nodes. This blending has made it impossible to create other indexes on
the blobs.

The path index in the flat scheme will probably look just like tree
nodes do today but these new tree nodes won't be intertwined with the
SHAs.


>
> --
> Andreas Ericsson                   andreas.ericsson@op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 15:37                 ` Jon Smirl
@ 2007-09-05 15:54                   ` Julian Phillips
  2007-09-05 16:12                     ` Jon Smirl
  0 siblings, 1 reply; 39+ messages in thread
From: Julian Phillips @ 2007-09-05 15:54 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List

On Wed, 5 Sep 2007, Jon Smirl wrote:

> On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
>> Jon Smirl wrote:
>>>
>>> The path name field needs to be moved back into the blobs to support
>>> alternative indexes. For example I want an index on the Signed-off-by
>>> field. I use this index to give me the SHAs for the blobs
>>> Signed-off-by a particular person. In the current design I have no way
>>> of recovering the path name for these blobs other than a brute force
>>> search following every path looking for the right SHA.
>>>
>>
>> Ah, there we go. A use-case at last :)

But not a brilliant one.  You sign off on commits not blobs.  So you go
from the sign-off to paths, then to blobs.  There is no need to go from
blob to path unless you deliberately introduce such a need.

>>
>> So now we have a concrete problem that we can formulate thus:
>> "How can one create a database listing the relationship between 'signers'
>> and blobs?"
>>
>> So the second question: Do you seriously argue that git should take a
>> huge performance loss on its common operations to accommodate a need that
>> I suspect very few people have?
>
> Why do you keep jumping to a performance loss? Both schemes will have
> an index based on paths. The problem is how those indexes are
> constructed, not the existence of the index. Moving the paths into the
> blobs in no way prevents you from creating an index on that field.

But moving the path into the blob _IS_ the perfomance hit.  You lose the 
ability to tell the two files have the same content _without even looking 
at the blob_.  This is one of the core parts of making git operations 
blindingly fast.  You can't throw that out, and then say that there is no 
performance hit.

You keep talking about abstract database performance - but git is not an 
abstract database.  It has very specific common usage patterns, and is 
optomisied to handle them.

>
> The problem is that the SHAs have been intertwined with the tree
> nodes. This blending has made it impossible to create other indexes on
> the blobs.
>
> The path index in the flat scheme will probably look just like tree
> nodes do today but these new tree nodes won't be intertwined with the
> SHAs.

And you will have to prove that diff/merge etc. don't become very much 
slower before you get buy in.

-- 
Julian

  ---
Many receive advice, few profit by it.
 		-- Publilius Syrus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 15:54                   ` Julian Phillips
@ 2007-09-05 16:12                     ` Jon Smirl
  2007-09-05 17:31                       ` Julian Phillips
                                         ` (4 more replies)
  0 siblings, 5 replies; 39+ messages in thread
From: Jon Smirl @ 2007-09-05 16:12 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List

On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote:
> On Wed, 5 Sep 2007, Jon Smirl wrote:
>
> > On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
> >> Jon Smirl wrote:
> >>>
> >>> The path name field needs to be moved back into the blobs to support
> >>> alternative indexes. For example I want an index on the Signed-off-by
> >>> field. I use this index to give me the SHAs for the blobs
> >>> Signed-off-by a particular person. In the current design I have no way
> >>> of recovering the path name for these blobs other than a brute force
> >>> search following every path looking for the right SHA.
> >>>
> >>
> >> Ah, there we go. A use-case at last :)
>
> But not a brilliant one.  You sign off on commits not blobs.  So you go
> from the sign-off to paths, then to blobs.  There is no need to go from
> blob to path unless you deliberately introduce such a need.

Use blame for an example. Blame has to crawl every commit to see if it
touched the file. It keeps doing this until it figures out the last
author for every line in the file. Worse case blame has to crawl every
commit in the data store.

> >>
> >> So now we have a concrete problem that we can formulate thus:
> >> "How can one create a database listing the relationship between 'signers'
> >> and blobs?"
> >>
> >> So the second question: Do you seriously argue that git should take a
> >> huge performance loss on its common operations to accommodate a need that
> >> I suspect very few people have?
> >
> > Why do you keep jumping to a performance loss? Both schemes will have
> > an index based on paths. The problem is how those indexes are
> > constructed, not the existence of the index. Moving the paths into the
> > blobs in no way prevents you from creating an index on that field.
>
> But moving the path into the blob _IS_ the perfomance hit.  You lose the
> ability to tell the two files have the same content _without even looking
> at the blob_.  This is one of the core parts of making git operations
> blindingly fast.  You can't throw that out, and then say that there is no
> performance hit.
>
> You keep talking about abstract database performance - but git is not an
> abstract database.  It has very specific common usage patterns, and is
> optomisied to handle them.
>
> >
> > The problem is that the SHAs have been intertwined with the tree
> > nodes. This blending has made it impossible to create other indexes on
> > the blobs.
> >
> > The path index in the flat scheme will probably look just like tree
> > nodes do today but these new tree nodes won't be intertwined with the
> > SHAs.
>
> And you will have to prove that diff/merge etc. don't become very much
> slower before you get buy in.
>
> --
> Julian
>
>   ---
> Many receive advice, few profit by it.
>                 -- Publilius Syrus
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 16:12                     ` Jon Smirl
@ 2007-09-05 17:31                       ` Julian Phillips
  2007-09-06  1:27                         ` Kyle Moffett
  2007-09-05 17:39                       ` Mike Hommey
                                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 39+ messages in thread
From: Julian Phillips @ 2007-09-05 17:31 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List

On Wed, 5 Sep 2007, Jon Smirl wrote:

> On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote:
>> On Wed, 5 Sep 2007, Jon Smirl wrote:
>>
>>> On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
>>>> Jon Smirl wrote:
>>>>>
>>>>> The path name field needs to be moved back into the blobs to support
>>>>> alternative indexes. For example I want an index on the Signed-off-by
>>>>> field. I use this index to give me the SHAs for the blobs
>>>>> Signed-off-by a particular person. In the current design I have no way
>>>>> of recovering the path name for these blobs other than a brute force
>>>>> search following every path looking for the right SHA.
>>>>>
>>>>
>>>> Ah, there we go. A use-case at last :)
>>
>> But not a brilliant one.  You sign off on commits not blobs.  So you go
>> from the sign-off to paths, then to blobs.  There is no need to go from
>> blob to path unless you deliberately introduce such a need.
>
> Use blame for an example. Blame has to crawl every commit to see if it
> touched the file. It keeps doing this until it figures out the last
> author for every line in the file. Worse case blame has to crawl every
> commit in the data store.

And this is advantaged by having the path in the blob how?  The important 
information here is knowing which commits touched the file - this 
information is expensive in git because it is snapshot based.  You have to 
go back through all the commits looking for changes to the given path. 
The information you might want to cache is which commits touched the file, 
which you could do without changing the current data storage. Presumably 
you are suggesting that such a cache would be cleaner with the filename in 
the blob?  Or do you think that it would somehow be faster to create?  If 
so, how?

-- 
Julian

  ---
Humor in the Court:
Q: (Showing man picture.) That's you?
A: Yes, sir.
Q: And you were present when the picture was taken, right?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 16:12                     ` Jon Smirl
  2007-09-05 17:31                       ` Julian Phillips
@ 2007-09-05 17:39                       ` Mike Hommey
  2007-09-06  8:49                       ` Andreas Ericsson
                                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 39+ messages in thread
From: Mike Hommey @ 2007-09-05 17:39 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano,
	Git Mailing List

On Wed, Sep 05, 2007 at 12:12:28PM -0400, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote:
> > On Wed, 5 Sep 2007, Jon Smirl wrote:
> >
> > > On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
> > >> Jon Smirl wrote:
> > >>>
> > >>> The path name field needs to be moved back into the blobs to support
> > >>> alternative indexes. For example I want an index on the Signed-off-by
> > >>> field. I use this index to give me the SHAs for the blobs
> > >>> Signed-off-by a particular person. In the current design I have no way
> > >>> of recovering the path name for these blobs other than a brute force
> > >>> search following every path looking for the right SHA.
> > >>>
> > >>
> > >> Ah, there we go. A use-case at last :)
> >
> > But not a brilliant one.  You sign off on commits not blobs.  So you go
> > from the sign-off to paths, then to blobs.  There is no need to go from
> > blob to path unless you deliberately introduce such a need.
> 
> Use blame for an example. Blame has to crawl every commit to see if it
> touched the file. It keeps doing this until it figures out the last
> author for every line in the file. Worse case blame has to crawl every
> commit in the data store.

And why exactly would you need to change blobs to contain path for blame
to be faster ?

Or more generally, what, in the current way of git doing things,
prevents you from adding an index to $THE_DATA_YOU_LIKE, exactly ?

>From the very few use cases you've given, I see nothing preventing to
create an additional index from the data git currently uses.

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 13:41             ` Jon Smirl
  2007-09-05 14:51               ` Andreas Ericsson
@ 2007-09-05 19:52               ` Andy Parkins
  1 sibling, 0 replies; 39+ messages in thread
From: Andy Parkins @ 2007-09-05 19:52 UTC (permalink / raw)
  To: git; +Cc: Jon Smirl, Andreas Ericsson, Theodore Tso, Junio C Hamano

On Wednesday 2007, September 05, Jon Smirl wrote:

> The path name field needs to be moved back into the blobs to support
> alternative indexes. For example I want an index on the Signed-off-by
> field. I use this index to give me the SHAs for the blobs
> Signed-off-by a particular person. In the current design I have no way
> of recovering the path name for these blobs other than a brute force
> search following every path looking for the right SHA.

Erm, if that's your only way then you designed your index incorrectly.

 1. Signed-Off-By lines appear in commits, so your index should be an index
    of SOB name against commit hash
 2. Lookup the commit for that commit hash.  As usual this is blindlingly
    git-fastic.
 3. That commit blob contains a tree hash.  Look it up.  As usual this is 
    blindingly git-fastic
 4. Start gathering blobs for that tree.  Fast, fast, fast.
 5. Any subtree objects you come across, goto 4.

This is not a brute force lookup and it's stuff that git is really good at 
anyway.

I'm really not sure I see what problem you're trying to solve.  Whatever 
index you want, you could keep and maintain if you wanted to without 
impacting git's core storage at all.



Andy

-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 17:31                       ` Julian Phillips
@ 2007-09-06  1:27                         ` Kyle Moffett
  0 siblings, 0 replies; 39+ messages in thread
From: Kyle Moffett @ 2007-09-06  1:27 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Jon Smirl, Andreas Ericsson, Theodore Tso, Junio C Hamano,
	Git Mailing List

On Sep 05, 2007, at 13:31:43, Julian Phillips wrote:
> And this is advantaged by having the path in the blob how?  The  
> important information here is knowing which commits touched the  
> file - this information is expensive in git because it is snapshot  
> based.  You have to go back through all the commits looking for  
> changes to the given path. The information you might want to cache  
> is which commits touched the file, which you could do without  
> changing the current data storage. Presumably you are suggesting  
> that such a cache would be cleaner with the filename in the blob?   
> Or do you think that it would somehow be faster to create?  If so,  
> how?

The only possible reason I can think of for moving data into the blob  
would be to make a POSIX-compliant git-like filesystem, and EVEN THEN  
you would NOT move the path out of the tree objects.  In order to  
have somewhat consistent inodes (and also for performance when  
changing 4 bytes in a 40GB file) you would want to have 3 different  
types of "inode" objects:

1)  4-64k of (metadata + filedata)
2)  4-64k of (metadata + list of 4-64k filedata blobs)
3)  4-64k of (metadata + list of 4-64k lists of filedata blobs)

On the other hand... that isn't GIT, it's something completely  
different with a very different usage pattern and set of  
requirements.  And you still don't put the path name in the objects,  
just the permissions and other attributes/metadata.

<Random Thought Experiment>
You would of course want to better define those 4-64k limits for  
allocation and performance reasons, but a double-indirect table of  
SHA128s with 64kb chunks lets you address up to 1TB of file data, and  
for each additional power-of-two increase in the chunk size you get 8  
times the storage space.  Furthermore, the actual double-indirect  
tables for an 8TB file using 128k chunks would be all of 64MB, for a  
more reasonable 4GB file with 32k tables (max of 128GB) it would be  
maybe 128kB of indirect SHA1 hash tables.
</Random Thought Experiment>

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 16:12                     ` Jon Smirl
  2007-09-05 17:31                       ` Julian Phillips
  2007-09-05 17:39                       ` Mike Hommey
@ 2007-09-06  8:49                       ` Andreas Ericsson
  2007-09-06  9:09                         ` Junio C Hamano
  2007-09-06 12:56                       ` Johannes Schindelin
  2007-09-07  0:33                       ` Martin Langhoff
  4 siblings, 1 reply; 39+ messages in thread
From: Andreas Ericsson @ 2007-09-06  8:49 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Julian Phillips, Theodore Tso, Junio C Hamano, Git Mailing List

Jon Smirl wrote:
> On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote:
>> On Wed, 5 Sep 2007, Jon Smirl wrote:
>>
>>> On 9/5/07, Andreas Ericsson <ae@op5.se> wrote:
>>>> Jon Smirl wrote:
>>>>> The path name field needs to be moved back into the blobs to support
>>>>> alternative indexes. For example I want an index on the Signed-off-by
>>>>> field. I use this index to give me the SHAs for the blobs
>>>>> Signed-off-by a particular person. In the current design I have no way
>>>>> of recovering the path name for these blobs other than a brute force
>>>>> search following every path looking for the right SHA.
>>>>>
>>>> Ah, there we go. A use-case at last :)
>> But not a brilliant one.  You sign off on commits not blobs.  So you go
>> from the sign-off to paths, then to blobs.  There is no need to go from
>> blob to path unless you deliberately introduce such a need.
> 
> Use blame for an example. Blame has to crawl every commit to see if it
> touched the file. It keeps doing this until it figures out the last
> author for every line in the file. Worse case blame has to crawl every
> commit in the data store.
> 

Estimated daily uses of git-blame, world-wide: few
Estimated daily uses of git-{merge,diff}, worldwide: lots

Code and benchmarks, or I'm not buying it.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-06  8:49                       ` Andreas Ericsson
@ 2007-09-06  9:09                         ` Junio C Hamano
  2007-09-06 11:03                           ` Wincent Colaiuta
  0 siblings, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2007-09-06  9:09 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Jon Smirl, Julian Phillips, Theodore Tso, Git Mailing List

Andreas Ericsson <ae@op5.se> writes:

> Estimated daily uses of git-blame, world-wide: few
> Estimated daily uses of git-{merge,diff}, worldwide: lots

Which makes the author of git-blame weep X-<.

The real issue is that embedding pathname in blob does _not_
help "git blame" but would actively hurt it.  A file with the
identical contents moved between the parent to child commit
shares the same blob object and same object name in the real
git.  Jon's modified system that hashes pathname together with
the contents would have them as two completely unrelated objects
with different object names, which only means that even 100%
similarity rename case becomes as expensive to find as renames
of lower similarity, which needs to expand and look into blob
contents.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-06  9:09                         ` Junio C Hamano
@ 2007-09-06 11:03                           ` Wincent Colaiuta
  0 siblings, 0 replies; 39+ messages in thread
From: Wincent Colaiuta @ 2007-09-06 11:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Andreas Ericsson, Jon Smirl, Julian Phillips, Theodore Tso,
	Git Mailing List

El 6/9/2007, a las 11:09, Junio C Hamano escribió:

> Andreas Ericsson <ae@op5.se> writes:
>
>> Estimated daily uses of git-blame, world-wide: few
>> Estimated daily uses of git-{merge,diff}, worldwide: lots
>
> Which makes the author of git-blame weep X-<.

But the few times when you do use git-blame (apart from when you use  
it out of sheer curiosity) it usually saves you backside (ie. when  
you've located a problem in the code and you want to know the who/ 
what/when/why of the offending commit).

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 16:12                     ` Jon Smirl
                                         ` (2 preceding siblings ...)
  2007-09-06  8:49                       ` Andreas Ericsson
@ 2007-09-06 12:56                       ` Johannes Schindelin
  2007-09-06 18:14                         ` Steven Grimm
  2007-09-07  0:33                       ` Martin Langhoff
  4 siblings, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2007-09-06 12:56 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano,
	Git Mailing List

Hi,

On Wed, 5 Sep 2007, Jon Smirl wrote:

> On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote:
> > On Wed, 5 Sep 2007, Jon Smirl wrote:
> >
> > >> Ah, there we go. A use-case at last :)
> >
> > But not a brilliant one.  You sign off on commits not blobs.  So you 
> > go from the sign-off to paths, then to blobs.  There is no need to go 
> > from blob to path unless you deliberately introduce such a need.
> 
> Use blame for an example. Blame has to crawl every commit to see if it 
> touched the file. It keeps doing this until it figures out the last 
> author for every line in the file. Worse case blame has to crawl every 
> commit in the data store.

But you can add _yet another_ index to it, which can be generated on the 
fly, so that Git only has to generate the information once, and then reuse 
it later.  As a benefit of this method, the underlying well-tested 
structure needs no change at all.

BTW could you please, please, please cut the quoted message that you are 
_not_ responding to?  It really _wastes_ my time.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-06 12:56                       ` Johannes Schindelin
@ 2007-09-06 18:14                         ` Steven Grimm
  0 siblings, 0 replies; 39+ messages in thread
From: Steven Grimm @ 2007-09-06 18:14 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jon Smirl, Julian Phillips, Andreas Ericsson, Theodore Tso,
	Junio C Hamano, Git Mailing List

Johannes Schindelin wrote:
> But you can add _yet another_ index to it, which can be generated on the 
> fly, so that Git only has to generate the information once, and then reuse 
> it later.  As a benefit of this method, the underlying well-tested 
> structure needs no change at all.
>   

And in fact, you can do this today, without modifying git-blame at all, 
by (ab)using its "-S" option (which lets you specify a custom ancestry 
chain to search). By coincidence, I was just showing some people at my 
office how to do this yesterday. I'll cut-and-paste from the email I 
sent them. I am not claiming this is nearly as desirable as a built-in, 
auto-updated secondary index, but it proves the concept, anyway.

Fast-to-generate version:

git-rev-list HEAD -- main.c | awk '{if (last) print last " " $0; 
last=$0;}' > /tmp/revlist

This speeds things up a lot, because git blame doesn't have to examine 
other revisions:

time git blame main.c
   1.56s user 0.30s system 99% cpu 1.868 total
time git blame -S /tmp/revlist main.c
   0.21s user 0.03s system 96% cpu 0.249 total

The bad news is that generating that revision list is a bit slow, and if 
you do it the naive way I suggested above, you can't use the rev list 
with the -M option (to follow renames). The good news is that it's 
possible to have that too if you generate a list of revisions that 
includes the renames:

# Generate a list of all revisions in the right order (only need to do 
this once, not once per file)
git rev-list HEAD > /tmp/all-revs
# Generate a list of the revisions that touched this file, following 
copies/renames.
# Could do this in fewer commands but this is hopefully easier to follow.
git blame --porcelain -M main.c | \
   egrep '^[0-9a-f]{40}' | \
   cut -d' ' -f1 | \
   fgrep -f - /tmp/all-revs | \
   awk '{if (last) print last " " $0; last=$0;}' > /tmp/revlist

Then -M is fast too:

time git blame -M main.c
   1.72s user 0.27s system 89% cpu 2.219 total
time git blame -M -S /tmp/revlist main.c
   0.29s user 0.03s system 93% cpu 0.341 total

Oddly, if you use the -S option, "git blame -C" actually gets 
significantly *slower*. I am not sure why.

-Steve

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Git's database structure
  2007-09-05 16:12                     ` Jon Smirl
                                         ` (3 preceding siblings ...)
  2007-09-06 12:56                       ` Johannes Schindelin
@ 2007-09-07  0:33                       ` Martin Langhoff
  4 siblings, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2007-09-07  0:33 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano,
	Git Mailing List

On 9/6/07, Jon Smirl <jonsmirl@gmail.com> wrote:
> Use blame for an example. Blame has to crawl every commit to see if it

Sure. Build a quick dedicated index for that and measure

 - cost (size and commit/fetch costs)
 - benefit
 - frequency of usage

git is a special-purpouse DB that does great for certain access
patterns. Have a look at monotone for a design that looks a lot like
git but is backed by a general purpouse DB and does equally poorly for
all access patterns ;-)

> It keeps doing this until it figures out the last
> author for every line in the file. Worse case blame has to crawl every
> commit in the data store.

Yep. Can we get a minimal-cost index with just enough hints that can
speed up blame, and perhaps git log with/very/deep/path? Probably!

That's worth pursuing sure.


martin

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2007-09-07  0:34 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-04 15:23 Git's database structure Jon Smirl
2007-09-04 15:55 ` Andreas Ericsson
2007-09-04 16:07   ` Mike Hommey
2007-09-04 16:10     ` Andreas Ericsson
2007-09-04 16:19   ` Jon Smirl
2007-09-04 16:29     ` Andreas Ericsson
2007-09-04 17:09     ` Jeff King
2007-09-04 20:17     ` David Tweed
2007-09-04 17:21   ` Junio C Hamano
2007-09-04 16:28 ` Jon Smirl
2007-09-04 16:31   ` Andreas Ericsson
2007-09-04 16:47     ` Jon Smirl
2007-09-04 16:51       ` Andreas Ericsson
2007-09-04 17:25   ` Junio C Hamano
2007-09-04 17:44     ` Jon Smirl
2007-09-04 18:04       ` Mike Hommey
2007-09-04 19:44         ` Reece Dunn
2007-09-04 18:06       ` Junio C Hamano
2007-09-04 21:25       ` Theodore Tso
2007-09-04 21:54         ` Jon Smirl
2007-09-05  7:18           ` Andreas Ericsson
2007-09-05 13:41             ` Jon Smirl
2007-09-05 14:51               ` Andreas Ericsson
2007-09-05 15:37                 ` Jon Smirl
2007-09-05 15:54                   ` Julian Phillips
2007-09-05 16:12                     ` Jon Smirl
2007-09-05 17:31                       ` Julian Phillips
2007-09-06  1:27                         ` Kyle Moffett
2007-09-05 17:39                       ` Mike Hommey
2007-09-06  8:49                       ` Andreas Ericsson
2007-09-06  9:09                         ` Junio C Hamano
2007-09-06 11:03                           ` Wincent Colaiuta
2007-09-06 12:56                       ` Johannes Schindelin
2007-09-06 18:14                         ` Steven Grimm
2007-09-07  0:33                       ` Martin Langhoff
2007-09-05 19:52               ` Andy Parkins
2007-09-04 17:19 ` Julian Phillips
2007-09-04 17:30   ` Jon Smirl
2007-09-04 18:51     ` Andreas Ericsson

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).