* Git's database structure @ 2007-09-04 15:23 Jon Smirl 2007-09-04 15:55 ` Andreas Ericsson ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Jon Smirl @ 2007-09-04 15:23 UTC (permalink / raw) To: Git Mailing List Let's back up a little bit from "Caclulating tree node". What are the elements of git's data structures? Right now we have an index structure (tree nodes) integrated in to a base table. Integrating indexing into the data is not normally done in a database. Doing a normalization analysis like this may expose flaws in the way the data is structured. Of course we may also decide to leave everything the way it is. What about the special status of a rename? In the current model we effectively have three tables. commit - a set of all SHAs in the commit, previous commit, comment, author, etc blob - a file, permissions, etc. file names - name, SHA The file name table is encoded as an index and it has been intermingled with the commit table. Looking at this from a set theory angle brings up the question, do we really have three tables and file names are an independent variable from the blobs, or should file names be an attribute of the blob? How this gets structured in the db is an independent question about how renames get detected on a commit. The current scheme for detecting renames by comparing diffs is working fine. The question is, once we detect a rename how should it be stored? Ignoring the performance impacts and looking at the problem from the set theory view point, should: the pathnames be in their own table with a row for each alias the pathnames be stored as an attribute of the blob Both of these are the same information, we're just looking at how things are normalized. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 15:23 Git's database structure Jon Smirl @ 2007-09-04 15:55 ` Andreas Ericsson 2007-09-04 16:07 ` Mike Hommey ` (2 more replies) 2007-09-04 16:28 ` Jon Smirl 2007-09-04 17:19 ` Julian Phillips 2 siblings, 3 replies; 39+ messages in thread From: Andreas Ericsson @ 2007-09-04 15:55 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List Jon Smirl wrote: > Let's back up a little bit from "Caclulating tree node". What are the > elements of git's data structures? > > Right now we have an index structure (tree nodes) integrated in to a > base table. Integrating indexing into the data is not normally done in > a database. Doing a normalization analysis like this may expose flaws > in the way the data is structured. Of course we may also decide to > leave everything the way it is. > > What about the special status of a rename? In the current model we > effectively have three tables. > > commit - a set of all SHAs in the commit, previous commit, comment, author, etc > blob - a file, permissions, etc. > file names - name, SHA commit - SHA1 of its parent(s) and its root-tree, along with author info and a free-form field blob - content addressable by *multiple trees* file names - List of path-names inside a tree object. To draw some sort of relationship model here, you'd have commit 1<->M roottree tree M<->M tree tree M<->M blob Assuming SHA1 never collides (collisions rule out any form of storage, so we might as well hope it never happens), that leaves us with this: Each root tree can only ever belong to a single commit, unless you intentionally force git to make completely empty commits. git won't complain about this, so long as you don't make two in the same second, because it relies more heavily on the DAG than on developer sanity. Each root tree can point to multiple sub-trees. The sub-trees can be linked to any number of root-trees. Blobs can be linked to any number of tree objects, or even multiple times to the same tree object. This wouldn't be possible if the blob objects had their own pathnames stored inside them, so to speak. > > The file name table is encoded as an index and it has been > intermingled with the commit table. > > Looking at this from a set theory angle brings up the question, do we > really have three tables and file names are an independent variable > from the blobs, or should file names be an attribute of the blob? > File names are not independant variables. They belong inside the table created for them, which is the tree objects. > How this gets structured in the db is an independent question about > how renames get detected on a commit. The current scheme for detecting > renames by comparing diffs is working fine. The question is, once we > detect a rename how should it be stored? > Do you realize that you're contradicting yourself in two upon each other following sentences here? Detecting renames after the fashion works fine. Not storing them is part of the "detect them by comparing diffs". > Ignoring the performance impacts and looking at the problem from the > set theory view point, should: > the pathnames be in their own table with a row for each alias > the pathnames be stored as an attribute of the blob > > Both of these are the same information, we're just looking at how > things are normalized. > Except that git init echo foo > a cp -a a b git add . git commit -m testing git count-objects yields 3 objects at the moment; A commit-object, a tree object and *one* blob object. With your scheme the 2 blob objects would differ, and there would be 4 of them. If you propose to ignore the path-name you have effectively broken support for having two identical files with different names in the same directory. Now, can you please tell me what gains you're hoping to see with this new layout of yours? -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 15:55 ` Andreas Ericsson @ 2007-09-04 16:07 ` Mike Hommey 2007-09-04 16:10 ` Andreas Ericsson 2007-09-04 16:19 ` Jon Smirl 2007-09-04 17:21 ` Junio C Hamano 2 siblings, 1 reply; 39+ messages in thread From: Mike Hommey @ 2007-09-04 16:07 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Git Mailing List On Tue, Sep 04, 2007 at 05:55:16PM +0200, Andreas Ericsson <ae@op5.se> wrote: > Each root tree can only ever belong to a single commit, unless you > intentionally force git to make completely empty commits. git > won't complain about this, so long as you don't make two in the > same second, because it relies more heavily on the DAG than on > developer sanity. Actually, you don't need to be insane to have multiple commits pointing at the same root tree. It is actually very easy: - git clone - do some stuff on your master branch and commit - send your changes upstream - upstream applies as is - git pull You now have everything merged, and the last commit on your master branch, while being a different commit object due to its parenting, has the same root tree as the tip of the remote branch. Mike ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:07 ` Mike Hommey @ 2007-09-04 16:10 ` Andreas Ericsson 0 siblings, 0 replies; 39+ messages in thread From: Andreas Ericsson @ 2007-09-04 16:10 UTC (permalink / raw) To: Mike Hommey; +Cc: Git Mailing List Mike Hommey wrote: > On Tue, Sep 04, 2007 at 05:55:16PM +0200, Andreas Ericsson <ae@op5.se> wrote: >> Each root tree can only ever belong to a single commit, unless you >> intentionally force git to make completely empty commits. git >> won't complain about this, so long as you don't make two in the >> same second, because it relies more heavily on the DAG than on >> developer sanity. > > Actually, you don't need to be insane to have multiple commits pointing > at the same root tree. It is actually very easy: > - git clone > - do some stuff on your master branch and commit > - send your changes upstream > - upstream applies as is > - git pull > > You now have everything merged, and the last commit on your master branch, > while being a different commit object due to its parenting, has the same > root tree as the tip of the remote branch. > That explains why it felt so awkward writing that sentence. :) Thanks for correcting me. Even so, one more M<->M relation-ship certainly speaks for rather than against the current model. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 15:55 ` Andreas Ericsson 2007-09-04 16:07 ` Mike Hommey @ 2007-09-04 16:19 ` Jon Smirl 2007-09-04 16:29 ` Andreas Ericsson ` (2 more replies) 2007-09-04 17:21 ` Junio C Hamano 2 siblings, 3 replies; 39+ messages in thread From: Jon Smirl @ 2007-09-04 16:19 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Git Mailing List On 9/4/07, Andreas Ericsson <ae@op5.se> wrote: > Jon Smirl wrote: > > Let's back up a little bit from "Caclulating tree node". What are the > > elements of git's data structures? > > > > Right now we have an index structure (tree nodes) integrated in to a > > base table. Integrating indexing into the data is not normally done in > > a database. Doing a normalization analysis like this may expose flaws > > in the way the data is structured. Of course we may also decide to > > leave everything the way it is. > > > > What about the special status of a rename? In the current model we > > effectively have three tables. > > > > commit - a set of all SHAs in the commit, previous commit, comment, author, etc > > > blob - a file, permissions, etc. > > file names - name, SHA > > commit - SHA1 of its parent(s) and its root-tree, along with > author info and a free-form field > blob - content addressable by *multiple trees* > file names - List of path-names inside a tree object. > > > To draw some sort of relationship model here, you'd have > > commit 1<->M roottree > tree M<->M tree > tree M<->M blob By introducing tree nodes you have blended a specific indexing scheme into the data. There are many other ways the path names could be indexed hash tables, binary trees, etc. This problem exists in files systems. Since the path names have been encoded into the directory structures there is no way to query something like "all files created yesterday" from a file system without building another mapping table or a brute force search. I keep using Google as an example, Google is indexing hierarchical URLs but they do not use a hierarchical index to do it. Databases keep the knowledge of how things are indexed out of the data. A data structure analysis of git should remove the blended index and start from the set theory. > Assuming SHA1 never collides (collisions rule out any form of storage, > so we might as well hope it never happens), that leaves us with this: > > Each root tree can only ever belong to a single commit, unless you > intentionally force git to make completely empty commits. git > won't complain about this, so long as you don't make two in the > same second, because it relies more heavily on the DAG than on > developer sanity. > > Each root tree can point to multiple sub-trees. The sub-trees can be > linked to any number of root-trees. > > Blobs can be linked to any number of tree objects, or even multiple > times to the same tree object. This wouldn't be possible if the > blob objects had their own pathnames stored inside them, so to speak. > > > > > The file name table is encoded as an index and it has been > > intermingled with the commit table. > > > > Looking at this from a set theory angle brings up the question, do we > > really have three tables and file names are an independent variable > > from the blobs, or should file names be an attribute of the blob? > > > > File names are not independant variables. They belong inside the > table created for them, which is the tree objects. > > > How this gets structured in the db is an independent question about > > how renames get detected on a commit. The current scheme for detecting > > renames by comparing diffs is working fine. The question is, once we > > detect a rename how should it be stored? > > > > Do you realize that you're contradicting yourself in two upon each > other following sentences here? > > Detecting renames after the fashion works fine. Not storing them > is part of the "detect them by comparing diffs". > > > Ignoring the performance impacts and looking at the problem from the > > set theory view point, should: > > the pathnames be in their own table with a row for each alias > > the pathnames be stored as an attribute of the blob > > > > Both of these are the same information, we're just looking at how > > things are normalized. > > > > Except that > > git init > echo foo > a > cp -a a b > git add . > git commit -m testing > git count-objects > > yields 3 objects at the moment; A commit-object, a tree object and *one* > blob object. With your scheme the 2 blob objects would differ, and there > would be 4 of them. If you propose to ignore the path-name you have > effectively broken support for having two identical files with different > names in the same directory. > > Now, can you please tell me what gains you're hoping to see with this > new layout of yours? > > -- > Andreas Ericsson andreas.ericsson@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:19 ` Jon Smirl @ 2007-09-04 16:29 ` Andreas Ericsson 2007-09-04 17:09 ` Jeff King 2007-09-04 20:17 ` David Tweed 2 siblings, 0 replies; 39+ messages in thread From: Andreas Ericsson @ 2007-09-04 16:29 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List Jon Smirl wrote: > On 9/4/07, Andreas Ericsson <ae@op5.se> wrote: >> Jon Smirl wrote: >>> Let's back up a little bit from "Caclulating tree node". What are the >>> elements of git's data structures? >>> >>> Right now we have an index structure (tree nodes) integrated in to a >>> base table. Integrating indexing into the data is not normally done in >>> a database. Doing a normalization analysis like this may expose flaws >>> in the way the data is structured. Of course we may also decide to >>> leave everything the way it is. >>> >>> What about the special status of a rename? In the current model we >>> effectively have three tables. >>> >>> commit - a set of all SHAs in the commit, previous commit, comment, author, etc >>> blob - a file, permissions, etc. >>> file names - name, SHA >> commit - SHA1 of its parent(s) and its root-tree, along with >> author info and a free-form field >> blob - content addressable by *multiple trees* >> file names - List of path-names inside a tree object. >> >> >> To draw some sort of relationship model here, you'd have >> >> commit 1<->M roottree >> tree M<->M tree >> tree M<->M blob > > By introducing tree nodes you have blended a specific indexing scheme > into the data. There are many other ways the path names could be > indexed hash tables, binary trees, etc. > > This problem exists in files systems. Since the path names have been > encoded into the directory structures there is no way to query > something like "all files created yesterday" from a file system > without building another mapping table or a brute force search. I keep > using Google as an example, Google is indexing hierarchical URLs but > they do not use a hierarchical index to do it. > Pathnames are by far the most common search-/delimiting criteria for git though, so I fail to see why this is a problem for you. > Databases keep the knowledge of how things are indexed out of the > data. A data structure analysis of git should remove the blended index > and start from the set theory. > Why? This is the core of the problem, really. You haven't specified a single, real-life reason *why* it should be any other way than it already is. It sounds a bit to me as if you've been to a really inspiring seminar about "how database-like things *should* be done" and then decided to go berserk on your favourite database-like thing, which is git. Code and benchmarks or bust. In the meantime, I'll settle for a recount of what problems you're having with the current layout, or what gains you're hoping to achieve with the new one. As it's the 3rd time I'm asking, this'll be the last. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:19 ` Jon Smirl 2007-09-04 16:29 ` Andreas Ericsson @ 2007-09-04 17:09 ` Jeff King 2007-09-04 20:17 ` David Tweed 2 siblings, 0 replies; 39+ messages in thread From: Jeff King @ 2007-09-04 17:09 UTC (permalink / raw) To: Jon Smirl; +Cc: Andreas Ericsson, Git Mailing List On Tue, Sep 04, 2007 at 12:19:33PM -0400, Jon Smirl wrote: > By introducing tree nodes you have blended a specific indexing scheme > into the data. There are many other ways the path names could be > indexed hash tables, binary trees, etc. That is correct. However, given that indexing scheme, many of the common operations just "fall out" simply and efficiently, without the need to keep separate indices. So yes, git is geared towards a particular set of operations. Your complaint seems to be two-fold: 1. there is an inelegance in the blending of data and indexing. The problem with changing this is: a. we are all already using git, and it would require completely re-vamping the core data structure b. there is some feeling that the blending is necessary for performance. Given the difficulty of (a), I think you would have to provide compelling evidence (i.e., numbers) that a git-like system based around set theory with separate indices would perform as well. 2. you want perform some operations to which the hierarchy is not well-suited. In this case, I think you can get by with the same solution you have proposed already: indices external to the data structure (in fact, this is exactly what Google is doing: taking hierarchical URLs and indexing them in different ways). Have you taken a look at the pack v4 work by Shawn and Nicolas? It is an attempt to build such indices at pack time (but keeping the core git data structure intact). -Peff ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:19 ` Jon Smirl 2007-09-04 16:29 ` Andreas Ericsson 2007-09-04 17:09 ` Jeff King @ 2007-09-04 20:17 ` David Tweed 2 siblings, 0 replies; 39+ messages in thread From: David Tweed @ 2007-09-04 20:17 UTC (permalink / raw) To: Jon Smirl; +Cc: Andreas Ericsson, Git Mailing List On 9/4/07, Jon Smirl <jonsmirl@gmail.com> wrote: > without building another mapping table or a brute force search. I keep > using Google as an example, Google is indexing hierarchical URLs but > they do not use a hierarchical index to do it. It might help the discussion if you could point to a reference, preferably one that discusses the trade-offs in the design, with more concrete details about what google or other search engines actually do. It would be particularly useful if it addressed issues of 1. the type of queries the representation is optimised for. 2. consistency requirements. (Can a search engine use different data structures if they improve average performance at the cost of occasional inconsistency/lossage?) Finally, this design space is not totally unexplored, for example, http://plan9.bell-labs.com/sys/doc/venti/venti.html AFAICS they only use SHA-1 for blocks within files (although this might be misreading the paper) so presumably they'd have knowledge about the trade-offs. -- cheers, dave tweed__________________________ david.tweed@gmail.com Rm 124, School of Systems Engineering, University of Reading. "we had no idea that when we added templates we were adding a Turing- complete compile-time language." -- C++ standardisation committee ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 15:55 ` Andreas Ericsson 2007-09-04 16:07 ` Mike Hommey 2007-09-04 16:19 ` Jon Smirl @ 2007-09-04 17:21 ` Junio C Hamano 2 siblings, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2007-09-04 17:21 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Jon Smirl, Git Mailing List Andreas Ericsson <ae@op5.se> writes: > Each root tree can only ever belong to a single commit, unless you > intentionally force git to make completely empty commits. git > won't complain about this, so long as you don't make two in the > same second, because it relies more heavily on the DAG than on > developer sanity. This actually can happen without even using 'ours' strategy. If two people independently applied the same patch on their branches and later their results were merged. And "the same second" requirement is not even there and not interesting. There are other things like developer identity, log message, and their ancestry that would make the resulting commit object distinct. > Each root tree can point to multiple sub-trees. The sub-trees can be > linked to any number of root-trees. > > Blobs can be linked to any number of tree objects, or even multiple > times to the same tree object. This wouldn't be possible if the > blob objects had their own pathnames stored inside them, so to speak. More importantly, in git, filenames and modes are not considered part of "contents", which git tracks. Although it is an entirely possible and valid alternate design to move that as part of "blob" to build a system that is different from git, which Jon seems to be aiming at, the benefit of such a design is unclear to me, both from theoretical point of view (now blobs are not about pure contents anymore) nor performance point of view (Linus's done flat tree object in an early stage of git, and it was not nice) as other people explained. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 15:23 Git's database structure Jon Smirl 2007-09-04 15:55 ` Andreas Ericsson @ 2007-09-04 16:28 ` Jon Smirl 2007-09-04 16:31 ` Andreas Ericsson 2007-09-04 17:25 ` Junio C Hamano 2007-09-04 17:19 ` Julian Phillips 2 siblings, 2 replies; 39+ messages in thread From: Jon Smirl @ 2007-09-04 16:28 UTC (permalink / raw) To: Git Mailing List Another way of looking at the problem, Let's build a full-text index for git. You put a string into the index and it returns the SHAs of all the file nodes that contain the string. How do I recover the path names of these SHAs? -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:28 ` Jon Smirl @ 2007-09-04 16:31 ` Andreas Ericsson 2007-09-04 16:47 ` Jon Smirl 2007-09-04 17:25 ` Junio C Hamano 1 sibling, 1 reply; 39+ messages in thread From: Andreas Ericsson @ 2007-09-04 16:31 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List Jon Smirl wrote: > Another way of looking at the problem, > > Let's build a full-text index for git. You put a string into the index > and it returns the SHAs of all the file nodes that contain the string. > How do I recover the path names of these SHAs? > I wouldn't know, but presumably any table can have more than one column. Is this a problem you face with git so often that it requires a complete re-design of its very core? -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:31 ` Andreas Ericsson @ 2007-09-04 16:47 ` Jon Smirl 2007-09-04 16:51 ` Andreas Ericsson 0 siblings, 1 reply; 39+ messages in thread From: Jon Smirl @ 2007-09-04 16:47 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Git Mailing List On 9/4/07, Andreas Ericsson <ae@op5.se> wrote: > Jon Smirl wrote: > > Another way of looking at the problem, > > > > Let's build a full-text index for git. You put a string into the index > > and it returns the SHAs of all the file nodes that contain the string. > > How do I recover the path names of these SHAs? > > > > I wouldn't know, but presumably any table can have more than one column. > > Is this a problem you face with git so often that it requires a complete > re-design of its very core? That's the whole point. We need to discuss the impact of merging a field (path names) with an index (tree nodes) has on future things we may want to do with the data stored in git. Databases don't usually blend fields/indexes without also duplicating the field in the table. You need all the fields in the table so that it is possible to create indexes on other fields. > > -- > Andreas Ericsson andreas.ericsson@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:47 ` Jon Smirl @ 2007-09-04 16:51 ` Andreas Ericsson 0 siblings, 0 replies; 39+ messages in thread From: Andreas Ericsson @ 2007-09-04 16:51 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List Jon Smirl wrote: > On 9/4/07, Andreas Ericsson <ae@op5.se> wrote: >> Jon Smirl wrote: >>> Another way of looking at the problem, >>> >>> Let's build a full-text index for git. You put a string into the index >>> and it returns the SHAs of all the file nodes that contain the string. >>> How do I recover the path names of these SHAs? >>> >> I wouldn't know, but presumably any table can have more than one column. >> >> Is this a problem you face with git so often that it requires a complete >> re-design of its very core? > > That's the whole point. We need to discuss the impact of merging a > field (path names) with an index (tree nodes) has on future things we > may want to do with the data stored in git. > Yes, but as nobody seems to know what those future things are, it feels rather pointless speculating about adding support to git for them. git is a tool. It's a great one at that, because it was built to solve a particular problem, which it does an amazing job at. Other SCM's which had the potential to become amazingly good tools too drowned somewhere between prototype and product in a sea of intellectual masturbation, which had little to do with solving real-world problems. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 16:28 ` Jon Smirl 2007-09-04 16:31 ` Andreas Ericsson @ 2007-09-04 17:25 ` Junio C Hamano 2007-09-04 17:44 ` Jon Smirl 1 sibling, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2007-09-04 17:25 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List "Jon Smirl" <jonsmirl@gmail.com> writes: > Another way of looking at the problem, > > Let's build a full-text index for git. You put a string into the index > and it returns the SHAs of all the file nodes that contain the string. > How do I recover the path names of these SHAs? That question does not make much sense without specifying "which commit's path you are talking about". If you want to encode such "contextual information" in addition to "contents", you could do so, but you essentially need to record commit + pathname + mode bits + contents as "blob" and hash that to come up with a name. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 17:25 ` Junio C Hamano @ 2007-09-04 17:44 ` Jon Smirl 2007-09-04 18:04 ` Mike Hommey ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Jon Smirl @ 2007-09-04 17:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote: > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > Another way of looking at the problem, > > > > Let's build a full-text index for git. You put a string into the index > > and it returns the SHAs of all the file nodes that contain the string. > > How do I recover the path names of these SHAs? > > That question does not make much sense without specifying "which > commit's path you are talking about". > > If you want to encode such "contextual information" in addition > to "contents", you could do so, but you essentially need to > record commit + pathname + mode bits + contents as "blob" and > hash that to come up with a name. I left the details out of the full-text example to make it more obvious that we can't recover the path names. Doing this type of analysis may point out that even more fields are missing from the blob table such as commit id. The current data store design is not very flexible. Databases solved the flexibility problem long ago. I'm just wondering if we should steal some good ideas out of the database world and apply them to git. Ten years from now we may have 100GB git databases and really wish we had more flexible ways of querying them. The reason databases don't encode the fields into the index is that you can only have a single index on the table if you do that. Databases do sometimes duplicate the field in both the index and the table. Databases also have the property that indexes are just a cache and can be dropped at any time. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 17:44 ` Jon Smirl @ 2007-09-04 18:04 ` Mike Hommey 2007-09-04 19:44 ` Reece Dunn 2007-09-04 18:06 ` Junio C Hamano 2007-09-04 21:25 ` Theodore Tso 2 siblings, 1 reply; 39+ messages in thread From: Mike Hommey @ 2007-09-04 18:04 UTC (permalink / raw) To: Jon Smirl; +Cc: Junio C Hamano, Git Mailing List On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl <jonsmirl@gmail.com> wrote: > On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote: > > "Jon Smirl" <jonsmirl@gmail.com> writes: > > > > > Another way of looking at the problem, > > > > > > Let's build a full-text index for git. You put a string into the index > > > and it returns the SHAs of all the file nodes that contain the string. > > > How do I recover the path names of these SHAs? > > > > That question does not make much sense without specifying "which > > commit's path you are talking about". > > > > If you want to encode such "contextual information" in addition > > to "contents", you could do so, but you essentially need to > > record commit + pathname + mode bits + contents as "blob" and > > hash that to come up with a name. > > I left the details out of the full-text example to make it more > obvious that we can't recover the path names. > > Doing this type of analysis may point out that even more fields are > missing from the blob table such as commit id. > > The current data store design is not very flexible. Databases solved > the flexibility problem long ago. I'm just wondering if we should > steal some good ideas out of the database world and apply them to git. > Ten years from now we may have 100GB git databases and really wish we > had more flexible ways of querying them. > > The reason databases don't encode the fields into the index is that > you can only have a single index on the table if you do that. > Databases do sometimes duplicate the field in both the index and the > table. Databases also have the property that indexes are just a cache > and can be dropped at any time. The big difference between a database and git is that a database is a general purpose tool. git has a much more restricted scope. As such, it doesn't need *that much* flexibility. Mike ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 18:04 ` Mike Hommey @ 2007-09-04 19:44 ` Reece Dunn 0 siblings, 0 replies; 39+ messages in thread From: Reece Dunn @ 2007-09-04 19:44 UTC (permalink / raw) To: Mike Hommey, Jon Smirl, Junio C Hamano, Git Mailing List On 04/09/07, Mike Hommey <mh@glandium.org> wrote: > On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl <jonsmirl@gmail.com> wrote: > > The reason databases don't encode the fields into the index is that > > you can only have a single index on the table if you do that. > > Databases do sometimes duplicate the field in both the index and the > > table. Databases also have the property that indexes are just a cache > > and can be dropped at any time. > > The big difference between a database and git is that a database is a > general purpose tool. git has a much more restricted scope. As such, it > doesn't need *that much* flexibility. Databases are designed to be efficient at storing and accessing large amounts of data. The key thing about a database is that it does not track the *history* of the data it is storing. This is the main problem with using a database as a metadata storage facility. Modern source control systems such as Perforce (and possibly Subversion), use a database to track metadata such as branch/merge history, user data and so on. This, IMHO is a huge weakness of these SCM systems. It is impossible to fully roll back to a given point in time, because that metadata is stored independently of the file content tracking. Git *is not a database*. This is fundamental to understanding how git works. Git stores *all* of its data in a Directed Acyclic Graph (with the exception of the pointers to tag and the current head of each branch, that it stores locally in the .git directory). Read http://eagain.net/articles/git-for-computer-scientists/ for more information on this. What this means is that for any commit, git has all the information it needs about the repository at that point in time. It doesn't need anything else. If you then store information in a database, you lose having the complete picture at any point in the history of the repository. - Reece ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 17:44 ` Jon Smirl 2007-09-04 18:04 ` Mike Hommey @ 2007-09-04 18:06 ` Junio C Hamano 2007-09-04 21:25 ` Theodore Tso 2 siblings, 0 replies; 39+ messages in thread From: Junio C Hamano @ 2007-09-04 18:06 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List "Jon Smirl" <jonsmirl@gmail.com> writes: > On 9/4/07, Junio C Hamano <gitster@pobox.com> wrote: >> "Jon Smirl" <jonsmirl@gmail.com> writes: >> >> > Another way of looking at the problem, >> > >> > Let's build a full-text index for git. You put a string into the index >> > and it returns the SHAs of all the file nodes that contain the string. >> > How do I recover the path names of these SHAs? >> >> That question does not make much sense without specifying "which >> commit's path you are talking about". >> >> If you want to encode such "contextual information" in addition >> to "contents", you could do so, but you essentially need to >> record commit + pathname + mode bits + contents as "blob" and >> hash that to come up with a name. > > I left the details out of the full-text example to make it more > obvious that we can't recover the path names. > > Doing this type of analysis may point out that even more fields are > missing from the blob table such as commit id. Quite the contrary. You just illustrated why it is wrong to put anything but contents in the blob. The specialized indexing is a different issue. If you want to have a full text index to answer "what paths in which commits had this string?", then your database table would have columns such as commit (sha-1), path (string) as values, indexed with the search string. Now the current set of "git" operation does not need to answer that query, so we do not build nor maintain such an index that nobody uses. But your application may benefit from such an index, and as others said, nobody prevents you from building one. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 17:44 ` Jon Smirl 2007-09-04 18:04 ` Mike Hommey 2007-09-04 18:06 ` Junio C Hamano @ 2007-09-04 21:25 ` Theodore Tso 2007-09-04 21:54 ` Jon Smirl 2 siblings, 1 reply; 39+ messages in thread From: Theodore Tso @ 2007-09-04 21:25 UTC (permalink / raw) To: Jon Smirl; +Cc: Junio C Hamano, Git Mailing List On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote: > The current data store design is not very flexible. Databases solved > the flexibility problem long ago. I'm just wondering if we should > steal some good ideas out of the database world and apply them to git. > Ten years from now we may have 100GB git databases and really wish we > had more flexible ways of querying them. Databases solved the flexibility problem, at the cost of performance. And if you use full normalized form in your database scheme, it costs you even more in performance, because of all of the joins that you need in order get the information you need to do, you know, useful work as opposed to database wanking. If you take a look at the really big databases with super high performance requirements, say like those used to managed airline tickets/reservation/fares, you will find that they are not normalized, and they are not relational; they can't afford to be. And if you take a look at some of git competition that use relational databases to store their SCM data, and take a look at how loooooong they they take to do even basic operations, I would say that the onus is on you to prove that normalization is actually a win in terms of real (not theoretical) advantages, and that it doesn't cause performance to go into the toilet. I think the fundamental disconnect here is that no one is buying your claim that just because the data design is "more flexible" that this is automatically a good thing in and of itself, and we should even for a moment, "put performance aside". I also don't think that attempting to force git's data structures into database terms makes sense; it is much closer to an filesystem using an object based store --- and very few people except for folks like Hans Resiers believes that Filesystems and Database should be unified.... - Ted ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 21:25 ` Theodore Tso @ 2007-09-04 21:54 ` Jon Smirl 2007-09-05 7:18 ` Andreas Ericsson 0 siblings, 1 reply; 39+ messages in thread From: Jon Smirl @ 2007-09-04 21:54 UTC (permalink / raw) To: Theodore Tso; +Cc: Junio C Hamano, Git Mailing List On 9/4/07, Theodore Tso <tytso@mit.edu> wrote: > On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote: > > The current data store design is not very flexible. Databases solved > > the flexibility problem long ago. I'm just wondering if we should > > steal some good ideas out of the database world and apply them to git. > > Ten years from now we may have 100GB git databases and really wish we > > had more flexible ways of querying them. > > Databases solved the flexibility problem, at the cost of performance. > And if you use full normalized form in your database scheme, it costs > you even more in performance, because of all of the joins that you > need in order get the information you need to do, you know, useful > work as opposed to database wanking. > > If you take a look at the really big databases with super high > performance requirements, say like those used to managed airline > tickets/reservation/fares, you will find that they are not normalized, > and they are not relational; they can't afford to be. And if you take > a look at some of git competition that use relational databases to > store their SCM data, and take a look at how loooooong they they take > to do even basic operations, I would say that the onus is on you to > prove that normalization is actually a win in terms of real (not > theoretical) advantages, and that it doesn't cause performance to go > into the toilet. > > I think the fundamental disconnect here is that no one is buying your > claim that just because the data design is "more flexible" that this > is automatically a good thing in and of itself, and we should even for > a moment, "put performance aside". It is very easy to get bogged down in performance arguments on database design when the correct answer is that there are always lots of different ways to achieve the same goal. I wanted to defer debating performance until we closely looked at the relationships between the data at an abstract level. Since git hasn't stored all of the fields in the object table (the path is encoded in the index) we are never going to be able to build an alternative way of indexing the object table. Not being able to build alternative indexes is likely to cause problems when the database starts getting really big. Without an index every query that can't use the path name index is reduced to doing full table scans. A few things that could benefit from alternative indexing, blame, full-text search, automating the Maintainers file, etc. I'm just asking if we really want to make full table scans the only possible way to implement these types of queries. If the answer is no, then let's first explore how to fix things at an abstract level before diving into the performance arguments. An obvious parallel from the file system world is the locate database and how it is forced to continuously rescan the file system and store full path names. > > I also don't think that attempting to force git's data structures into > database terms makes sense; it is much closer to an filesystem using > an object based store --- and very few people except for folks like > Hans Resiers believes that Filesystems and Database should be > unified.... > > - Ted > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 21:54 ` Jon Smirl @ 2007-09-05 7:18 ` Andreas Ericsson 2007-09-05 13:41 ` Jon Smirl 0 siblings, 1 reply; 39+ messages in thread From: Andreas Ericsson @ 2007-09-05 7:18 UTC (permalink / raw) To: Jon Smirl; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List Jon Smirl wrote: > On 9/4/07, Theodore Tso <tytso@mit.edu> wrote: >> On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote: >>> The current data store design is not very flexible. Databases solved >>> the flexibility problem long ago. I'm just wondering if we should >>> steal some good ideas out of the database world and apply them to git. >>> Ten years from now we may have 100GB git databases and really wish we >>> had more flexible ways of querying them. >> Databases solved the flexibility problem, at the cost of performance. >> And if you use full normalized form in your database scheme, it costs >> you even more in performance, because of all of the joins that you >> need in order get the information you need to do, you know, useful >> work as opposed to database wanking. >> >> If you take a look at the really big databases with super high >> performance requirements, say like those used to managed airline >> tickets/reservation/fares, you will find that they are not normalized, >> and they are not relational; they can't afford to be. And if you take >> a look at some of git competition that use relational databases to >> store their SCM data, and take a look at how loooooong they they take >> to do even basic operations, I would say that the onus is on you to >> prove that normalization is actually a win in terms of real (not >> theoretical) advantages, and that it doesn't cause performance to go >> into the toilet. >> >> I think the fundamental disconnect here is that no one is buying your >> claim that just because the data design is "more flexible" that this >> is automatically a good thing in and of itself, and we should even for >> a moment, "put performance aside". > > It is very easy to get bogged down in performance arguments on > database design when the correct answer is that there are always lots > of different ways to achieve the same goal. I wanted to defer debating > performance until we closely looked at the relationships between the > data at an abstract level. > But you cannot. Git is performance-critical, for the same reason every other performance-critical application is: It's a tool to save human time. Linux development *could* be done using patchfiles by the bundle and masses of tarballs. It's just not the fastest way to do it, so enter git, and lots of problems just go away. It's not the only way of doing it, but it saves time. If you were to add 2 seconds to each commit, that's several months of developer time that is lost every day! > Since git hasn't stored all of the fields in the object table (the > path is encoded in the index) we are never going to be able to build > an alternative way of indexing the object table. We can still build alternative indexes. They just have to be separate from the DAG and the current indexing scheme. Junio has pointed out ways of doing this already. > Not being able to > build alternative indexes is likely to cause problems when the > database starts getting really big. Without an index every query that > can't use the path name index is reduced to doing full table scans. > I've said it before; The most common delimiter used today is paths. It's a behaviour git was designed to handle well, because it *is* the most common way of limiting and separating content. It's not some random fluke that has made git perform very well on actions that commonly performed in large scale software projects; Linus designed it that way from the start, and kudos to him for a job well done. > A few things that could benefit from alternative indexing, blame, > full-text search, automating the Maintainers file, etc. > Yes, but getting rid of the tree objects and storing pathnames in blob objects would penalize log-viewing, diffs and merges, which are far more common operations than full-text searches in a software project. > I'm just asking if we really want to make full table scans the only > possible way to implement these types of queries. If the answer is no, > then let's first explore how to fix things at an abstract level before > diving into the performance arguments. > Personally, I really don't care. But you should really have read Junio's mail a bit more carefully. He explained about 'notes' that can be attached to commits and contain arbitrary data. By all means, create your indexes there and use them for whatever you like, but leave the foundation on which git was built *alone*. The design hasn't changed since April 2006 (subtrees were introduced April 26, I think), because it's a *good* design. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 7:18 ` Andreas Ericsson @ 2007-09-05 13:41 ` Jon Smirl 2007-09-05 14:51 ` Andreas Ericsson 2007-09-05 19:52 ` Andy Parkins 0 siblings, 2 replies; 39+ messages in thread From: Jon Smirl @ 2007-09-05 13:41 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: > Jon Smirl wrote: > > On 9/4/07, Theodore Tso <tytso@mit.edu> wrote: > >> On Tue, Sep 04, 2007 at 01:44:47PM -0400, Jon Smirl wrote: > >>> The current data store design is not very flexible. Databases solved > >>> the flexibility problem long ago. I'm just wondering if we should > >>> steal some good ideas out of the database world and apply them to git. > >>> Ten years from now we may have 100GB git databases and really wish we > >>> had more flexible ways of querying them. > >> Databases solved the flexibility problem, at the cost of performance. > >> And if you use full normalized form in your database scheme, it costs > >> you even more in performance, because of all of the joins that you > >> need in order get the information you need to do, you know, useful > >> work as opposed to database wanking. > >> > >> If you take a look at the really big databases with super high > >> performance requirements, say like those used to managed airline > >> tickets/reservation/fares, you will find that they are not normalized, > >> and they are not relational; they can't afford to be. And if you take > >> a look at some of git competition that use relational databases to > >> store their SCM data, and take a look at how loooooong they they take > >> to do even basic operations, I would say that the onus is on you to > >> prove that normalization is actually a win in terms of real (not > >> theoretical) advantages, and that it doesn't cause performance to go > >> into the toilet. > >> > >> I think the fundamental disconnect here is that no one is buying your > >> claim that just because the data design is "more flexible" that this > >> is automatically a good thing in and of itself, and we should even for > >> a moment, "put performance aside". > > > > It is very easy to get bogged down in performance arguments on > > database design when the correct answer is that there are always lots > > of different ways to achieve the same goal. I wanted to defer debating > > performance until we closely looked at the relationships between the > > data at an abstract level. > > > > But you cannot. Git is performance-critical, for the same reason every > other performance-critical application is: It's a tool to save human > time. Linux development *could* be done using patchfiles by the bundle > and masses of tarballs. It's just not the fastest way to do it, so enter > git, and lots of problems just go away. It's not the only way of doing > it, but it saves time. If you were to add 2 seconds to each commit, > that's several months of developer time that is lost every day! > > > > Since git hasn't stored all of the fields in the object table (the > > path is encoded in the index) we are never going to be able to build > > an alternative way of indexing the object table. > > We can still build alternative indexes. They just have to be separate > from the DAG and the current indexing scheme. Junio has pointed out > ways of doing this already. > > > Not being able to > > build alternative indexes is likely to cause problems when the > > database starts getting really big. Without an index every query that > > can't use the path name index is reduced to doing full table scans. > > > > I've said it before; The most common delimiter used today is paths. It's > a behaviour git was designed to handle well, because it *is* the most > common way of limiting and separating content. It's not some random > fluke that has made git perform very well on actions that commonly > performed in large scale software projects; Linus designed it that way > from the start, and kudos to him for a job well done. This is why I wanted to separate the abstract data structure design discussion from the performance one. In the flat design indexes are like caches and can be created and destroyed. There will definitely be an index created on the the paths. This index will work like the current tree nodes. The difference is that this index is a cache unlike the current tree nodes which are an immutable part of the the data base. The path name field needs to be moved back into the blobs to support alternative indexes. For example I want an index on the Signed-off-by field. I use this index to give me the SHAs for the blobs Signed-off-by a particular person. In the current design I have no way of recovering the path name for these blobs other than a brute force search following every path looking for the right SHA. > > > A few things that could benefit from alternative indexing, blame, > > full-text search, automating the Maintainers file, etc. > > > > Yes, but getting rid of the tree objects and storing pathnames in > blob objects would penalize log-viewing, diffs and merges, which > are far more common operations than full-text searches in a software > project. > > > I'm just asking if we really want to make full table scans the only > > possible way to implement these types of queries. If the answer is no, > > then let's first explore how to fix things at an abstract level before > > diving into the performance arguments. > > > > Personally, I really don't care. But you should really have read Junio's > mail a bit more carefully. He explained about 'notes' that can be attached > to commits and contain arbitrary data. By all means, create your indexes > there and use them for whatever you like, but leave the foundation on which > git was built *alone*. The design hasn't changed since April 2006 (subtrees > were introduced April 26, I think), because it's a *good* design. > > -- > Andreas Ericsson andreas.ericsson@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 > > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 13:41 ` Jon Smirl @ 2007-09-05 14:51 ` Andreas Ericsson 2007-09-05 15:37 ` Jon Smirl 2007-09-05 19:52 ` Andy Parkins 1 sibling, 1 reply; 39+ messages in thread From: Andreas Ericsson @ 2007-09-05 14:51 UTC (permalink / raw) To: Jon Smirl; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List Jon Smirl wrote: > > The path name field needs to be moved back into the blobs to support > alternative indexes. For example I want an index on the Signed-off-by > field. I use this index to give me the SHAs for the blobs > Signed-off-by a particular person. In the current design I have no way > of recovering the path name for these blobs other than a brute force > search following every path looking for the right SHA. > Ah, there we go. A use-case at last :) So now we have a concrete problem that we can formulate thus: "How can one create a database listing the relationship between 'signers' and blobs?" So the second question: Do you seriously argue that git should take a huge performance loss on its common operations to accommodate a need that I suspect very few people have? -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 14:51 ` Andreas Ericsson @ 2007-09-05 15:37 ` Jon Smirl 2007-09-05 15:54 ` Julian Phillips 0 siblings, 1 reply; 39+ messages in thread From: Jon Smirl @ 2007-09-05 15:37 UTC (permalink / raw) To: Andreas Ericsson; +Cc: Theodore Tso, Junio C Hamano, Git Mailing List On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: > Jon Smirl wrote: > > > > The path name field needs to be moved back into the blobs to support > > alternative indexes. For example I want an index on the Signed-off-by > > field. I use this index to give me the SHAs for the blobs > > Signed-off-by a particular person. In the current design I have no way > > of recovering the path name for these blobs other than a brute force > > search following every path looking for the right SHA. > > > > Ah, there we go. A use-case at last :) > > So now we have a concrete problem that we can formulate thus: > "How can one create a database listing the relationship between 'signers' > and blobs?" > > So the second question: Do you seriously argue that git should take a > huge performance loss on its common operations to accommodate a need that > I suspect very few people have? Why do you keep jumping to a performance loss? Both schemes will have an index based on paths. The problem is how those indexes are constructed, not the existence of the index. Moving the paths into the blobs in no way prevents you from creating an index on that field. The problem is that the SHAs have been intertwined with the tree nodes. This blending has made it impossible to create other indexes on the blobs. The path index in the flat scheme will probably look just like tree nodes do today but these new tree nodes won't be intertwined with the SHAs. > > -- > Andreas Ericsson andreas.ericsson@op5.se > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 15:37 ` Jon Smirl @ 2007-09-05 15:54 ` Julian Phillips 2007-09-05 16:12 ` Jon Smirl 0 siblings, 1 reply; 39+ messages in thread From: Julian Phillips @ 2007-09-05 15:54 UTC (permalink / raw) To: Jon Smirl Cc: Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List On Wed, 5 Sep 2007, Jon Smirl wrote: > On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: >> Jon Smirl wrote: >>> >>> The path name field needs to be moved back into the blobs to support >>> alternative indexes. For example I want an index on the Signed-off-by >>> field. I use this index to give me the SHAs for the blobs >>> Signed-off-by a particular person. In the current design I have no way >>> of recovering the path name for these blobs other than a brute force >>> search following every path looking for the right SHA. >>> >> >> Ah, there we go. A use-case at last :) But not a brilliant one. You sign off on commits not blobs. So you go from the sign-off to paths, then to blobs. There is no need to go from blob to path unless you deliberately introduce such a need. >> >> So now we have a concrete problem that we can formulate thus: >> "How can one create a database listing the relationship between 'signers' >> and blobs?" >> >> So the second question: Do you seriously argue that git should take a >> huge performance loss on its common operations to accommodate a need that >> I suspect very few people have? > > Why do you keep jumping to a performance loss? Both schemes will have > an index based on paths. The problem is how those indexes are > constructed, not the existence of the index. Moving the paths into the > blobs in no way prevents you from creating an index on that field. But moving the path into the blob _IS_ the perfomance hit. You lose the ability to tell the two files have the same content _without even looking at the blob_. This is one of the core parts of making git operations blindingly fast. You can't throw that out, and then say that there is no performance hit. You keep talking about abstract database performance - but git is not an abstract database. It has very specific common usage patterns, and is optomisied to handle them. > > The problem is that the SHAs have been intertwined with the tree > nodes. This blending has made it impossible to create other indexes on > the blobs. > > The path index in the flat scheme will probably look just like tree > nodes do today but these new tree nodes won't be intertwined with the > SHAs. And you will have to prove that diff/merge etc. don't become very much slower before you get buy in. -- Julian --- Many receive advice, few profit by it. -- Publilius Syrus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 15:54 ` Julian Phillips @ 2007-09-05 16:12 ` Jon Smirl 2007-09-05 17:31 ` Julian Phillips ` (4 more replies) 0 siblings, 5 replies; 39+ messages in thread From: Jon Smirl @ 2007-09-05 16:12 UTC (permalink / raw) To: Julian Phillips Cc: Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote: > On Wed, 5 Sep 2007, Jon Smirl wrote: > > > On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: > >> Jon Smirl wrote: > >>> > >>> The path name field needs to be moved back into the blobs to support > >>> alternative indexes. For example I want an index on the Signed-off-by > >>> field. I use this index to give me the SHAs for the blobs > >>> Signed-off-by a particular person. In the current design I have no way > >>> of recovering the path name for these blobs other than a brute force > >>> search following every path looking for the right SHA. > >>> > >> > >> Ah, there we go. A use-case at last :) > > But not a brilliant one. You sign off on commits not blobs. So you go > from the sign-off to paths, then to blobs. There is no need to go from > blob to path unless you deliberately introduce such a need. Use blame for an example. Blame has to crawl every commit to see if it touched the file. It keeps doing this until it figures out the last author for every line in the file. Worse case blame has to crawl every commit in the data store. > >> > >> So now we have a concrete problem that we can formulate thus: > >> "How can one create a database listing the relationship between 'signers' > >> and blobs?" > >> > >> So the second question: Do you seriously argue that git should take a > >> huge performance loss on its common operations to accommodate a need that > >> I suspect very few people have? > > > > Why do you keep jumping to a performance loss? Both schemes will have > > an index based on paths. The problem is how those indexes are > > constructed, not the existence of the index. Moving the paths into the > > blobs in no way prevents you from creating an index on that field. > > But moving the path into the blob _IS_ the perfomance hit. You lose the > ability to tell the two files have the same content _without even looking > at the blob_. This is one of the core parts of making git operations > blindingly fast. You can't throw that out, and then say that there is no > performance hit. > > You keep talking about abstract database performance - but git is not an > abstract database. It has very specific common usage patterns, and is > optomisied to handle them. > > > > > The problem is that the SHAs have been intertwined with the tree > > nodes. This blending has made it impossible to create other indexes on > > the blobs. > > > > The path index in the flat scheme will probably look just like tree > > nodes do today but these new tree nodes won't be intertwined with the > > SHAs. > > And you will have to prove that diff/merge etc. don't become very much > slower before you get buy in. > > -- > Julian > > --- > Many receive advice, few profit by it. > -- Publilius Syrus > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 16:12 ` Jon Smirl @ 2007-09-05 17:31 ` Julian Phillips 2007-09-06 1:27 ` Kyle Moffett 2007-09-05 17:39 ` Mike Hommey ` (3 subsequent siblings) 4 siblings, 1 reply; 39+ messages in thread From: Julian Phillips @ 2007-09-05 17:31 UTC (permalink / raw) To: Jon Smirl Cc: Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List On Wed, 5 Sep 2007, Jon Smirl wrote: > On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote: >> On Wed, 5 Sep 2007, Jon Smirl wrote: >> >>> On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: >>>> Jon Smirl wrote: >>>>> >>>>> The path name field needs to be moved back into the blobs to support >>>>> alternative indexes. For example I want an index on the Signed-off-by >>>>> field. I use this index to give me the SHAs for the blobs >>>>> Signed-off-by a particular person. In the current design I have no way >>>>> of recovering the path name for these blobs other than a brute force >>>>> search following every path looking for the right SHA. >>>>> >>>> >>>> Ah, there we go. A use-case at last :) >> >> But not a brilliant one. You sign off on commits not blobs. So you go >> from the sign-off to paths, then to blobs. There is no need to go from >> blob to path unless you deliberately introduce such a need. > > Use blame for an example. Blame has to crawl every commit to see if it > touched the file. It keeps doing this until it figures out the last > author for every line in the file. Worse case blame has to crawl every > commit in the data store. And this is advantaged by having the path in the blob how? The important information here is knowing which commits touched the file - this information is expensive in git because it is snapshot based. You have to go back through all the commits looking for changes to the given path. The information you might want to cache is which commits touched the file, which you could do without changing the current data storage. Presumably you are suggesting that such a cache would be cleaner with the filename in the blob? Or do you think that it would somehow be faster to create? If so, how? -- Julian --- Humor in the Court: Q: (Showing man picture.) That's you? A: Yes, sir. Q: And you were present when the picture was taken, right? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 17:31 ` Julian Phillips @ 2007-09-06 1:27 ` Kyle Moffett 0 siblings, 0 replies; 39+ messages in thread From: Kyle Moffett @ 2007-09-06 1:27 UTC (permalink / raw) To: Julian Phillips Cc: Jon Smirl, Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List On Sep 05, 2007, at 13:31:43, Julian Phillips wrote: > And this is advantaged by having the path in the blob how? The > important information here is knowing which commits touched the > file - this information is expensive in git because it is snapshot > based. You have to go back through all the commits looking for > changes to the given path. The information you might want to cache > is which commits touched the file, which you could do without > changing the current data storage. Presumably you are suggesting > that such a cache would be cleaner with the filename in the blob? > Or do you think that it would somehow be faster to create? If so, > how? The only possible reason I can think of for moving data into the blob would be to make a POSIX-compliant git-like filesystem, and EVEN THEN you would NOT move the path out of the tree objects. In order to have somewhat consistent inodes (and also for performance when changing 4 bytes in a 40GB file) you would want to have 3 different types of "inode" objects: 1) 4-64k of (metadata + filedata) 2) 4-64k of (metadata + list of 4-64k filedata blobs) 3) 4-64k of (metadata + list of 4-64k lists of filedata blobs) On the other hand... that isn't GIT, it's something completely different with a very different usage pattern and set of requirements. And you still don't put the path name in the objects, just the permissions and other attributes/metadata. <Random Thought Experiment> You would of course want to better define those 4-64k limits for allocation and performance reasons, but a double-indirect table of SHA128s with 64kb chunks lets you address up to 1TB of file data, and for each additional power-of-two increase in the chunk size you get 8 times the storage space. Furthermore, the actual double-indirect tables for an 8TB file using 128k chunks would be all of 64MB, for a more reasonable 4GB file with 32k tables (max of 128GB) it would be maybe 128kB of indirect SHA1 hash tables. </Random Thought Experiment> Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 16:12 ` Jon Smirl 2007-09-05 17:31 ` Julian Phillips @ 2007-09-05 17:39 ` Mike Hommey 2007-09-06 8:49 ` Andreas Ericsson ` (2 subsequent siblings) 4 siblings, 0 replies; 39+ messages in thread From: Mike Hommey @ 2007-09-05 17:39 UTC (permalink / raw) To: Jon Smirl Cc: Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List On Wed, Sep 05, 2007 at 12:12:28PM -0400, Jon Smirl <jonsmirl@gmail.com> wrote: > On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote: > > On Wed, 5 Sep 2007, Jon Smirl wrote: > > > > > On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: > > >> Jon Smirl wrote: > > >>> > > >>> The path name field needs to be moved back into the blobs to support > > >>> alternative indexes. For example I want an index on the Signed-off-by > > >>> field. I use this index to give me the SHAs for the blobs > > >>> Signed-off-by a particular person. In the current design I have no way > > >>> of recovering the path name for these blobs other than a brute force > > >>> search following every path looking for the right SHA. > > >>> > > >> > > >> Ah, there we go. A use-case at last :) > > > > But not a brilliant one. You sign off on commits not blobs. So you go > > from the sign-off to paths, then to blobs. There is no need to go from > > blob to path unless you deliberately introduce such a need. > > Use blame for an example. Blame has to crawl every commit to see if it > touched the file. It keeps doing this until it figures out the last > author for every line in the file. Worse case blame has to crawl every > commit in the data store. And why exactly would you need to change blobs to contain path for blame to be faster ? Or more generally, what, in the current way of git doing things, prevents you from adding an index to $THE_DATA_YOU_LIKE, exactly ? >From the very few use cases you've given, I see nothing preventing to create an additional index from the data git currently uses. Mike ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 16:12 ` Jon Smirl 2007-09-05 17:31 ` Julian Phillips 2007-09-05 17:39 ` Mike Hommey @ 2007-09-06 8:49 ` Andreas Ericsson 2007-09-06 9:09 ` Junio C Hamano 2007-09-06 12:56 ` Johannes Schindelin 2007-09-07 0:33 ` Martin Langhoff 4 siblings, 1 reply; 39+ messages in thread From: Andreas Ericsson @ 2007-09-06 8:49 UTC (permalink / raw) To: Jon Smirl; +Cc: Julian Phillips, Theodore Tso, Junio C Hamano, Git Mailing List Jon Smirl wrote: > On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote: >> On Wed, 5 Sep 2007, Jon Smirl wrote: >> >>> On 9/5/07, Andreas Ericsson <ae@op5.se> wrote: >>>> Jon Smirl wrote: >>>>> The path name field needs to be moved back into the blobs to support >>>>> alternative indexes. For example I want an index on the Signed-off-by >>>>> field. I use this index to give me the SHAs for the blobs >>>>> Signed-off-by a particular person. In the current design I have no way >>>>> of recovering the path name for these blobs other than a brute force >>>>> search following every path looking for the right SHA. >>>>> >>>> Ah, there we go. A use-case at last :) >> But not a brilliant one. You sign off on commits not blobs. So you go >> from the sign-off to paths, then to blobs. There is no need to go from >> blob to path unless you deliberately introduce such a need. > > Use blame for an example. Blame has to crawl every commit to see if it > touched the file. It keeps doing this until it figures out the last > author for every line in the file. Worse case blame has to crawl every > commit in the data store. > Estimated daily uses of git-blame, world-wide: few Estimated daily uses of git-{merge,diff}, worldwide: lots Code and benchmarks, or I'm not buying it. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-06 8:49 ` Andreas Ericsson @ 2007-09-06 9:09 ` Junio C Hamano 2007-09-06 11:03 ` Wincent Colaiuta 0 siblings, 1 reply; 39+ messages in thread From: Junio C Hamano @ 2007-09-06 9:09 UTC (permalink / raw) To: Andreas Ericsson Cc: Jon Smirl, Julian Phillips, Theodore Tso, Git Mailing List Andreas Ericsson <ae@op5.se> writes: > Estimated daily uses of git-blame, world-wide: few > Estimated daily uses of git-{merge,diff}, worldwide: lots Which makes the author of git-blame weep X-<. The real issue is that embedding pathname in blob does _not_ help "git blame" but would actively hurt it. A file with the identical contents moved between the parent to child commit shares the same blob object and same object name in the real git. Jon's modified system that hashes pathname together with the contents would have them as two completely unrelated objects with different object names, which only means that even 100% similarity rename case becomes as expensive to find as renames of lower similarity, which needs to expand and look into blob contents. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-06 9:09 ` Junio C Hamano @ 2007-09-06 11:03 ` Wincent Colaiuta 0 siblings, 0 replies; 39+ messages in thread From: Wincent Colaiuta @ 2007-09-06 11:03 UTC (permalink / raw) To: Junio C Hamano Cc: Andreas Ericsson, Jon Smirl, Julian Phillips, Theodore Tso, Git Mailing List El 6/9/2007, a las 11:09, Junio C Hamano escribió: > Andreas Ericsson <ae@op5.se> writes: > >> Estimated daily uses of git-blame, world-wide: few >> Estimated daily uses of git-{merge,diff}, worldwide: lots > > Which makes the author of git-blame weep X-<. But the few times when you do use git-blame (apart from when you use it out of sheer curiosity) it usually saves you backside (ie. when you've located a problem in the code and you want to know the who/ what/when/why of the offending commit). Cheers, Wincent ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 16:12 ` Jon Smirl ` (2 preceding siblings ...) 2007-09-06 8:49 ` Andreas Ericsson @ 2007-09-06 12:56 ` Johannes Schindelin 2007-09-06 18:14 ` Steven Grimm 2007-09-07 0:33 ` Martin Langhoff 4 siblings, 1 reply; 39+ messages in thread From: Johannes Schindelin @ 2007-09-06 12:56 UTC (permalink / raw) To: Jon Smirl Cc: Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List Hi, On Wed, 5 Sep 2007, Jon Smirl wrote: > On 9/5/07, Julian Phillips <julian@quantumfyre.co.uk> wrote: > > On Wed, 5 Sep 2007, Jon Smirl wrote: > > > > >> Ah, there we go. A use-case at last :) > > > > But not a brilliant one. You sign off on commits not blobs. So you > > go from the sign-off to paths, then to blobs. There is no need to go > > from blob to path unless you deliberately introduce such a need. > > Use blame for an example. Blame has to crawl every commit to see if it > touched the file. It keeps doing this until it figures out the last > author for every line in the file. Worse case blame has to crawl every > commit in the data store. But you can add _yet another_ index to it, which can be generated on the fly, so that Git only has to generate the information once, and then reuse it later. As a benefit of this method, the underlying well-tested structure needs no change at all. BTW could you please, please, please cut the quoted message that you are _not_ responding to? It really _wastes_ my time. Ciao, Dscho ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-06 12:56 ` Johannes Schindelin @ 2007-09-06 18:14 ` Steven Grimm 0 siblings, 0 replies; 39+ messages in thread From: Steven Grimm @ 2007-09-06 18:14 UTC (permalink / raw) To: Johannes Schindelin Cc: Jon Smirl, Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List Johannes Schindelin wrote: > But you can add _yet another_ index to it, which can be generated on the > fly, so that Git only has to generate the information once, and then reuse > it later. As a benefit of this method, the underlying well-tested > structure needs no change at all. > And in fact, you can do this today, without modifying git-blame at all, by (ab)using its "-S" option (which lets you specify a custom ancestry chain to search). By coincidence, I was just showing some people at my office how to do this yesterday. I'll cut-and-paste from the email I sent them. I am not claiming this is nearly as desirable as a built-in, auto-updated secondary index, but it proves the concept, anyway. Fast-to-generate version: git-rev-list HEAD -- main.c | awk '{if (last) print last " " $0; last=$0;}' > /tmp/revlist This speeds things up a lot, because git blame doesn't have to examine other revisions: time git blame main.c 1.56s user 0.30s system 99% cpu 1.868 total time git blame -S /tmp/revlist main.c 0.21s user 0.03s system 96% cpu 0.249 total The bad news is that generating that revision list is a bit slow, and if you do it the naive way I suggested above, you can't use the rev list with the -M option (to follow renames). The good news is that it's possible to have that too if you generate a list of revisions that includes the renames: # Generate a list of all revisions in the right order (only need to do this once, not once per file) git rev-list HEAD > /tmp/all-revs # Generate a list of the revisions that touched this file, following copies/renames. # Could do this in fewer commands but this is hopefully easier to follow. git blame --porcelain -M main.c | \ egrep '^[0-9a-f]{40}' | \ cut -d' ' -f1 | \ fgrep -f - /tmp/all-revs | \ awk '{if (last) print last " " $0; last=$0;}' > /tmp/revlist Then -M is fast too: time git blame -M main.c 1.72s user 0.27s system 89% cpu 2.219 total time git blame -M -S /tmp/revlist main.c 0.29s user 0.03s system 93% cpu 0.341 total Oddly, if you use the -S option, "git blame -C" actually gets significantly *slower*. I am not sure why. -Steve ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 16:12 ` Jon Smirl ` (3 preceding siblings ...) 2007-09-06 12:56 ` Johannes Schindelin @ 2007-09-07 0:33 ` Martin Langhoff 4 siblings, 0 replies; 39+ messages in thread From: Martin Langhoff @ 2007-09-07 0:33 UTC (permalink / raw) To: Jon Smirl Cc: Julian Phillips, Andreas Ericsson, Theodore Tso, Junio C Hamano, Git Mailing List On 9/6/07, Jon Smirl <jonsmirl@gmail.com> wrote: > Use blame for an example. Blame has to crawl every commit to see if it Sure. Build a quick dedicated index for that and measure - cost (size and commit/fetch costs) - benefit - frequency of usage git is a special-purpouse DB that does great for certain access patterns. Have a look at monotone for a design that looks a lot like git but is backed by a general purpouse DB and does equally poorly for all access patterns ;-) > It keeps doing this until it figures out the last > author for every line in the file. Worse case blame has to crawl every > commit in the data store. Yep. Can we get a minimal-cost index with just enough hints that can speed up blame, and perhaps git log with/very/deep/path? Probably! That's worth pursuing sure. martin ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-05 13:41 ` Jon Smirl 2007-09-05 14:51 ` Andreas Ericsson @ 2007-09-05 19:52 ` Andy Parkins 1 sibling, 0 replies; 39+ messages in thread From: Andy Parkins @ 2007-09-05 19:52 UTC (permalink / raw) To: git; +Cc: Jon Smirl, Andreas Ericsson, Theodore Tso, Junio C Hamano On Wednesday 2007, September 05, Jon Smirl wrote: > The path name field needs to be moved back into the blobs to support > alternative indexes. For example I want an index on the Signed-off-by > field. I use this index to give me the SHAs for the blobs > Signed-off-by a particular person. In the current design I have no way > of recovering the path name for these blobs other than a brute force > search following every path looking for the right SHA. Erm, if that's your only way then you designed your index incorrectly. 1. Signed-Off-By lines appear in commits, so your index should be an index of SOB name against commit hash 2. Lookup the commit for that commit hash. As usual this is blindlingly git-fastic. 3. That commit blob contains a tree hash. Look it up. As usual this is blindingly git-fastic 4. Start gathering blobs for that tree. Fast, fast, fast. 5. Any subtree objects you come across, goto 4. This is not a brute force lookup and it's stuff that git is really good at anyway. I'm really not sure I see what problem you're trying to solve. Whatever index you want, you could keep and maintain if you wanted to without impacting git's core storage at all. Andy -- Dr Andy Parkins, M Eng (hons), MIET andyparkins@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 15:23 Git's database structure Jon Smirl 2007-09-04 15:55 ` Andreas Ericsson 2007-09-04 16:28 ` Jon Smirl @ 2007-09-04 17:19 ` Julian Phillips 2007-09-04 17:30 ` Jon Smirl 2 siblings, 1 reply; 39+ messages in thread From: Julian Phillips @ 2007-09-04 17:19 UTC (permalink / raw) To: Jon Smirl; +Cc: Git Mailing List On Tue, 4 Sep 2007, Jon Smirl wrote: > Let's back up a little bit from "Caclulating tree node". What are the > elements of git's data structures? > > Right now we have an index structure (tree nodes) integrated in to a > base table. Integrating indexing into the data is not normally done in > a database. Doing a normalization analysis like this may expose flaws > in the way the data is structured. Of course we may also decide to > leave everything the way it is. > > What about the special status of a rename? In the current model we > effectively have three tables. > > commit - a set of all SHAs in the commit, previous commit, comment, author, etc > blob - a file, permissions, etc. > file names - name, SHA > > The file name table is encoded as an index and it has been > intermingled with the commit table. > > Looking at this from a set theory angle brings up the question, do we > really have three tables and file names are an independent variable > from the blobs, or should file names be an attribute of the blob? There isn't a one-to-one mapping of file names to blobs. The blob only describes the contents of the file. In the extreme case you could have one blob for every single file in your tree. For example: # git ls-tree -r HEAD 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c bar/foo 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo2 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo3 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo4 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo5 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo6 > > How this gets structured in the db is an independent question about > how renames get detected on a commit. The current scheme for detecting > renames by comparing diffs is working fine. The question is, once we > detect a rename how should it be stored? > > Ignoring the performance impacts and looking at the problem from the > set theory view point, should: > the pathnames be in their own table with a row for each alias > the pathnames be stored as an attribute of the blob > > Both of these are the same information, we're just looking at how > things are normalized. > > -- Julian --- "You shouldn't make my toaster angry." -- Household security explained in "Johnny Quest" ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 17:19 ` Julian Phillips @ 2007-09-04 17:30 ` Jon Smirl 2007-09-04 18:51 ` Andreas Ericsson 0 siblings, 1 reply; 39+ messages in thread From: Jon Smirl @ 2007-09-04 17:30 UTC (permalink / raw) To: Julian Phillips; +Cc: Git Mailing List On 9/4/07, Julian Phillips <julian@quantumfyre.co.uk> wrote: > On Tue, 4 Sep 2007, Jon Smirl wrote: > > > Let's back up a little bit from "Caclulating tree node". What are the > > elements of git's data structures? > > > > Right now we have an index structure (tree nodes) integrated in to a > > base table. Integrating indexing into the data is not normally done in > > a database. Doing a normalization analysis like this may expose flaws > > in the way the data is structured. Of course we may also decide to > > leave everything the way it is. > > > > What about the special status of a rename? In the current model we > > effectively have three tables. > > > > commit - a set of all SHAs in the commit, previous commit, comment, author, etc > > blob - a file, permissions, etc. > > file names - name, SHA > > > > The file name table is encoded as an index and it has been > > intermingled with the commit table. > > > > Looking at this from a set theory angle brings up the question, do we > > really have three tables and file names are an independent variable > > from the blobs, or should file names be an attribute of the blob? > > There isn't a one-to-one mapping of file names to blobs. The blob only > describes the contents of the file. In the extreme case you could have > one blob for every single file in your tree. For example: > > # git ls-tree -r HEAD > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c bar/foo > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo2 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo3 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo4 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo5 > 100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo6 Both schemes support aliasing. In the flat scheme you would create a second blob which contains the file and the aliased path name. When the blob gets delta'd the second copy of the file will disappear. I'm not proposing a change to data being stored in git, it is a proposal to consider the impacts of how this data has been normalized in the data store. > > How this gets structured in the db is an independent question about > > how renames get detected on a commit. The current scheme for detecting > > renames by comparing diffs is working fine. The question is, once we > > detect a rename how should it be stored? > > > > Ignoring the performance impacts and looking at the problem from the > > set theory view point, should: > > the pathnames be in their own table with a row for each alias > > the pathnames be stored as an attribute of the blob > > > > Both of these are the same information, we're just looking at how > > things are normalized. > > > > > > -- > Julian > > --- > "You shouldn't make my toaster angry." > -- Household security explained in "Johnny Quest" > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Git's database structure 2007-09-04 17:30 ` Jon Smirl @ 2007-09-04 18:51 ` Andreas Ericsson 0 siblings, 0 replies; 39+ messages in thread From: Andreas Ericsson @ 2007-09-04 18:51 UTC (permalink / raw) To: Jon Smirl; +Cc: Julian Phillips, Git Mailing List Jon Smirl wrote: > > I'm not proposing a change to data being stored in git, it is a > proposal to consider the impacts of how this data has been normalized > in the data store. > But to what end? We all *know* the impacts: * Excellent performance at what it does now. * Currently zero capability to replace google as the #1 search engine. Since replacing google's db was never, and will never, be the goal of git, what is it you wish to achieve? Seriously, I'm dying to know, so please tell me. If you have already and I'm too daft to understand it, humor me and reiterate :-) -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2007-09-07 0:34 UTC | newest] Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-09-04 15:23 Git's database structure Jon Smirl 2007-09-04 15:55 ` Andreas Ericsson 2007-09-04 16:07 ` Mike Hommey 2007-09-04 16:10 ` Andreas Ericsson 2007-09-04 16:19 ` Jon Smirl 2007-09-04 16:29 ` Andreas Ericsson 2007-09-04 17:09 ` Jeff King 2007-09-04 20:17 ` David Tweed 2007-09-04 17:21 ` Junio C Hamano 2007-09-04 16:28 ` Jon Smirl 2007-09-04 16:31 ` Andreas Ericsson 2007-09-04 16:47 ` Jon Smirl 2007-09-04 16:51 ` Andreas Ericsson 2007-09-04 17:25 ` Junio C Hamano 2007-09-04 17:44 ` Jon Smirl 2007-09-04 18:04 ` Mike Hommey 2007-09-04 19:44 ` Reece Dunn 2007-09-04 18:06 ` Junio C Hamano 2007-09-04 21:25 ` Theodore Tso 2007-09-04 21:54 ` Jon Smirl 2007-09-05 7:18 ` Andreas Ericsson 2007-09-05 13:41 ` Jon Smirl 2007-09-05 14:51 ` Andreas Ericsson 2007-09-05 15:37 ` Jon Smirl 2007-09-05 15:54 ` Julian Phillips 2007-09-05 16:12 ` Jon Smirl 2007-09-05 17:31 ` Julian Phillips 2007-09-06 1:27 ` Kyle Moffett 2007-09-05 17:39 ` Mike Hommey 2007-09-06 8:49 ` Andreas Ericsson 2007-09-06 9:09 ` Junio C Hamano 2007-09-06 11:03 ` Wincent Colaiuta 2007-09-06 12:56 ` Johannes Schindelin 2007-09-06 18:14 ` Steven Grimm 2007-09-07 0:33 ` Martin Langhoff 2007-09-05 19:52 ` Andy Parkins 2007-09-04 17:19 ` Julian Phillips 2007-09-04 17:30 ` Jon Smirl 2007-09-04 18:51 ` Andreas Ericsson
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).