From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: Git's database structure Date: Tue, 4 Sep 2007 13:09:22 -0400 Message-ID: <20070904170921.GA31300@coredump.intra.peff.net> References: <9e4733910709040823k731f0ffchba1f93bdb4a8373d@mail.gmail.com> <46DD7FE4.1060908@op5.se> <9e4733910709040919u3d252b91s2785ed4d20086c88@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andreas Ericsson , Git Mailing List To: Jon Smirl X-From: git-owner@vger.kernel.org Tue Sep 04 19:13:47 2007 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1ISbu5-0002gx-1G for gcvg-git@gmane.org; Tue, 04 Sep 2007 19:09:29 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754618AbXIDRJY (ORCPT ); Tue, 4 Sep 2007 13:09:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752160AbXIDRJY (ORCPT ); Tue, 4 Sep 2007 13:09:24 -0400 Received: from 66-23-211-5.clients.speedfactory.net ([66.23.211.5]:4990 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751776AbXIDRJY (ORCPT ); Tue, 4 Sep 2007 13:09:24 -0400 Received: (qmail 30362 invoked by uid 111); 4 Sep 2007 17:09:23 -0000 Received: from coredump.intra.peff.net (HELO coredump.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.32) with SMTP; Tue, 04 Sep 2007 13:09:23 -0400 Received: by coredump.intra.peff.net (sSMTP sendmail emulation); Tue, 04 Sep 2007 13:09:22 -0400 Content-Disposition: inline In-Reply-To: <9e4733910709040919u3d252b91s2785ed4d20086c88@mail.gmail.com> Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On Tue, Sep 04, 2007 at 12:19:33PM -0400, Jon Smirl wrote: > By introducing tree nodes you have blended a specific indexing scheme > into the data. There are many other ways the path names could be > indexed hash tables, binary trees, etc. That is correct. However, given that indexing scheme, many of the common operations just "fall out" simply and efficiently, without the need to keep separate indices. So yes, git is geared towards a particular set of operations. Your complaint seems to be two-fold: 1. there is an inelegance in the blending of data and indexing. The problem with changing this is: a. we are all already using git, and it would require completely re-vamping the core data structure b. there is some feeling that the blending is necessary for performance. Given the difficulty of (a), I think you would have to provide compelling evidence (i.e., numbers) that a git-like system based around set theory with separate indices would perform as well. 2. you want perform some operations to which the hierarchy is not well-suited. In this case, I think you can get by with the same solution you have proposed already: indices external to the data structure (in fact, this is exactly what Google is doing: taking hierarchical URLs and indexing them in different ways). Have you taken a look at the pack v4 work by Shawn and Nicolas? It is an attempt to build such indices at pack time (but keeping the core git data structure intact). -Peff