git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Joshua Juran <jjuran@gmail.com>
To: Avery Pennarun <apenwarr@gmail.com>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	Finn Arne Gangstad <finnag@pvv.org>,
	git@vger.kernel.org
Subject: Re: inotify daemon speedup for git [POC/HACK]
Date: Tue, 27 Jul 2010 18:14:47 -0700	[thread overview]
Message-ID: <52EDBD9A-2961-4F66-88B3-07BF873FA994@gmail.com> (raw)
In-Reply-To: <AANLkTimkLrTwavErFkyaUTSVU-2s3me5f+cyqNFp7n+D@mail.gmail.com>

On Jul 27, 2010, at 5:18 PM, Avery Pennarun wrote:

> On Tue, Jul 27, 2010 at 8:00 PM, Shawn O. Pearce  
> <spearce@spearce.org> wrote:
>> Avery Pennarun <apenwarr@gmail.com> wrote:
>>> While we're here, it's probably worth mentioning that git's index  
>>> file
>>> format (which stores a sequential list of full paths in alphabetical
>>> order, instead of an actual hierarchy) does become a bottleneck when
>>> you actually have a huge number of files in your repo (like  
>>> literally
>>> a million).  You can't actually binary search through the index!   
>>> The
>>> current implementation of submodules allows you to dodge that
>>> scalability problem since you end up with multiple smaller index
>>> files.  Anyway, that's fixable too.
>>
>> Yes.
>>
>> More than once I've been tempted to rewrite the on-disk (and I guess
>> in-memory) format of the index.  And then I remember how painful that
>> stuff is in either C git.git or JGit, and I back away slowly.  :-)
>>
>> Ideally the index is organized the same way the trees are, but
>> you still can't do a really good binary search because of the
>> ass-backwards name sorting rule for trees.  But for performance
>> reasons you still want to keep the entire index in a single file,
>> an index per directory (aka SVN/CVS) is too slow for the common
>> case of <30k files.
>
> Really?  What's wrong with the name sorting rule?  I kind of like it.
>
> bup's current index - after I abandoned my clone of the git one since
> it was too slow with insane numbers of files - is very fast for reads
> and in-place updates using mmap.
>
> Essentially, it's a tree, starting from the outermost leafs and
> leading toward the entry at the very end of the file, which is the
> root.  (The idea of doing it backwards was that I could write the file
> sequentially.  In retrospect, that was probably an unnecessarily
> brain-bending waste of time and the root should have been the first
> entry instead.)
>
> For speed, the bup index can just mark entries as deleted using a flag
> rather than actually rewriting the whole indexfile.  Unfortunately, I
> failed to make it sufficiently flexible to *add* new entries without
> needing to rewrite the whole thing.  In bup, that's a big deal
> (especially since python is kind of slow and there are typically >1
> million files in the index).  In git, it's maybe not so bad; after
> all, the current implementation rewrites the index *every* time and
> nobody notices.

Okay, I have an idea.  If I understand correctly, the index is a flat  
database of records including a pathname and several fixed-length  
fields.  Since the records are not fixed-length, only sequential  
search is possible, even though the records are sorted by pathname.

Here's the idea:  Divide the database into blocks.  Each block  
contains a block header and the records belonging to a single  
directory.  The block header contains the length of the block and also  
the offset to the next block, in bytes.  In addition to a record for  
each indexed file in a directory, a directory's block also contains  
records for subdirectories. The mode flags in a record indicate the  
record type.  Directory records contain an offset in bytes to the  
block for that directory (in place of the SHA-1 hash).  The block list  
is preceded by a file header, which includes the offset in bytes of  
the root block.  All offsets are from the beginning of the file.

Instead of having to search among every file in the repository, the  
search space now includes only the immediate descendants of each  
directory in the target file's path.  If a directory is modified then  
it can either be rewritten in place (if there's sufficient room) or  
appended to the end of the file (requiring the old and new  
sequentially preceding blocks and the parent directory's block to  
update their offsets).

Is this useful?

Josh

  reply	other threads:[~2010-07-28  1:14 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-27 12:20 inotify daemon speedup for git [POC/HACK] Finn Arne Gangstad
2010-07-27 23:29 ` Avery Pennarun
2010-07-27 23:39   ` Joshua Juran
2010-07-27 23:51     ` Avery Pennarun
2010-07-28  0:00       ` Shawn O. Pearce
2010-07-28  0:18         ` Avery Pennarun
2010-07-28  1:14           ` Joshua Juran [this message]
2010-07-28  1:31             ` Avery Pennarun
2010-07-28  6:03               ` Sverre Rabbelier
2010-07-28  6:06                 ` Jonathan Nieder
2010-07-28  7:44                   ` Ævar Arnfjörð Bjarmason
2010-07-28 11:08                     ` Theodore Tso
2010-07-28  8:20                 ` Nguyen Thai Ngoc Duy
2010-08-13 17:53                   ` Enrico Weigelt
2010-07-28 13:09           ` Jakub Narebski
2010-07-28 13:06         ` Jakub Narebski
2010-08-13 17:58           ` Enrico Weigelt
2010-07-27 23:58 ` Sverre Rabbelier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52EDBD9A-2961-4F66-88B3-07BF873FA994@gmail.com \
    --to=jjuran@gmail.com \
    --cc=apenwarr@gmail.com \
    --cc=finnag@pvv.org \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).