git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* inotify daemon speedup for git [POC/HACK]
@ 2010-07-27 12:20 Finn Arne Gangstad
  2010-07-27 23:29 ` Avery Pennarun
  2010-07-27 23:58 ` Sverre Rabbelier
  0 siblings, 2 replies; 18+ messages in thread
From: Finn Arne Gangstad @ 2010-07-27 12:20 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun

Reading through the thread about subtree I noticed Avery mentioning
using inotify to speed up git status & co.

Here is a quick hack I did some time ago to test this out, to use it
call "igit" instead of "git" for all commmands you want to speed up.

There is one minor nit: The speedup gain is zero :) git still
traverses all directories to look for .gitignore files, which seems to
totally kill the optimisation.

To use it, put igit and git-inotify-daemon.pl in path, and do git
config core.ignorestat=true in the repositories you want to test it
with. The igit wrapper will run git update-index --no-assume-unchanged
on all modified files before running any real git commands.

To get inotify to ignore all changes that the git commands themselves
perform, the "igit" wrapper kills the currently running daemon. Then
it reads the list of updates files, and does git-update-index
--no-assume-unchanged on them. Then the git command is run, and
finally the daemon is fired up again.

I had to do one tiny modification to git to make update-index ignore
bad paths.


igit - a git wrapper with an inotify daemon

Linux only - requires inotifytools installed. This is juct a quick hack/proof
of concept!
update-index: Do not error out on bad paths, just warn
---
 .gitignore             |    1 +
 builtin/update-index.c |    2 +-
 git-inotify-daemon.pl  |   28 ++++++++++++++++++++++++++++
 igit                   |   22 ++++++++++++++++++++++
 4 files changed, 52 insertions(+), 1 deletions(-)
 create mode 100755 git-inotify-daemon.pl
 create mode 100755 igit

diff --git a/.gitignore b/.gitignore
index 14e2b6b..fa67132 100644
--- a/.gitignore
+++ b/.gitignore
@@ -204,3 +204,4 @@
 *.pdb
 /Debug/
 /Release/
+.igit-*
diff --git a/builtin/update-index.c b/builtin/update-index.c
index 3ab214d..c905d78 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -282,7 +282,7 @@ static void update_one(const char *path, const char *prefix, int prefix_length)
 	}
 	if (mark_valid_only) {
 		if (mark_ce_flags(p, CE_VALID, mark_valid_only == MARK_FLAG))
-			die("Unable to mark file %s", path);
+			fprintf(stderr, "Unable to mark file %s\n", path);
 		goto free_return;
 	}
 	if (mark_skip_worktree_only) {
diff --git a/git-inotify-daemon.pl b/git-inotify-daemon.pl
new file mode 100755
index 0000000..a57ceef
--- /dev/null
+++ b/git-inotify-daemon.pl
@@ -0,0 +1,28 @@
+#!/usr/bin/env perl
+# Run from igit
+
+use warnings;
+use strict;
+
+die "Usage: $0 <output-file>" unless $#ARGV == 0;
+my $output = $ARGV[0];
+my $pid = open(INOTIFY, "exec inotifywait -q --monitor --recursive --exclude .git -e attrib,moved_to,moved_from,move,create,delete,modify --format '%w%f' .|") or die "Cannot run inotifywait: $!\n";
+
+$| = 1;
+print "$pid\n";
+
+my %modified_files;
+while (<INOTIFY>) {
+    s=^./==;
+    chomp;
+    $modified_files{$_} = 1;
+}
+
+# Output file must be opened as late as possible, it is a named pipe
+# and the listener won't be here before inotifywait exits.
+# open would just hang if it was done earlier.
+open(OUT, ">$output");
+foreach my $key (sort keys %modified_files) {
+    print OUT "$key\000";
+}
+exit 0;
diff --git a/igit b/igit
new file mode 100755
index 0000000..60c5bb2
--- /dev/null
+++ b/igit
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+TOPDIR=`git rev-parse --show-cdup` || exit 1
+
+if [ ! "$TOPDIR" ]; then
+    TOPDIR="./"
+fi
+
+PIPE=.igit-pipe
+PIDFILE=.igit-pid
+
+if [ -p ${TOPDIR}${PIPE} ] && kill -TERM `cat ${TOPDIR}${PIDFILE}`; then
+    ( cd $TOPDIR && git update-index --verbose --no-assume-unchanged -z --stdin < $PIPE )
+fi
+
+git "$@"
+
+cd $TOPDIR
+rm -f $PIPE
+mkfifo $PIPE
+git-inotify-daemon.pl $PIPE > $PIDFILE 2>> .igit-errors </dev/null &
+
-- 
1.7.2.rc0


- Finn Arne

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-27 12:20 inotify daemon speedup for git [POC/HACK] Finn Arne Gangstad
@ 2010-07-27 23:29 ` Avery Pennarun
  2010-07-27 23:39   ` Joshua Juran
  2010-07-27 23:58 ` Sverre Rabbelier
  1 sibling, 1 reply; 18+ messages in thread
From: Avery Pennarun @ 2010-07-27 23:29 UTC (permalink / raw)
  To: Finn Arne Gangstad; +Cc: git

On Tue, Jul 27, 2010 at 8:20 AM, Finn Arne Gangstad <finnag@pvv.org> wrote:
> Reading through the thread about subtree I noticed Avery mentioning
> using inotify to speed up git status & co.
>
> Here is a quick hack I did some time ago to test this out, to use it
> call "igit" instead of "git" for all commmands you want to speed up.
>
> There is one minor nit: The speedup gain is zero :) git still
> traverses all directories to look for .gitignore files, which seems to
> totally kill the optimisation.

Hey, this is kind of cool.  Except for that last part :)

Actually I think the problem is a little worse than .gitignore files.
'git status', for example (which is called by git commit), wants to
generate a list of the files it *doesn't* know about.  Unfortunately,
those files aren't in the index at all.  So it resorts to doing
recursive readdir() across the entire repository.  The net result is
about as slow as doing that plus one stat() per file in the index.

An inotify daemon could easily keep track of which files have been
added that aren't in the index... but where would it put the list of
files git doesn't know about?  Do they go in the index with a special
NOT_REALLY_INDEXED flag?

This is the main question that has so far prevented me from trying to
solve the problem myself.

Thanks,

Avery

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-27 23:29 ` Avery Pennarun
@ 2010-07-27 23:39   ` Joshua Juran
  2010-07-27 23:51     ` Avery Pennarun
  0 siblings, 1 reply; 18+ messages in thread
From: Joshua Juran @ 2010-07-27 23:39 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Finn Arne Gangstad, git

On Jul 27, 2010, at 4:29 PM, Avery Pennarun wrote:

> An inotify daemon could easily keep track of which files have been
> added that aren't in the index... but where would it put the list of
> files git doesn't know about?  Do they go in the index with a special
> NOT_REALLY_INDEXED flag?

One option is not to write it to disk at all.  The client could  
consult the daemon directly.

Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-27 23:39   ` Joshua Juran
@ 2010-07-27 23:51     ` Avery Pennarun
  2010-07-28  0:00       ` Shawn O. Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Avery Pennarun @ 2010-07-27 23:51 UTC (permalink / raw)
  To: Joshua Juran; +Cc: Finn Arne Gangstad, git

On Tue, Jul 27, 2010 at 7:39 PM, Joshua Juran <jjuran@gmail.com> wrote:
> On Jul 27, 2010, at 4:29 PM, Avery Pennarun wrote:
>
>> An inotify daemon could easily keep track of which files have been
>> added that aren't in the index... but where would it put the list of
>> files git doesn't know about?  Do they go in the index with a special
>> NOT_REALLY_INDEXED flag?
>
> One option is not to write it to disk at all.  The client could consult the
> daemon directly.

True.  What would the client-server protocol look like, though?  "Give
me the list of unknown files?"  Does the daemon need to understand
.gitignore or will it send back a list of all my million *.o files
every time?  etc.

Offhandedly, I think it would be nice to have an inotify daemon just
maintain (something like) the git index file where it just has a list
of *all* the files in a form that's a) random access, not just
sequential, and b) really fast when accessed sequentially.

Knowing that large numbers of files can cause slowness, I was planning
ahead for inotify when I designed bup's index file format, and it
meets the above criteria.  Unfortunately I screwed up other stuff
(adding new files is too slow) and it still needs to be rewritten
anyway.  Oh well.

While we're here, it's probably worth mentioning that git's index file
format (which stores a sequential list of full paths in alphabetical
order, instead of an actual hierarchy) does become a bottleneck when
you actually have a huge number of files in your repo (like literally
a million).  You can't actually binary search through the index!  The
current implementation of submodules allows you to dodge that
scalability problem since you end up with multiple smaller index
files.  Anyway, that's fixable too.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-27 12:20 inotify daemon speedup for git [POC/HACK] Finn Arne Gangstad
  2010-07-27 23:29 ` Avery Pennarun
@ 2010-07-27 23:58 ` Sverre Rabbelier
  1 sibling, 0 replies; 18+ messages in thread
From: Sverre Rabbelier @ 2010-07-27 23:58 UTC (permalink / raw)
  To: Finn Arne Gangstad; +Cc: git, Avery Pennarun

Heya,

On Tue, Jul 27, 2010 at 07:20, Finn Arne Gangstad <finnag@pvv.org> wrote:
> There is one minor nit: The speedup gain is zero :) git still
> traverses all directories to look for .gitignore files, which seems to
> totally kill the optimisation.

This is very true. In my experience with ginormous trees even if you
'git update-index --assume-unchanged' every file and directory it's
still unbearably slow due to the .gitignore files. Any solution that
aims to solve this problem should also address the .gitignore file
problem. Note: a safe assumption here is that to solve the problem it
needs to work if there are more .gitignore files than regular files
:).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-27 23:51     ` Avery Pennarun
@ 2010-07-28  0:00       ` Shawn O. Pearce
  2010-07-28  0:18         ` Avery Pennarun
  2010-07-28 13:06         ` Jakub Narebski
  0 siblings, 2 replies; 18+ messages in thread
From: Shawn O. Pearce @ 2010-07-28  0:00 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Joshua Juran, Finn Arne Gangstad, git

Avery Pennarun <apenwarr@gmail.com> wrote:
> 
> While we're here, it's probably worth mentioning that git's index file
> format (which stores a sequential list of full paths in alphabetical
> order, instead of an actual hierarchy) does become a bottleneck when
> you actually have a huge number of files in your repo (like literally
> a million).  You can't actually binary search through the index!  The
> current implementation of submodules allows you to dodge that
> scalability problem since you end up with multiple smaller index
> files.  Anyway, that's fixable too.

Yes.

More than once I've been tempted to rewrite the on-disk (and I guess
in-memory) format of the index.  And then I remember how painful that
stuff is in either C git.git or JGit, and I back away slowly.  :-)

Ideally the index is organized the same way the trees are, but
you still can't do a really good binary search because of the
ass-backwards name sorting rule for trees.  But for performance
reasons you still want to keep the entire index in a single file,
an index per directory (aka SVN/CVS) is too slow for the common
case of <30k files.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  0:00       ` Shawn O. Pearce
@ 2010-07-28  0:18         ` Avery Pennarun
  2010-07-28  1:14           ` Joshua Juran
  2010-07-28 13:09           ` Jakub Narebski
  2010-07-28 13:06         ` Jakub Narebski
  1 sibling, 2 replies; 18+ messages in thread
From: Avery Pennarun @ 2010-07-28  0:18 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Joshua Juran, Finn Arne Gangstad, git

On Tue, Jul 27, 2010 at 8:00 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Avery Pennarun <apenwarr@gmail.com> wrote:
>> While we're here, it's probably worth mentioning that git's index file
>> format (which stores a sequential list of full paths in alphabetical
>> order, instead of an actual hierarchy) does become a bottleneck when
>> you actually have a huge number of files in your repo (like literally
>> a million).  You can't actually binary search through the index!  The
>> current implementation of submodules allows you to dodge that
>> scalability problem since you end up with multiple smaller index
>> files.  Anyway, that's fixable too.
>
> Yes.
>
> More than once I've been tempted to rewrite the on-disk (and I guess
> in-memory) format of the index.  And then I remember how painful that
> stuff is in either C git.git or JGit, and I back away slowly.  :-)
>
> Ideally the index is organized the same way the trees are, but
> you still can't do a really good binary search because of the
> ass-backwards name sorting rule for trees.  But for performance
> reasons you still want to keep the entire index in a single file,
> an index per directory (aka SVN/CVS) is too slow for the common
> case of <30k files.

Really?  What's wrong with the name sorting rule?  I kind of like it.

bup's current index - after I abandoned my clone of the git one since
it was too slow with insane numbers of files - is very fast for reads
and in-place updates using mmap.

Essentially, it's a tree, starting from the outermost leafs and
leading toward the entry at the very end of the file, which is the
root.  (The idea of doing it backwards was that I could write the file
sequentially.  In retrospect, that was probably an unnecessarily
brain-bending waste of time and the root should have been the first
entry instead.)

For speed, the bup index can just mark entries as deleted using a flag
rather than actually rewriting the whole indexfile.  Unfortunately, I
failed to make it sufficiently flexible to *add* new entries without
needing to rewrite the whole thing.  In bup, that's a big deal
(especially since python is kind of slow and there are typically >1
million files in the index).  In git, it's maybe not so bad; after
all, the current implementation rewrites the index *every* time and
nobody notices.

Anyway, the code for it isn't too hairy, in case you want to steal some ideas:
http://github.com/apenwarr/bup/blob/master/lib/bup/index.py

(Disclaimer: I say this after actually spending a couple of late
nights pulling my hair out over it.  So I'm not so hairy anymore
either, but that doesn't prove much.)

I've considered just tossing the whole thing and using sqlite instead.
 Eventually I'll do it as a benchmark to see what happens.  My past
experiments with sqlite have demonstrated that its performance is
rather mind boggling (> 100k rows inserted per second as long as you
prepare() your SQL statements).  Reading from the index would be fast,
adding entries would be much faster than presently, but I'm not sure
about mass updates.  For bup sqlite would be okay, though I doubt git
wants to take on a whole sqlite dependency.  Then again, you never
know.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  0:18         ` Avery Pennarun
@ 2010-07-28  1:14           ` Joshua Juran
  2010-07-28  1:31             ` Avery Pennarun
  2010-07-28 13:09           ` Jakub Narebski
  1 sibling, 1 reply; 18+ messages in thread
From: Joshua Juran @ 2010-07-28  1:14 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Shawn O. Pearce, Finn Arne Gangstad, git

On Jul 27, 2010, at 5:18 PM, Avery Pennarun wrote:

> On Tue, Jul 27, 2010 at 8:00 PM, Shawn O. Pearce  
> <spearce@spearce.org> wrote:
>> Avery Pennarun <apenwarr@gmail.com> wrote:
>>> While we're here, it's probably worth mentioning that git's index  
>>> file
>>> format (which stores a sequential list of full paths in alphabetical
>>> order, instead of an actual hierarchy) does become a bottleneck when
>>> you actually have a huge number of files in your repo (like  
>>> literally
>>> a million).  You can't actually binary search through the index!   
>>> The
>>> current implementation of submodules allows you to dodge that
>>> scalability problem since you end up with multiple smaller index
>>> files.  Anyway, that's fixable too.
>>
>> Yes.
>>
>> More than once I've been tempted to rewrite the on-disk (and I guess
>> in-memory) format of the index.  And then I remember how painful that
>> stuff is in either C git.git or JGit, and I back away slowly.  :-)
>>
>> Ideally the index is organized the same way the trees are, but
>> you still can't do a really good binary search because of the
>> ass-backwards name sorting rule for trees.  But for performance
>> reasons you still want to keep the entire index in a single file,
>> an index per directory (aka SVN/CVS) is too slow for the common
>> case of <30k files.
>
> Really?  What's wrong with the name sorting rule?  I kind of like it.
>
> bup's current index - after I abandoned my clone of the git one since
> it was too slow with insane numbers of files - is very fast for reads
> and in-place updates using mmap.
>
> Essentially, it's a tree, starting from the outermost leafs and
> leading toward the entry at the very end of the file, which is the
> root.  (The idea of doing it backwards was that I could write the file
> sequentially.  In retrospect, that was probably an unnecessarily
> brain-bending waste of time and the root should have been the first
> entry instead.)
>
> For speed, the bup index can just mark entries as deleted using a flag
> rather than actually rewriting the whole indexfile.  Unfortunately, I
> failed to make it sufficiently flexible to *add* new entries without
> needing to rewrite the whole thing.  In bup, that's a big deal
> (especially since python is kind of slow and there are typically >1
> million files in the index).  In git, it's maybe not so bad; after
> all, the current implementation rewrites the index *every* time and
> nobody notices.

Okay, I have an idea.  If I understand correctly, the index is a flat  
database of records including a pathname and several fixed-length  
fields.  Since the records are not fixed-length, only sequential  
search is possible, even though the records are sorted by pathname.

Here's the idea:  Divide the database into blocks.  Each block  
contains a block header and the records belonging to a single  
directory.  The block header contains the length of the block and also  
the offset to the next block, in bytes.  In addition to a record for  
each indexed file in a directory, a directory's block also contains  
records for subdirectories. The mode flags in a record indicate the  
record type.  Directory records contain an offset in bytes to the  
block for that directory (in place of the SHA-1 hash).  The block list  
is preceded by a file header, which includes the offset in bytes of  
the root block.  All offsets are from the beginning of the file.

Instead of having to search among every file in the repository, the  
search space now includes only the immediate descendants of each  
directory in the target file's path.  If a directory is modified then  
it can either be rewritten in place (if there's sufficient room) or  
appended to the end of the file (requiring the old and new  
sequentially preceding blocks and the parent directory's block to  
update their offsets).

Is this useful?

Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  1:14           ` Joshua Juran
@ 2010-07-28  1:31             ` Avery Pennarun
  2010-07-28  6:03               ` Sverre Rabbelier
  0 siblings, 1 reply; 18+ messages in thread
From: Avery Pennarun @ 2010-07-28  1:31 UTC (permalink / raw)
  To: Joshua Juran; +Cc: Shawn O. Pearce, Finn Arne Gangstad, git

On Tue, Jul 27, 2010 at 9:14 PM, Joshua Juran <jjuran@gmail.com> wrote:
> Okay, I have an idea.  If I understand correctly, the index is a flat
> database of records including a pathname and several fixed-length fields.
>  Since the records are not fixed-length, only sequential search is possible,
> even though the records are sorted by pathname.
>
> Here's the idea:  Divide the database into blocks.  Each block contains a
> block header and the records belonging to a single directory.  The block
> header contains the length of the block and also the offset to the next
> block, in bytes.  In addition to a record for each indexed file in a
> directory, a directory's block also contains records for subdirectories. The
> mode flags in a record indicate the record type.  Directory records contain
> an offset in bytes to the block for that directory (in place of the SHA-1
> hash).  The block list is preceded by a file header, which includes the
> offset in bytes of the root block.  All offsets are from the beginning of
> the file.
>
> Instead of having to search among every file in the repository, the search
> space now includes only the immediate descendants of each directory in the
> target file's path.  If a directory is modified then it can either be
> rewritten in place (if there's sufficient room) or appended to the end of
> the file (requiring the old and new sequentially preceding blocks and the
> parent directory's block to update their offsets).

Yeah, that's pretty much what bup's current format does, minus
appending rewritten dirs at the end when files are added.  I've
thought of that, but sooner or later, the file would need to be
rewritten anyway, and then you end up with odd performance
characteristics where the file expands in random ways and then shrinks
again when you decide it's gotten too big.  And if you do try to reuse
empty blocks - which should mostly avoid the endless growth problem -
you basically just have a database, including fragmentation problems
and multi-user concerns and all.  That's what made me think that
sqlite might be a sensible choice, since it's already a database :)

But maybe there's some simpler way.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  1:31             ` Avery Pennarun
@ 2010-07-28  6:03               ` Sverre Rabbelier
  2010-07-28  6:06                 ` Jonathan Nieder
  2010-07-28  8:20                 ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 18+ messages in thread
From: Sverre Rabbelier @ 2010-07-28  6:03 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Joshua Juran, Shawn O. Pearce, Finn Arne Gangstad, git

Heya,

On Tue, Jul 27, 2010 at 20:31, Avery Pennarun <apenwarr@gmail.com> wrote:
> That's what made me think that
> sqlite might be a sensible choice, since it's already a database :)

Sounds very sensible to me, especially the fact that (if it is indeed
fast enough, which I can't imagine it not being) it would make
development so much easier. At least, I think that having sqlite deal
with backwards comparability of your schema is easier than having to
manually do that? Also, sqlite is known to scale, is exactly one file
worth of dependency, what's not to love (other than having to support
upgrading to 'index vSqlite').

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  6:03               ` Sverre Rabbelier
@ 2010-07-28  6:06                 ` Jonathan Nieder
  2010-07-28  7:44                   ` Ævar Arnfjörð Bjarmason
  2010-07-28  8:20                 ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 18+ messages in thread
From: Jonathan Nieder @ 2010-07-28  6:06 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: Avery Pennarun, Joshua Juran, Shawn O. Pearce, Finn Arne Gangstad,
	git

Sverre Rabbelier wrote:

> Also, sqlite is known to scale, is exactly one file
> worth of dependency, what's not to love (other than having to support
> upgrading to 'index vSqlite').

The frequent fsync()-ing.  Though that seems to be a problem with
pretty much anything that does not involve rewriting the index
with each change.

Maybe filesystems will cope better soon. :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  6:06                 ` Jonathan Nieder
@ 2010-07-28  7:44                   ` Ævar Arnfjörð Bjarmason
  2010-07-28 11:08                     ` Theodore Tso
  0 siblings, 1 reply; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2010-07-28  7:44 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Sverre Rabbelier, Avery Pennarun, Joshua Juran, Shawn O. Pearce,
	Finn Arne Gangstad, git

On Wed, Jul 28, 2010 at 06:06, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Sverre Rabbelier wrote:
>
>> Also, sqlite is known to scale, is exactly one file
>> worth of dependency, what's not to love (other than having to support
>> upgrading to 'index vSqlite').
>
> The frequent fsync()-ing.  Though that seems to be a problem with
> pretty much anything that does not involve rewriting the index
> with each change.

SQLite has an option to turn that off [1], but I don't know if it has
an equivalent feature to manually call fsync when you need that.

Anyway, I've been very impressed by SQLite in every way. I'd try it
before designing my own fileformat, especially something involving
binary/sequential search. It's not a large dependency, and can easily
be bundled in compat/.

1. http://www.sqlite.org/pragma.html#pragma_synchronous

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  6:03               ` Sverre Rabbelier
  2010-07-28  6:06                 ` Jonathan Nieder
@ 2010-07-28  8:20                 ` Nguyen Thai Ngoc Duy
  2010-08-13 17:53                   ` Enrico Weigelt
  1 sibling, 1 reply; 18+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-28  8:20 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: Avery Pennarun, Joshua Juran, Shawn O. Pearce, Finn Arne Gangstad,
	git

On Wed, Jul 28, 2010 at 4:03 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> Heya,
>
> On Tue, Jul 27, 2010 at 20:31, Avery Pennarun <apenwarr@gmail.com> wrote:
>> That's what made me think that
>> sqlite might be a sensible choice, since it's already a database :)
>
> Sounds very sensible to me, especially the fact that (if it is indeed
> fast enough, which I can't imagine it not being) it would make
> development so much easier. At least, I think that having sqlite deal
> with backwards comparability of your schema is easier than having to
> manually do that? Also, sqlite is known to scale, is exactly one file
> worth of dependency, what's not to love (other than having to support
> upgrading to 'index vSqlite').

Even more sensible to replace all pack index with a single database.
But then we could as well drop git object store in favor of Fossil (OK
I'm going to far).
-- 
Duy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  7:44                   ` Ævar Arnfjörð Bjarmason
@ 2010-07-28 11:08                     ` Theodore Tso
  0 siblings, 0 replies; 18+ messages in thread
From: Theodore Tso @ 2010-07-28 11:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Sverre Rabbelier, Avery Pennarun, Joshua Juran,
	Shawn O. Pearce, Finn Arne Gangstad, git


On Jul 28, 2010, at 3:44 AM, Ævar Arnfjörð Bjarmason wrote:

> SQLite has an option to turn that off [1], but I don't know if it has
> an equivalent feature to manually call fsync when you need that.

The right way to use SQLite is to have a memory-packed database which you check first, and where you do al of your work.  Then once you hit a stable stopping point, you commit those changes to your on-disk SQLite database, which can have proper transaction support.   That way you don't lose your database when your crappy binary-only video driver crashes on you, but you don't trash your disk performance because of the fsync() calls....

It only took a few years for firefox developers to figure this out, but the next version is supposed to finally get this right....  it'll be nice to have it NOT chewing up a third of a megabyte of SSD write endurance on every URL click....

-- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  0:00       ` Shawn O. Pearce
  2010-07-28  0:18         ` Avery Pennarun
@ 2010-07-28 13:06         ` Jakub Narebski
  2010-08-13 17:58           ` Enrico Weigelt
  1 sibling, 1 reply; 18+ messages in thread
From: Jakub Narebski @ 2010-07-28 13:06 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Avery Pennarun, Joshua Juran, Finn Arne Gangstad, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Avery Pennarun <apenwarr@gmail.com> wrote:
> > 
> > While we're here, it's probably worth mentioning that git's index file
> > format (which stores a sequential list of full paths in alphabetical
> > order, instead of an actual hierarchy) does become a bottleneck when
> > you actually have a huge number of files in your repo (like literally
> > a million).  You can't actually binary search through the index!  The
> > current implementation of submodules allows you to dodge that
> > scalability problem since you end up with multiple smaller index
> > files.  Anyway, that's fixable too.
> 
> Yes.
> 
> More than once I've been tempted to rewrite the on-disk (and I guess
> in-memory) format of the index.  And then I remember how painful that
> stuff is in either C git.git or JGit, and I back away slowly.  :-)
> 
> Ideally the index is organized the same way the trees are, but
> you still can't do a really good binary search because of the
> ass-backwards name sorting rule for trees.  But for performance
> reasons you still want to keep the entire index in a single file,
> an index per directory (aka SVN/CVS) is too slow for the common
> case of <30k files.

I guess that modern filesystems solve the problem of very many files
in a single directory somehow (hash tables?).  Perhaps the index file
could borrow some such mechanism as an extension.

Index for index?
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  0:18         ` Avery Pennarun
  2010-07-28  1:14           ` Joshua Juran
@ 2010-07-28 13:09           ` Jakub Narebski
  1 sibling, 0 replies; 18+ messages in thread
From: Jakub Narebski @ 2010-07-28 13:09 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Shawn O. Pearce, Joshua Juran, Finn Arne Gangstad, git

Avery Pennarun <apenwarr@gmail.com> writes:

> For speed, the bup index can just mark entries as deleted using a flag
> rather than actually rewriting the whole indexfile.  Unfortunately, I
> failed to make it sufficiently flexible to *add* new entries without
> needing to rewrite the whole thing.  In bup, that's a big deal
> (especially since python is kind of slow and there are typically >1
> million files in the index).  In git, it's maybe not so bad; after
> all, the current implementation rewrites the index *every* time and
> nobody notices.

Sidenote: couldn't you do what e.g. Mercurial did, i.e. rewrite
critical for performance parts in C?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28  8:20                 ` Nguyen Thai Ngoc Duy
@ 2010-08-13 17:53                   ` Enrico Weigelt
  0 siblings, 0 replies; 18+ messages in thread
From: Enrico Weigelt @ 2010-08-13 17:53 UTC (permalink / raw)
  To: git

* Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:

> But then we could as well drop git object store in favor of Fossil (OK
> I'm going to far).

You mean venti ?

Actually: that's an idea I'm thinking about for quite a while :)

But venti is yet lacking delete operations and differential
compression. The first is unproblematic (even it would require
rewriting the log areas in some ways to reclaim space), but
for differential compression, the venti store would have to
know a lot about the object's internal structure.

I'm doing some bit reasearch in the area of distributed 
content-addressed objects stores , designing an superstore 
called "Nebulon" [1] with things like strong encryption and
on-demand fetching/syncing. But getting git into it seems
to be a bit tricky, at least the hashes would change ...


cu

[1] http://www.metux.de/index.php/de/nebulon-storage-cloud.html
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: inotify daemon speedup for git [POC/HACK]
  2010-07-28 13:06         ` Jakub Narebski
@ 2010-08-13 17:58           ` Enrico Weigelt
  0 siblings, 0 replies; 18+ messages in thread
From: Enrico Weigelt @ 2010-08-13 17:58 UTC (permalink / raw)
  To: git

* Jakub Narebski <jnareb@gmail.com> wrote:

> I guess that modern filesystems solve the problem of very many files
> in a single directory somehow (hash tables?).  Perhaps the index file
> could borrow some such mechanism as an extension.
> 
> Index for index?

hmm, if an index gets too large, it could be split into several
ones by an pathname prefix (but not necessarily one per directory)
so when having the subdirs "a", "b", "c", we'll three separate 
index files and a master index telling:

    a/	index.001
    b/	index.002
    c/	index.003

or even:

    a/		index.001
    b/		index.002
    b/foo	index.004
    c/		index.005

this would just add one indirection in the index-lookup by
comparing the key w/ index indice's prefix.


cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-08-13 18:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-27 12:20 inotify daemon speedup for git [POC/HACK] Finn Arne Gangstad
2010-07-27 23:29 ` Avery Pennarun
2010-07-27 23:39   ` Joshua Juran
2010-07-27 23:51     ` Avery Pennarun
2010-07-28  0:00       ` Shawn O. Pearce
2010-07-28  0:18         ` Avery Pennarun
2010-07-28  1:14           ` Joshua Juran
2010-07-28  1:31             ` Avery Pennarun
2010-07-28  6:03               ` Sverre Rabbelier
2010-07-28  6:06                 ` Jonathan Nieder
2010-07-28  7:44                   ` Ævar Arnfjörð Bjarmason
2010-07-28 11:08                     ` Theodore Tso
2010-07-28  8:20                 ` Nguyen Thai Ngoc Duy
2010-08-13 17:53                   ` Enrico Weigelt
2010-07-28 13:09           ` Jakub Narebski
2010-07-28 13:06         ` Jakub Narebski
2010-08-13 17:58           ` Enrico Weigelt
2010-07-27 23:58 ` Sverre Rabbelier

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).