git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Thomas Rast <trast@student.ethz.ch>
To: Jeff King <peff@peff.net>
Cc: <git@vger.kernel.org>
Subject: [git wiki PATCH 2/3] "Designing a faster index format" project
Date: Fri, 2 Mar 2012 12:05:46 +0100	[thread overview]
Message-ID: <afdfa68348d8d98f2cb604d9c17dad6cd764066e.1330686331.git.trast@student.ethz.ch> (raw)
In-Reply-To: <57e8b4eb7a98af33982c2f3a763e18f62b1d6d6d.1330686331.git.trast@student.ethz.ch>

---
 SoC-2012-Ideas.md |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/SoC-2012-Ideas.md b/SoC-2012-Ideas.md
index 145b379..59d1baf 100644
--- a/SoC-2012-Ideas.md
+++ b/SoC-2012-Ideas.md
@@ -99,3 +99,44 @@ computers is a plus.
 
 Proposed by: Thomas Rast
 Possible mentor(s): Thomas Rast
+
+Designing a faster index format
+-------------------------------
+
+Git is pretty slow when managing huge repositories in terms of files
+in any given tree, as it needs to rewrite the index (in full) on
+pretty much every operation.  For example, even though _logically_
+`git add already_tracked_file` only changes a single blob SHA-1 in the
+index, Git will verify index correctness during loading and recompute
+the new hash during writing _over the whole index_.  It thus ends up
+spending a large amount of time simply on hashing the index.
+
+A carefully designed index format could help in several ways.  (For the
+complexity estimates below, let n be the number of index entries or
+the size of the index, which is roughly the same.)
+
+ * The work needed for something as simple as entering a new blob into
+   the index, which is possibly the most common operation in git
+   (think `git add -p` etc.) should be at most log(n).
+
+ * The work needed for a more complex operation that changes the
+   number of index entries will have to be larger unless we get into
+   database land.  However the amount of data that we SHA-1 over
+   should still be log(n).
+
+ * It may be possible to store the cache-tree data directly as part of
+   the index, always keeping it valid, and using that to validate
+   index consistency throughout.  If so, this would be a big boost to
+   other git operations that currently suffer from frequent cache-tree
+   invalidation.
+
+Note that there are other criteria than speed: the format should also
+be as easy to parse as possible, so as to simplify work for the other
+.git-reading programs (such as jgit and libgit2).  For the same
+reason, you will also have to show a significant speed boost as
+otherwise the break in compatibility is not worth the fallout.
+
+The programming work will be in C, as it replaces a core part of git.
+
+Proposed by: Thomas Rast
+Possible mentor(s): Thomas Rast
-- 
1.7.9.2.467.g7fee4

  reply	other threads:[~2012-03-02 11:06 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-02  9:11 GSoC 2012 application process Jeff King
2012-03-02 11:05 ` [git wiki PATCH 1/3] "Improving parallelism in various commands" project Thomas Rast
2012-03-02 11:05   ` Thomas Rast [this message]
2012-03-02 11:08     ` [git wiki PATCH 2/3] "Designing a faster index format" project Jeff King
2012-03-02 18:24     ` Junio C Hamano
2012-03-03  3:30       ` Nguyen Thai Ngoc Duy
2012-03-02 11:05   ` [git wiki PATCH 3/3] "Improving the `git add -p` interface" project Thomas Rast
2012-03-02 14:29   ` [git wiki PATCH 1/3] "Improving parallelism in various commands" project Nguyen Thai Ngoc Duy
2012-03-02 17:35   ` James Pickens
2012-03-02 14:52 ` GSoC 2012 application process Nguyen Thai Ngoc Duy
2012-03-02 21:48 ` Junio C Hamano
2012-03-03  0:09   ` Jeff King
2012-03-03 21:14 ` [git wiki PATCH] "Modernizing and expanding Git.pm" project Jakub Narebski
2012-03-03 22:23   ` Jeff King
2012-03-04 23:35 ` [git wiki PATCH] Teaching "--3way" to "git apply" Junio C Hamano
2012-03-05  5:33   ` Jeff King
2012-03-05  8:05     ` Thomas Rast
2012-03-05 10:02       ` Jeff King
2012-03-05 13:38 ` GSoC 2012 application process Matthieu Moy
2012-03-05 13:58   ` Jeff King
2012-03-05 14:42     ` Matthieu Moy
2012-03-05 23:53       ` Jeff King
2012-03-07 14:36 ` GSoC backup admin Jeff King
2012-03-07 15:27   ` Shawn Pearce
2012-03-08 21:18 ` [gsoc2012 wiki PATCH] "Use JavaScript library / framework in gitweb" project Jakub Narebski
2012-03-09  7:24   ` Jeff King
2012-03-10  0:46 ` [gsoc2012 wiki PATCH] "`git instaweb --serve`" project Jakub Narebski
2012-03-11 22:30 ` [gsoc2012 wiki PATCH] "Graphical diff in git-gui" project Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afdfa68348d8d98f2cb604d9c17dad6cd764066e.1330686331.git.trast@student.ethz.ch \
    --to=trast@student.ethz.ch \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).