git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Karsten Blees <karsten.blees@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: kusmabite@gmail.com, Ramkumar Ramachandra <artagnon@gmail.com>,
	Robert Zeh <robert.allan.zeh@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Git List <git@vger.kernel.org>,
	finnag@pvv.org
Subject: Re: inotify to minimize stat() calls
Date: Tue, 12 Feb 2013 21:48:18 +0100	[thread overview]
Message-ID: <511AAA92.4030508@gmail.com> (raw)
In-Reply-To: <CACsJy8AWyJ=dW5f44huWyPPe4X62xyi+R9CNM5Tg6u6TYf+thQ@mail.gmail.com>

Am 11.02.2013 04:53, schrieb Duy Nguyen:
> On Sun, Feb 10, 2013 at 11:58 PM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>> Karsten Blees has done something similar-ish on Windows, and he posted
>> the results here:
>>
>> https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion
>>

The new hashtable implementation in fscache [1] supports O(1) removal and has no mingw dependencies - might come in handy for anyone trying to implement an inotify daemon.

[1] https://github.com/kblees/git/commit/f7eb85c2

>> I also seem to remember he doing a ReadDirectoryChangesW version, but
>> I don't remember what happened with that.
> 
> Thanks. I came across that but did not remember. For one thing, we
> know the inotify alternative for Windows: ReadDirectoryChangesW.
> 

I dropped ReadDirectoryChangesW because maintaining a 'live' file system cache became more and more complicated. For example, according to MSDN docs, ReadDirectoryChangesW *may* report short DOS 8.3 names (i.e. "PROGRA~1" instead of "Program Files"), so a correct and fast cache implementation would have to be indexed by long *and* short names...

Another problem was that the 'live' cache had quite negative performance impact on mutating git commands (checkout, reset...). An inotify daemon running as a background process (not in-process as fscache) will probably affect everyone that modifies the working copy, e.g. running 'make' or the test-suite. This should be considered in the design.

> I copy "git status"'s (impressive) numbers from fscache-v0 for those
> who are interested in:
> 
> preload | -u  | normal | cached | gain
> --------+-----+--------+--------+------
> false   | all | 25.144 | 3.055  |  8.2
> false   | no  | 22.822 | 1.748  | 12.8
> true    | all |  9.234 | 2.179  |  4.2
> true    | no  |  6.833 | 0.955  |  7.2
> 

Note that I wasn't able to reproduce such bad 'normal' values in later tests, I guess disk fragmentation and/or virus scanner must have tricked me on that day...gain factors of 2.5 - 5 are more appropriate.


However, the difference between git status -uall and -uno was always about 1.3 s in all fscache versions, even though opendir/readdir/closedir was served entirely from the cache. I added a bit of performance tracing to find the cause, and I think most of the time spent in wt_status_collect_untracked can be eliminated:

1.) 0.939 s is spent in dir.c/excluded (i.e. checking .gitignore). This check is done for *every* file in the working copy, including files in the index. Checking the index first could eliminate most of that, i.e.:

(Note: patches are for discussion only, I'm aware that they may have unintended side effects...)

@@ -1097,6 +1097,8 @@ static enum path_treatment treat_path(struct dir_struct *dir,
                return path_ignored;
        strbuf_setlen(path, baselen);
        strbuf_addstr(path, de->d_name);
+       if (cache_name_exists(path->buf, path->len, ignore_case))
+               return path_ignored;
        if (simplify_away(path->buf, path->len, simplify))
                return path_ignored;
---


2.) 0.135 s is spent in name-hash.c/hash_index_entry_directories, reindexing the same directories over and over again. In the end, the hashtable contains 939k directory entries, even though the WebKit test repo only has 7k directories. Checking if a directory entry already exists could reduce that, i.e.:

@@ -53,14 +55,23 @@ static void hash_index_entry_directories(struct index_state *istate, struct cach
 	unsigned int hash;
 	void **pos;
 	double t = ticks();
+	struct cache_entry *ce2;
+	int len = ce_namelen(ce);
 
-	const char *ptr = ce->name;
-	while (*ptr) {
-		while (*ptr && *ptr != '/')
-			++ptr;
-		if (*ptr == '/') {
-			++ptr;
-			hash = hash_name(ce->name, ptr - ce->name);
+	while (len > 0) {
+		while (len > 0 && ce->name[len - 1] != '/')
+			len--;
+		if (len > 0) {
+			hash = hash_name(ce->name, len);
+			ce2 = lookup_hash(hash, &istate->name_hash);
+			while (ce2) {
+				if (same_name(ce2, ce->name, len, ignore_case)) {
+					add_since(t, &hash_dirs);
+					return;
+				}
+				ce2 = ce2->dir_next;
+			}
+			len--;
 			pos = insert_hash(hash, ce, &istate->name_hash);
 			if (pos) {
 				ce->dir_next = *pos;
---


Tests were done with the WebKit repo (~200k files, ~7k directories, 15 .gitignore files, ~100 entries in root .gitignore). Instrumented code can be found here: https://github.com/kblees/git/tree/kb/git-status-performance-tracing

Here's the performance traces of 'git status -s -uall'

Before patches:

trace: at builtin/commit.c:1221, time: 0.523429 s: cmd_status/read_cache_preload
trace: at builtin/commit.c:1223, time: 0.00403477 s: cmd_status/refresh_index
trace: at builtin/commit.c:1231, time: 0.00318494 s: cmd_status/hold_locked_index
trace: at wt-status.c:539, time: 0.00527396 s: wt_status_collect_changes_worktree
trace: at wt-status.c:544, time: 0.00545771 s: wt_status_collect_changes
trace: at wt-status.c:546, time: 1.286 s: wt_status_collect_untracked
trace: at builtin/commit.c:1233, time: 1.29852 s: cmd_status/wt_status_collect
trace: at dir.c:1540, time: 0.00170986 s: read_directory_recursive/strbuf_add
trace: at dir.c:1541, time: 0.00623972 s: read_directory_recursive/opendir
trace: at dir.c:1542, time: 0.00517881 s: read_directory_recursive/readdir
trace: at dir.c:1543, time: 0.992936 s: read_directory_recursive/treat_path
trace: at dir.c:1544, time: 0.277942 s: read_directory_recursive/dir_add_name
trace: at dir.c:1545, time: 0.0014594 s: read_directory_recursive/close
trace: at dir.c:1546, time: 0.939349 s: treat_one_path/excluded
trace: at dir.c:1547, time: 0.0050811 s: treat_one_path/dir_add_ignored
trace: at dir.c:1548, time: 0.00515875 s: treat_one_path/get_dtype
trace: at dir.c:1549, time: 0.00329322 s: treat_one_path/treat_directory
trace: at dir.c:1550, time: 0.222969 s: excluded/prep_exclude
trace: at dir.c:1551, time: 0.00443398 s: excluded/excluded_from_list[EXC_CMDL]
trace: at dir.c:1552, time: 0.699602 s: excluded/excluded_from_list[EXC_DIRS]
trace: at dir.c:1553, time: 0.00475736 s: excluded/excluded_from_list[EXC_FILE]
trace: at read-cache.c:460, time: 0.00967987 s: index_name_pos
trace: at name-hash.c:213, time: 0.190481 s: lazy_init_name_hash
trace: at name-hash.c:216, time: 0.135248 s: hash_index_entry_directories (938865 entries)
trace: at name-hash.c:217, time: 0.0806647 s: index_name_exists
trace: at compat/mingw.c:2137, time: 1.97424 s: command: c:\git\msysgit\git\git-status.exe -s -uall


After patches:

trace: at builtin/commit.c:1221, time: 0.517511 s: cmd_status/read_cache_preload
trace: at builtin/commit.c:1223, time: 0.00405227 s: cmd_status/refresh_index
trace: at builtin/commit.c:1231, time: 0.00322796 s: cmd_status/hold_locked_index
trace: at wt-status.c:539, time: 0.00530057 s: wt_status_collect_changes_worktree
trace: at wt-status.c:544, time: 0.00546062 s: wt_status_collect_changes
trace: at wt-status.c:546, time: 0.322799 s: wt_status_collect_untracked
trace: at builtin/commit.c:1233, time: 0.33536 s: cmd_status/wt_status_collect
trace: at dir.c:1542, time: 0.00120529 s: read_directory_recursive/strbuf_add
trace: at dir.c:1543, time: 0.00476647 s: read_directory_recursive/opendir
trace: at dir.c:1544, time: 0.00502022 s: read_directory_recursive/readdir
trace: at dir.c:1545, time: 0.310515 s: read_directory_recursive/treat_path
trace: at dir.c:1546, time: 0 s: read_directory_recursive/dir_add_name
trace: at dir.c:1547, time: 0.000831234 s: read_directory_recursive/close
trace: at dir.c:1548, time: 0.0668582 s: treat_one_path/excluded
trace: at dir.c:1549, time: 0.000173174 s: treat_one_path/dir_add_ignored
trace: at dir.c:1550, time: 0.000174267 s: treat_one_path/get_dtype
trace: at dir.c:1551, time: 0.00315468 s: treat_one_path/treat_directory
trace: at dir.c:1552, time: 0.039733 s: excluded/prep_exclude
trace: at dir.c:1553, time: 0.000185205 s: excluded/excluded_from_list[EXC_CMDL]
trace: at dir.c:1554, time: 0.0264496 s: excluded/excluded_from_list[EXC_DIRS]
trace: at dir.c:1555, time: 0.000170622 s: excluded/excluded_from_list[EXC_FILE]
trace: at read-cache.c:460, time: 0.00260636 s: index_name_pos
trace: at name-hash.c:224, time: 0.126637 s: lazy_init_name_hash
trace: at name-hash.c:227, time: 0.0500866 s: hash_index_entry_directories (7152 entries)
trace: at name-hash.c:228, time: 0.0790143 s: index_name_exists
trace: at compat/mingw.c:2137, time: 1.00595 s: command: c:\git\msysgit\git\git-status.exe -s -uall

  reply	other threads:[~2013-02-12 20:48 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-08 21:10 inotify to minimize stat() calls Ramkumar Ramachandra
2013-02-08 22:15 ` Junio C Hamano
2013-02-08 22:45   ` Junio C Hamano
2013-02-09  2:10     ` Duy Nguyen
2013-02-09  2:37       ` Junio C Hamano
2013-02-09  2:56     ` Junio C Hamano
2013-02-09  3:36       ` Robert Zeh
2013-02-09 12:05         ` Ramkumar Ramachandra
2013-02-09 12:11           ` Ramkumar Ramachandra
2013-02-09 12:53           ` Ramkumar Ramachandra
2013-02-09 12:59             ` Duy Nguyen
2013-02-09 17:10               ` Ramkumar Ramachandra
2013-02-09 18:56                 ` Ramkumar Ramachandra
2013-02-10  5:24                 ` Duy Nguyen
2013-02-10 11:17                   ` Duy Nguyen
2013-02-10 11:22                     ` Duy Nguyen
2013-02-10 20:16                       ` Junio C Hamano
2013-02-11  2:56                         ` Duy Nguyen
2013-02-11 11:12                           ` Duy Nguyen
2013-03-07 22:16                           ` Torsten Bögershausen
2013-03-08  0:04                             ` Junio C Hamano
2013-03-08  7:01                               ` Torsten Bögershausen
2013-03-08  8:15                                 ` Junio C Hamano
2013-03-08  9:24                                   ` Torsten Bögershausen
2013-03-08 10:53                                   ` Duy Nguyen
2013-03-10  8:23                                     ` Ramkumar Ramachandra
2013-03-13 12:59                                     ` [PATCH] status: hint the user about -uno if read_directory takes too long Nguyễn Thái Ngọc Duy
2013-03-13 15:21                                       ` Torsten Bögershausen
2013-03-13 16:16                                       ` Junio C Hamano
2013-03-14 10:22                                         ` Duy Nguyen
2013-03-14 15:05                                           ` Junio C Hamano
2013-03-15 12:30                                             ` Duy Nguyen
2013-03-15 15:52                                               ` Torsten Bögershausen
2013-03-15 15:57                                                 ` Ramkumar Ramachandra
2013-03-15 16:53                                                 ` Junio C Hamano
2013-03-15 17:41                                                   ` Torsten Bögershausen
2013-03-15 20:06                                                     ` Junio C Hamano
2013-03-15 21:14                                                       ` Torsten Bögershausen
2013-03-15 21:59                                                         ` Junio C Hamano
2013-03-16  7:21                                                           ` Torsten Bögershausen
2013-03-17  4:47                                                             ` Junio C Hamano
2013-03-16  1:51                                           ` Duy Nguyen
2013-02-10 13:26                     ` inotify to minimize stat() calls demerphq
2013-02-10 15:35                       ` Duy Nguyen
2013-02-14 14:36                       ` Magnus Bäck
2013-02-10 16:45                     ` Ramkumar Ramachandra
2013-02-11  3:03                       ` Duy Nguyen
2013-02-10 16:58                     ` Erik Faye-Lund
2013-02-11  3:53                       ` Duy Nguyen
2013-02-12 20:48                         ` Karsten Blees [this message]
2013-02-13 10:06                           ` Duy Nguyen
2013-02-13 12:15                           ` Duy Nguyen
2013-02-13 18:18                             ` Jeff King
2013-02-13 19:47                               ` Jeff King
2013-02-13 20:25                               ` Karsten Blees
2013-02-13 22:55                                 ` Jeff King
2013-02-14  0:48                                   ` Karsten Blees
2013-02-27 14:45                                     ` [PATCH] name-hash.c: fix endless loop with core.ignorecase=true Karsten Blees
2013-02-27 16:53                                       ` Junio C Hamano
2013-02-27 21:52                                         ` Karsten Blees
2013-02-27 23:57                                           ` [PATCH v2] " Karsten Blees
2013-02-28  0:27                                             ` Junio C Hamano
2013-02-19  9:49                           ` inotify to minimize stat() calls Ramkumar Ramachandra
2013-02-19 14:25                             ` Karsten Blees
2013-02-19 13:16                   ` Drew Northup
2013-02-19 13:47                     ` Duy Nguyen
2013-02-09 19:35           ` Junio C Hamano
2013-02-10 19:03             ` Robert Zeh
2013-02-10 19:26               ` Martin Fick
2013-02-10 20:18                 ` Robert Zeh
2013-02-11  3:21               ` Duy Nguyen
2013-02-11 14:13                 ` Robert Zeh
2013-02-19  9:57                   ` Ramkumar Ramachandra
2013-04-24 17:20               ` [PATCH] " Robert Zeh
2013-04-24 21:32                 ` Duy Nguyen
2013-04-25 19:44                   ` Robert Zeh
2013-04-25 21:20                     ` Duy Nguyen
2013-04-26 15:35                       ` Robert Zeh
2013-04-25  8:18                 ` Thomas Rast
2013-04-25 19:37                   ` Robert Zeh
2013-04-25 19:59                     ` Thomas Rast
2013-04-27 13:51                       ` Thomas Rast
2013-04-27 23:56                 ` Duy Nguyen
     [not found]                   ` <CAKXa9=r2A7UeBV2s2H3wVGdPkS1zZ9huNJhtvTC-p0S5Ed12xA@mail.gmail.com>
2013-04-30  0:27                     ` Duy Nguyen
2013-02-09 11:32       ` Ramkumar Ramachandra
2013-02-14 15:16 ` Ævar Arnfjörð Bjarmason
2013-02-14 16:31   ` Junio C Hamano
2013-02-19  9:40   ` Ramkumar Ramachandra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=511AAA92.4030508@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=artagnon@gmail.com \
    --cc=blees@dcon.de \
    --cc=finnag@pvv.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=kusmabite@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=robert.allan.zeh@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).