Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Ben Peart <peartben@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: "Brandon Williams" <bmwill@google.com>,
	"Ben Peart" <benpeart@microsoft.com>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
Date: Mon, 12 Feb 2018 12:57:51 -0500	[thread overview]
Message-ID: <d1371a6c-2b30-515a-372e-4fea9bc09c43@gmail.com> (raw)
In-Reply-To: <CACsJy8A2=tWpiBOBxmTLHXm6bvjGCdoDEuJEy7PewvnzEQi2Qg@mail.gmail.com>

On 2/12/2018 5:20 AM, Duy Nguyen wrote:
> On Wed, Feb 7, 2018 at 9:13 PM, Ben Peart <peartben@gmail.com> wrote:
>>
>> On 2/6/2018 7:27 AM, Duy Nguyen wrote:
>>>
>>> This is another thing that bugs me. I know you're talking about huge
>>> index files, but at what size should we start this sort of
>>> optimization? Writing down a few MBs on linux is cheap enough that I
>>> won't bother optimizing (and I get my UNTR extension repaired all the
>>> time, so reduced lstat calls and stuff). This "typically" only comes
>>> at certain size, what size?
>>>
>> It's important to identify what we're trading off here.  With the proposed
>> optimization, we'll end up doing less writes of the index but potentially
>> more lstat calls.  Even with a small index, writing the index is much more
>> expensive than calling lstat on a file.  Exactly how much more expensive
>> depends on a lot of variables but even with a SSD its still orders of
>> magnitude difference.
> Keep in mind it's not just about lstat() calls. Processing ignore
> patterns also takes a big chunk of CPU, especially when you have more
> than a couple ignore rules.

Yes, I'm very familiar with the cost of the exclude list pattern 
matching code.  I've rewritten it to use a hashmap (for those patterns 
where it is possible) that dramatically speeds that aspect up as we tend 
to abuse it pretty heavily (~60K entries on average).

>
> I'm not convinced that writing index files is that expensive for small
> files. I don't know about Windows, with Linux it seems fast to me.
> Actually, for small scale repos, performance probably does not differ
> much either.

I agree.  For small scale repos, none of this is significant enough to 
matter.  Untracked caching should help most as your working directory 
starts to get large.

>> That means we could potentially lstat hundreds or thousands of files and
>> still come out ahead.  Given the untracked cache works at the directory
>> level, the lstat cost actually scales with the number of directories that
>> have had changes (ie changing a 2nd file in the same directory doesn't add
>> any additional cost).  In "typical" workflows, developers don't often change
>> hundreds of files across different directories so we'd "typically" still
>> come out ahead.
> I agree. But we must deal with the bad case when someone
> "accidentally" does that. We should not wait until them yell up "it's
> too slow now" and tell them to disable/enable untracked cache again.
>
> Another case that could touches a lot of directories over time is
> switch trees (e.g. "git checkout master" then "checkout next" or worse
> "checkout v1.0").

You're example above makes me wonder if you understand what my patch is 
doing.  If the index is flagged as dirty for _any_ reason, the entire 
index is written to disk with the latest information - including the 
updated untracked cache and all other extensions.  So in your checkout 
examples above, the index will still get written to disk with the 
updated untracked cache extension.  There would be zero change in 
behavior or performance.  The _only_ time the index is not written to 
disk after my patch is if there were no other changes to the index.  In 
my experience, that is only status calls.

To suffer any degradation in the untracked cache would be if a user 
edited a bunch of files across multiple directories and called status 
repeatedly.  As soon as they did add, checkout, rm, rebase, etc (ie most 
git commands other than status) the index would get flagged as dirty and 
the latest untracked cache extension would get written to disk.

>> We have internal performance tests based on common developer sequences of
>> commands (status, edit a file, status, add, status, commit, log, push, etc)
>> that show that having the untracked cache turned on actually makes these
>> sequences slower.  At the time, we didn't have time to investigate why so we
>> just turned off the untracked cache.
>>
>> When writing the fsmonitor patch series which can utilize the untracked
>> cache, I did some investigation into why the untracked cache was slowing
>> things down in these tests and discovered the cause was the additional index
>> writes.  That is what led to this patch.
>>
>> I've been sitting on it for months now because I didn't have the time to
>> write some performance tests for the git perf framework to demonstrate the
>> problem and how this helps solve it.  With the discussion about the
>> fsmonitor extension, I thought this might be a good time to send it out
>> there.
> Hopefully you have time to get some of those numbers :) A patch is
> more convincing when it's backed by numbers. And I'm still not
> convinced that never ever letting untracked cache be written to the
> index again is a right thing to do [1].

I agree that any performance patch should be accompanied by a 
performance test to demonstrate it actually performs as promised.  I 
looked for but didn't find a performance test for the untracked cache so 
will have to create one from scratch.  In order to make that as 
realistic as possible, it needs to simulate (as much as possible) 
typical developer workflows, ie  create/edit files across multiple 
directories and then execute various common commands (status, add, 
status, add, status, rm, status, commit, log, rebase, etc) and time the 
performance of that sequence of commands.  All doable, that just isn't 
supported well in the performance framework yet so will take some time 
(much more than the actual patch :)).

>> If you have the cycles, time a typical series of commands with and without
>> the untracked cache (ie don't just call status over and over in a loop) and
>> you should see the same results.  Given my limited time right now, I'm OK
>> putting this on the back burner again until a git perf test can be written
>> to ensure it actually speeds things up as advertised.
> [1] Another case that we must _sometimes_ let untracked cache be
> written down is to correct its data. My last invalidation bug, for
> example, could leave the untracked cache in a bad state, when you run
> with new git then it should be able to correct the data and write
> down. But if you don't allow writing down, the new 'git' will keep
> seeing the old errors and keep complaining.

next prev parent reply	other threads:[~2018-02-12 17:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-05 19:56 [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache Ben Peart
2018-02-05 20:55 ` Junio C Hamano
2018-02-06  1:39   ` Ben Peart
2018-02-05 21:58 ` Brandon Williams
2018-02-06  1:48   ` Ben Peart
2018-02-06 12:27     ` Duy Nguyen
2018-02-06 12:55       ` Duy Nguyen
2018-02-07 10:59         ` Duy Nguyen
2018-02-07 13:46           ` Ben Peart
2018-02-06 14:50       ` Junio C Hamano
2018-02-07 14:13       ` Ben Peart
2018-02-12 10:20         ` Duy Nguyen
2018-02-12 17:57           ` Ben Peart [this message]
2018-02-13  9:57             ` Duy Nguyen
2018-02-08 10:33 ` Jeff King
2018-02-28 21:27   ` Junio C Hamano
2018-03-01  7:42     ` Jeff King
2018-03-01 12:35       ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1371a6c-2b30-515a-372e-4fea9bc09c43@gmail.com \
    --to=peartben@gmail.com \
    --cc=avarab@gmail.com \
    --cc=benpeart@microsoft.com \
    --cc=bmwill@google.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).