git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
@ 2018-02-05 19:56 Ben Peart
  2018-02-05 20:55 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ben Peart @ 2018-02-05 19:56 UTC (permalink / raw)
  To: git; +Cc: pclouds, bmwill, avarab, benpeart

The untracked cache saves its current state in the UNTR index extension.
Currently, _any_ change to that state causes the index to be flagged as dirty
and written out to disk.  Unfortunately, the cost to write out the index can
exceed the savings gained by using the untracked cache.  Since it is a cache
that can be updated from the current state of the working directory, there is
no functional requirement that the index be written out for every change to the
untracked cache.

Update the untracked cache logic so that it no longer forces the index to be
written to disk except in the case where the extension is being turned on or
off.  When some other git command requires the index to be written to disk, the
untracked cache will take advantage of that to save it's updated state as well.
This results in a performance win when looked at over common sequences of git
commands (ie such as a status followed by add, commit, etc).

After this patch, all the logic to track statistics for the untracked cache
could be removed as it is only used by debug tracing used to debug the untracked
cache.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
---

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/20c2e8d787
    Checkout: git fetch https://github.com/benpeart/git untracked-cache-v1 && git checkout 20c2e8d787

 dir.c                             | 3 ++-
 t/t7063-status-untracked-cache.sh | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/dir.c b/dir.c
index 7c4b45e30e..da93374f0c 100644
--- a/dir.c
+++ b/dir.c
@@ -2297,7 +2297,8 @@ int read_directory(struct dir_struct *dir, struct index_state *istate,
 				 dir->untracked->gitignore_invalidated,
 				 dir->untracked->dir_invalidated,
 				 dir->untracked->dir_opened);
-		if (dir->untracked == istate->untracked &&
+		if (getenv("GIT_TEST_UNTRACKED_CACHE") &&
+			dir->untracked == istate->untracked &&
 		    (dir->untracked->dir_opened ||
 		     dir->untracked->gitignore_invalidated ||
 		     dir->untracked->dir_invalidated))
diff --git a/t/t7063-status-untracked-cache.sh b/t/t7063-status-untracked-cache.sh
index e5fb892f95..6ef520e823 100755
--- a/t/t7063-status-untracked-cache.sh
+++ b/t/t7063-status-untracked-cache.sh
@@ -14,6 +14,9 @@ test_description='test untracked cache'
 # See <20160803174522.5571-1-pclouds@gmail.com> if you want to know
 # more.
 
+GIT_TEST_UNTRACKED_CACHE=true
+export GIT_TEST_UNTRACKED_CACHE
+
 sync_mtime () {
 	find . -type d -ls >/dev/null
 }

base-commit: 5be1f00a9a701532232f57958efab4be8c959a29
-- 
2.15.0.windows.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-05 19:56 [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache Ben Peart
@ 2018-02-05 20:55 ` Junio C Hamano
  2018-02-06  1:39   ` Ben Peart
  2018-02-05 21:58 ` Brandon Williams
  2018-02-08 10:33 ` Jeff King
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2018-02-05 20:55 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, pclouds, bmwill, avarab

Ben Peart <benpeart@microsoft.com> writes:

> The untracked cache saves its current state in the UNTR index extension.
> Currently, _any_ change to that state causes the index to be flagged as dirty
> and written out to disk.  Unfortunately, the cost to write out the index can
> exceed the savings gained by using the untracked cache.  Since it is a cache
> that can be updated from the current state of the working directory, there is
> no functional requirement that the index be written out for every change to the
> untracked cache.
>
> Update the untracked cache logic so that it no longer forces the index to be
> written to disk except in the case where the extension is being turned on or
> off.  When some other git command requires the index to be written to disk, the
> untracked cache will take advantage of that to save it's updated state as well.
> This results in a performance win when looked at over common sequences of git
> commands (ie such as a status followed by add, commit, etc).
>
> After this patch, all the logic to track statistics for the untracked cache
> could be removed as it is only used by debug tracing used to debug the untracked
> cache.
>
> Signed-off-by: Ben Peart <benpeart@microsoft.com>
> ---
>

OK, so in other words (note: not a suggestion to use different
wording in the log message; just making sure I got the motivation
behind this change correctly), without this new environment
variable, changes to untracked cache alone (due to observed changes
in the filesystem) does not count as "in-core index changed so we
need to write it back to the disk".

That makes sense to me.

Is it envisioned that we want to have similar but different "testing
only" behaviour around this area?  If not, this environment variable
sounds more like "force-flush untracked cache", not "test untracked
cache", to me.

> +GIT_TEST_UNTRACKED_CACHE=true
> +export GIT_TEST_UNTRACKED_CACHE
> +
>  sync_mtime () {
>  	find . -type d -ls >/dev/null
>  }
>
> base-commit: 5be1f00a9a701532232f57958efab4be8c959a29

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-05 19:56 [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache Ben Peart
  2018-02-05 20:55 ` Junio C Hamano
@ 2018-02-05 21:58 ` Brandon Williams
  2018-02-06  1:48   ` Ben Peart
  2018-02-08 10:33 ` Jeff King
  2 siblings, 1 reply; 15+ messages in thread
From: Brandon Williams @ 2018-02-05 21:58 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, pclouds, avarab

On 02/05, Ben Peart wrote:
> The untracked cache saves its current state in the UNTR index extension.
> Currently, _any_ change to that state causes the index to be flagged as dirty
> and written out to disk.  Unfortunately, the cost to write out the index can
> exceed the savings gained by using the untracked cache.  Since it is a cache
> that can be updated from the current state of the working directory, there is
> no functional requirement that the index be written out for every change to the
> untracked cache.
> 
> Update the untracked cache logic so that it no longer forces the index to be
> written to disk except in the case where the extension is being turned on or
> off.  When some other git command requires the index to be written to disk, the
> untracked cache will take advantage of that to save it's updated state as well.
> This results in a performance win when looked at over common sequences of git
> commands (ie such as a status followed by add, commit, etc).
> 
> After this patch, all the logic to track statistics for the untracked cache
> could be removed as it is only used by debug tracing used to debug the untracked
> cache.

So we don't need to update it every time because its just a cache
and if its inaccurate between status calls that's ok?  So only
operations like add and commit will actually write out the untracked
cache (as a part of writing out the index).  Sounds ok.

What benefit is there to using the untracked cache then?  Sounds like
you should just turn it off instead?
(I'm sure this is a naive question :D )

> 
> Signed-off-by: Ben Peart <benpeart@microsoft.com>
> ---
> 
> Notes:
>     Base Ref: master
>     Web-Diff: https://github.com/benpeart/git/commit/20c2e8d787
>     Checkout: git fetch https://github.com/benpeart/git untracked-cache-v1 && git checkout 20c2e8d787
> 
>  dir.c                             | 3 ++-
>  t/t7063-status-untracked-cache.sh | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/dir.c b/dir.c
> index 7c4b45e30e..da93374f0c 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -2297,7 +2297,8 @@ int read_directory(struct dir_struct *dir, struct index_state *istate,
>  				 dir->untracked->gitignore_invalidated,
>  				 dir->untracked->dir_invalidated,
>  				 dir->untracked->dir_opened);
> -		if (dir->untracked == istate->untracked &&
> +		if (getenv("GIT_TEST_UNTRACKED_CACHE") &&
> +			dir->untracked == istate->untracked &&
>  		    (dir->untracked->dir_opened ||
>  		     dir->untracked->gitignore_invalidated ||
>  		     dir->untracked->dir_invalidated))
> diff --git a/t/t7063-status-untracked-cache.sh b/t/t7063-status-untracked-cache.sh
> index e5fb892f95..6ef520e823 100755
> --- a/t/t7063-status-untracked-cache.sh
> +++ b/t/t7063-status-untracked-cache.sh
> @@ -14,6 +14,9 @@ test_description='test untracked cache'
>  # See <20160803174522.5571-1-pclouds@gmail.com> if you want to know
>  # more.
>  
> +GIT_TEST_UNTRACKED_CACHE=true
> +export GIT_TEST_UNTRACKED_CACHE
> +
>  sync_mtime () {
>  	find . -type d -ls >/dev/null
>  }
> 
> base-commit: 5be1f00a9a701532232f57958efab4be8c959a29
> -- 
> 2.15.0.windows.1
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-05 20:55 ` Junio C Hamano
@ 2018-02-06  1:39   ` Ben Peart
  0 siblings, 0 replies; 15+ messages in thread
From: Ben Peart @ 2018-02-06  1:39 UTC (permalink / raw)
  To: Junio C Hamano, Ben Peart; +Cc: git, pclouds, bmwill, avarab



On 2/5/2018 3:55 PM, Junio C Hamano wrote:
> Ben Peart <benpeart@microsoft.com> writes:
> 
>> The untracked cache saves its current state in the UNTR index extension.
>> Currently, _any_ change to that state causes the index to be flagged as dirty
>> and written out to disk.  Unfortunately, the cost to write out the index can
>> exceed the savings gained by using the untracked cache.  Since it is a cache
>> that can be updated from the current state of the working directory, there is
>> no functional requirement that the index be written out for every change to the
>> untracked cache.
>>
>> Update the untracked cache logic so that it no longer forces the index to be
>> written to disk except in the case where the extension is being turned on or
>> off.  When some other git command requires the index to be written to disk, the
>> untracked cache will take advantage of that to save it's updated state as well.
>> This results in a performance win when looked at over common sequences of git
>> commands (ie such as a status followed by add, commit, etc).
>>
>> After this patch, all the logic to track statistics for the untracked cache
>> could be removed as it is only used by debug tracing used to debug the untracked
>> cache.
>>
>> Signed-off-by: Ben Peart <benpeart@microsoft.com>
>> ---
>>
> 
> OK, so in other words (note: not a suggestion to use different
> wording in the log message; just making sure I got the motivation
> behind this change correctly), without this new environment
> variable, changes to untracked cache alone (due to observed changes
> in the filesystem) does not count as "in-core index changed so we
> need to write it back to the disk".
> 

Correct

> That makes sense to me.
> 
> Is it envisioned that we want to have similar but different "testing
> only" behaviour around this area?  If not, this environment variable
> sounds more like "force-flush untracked cache", not "test untracked
> cache", to me.
> 

Many of the tests make a change and then verify that the on disk 
structure was updated correctly.  This was the simplest way to keep 
those tests functioning. I don't imagine this would be used for anything 
other than enabling the tests.  I hate naming so am happy to name it 
which ever you think is best. :)

>> +GIT_TEST_UNTRACKED_CACHE=true
>> +export GIT_TEST_UNTRACKED_CACHE
>> +
>>   sync_mtime () {
>>   	find . -type d -ls >/dev/null
>>   }
>>
>> base-commit: 5be1f00a9a701532232f57958efab4be8c959a29

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-05 21:58 ` Brandon Williams
@ 2018-02-06  1:48   ` Ben Peart
  2018-02-06 12:27     ` Duy Nguyen
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Peart @ 2018-02-06  1:48 UTC (permalink / raw)
  To: Brandon Williams, Ben Peart; +Cc: git, pclouds, avarab



On 2/5/2018 4:58 PM, Brandon Williams wrote:
> On 02/05, Ben Peart wrote:
>> The untracked cache saves its current state in the UNTR index extension.
>> Currently, _any_ change to that state causes the index to be flagged as dirty
>> and written out to disk.  Unfortunately, the cost to write out the index can
>> exceed the savings gained by using the untracked cache.  Since it is a cache
>> that can be updated from the current state of the working directory, there is
>> no functional requirement that the index be written out for every change to the
>> untracked cache.
>>
>> Update the untracked cache logic so that it no longer forces the index to be
>> written to disk except in the case where the extension is being turned on or
>> off.  When some other git command requires the index to be written to disk, the
>> untracked cache will take advantage of that to save it's updated state as well.
>> This results in a performance win when looked at over common sequences of git
>> commands (ie such as a status followed by add, commit, etc).
>>
>> After this patch, all the logic to track statistics for the untracked cache
>> could be removed as it is only used by debug tracing used to debug the untracked
>> cache.
> 
> So we don't need to update it every time because its just a cache
> and if its inaccurate between status calls that's ok?  So only
> operations like add and commit will actually write out the untracked
> cache (as a part of writing out the index).  Sounds ok.
> 
> What benefit is there to using the untracked cache then?  Sounds like
> you should just turn it off instead?
> (I'm sure this is a naive question :D )

The parts of the untracked cache that have not changed since the 
extension was updated are still cached and valid.  Only those 
directories that have changes will need to be checked.

With the old behavior, making a change in dir1/, then calling status 
would update the dir1/ untracked cache entry, flag the index as dirty 
and write it out.  On the next status, git would detect that no changes 
have been made and use the cached data for dir1/.

With the new behavior, making a change in dir1/, then calling status 
would update the dir1/ untracked cache entry but not write it out. On 
the next status, git would detect the change in dir1/ again and update 
the untracked cache.  All of the other cached entries are still valid 
and the cache would be used for them.  The updated cache entry for dir1/ 
would not get persisted to disk until something that required the index 
to be written out.

The behavior is correct in both cases.  You just don't get the benefit 
of the updated cache for the dir1/ entry until the index is persisted 
again.  What you gain in exchange is that you don't have to write out 
the index which is (typically) a lot more expensive than checking dir1/ 
for changes.

> 
>>
>> Signed-off-by: Ben Peart <benpeart@microsoft.com>
>> ---
>>
>> Notes:
>>      Base Ref: master
>>      Web-Diff: https://github.com/benpeart/git/commit/20c2e8d787
>>      Checkout: git fetch https://github.com/benpeart/git untracked-cache-v1 && git checkout 20c2e8d787
>>
>>   dir.c                             | 3 ++-
>>   t/t7063-status-untracked-cache.sh | 3 +++
>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/dir.c b/dir.c
>> index 7c4b45e30e..da93374f0c 100644
>> --- a/dir.c
>> +++ b/dir.c
>> @@ -2297,7 +2297,8 @@ int read_directory(struct dir_struct *dir, struct index_state *istate,
>>   				 dir->untracked->gitignore_invalidated,
>>   				 dir->untracked->dir_invalidated,
>>   				 dir->untracked->dir_opened);
>> -		if (dir->untracked == istate->untracked &&
>> +		if (getenv("GIT_TEST_UNTRACKED_CACHE") &&
>> +			dir->untracked == istate->untracked &&
>>   		    (dir->untracked->dir_opened ||
>>   		     dir->untracked->gitignore_invalidated ||
>>   		     dir->untracked->dir_invalidated))
>> diff --git a/t/t7063-status-untracked-cache.sh b/t/t7063-status-untracked-cache.sh
>> index e5fb892f95..6ef520e823 100755
>> --- a/t/t7063-status-untracked-cache.sh
>> +++ b/t/t7063-status-untracked-cache.sh
>> @@ -14,6 +14,9 @@ test_description='test untracked cache'
>>   # See <20160803174522.5571-1-pclouds@gmail.com> if you want to know
>>   # more.
>>   
>> +GIT_TEST_UNTRACKED_CACHE=true
>> +export GIT_TEST_UNTRACKED_CACHE
>> +
>>   sync_mtime () {
>>   	find . -type d -ls >/dev/null
>>   }
>>
>> base-commit: 5be1f00a9a701532232f57958efab4be8c959a29
>> -- 
>> 2.15.0.windows.1
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-06  1:48   ` Ben Peart
@ 2018-02-06 12:27     ` Duy Nguyen
  2018-02-06 12:55       ` Duy Nguyen
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Duy Nguyen @ 2018-02-06 12:27 UTC (permalink / raw)
  To: Ben Peart; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason

On Tue, Feb 6, 2018 at 8:48 AM, Ben Peart <peartben@gmail.com> wrote:
> With the new behavior, making a change in dir1/, then calling status would
> update the dir1/ untracked cache entry but not write it out. On the next
> status, git would detect the change in dir1/ again and update the untracked

Thing only missing piece here I would add is, this dir1/ detection is
done by watchman. We have to contact watchman and ask the set of
changed paths since $TIME where $TIME is the last time we updated
untracked cache and invalidate those paths in core. Then it should
work correctly. I checked the watchman query in the fsmonitor hook and
I think it's correct so far.

Except one point. What if you activate fsmonitor, write down the
fsmonitor_last_update field there but _not_ create UNTR extension for
the same timestamp? UNTR extension is created only when
read_directory() is executed and we don't do that in every command. I
haven't checked fsmonitor.c carefully, maybe I'm missing something.

One thing I'd like to point out in this patch is untracked cache will
start degrading if you just make the initial UNTR extension once and
don't ever update it again. Dirty paths hit slow path and will start
poking the disk. If somebody accidentally change a minor thing in
every single directory (worst case scenario), the untracked cache
becomes useless. We need a threshold or something to start repairing
UNTR extension if it gets damaged too much.

If you rely on random index updates (e.g. the main entries got updated
and .git/index must be updated) to write UNTR extension down together.
Please don't do that, at least not this way. cache_changed mask should
reflect all dirty parts in .git/index. If UNTR extension is not marked
updated, it's legit to just skip generating/writing it down (e.g. if I
kept the old UNTR extension from the last time I read .git/index
around in memory)

> cache.  All of the other cached entries are still valid and the cache would
> be used for them.  The updated cache entry for dir1/ would not get persisted
> to disk until something that required the index to be written out.
>
> The behavior is correct in both cases.  You just don't get the benefit of
> the updated cache for the dir1/ entry until the index is persisted again.
> What you gain in exchange is that you don't have to write out the index
> which is (typically) a lot more expensive than checking dir1/ for changes.

This is another thing that bugs me. I know you're talking about huge
index files, but at what size should we start this sort of
optimization? Writing down a few MBs on linux is cheap enough that I
won't bother optimizing (and I get my UNTR extension repaired all the
time, so reduced lstat calls and stuff). This "typically" only comes
at certain size, what size?
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-06 12:27     ` Duy Nguyen
@ 2018-02-06 12:55       ` Duy Nguyen
  2018-02-07 10:59         ` Duy Nguyen
  2018-02-06 14:50       ` Junio C Hamano
  2018-02-07 14:13       ` Ben Peart
  2 siblings, 1 reply; 15+ messages in thread
From: Duy Nguyen @ 2018-02-06 12:55 UTC (permalink / raw)
  To: Ben Peart; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason

On Tue, Feb 6, 2018 at 7:27 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 8:48 AM, Ben Peart <peartben@gmail.com> wrote:
>> With the new behavior, making a change in dir1/, then calling status would
>> update the dir1/ untracked cache entry but not write it out. On the next
>> status, git would detect the change in dir1/ again and update the untracked
>
> Thing only missing piece here I would add is, this dir1/ detection is
> done by watchman. We have to contact watchman and ask the set of
> changed paths since $TIME where $TIME is the last time we updated
> untracked cache and invalidate those paths in core. Then it should
> work correctly. I checked the watchman query in the fsmonitor hook and
> I think it's correct so far.

No I think I'm wrong. And worse, I think the interaction between FSNM
and UNTR extension is broken. This is partly from me guessing how
fsmonitor works so I may be double-wrong.

UNTR extension is supposed to cover the untracked state at timestamp
$X. Where $X is stored in FSNM extension. Correct?

When fsmonitor is used and read_directory() is executed (at timestamp
$Y), we ask watchman "what's new on worktree since $X?"). We get the
list, we invalidate relevant directories so we can see new untracked
entries (or lack of old untracked entries). We write FSNM with
timestamp $Y down. We may or may not write UNTR down because of this
patch, but let's suppose we do. All is good. UNTR now records the
state at $Y. FSNM says $Y. This is how it works (when there's no bugs)

UNTR extension is only updated when read_directory() is called. It
does not always happen. FSNM is updated all the time (almost at every
git command since most of them needs to read index, many write it
down) with new timestamp. Suppose FSNM now records timestamp $Z, UNTR
still records the state at $Y because in the last index update,
read_directory() is not executed. 4 files have been added between $Y
and $Z. When we ask watchman "what's new since $Z?" we will not see
those files, so we don't invalidate relevant directories and
read_directory() will not see those files.

What am I missing?
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-06 12:27     ` Duy Nguyen
  2018-02-06 12:55       ` Duy Nguyen
@ 2018-02-06 14:50       ` Junio C Hamano
  2018-02-07 14:13       ` Ben Peart
  2 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2018-02-06 14:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ben Peart, Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason

Duy Nguyen <pclouds@gmail.com> writes:

> Please don't do that, at least not this way. cache_changed mask should
> reflect all dirty parts in .git/index. If UNTR extension is not marked
> updated, it's legit to just skip generating/writing it down (e.g. if I
> kept the old UNTR extension from the last time I read .git/index
> around in memory)

Excellent point.  Thanks for mentioning this (it crossed my mind
when I responded but I forgot to metion in my message).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-06 12:55       ` Duy Nguyen
@ 2018-02-07 10:59         ` Duy Nguyen
  2018-02-07 13:46           ` Ben Peart
  0 siblings, 1 reply; 15+ messages in thread
From: Duy Nguyen @ 2018-02-07 10:59 UTC (permalink / raw)
  To: Ben Peart; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason

On Tue, Feb 6, 2018 at 7:55 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 7:27 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Tue, Feb 6, 2018 at 8:48 AM, Ben Peart <peartben@gmail.com> wrote:
>>> With the new behavior, making a change in dir1/, then calling status would
>>> update the dir1/ untracked cache entry but not write it out. On the next
>>> status, git would detect the change in dir1/ again and update the untracked
>>
>> Thing only missing piece here I would add is, this dir1/ detection is
>> done by watchman. We have to contact watchman and ask the set of
>> changed paths since $TIME where $TIME is the last time we updated
>> untracked cache and invalidate those paths in core. Then it should
>> work correctly. I checked the watchman query in the fsmonitor hook and
>> I think it's correct so far.
>
> No I think I'm wrong. And worse, I think the interaction between FSNM
> and UNTR extension is broken. This is partly from me guessing how
> fsmonitor works so I may be double-wrong.
>
> UNTR extension is supposed to cover the untracked state at timestamp
> $X. Where $X is stored in FSNM extension. Correct?
>
> When fsmonitor is used and read_directory() is executed (at timestamp
> $Y), we ask watchman "what's new on worktree since $X?"). We get the
> list, we invalidate relevant directories so we can see new untracked
> entries (or lack of old untracked entries). We write FSNM with
> timestamp $Y down. We may or may not write UNTR down because of this
> patch, but let's suppose we do. All is good. UNTR now records the
> state at $Y. FSNM says $Y. This is how it works (when there's no bugs)
>
> UNTR extension is only updated when read_directory() is called. It
> does not always happen. FSNM is updated all the time (almost at every

I was indeed doubly wrong. When FSMN is read, it does make calls to
invalidate stuff in untracked cache data structure. If the index is
written down (with updated FSMN extension and timestamp) then the UNTR
extension, which is generated from in-core untracked data structure,
also reflects the damages by the changed paths so next time even if
those changed paths are not reported again (between $Y and $Z below),
it's fine.

All is good in the world again :)

> git command since most of them needs to read index, many write it
> down) with new timestamp. Suppose FSNM now records timestamp $Z, UNTR
> still records the state at $Y because in the last index update,
> read_directory() is not executed. 4 files have been added between $Y
> and $Z. When we ask watchman "what's new since $Z?" we will not see
> those files, so we don't invalidate relevant directories and
> read_directory() will not see those files.
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-07 10:59         ` Duy Nguyen
@ 2018-02-07 13:46           ` Ben Peart
  0 siblings, 0 replies; 15+ messages in thread
From: Ben Peart @ 2018-02-07 13:46 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason



On 2/7/2018 5:59 AM, Duy Nguyen wrote:
> On Tue, Feb 6, 2018 at 7:55 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Tue, Feb 6, 2018 at 7:27 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>>> On Tue, Feb 6, 2018 at 8:48 AM, Ben Peart <peartben@gmail.com> wrote:
>>>> With the new behavior, making a change in dir1/, then calling status would
>>>> update the dir1/ untracked cache entry but not write it out. On the next
>>>> status, git would detect the change in dir1/ again and update the untracked
>>>
>>> Thing only missing piece here I would add is, this dir1/ detection is
>>> done by watchman. We have to contact watchman and ask the set of
>>> changed paths since $TIME where $TIME is the last time we updated
>>> untracked cache and invalidate those paths in core. Then it should
>>> work correctly. I checked the watchman query in the fsmonitor hook and
>>> I think it's correct so far.
>>
>> No I think I'm wrong. And worse, I think the interaction between FSNM
>> and UNTR extension is broken. This is partly from me guessing how
>> fsmonitor works so I may be double-wrong.
>>
>> UNTR extension is supposed to cover the untracked state at timestamp
>> $X. Where $X is stored in FSNM extension. Correct?
>>
>> When fsmonitor is used and read_directory() is executed (at timestamp
>> $Y), we ask watchman "what's new on worktree since $X?"). We get the
>> list, we invalidate relevant directories so we can see new untracked
>> entries (or lack of old untracked entries). We write FSNM with
>> timestamp $Y down. We may or may not write UNTR down because of this
>> patch, but let's suppose we do. All is good. UNTR now records the
>> state at $Y. FSNM says $Y. This is how it works (when there's no bugs)
>>
>> UNTR extension is only updated when read_directory() is called. It
>> does not always happen. FSNM is updated all the time (almost at every
> 
> I was indeed doubly wrong. When FSMN is read, it does make calls to
> invalidate stuff in untracked cache data structure. If the index is
> written down (with updated FSMN extension and timestamp) then the UNTR
> extension, which is generated from in-core untracked data structure,
> also reflects the damages by the changed paths so next time even if
> those changed paths are not reported again (between $Y and $Z below),
> it's fine.
> 
> All is good in the world again :)

Thank you for looking into this closely.  It's always good to have 
someone else examine the logic to make sure there aren't errors that 
were missed.  Sorry my responses have been slow, my day job has had me 
busy lately.

> 
>> git command since most of them needs to read index, many write it
>> down) with new timestamp. Suppose FSNM now records timestamp $Z, UNTR
>> still records the state at $Y because in the last index update,
>> read_directory() is not executed. 4 files have been added between $Y
>> and $Z. When we ask watchman "what's new since $Z?" we will not see
>> those files, so we don't invalidate relevant directories and
>> read_directory() will not see those files.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-06 12:27     ` Duy Nguyen
  2018-02-06 12:55       ` Duy Nguyen
  2018-02-06 14:50       ` Junio C Hamano
@ 2018-02-07 14:13       ` Ben Peart
  2018-02-12 10:20         ` Duy Nguyen
  2 siblings, 1 reply; 15+ messages in thread
From: Ben Peart @ 2018-02-07 14:13 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason



On 2/6/2018 7:27 AM, Duy Nguyen wrote:
> 
> This is another thing that bugs me. I know you're talking about huge
> index files, but at what size should we start this sort of
> optimization? Writing down a few MBs on linux is cheap enough that I
> won't bother optimizing (and I get my UNTR extension repaired all the
> time, so reduced lstat calls and stuff). This "typically" only comes
> at certain size, what size?
> 

It's important to identify what we're trading off here.  With the 
proposed optimization, we'll end up doing less writes of the index but 
potentially more lstat calls.  Even with a small index, writing the 
index is much more expensive than calling lstat on a file.  Exactly how 
much more expensive depends on a lot of variables but even with a SSD 
its still orders of magnitude difference.

That means we could potentially lstat hundreds or thousands of files and 
still come out ahead.  Given the untracked cache works at the directory 
level, the lstat cost actually scales with the number of directories 
that have had changes (ie changing a 2nd file in the same directory 
doesn't add any additional cost).  In "typical" workflows, developers 
don't often change hundreds of files across different directories so 
we'd "typically" still come out ahead.

We have internal performance tests based on common developer sequences 
of commands (status, edit a file, status, add, status, commit, log, 
push, etc) that show that having the untracked cache turned on actually 
makes these sequences slower.  At the time, we didn't have time to 
investigate why so we just turned off the untracked cache.

When writing the fsmonitor patch series which can utilize the untracked 
cache, I did some investigation into why the untracked cache was slowing 
things down in these tests and discovered the cause was the additional 
index writes.  That is what led to this patch.

I've been sitting on it for months now because I didn't have the time to 
write some performance tests for the git perf framework to demonstrate 
the problem and how this helps solve it.  With the discussion about the 
fsmonitor extension, I thought this might be a good time to send it out 
there.

If you have the cycles, time a typical series of commands with and 
without the untracked cache (ie don't just call status over and over in 
a loop) and you should see the same results.  Given my limited time 
right now, I'm OK putting this on the back burner again until a git perf 
test can be written to ensure it actually speeds things up as advertised.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-05 19:56 [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache Ben Peart
  2018-02-05 20:55 ` Junio C Hamano
  2018-02-05 21:58 ` Brandon Williams
@ 2018-02-08 10:33 ` Jeff King
  2 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2018-02-08 10:33 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, pclouds, bmwill, avarab

On Mon, Feb 05, 2018 at 02:56:19PM -0500, Ben Peart wrote:

> diff --git a/dir.c b/dir.c
> index 7c4b45e30e..da93374f0c 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -2297,7 +2297,8 @@ int read_directory(struct dir_struct *dir, struct index_state *istate,
>  				 dir->untracked->gitignore_invalidated,
>  				 dir->untracked->dir_invalidated,
>  				 dir->untracked->dir_opened);
> -		if (dir->untracked == istate->untracked &&
> +		if (getenv("GIT_TEST_UNTRACKED_CACHE") &&

A minor nit, but please use something like:

  if (git_env_bool("GIT_TEST_UNTRACKED_CACHE", 0) && ...

so that:

  GIT_TEST_UNTRACKED_CACHE=false

does what one might expect, and not the opposite.

Two other thoughts:

  - it may be worth memo-izing it with a static variable to avoid
    repeatedly calling the possibly-slow getenv()

  - I agree with the sentiment elsewhere that something like
    GIT_FORCE_UNTRACKED_CACHE is probably a better name

(The idea itself seems sound to me, but it's not really my area).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-07 14:13       ` Ben Peart
@ 2018-02-12 10:20         ` Duy Nguyen
  2018-02-12 17:57           ` Ben Peart
  0 siblings, 1 reply; 15+ messages in thread
From: Duy Nguyen @ 2018-02-12 10:20 UTC (permalink / raw)
  To: Ben Peart; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason

On Wed, Feb 7, 2018 at 9:13 PM, Ben Peart <peartben@gmail.com> wrote:
>
>
> On 2/6/2018 7:27 AM, Duy Nguyen wrote:
>>
>>
>> This is another thing that bugs me. I know you're talking about huge
>> index files, but at what size should we start this sort of
>> optimization? Writing down a few MBs on linux is cheap enough that I
>> won't bother optimizing (and I get my UNTR extension repaired all the
>> time, so reduced lstat calls and stuff). This "typically" only comes
>> at certain size, what size?
>>
>
> It's important to identify what we're trading off here.  With the proposed
> optimization, we'll end up doing less writes of the index but potentially
> more lstat calls.  Even with a small index, writing the index is much more
> expensive than calling lstat on a file.  Exactly how much more expensive
> depends on a lot of variables but even with a SSD its still orders of
> magnitude difference.

Keep in mind it's not just about lstat() calls. Processing ignore
patterns also takes a big chunk of CPU, especially when you have more
than a couple ignore rules.

I'm not convinced that writing index files is that expensive for small
files. I don't know about Windows, with Linux it seems fast to me.
Actually, for small scale repos, performance probably does not differ
much either.

> That means we could potentially lstat hundreds or thousands of files and
> still come out ahead.  Given the untracked cache works at the directory
> level, the lstat cost actually scales with the number of directories that
> have had changes (ie changing a 2nd file in the same directory doesn't add
> any additional cost).  In "typical" workflows, developers don't often change
> hundreds of files across different directories so we'd "typically" still
> come out ahead.

I agree. But we must deal with the bad case when someone
"accidentally" does that. We should not wait until them yell up "it's
too slow now" and tell them to disable/enable untracked cache again.

Another case that could touches a lot of directories over time is
switch trees (e.g. "git checkout master" then "checkout next" or worse
"checkout v1.0").

> We have internal performance tests based on common developer sequences of
> commands (status, edit a file, status, add, status, commit, log, push, etc)
> that show that having the untracked cache turned on actually makes these
> sequences slower.  At the time, we didn't have time to investigate why so we
> just turned off the untracked cache.
>
> When writing the fsmonitor patch series which can utilize the untracked
> cache, I did some investigation into why the untracked cache was slowing
> things down in these tests and discovered the cause was the additional index
> writes.  That is what led to this patch.
>
> I've been sitting on it for months now because I didn't have the time to
> write some performance tests for the git perf framework to demonstrate the
> problem and how this helps solve it.  With the discussion about the
> fsmonitor extension, I thought this might be a good time to send it out
> there.

Hopefully you have time to get some of those numbers :) A patch is
more convincing when it's backed by numbers. And I'm still not
convinced that never ever letting untracked cache be written to the
index again is a right thing to do [1].

> If you have the cycles, time a typical series of commands with and without
> the untracked cache (ie don't just call status over and over in a loop) and
> you should see the same results.  Given my limited time right now, I'm OK
> putting this on the back burner again until a git perf test can be written
> to ensure it actually speeds things up as advertised.

[1] Another case that we must _sometimes_ let untracked cache be
written down is to correct its data. My last invalidation bug, for
example, could leave the untracked cache in a bad state, when you run
with new git then it should be able to correct the data and write
down. But if you don't allow writing down, the new 'git' will keep
seeing the old errors and keep complaining.
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-12 10:20         ` Duy Nguyen
@ 2018-02-12 17:57           ` Ben Peart
  2018-02-13  9:57             ` Duy Nguyen
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Peart @ 2018-02-12 17:57 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason



On 2/12/2018 5:20 AM, Duy Nguyen wrote:
> On Wed, Feb 7, 2018 at 9:13 PM, Ben Peart <peartben@gmail.com> wrote:
>>
>> On 2/6/2018 7:27 AM, Duy Nguyen wrote:
>>>
>>> This is another thing that bugs me. I know you're talking about huge
>>> index files, but at what size should we start this sort of
>>> optimization? Writing down a few MBs on linux is cheap enough that I
>>> won't bother optimizing (and I get my UNTR extension repaired all the
>>> time, so reduced lstat calls and stuff). This "typically" only comes
>>> at certain size, what size?
>>>
>> It's important to identify what we're trading off here.  With the proposed
>> optimization, we'll end up doing less writes of the index but potentially
>> more lstat calls.  Even with a small index, writing the index is much more
>> expensive than calling lstat on a file.  Exactly how much more expensive
>> depends on a lot of variables but even with a SSD its still orders of
>> magnitude difference.
> Keep in mind it's not just about lstat() calls. Processing ignore
> patterns also takes a big chunk of CPU, especially when you have more
> than a couple ignore rules.

Yes, I'm very familiar with the cost of the exclude list pattern 
matching code.  I've rewritten it to use a hashmap (for those patterns 
where it is possible) that dramatically speeds that aspect up as we tend 
to abuse it pretty heavily (~60K entries on average).

>
> I'm not convinced that writing index files is that expensive for small
> files. I don't know about Windows, with Linux it seems fast to me.
> Actually, for small scale repos, performance probably does not differ
> much either.

I agree.  For small scale repos, none of this is significant enough to 
matter.  Untracked caching should help most as your working directory 
starts to get large.

>> That means we could potentially lstat hundreds or thousands of files and
>> still come out ahead.  Given the untracked cache works at the directory
>> level, the lstat cost actually scales with the number of directories that
>> have had changes (ie changing a 2nd file in the same directory doesn't add
>> any additional cost).  In "typical" workflows, developers don't often change
>> hundreds of files across different directories so we'd "typically" still
>> come out ahead.
> I agree. But we must deal with the bad case when someone
> "accidentally" does that. We should not wait until them yell up "it's
> too slow now" and tell them to disable/enable untracked cache again.
>
> Another case that could touches a lot of directories over time is
> switch trees (e.g. "git checkout master" then "checkout next" or worse
> "checkout v1.0").

You're example above makes me wonder if you understand what my patch is 
doing.  If the index is flagged as dirty for _any_ reason, the entire 
index is written to disk with the latest information - including the 
updated untracked cache and all other extensions.  So in your checkout 
examples above, the index will still get written to disk with the 
updated untracked cache extension.  There would be zero change in 
behavior or performance.  The _only_ time the index is not written to 
disk after my patch is if there were no other changes to the index.  In 
my experience, that is only status calls.

To suffer any degradation in the untracked cache would be if a user 
edited a bunch of files across multiple directories and called status 
repeatedly.  As soon as they did add, checkout, rm, rebase, etc (ie most 
git commands other than status) the index would get flagged as dirty and 
the latest untracked cache extension would get written to disk.

>> We have internal performance tests based on common developer sequences of
>> commands (status, edit a file, status, add, status, commit, log, push, etc)
>> that show that having the untracked cache turned on actually makes these
>> sequences slower.  At the time, we didn't have time to investigate why so we
>> just turned off the untracked cache.
>>
>> When writing the fsmonitor patch series which can utilize the untracked
>> cache, I did some investigation into why the untracked cache was slowing
>> things down in these tests and discovered the cause was the additional index
>> writes.  That is what led to this patch.
>>
>> I've been sitting on it for months now because I didn't have the time to
>> write some performance tests for the git perf framework to demonstrate the
>> problem and how this helps solve it.  With the discussion about the
>> fsmonitor extension, I thought this might be a good time to send it out
>> there.
> Hopefully you have time to get some of those numbers :) A patch is
> more convincing when it's backed by numbers. And I'm still not
> convinced that never ever letting untracked cache be written to the
> index again is a right thing to do [1].

I agree that any performance patch should be accompanied by a 
performance test to demonstrate it actually performs as promised.  I 
looked for but didn't find a performance test for the untracked cache so 
will have to create one from scratch.  In order to make that as 
realistic as possible, it needs to simulate (as much as possible) 
typical developer workflows, ie  create/edit files across multiple 
directories and then execute various common commands (status, add, 
status, add, status, rm, status, commit, log, rebase, etc) and time the 
performance of that sequence of commands.  All doable, that just isn't 
supported well in the performance framework yet so will take some time 
(much more than the actual patch :)).

>> If you have the cycles, time a typical series of commands with and without
>> the untracked cache (ie don't just call status over and over in a loop) and
>> you should see the same results.  Given my limited time right now, I'm OK
>> putting this on the back burner again until a git perf test can be written
>> to ensure it actually speeds things up as advertised.
> [1] Another case that we must _sometimes_ let untracked cache be
> written down is to correct its data. My last invalidation bug, for
> example, could leave the untracked cache in a bad state, when you run
> with new git then it should be able to correct the data and write
> down. But if you don't allow writing down, the new 'git' will keep
> seeing the old errors and keep complaining.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache
  2018-02-12 17:57           ` Ben Peart
@ 2018-02-13  9:57             ` Duy Nguyen
  0 siblings, 0 replies; 15+ messages in thread
From: Duy Nguyen @ 2018-02-13  9:57 UTC (permalink / raw)
  To: Ben Peart; +Cc: Brandon Williams, Ben Peart, Git Mailing List, Ævar Arnfjörð Bjarmason

On Tue, Feb 13, 2018 at 12:57 AM, Ben Peart <peartben@gmail.com> wrote:
>> Another case that could touches a lot of directories over time is
>> switch trees (e.g. "git checkout master" then "checkout next" or worse
>> "checkout v1.0").
>
>
> You're example above makes me wonder if you understand what my patch is
> doing.  If the index is flagged as dirty for _any_ reason, the entire index
> is written to disk with the latest information - including the updated
> untracked cache and all other extensions.  So in your checkout examples
> above, the index will still get written to disk with the updated untracked
> cache extension.  There would be zero change in behavior or performance.
> The _only_ time the index is not written to disk after my patch is if there
> were no other changes to the index.  In my experience, that is only status
> calls.

The untracked cache is updated and does get written down, but it's not
"repaired" unless you have called read_directory() before the index is
written. Though paths that hit untracked_cache_invalidate_path() will
continue on slow path until you call read_directory() and write down.
I don't think "git checkout" calls read_directory. There are some
commands, like "git add", that update the index and call
read_directory() at the same time. So yes I was wrong, the untracked
cache can be repaired sometimes, not never repaired.

We do try to improve performance at "git checkout" and a couple other
"slow" commands though (e.g. repair cache tree), perhaps we can do the
same for untracked cache. Though the cost of updating untracked cache
is potentially much higher than cache tree.
-- 
Duy

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, back to index

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-05 19:56 [PATCH v1] dir.c: don't flag the index as dirty for changes to the untracked cache Ben Peart
2018-02-05 20:55 ` Junio C Hamano
2018-02-06  1:39   ` Ben Peart
2018-02-05 21:58 ` Brandon Williams
2018-02-06  1:48   ` Ben Peart
2018-02-06 12:27     ` Duy Nguyen
2018-02-06 12:55       ` Duy Nguyen
2018-02-07 10:59         ` Duy Nguyen
2018-02-07 13:46           ` Ben Peart
2018-02-06 14:50       ` Junio C Hamano
2018-02-07 14:13       ` Ben Peart
2018-02-12 10:20         ` Duy Nguyen
2018-02-12 17:57           ` Ben Peart
2018-02-13  9:57             ` Duy Nguyen
2018-02-08 10:33 ` Jeff King

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox