[PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
@ 2017-10-18 14:27 Ben Peart
  2017-10-19  5:22 ` Junio C Hamano
  2017-10-24 14:45 ` [PATCH v2] " Ben Peart
  0 siblings, 2 replies; 18+ messages in thread
From: Ben Peart @ 2017-10-18 14:27 UTC (permalink / raw)
  To: git; +Cc: jeffhost, gitster, t.gummerer, mhagger, Ben Peart

There is code in post_read_index_from() to catch out of order entries
when reading an index file.  This order verification is ~13% of the cost
of every call to read_index_from().

Update check_ce_order() so that it skips this verification unless the
"verify_ce_order" global variable is set.

Teach fsck to force this verification.

The effect can be seen using t/perf/p0002-read-cache.sh:

Test                                          HEAD              HEAD~1
--------------------------------------------------------------------------------------
0002.1: read_cache/discard_cache 1000 times   0.41(0.04+0.04)   0.50(0.00+0.10) +22.0%

Signed-off-by: Ben Peart <benpeart@microsoft.com>
---

Notes:
    Base Ref:
    Web-Diff: https://github.com/benpeart/git/commit/54fa9cf954
    Checkout: git fetch https://github.com/benpeart/git verify_ce_order-v1 && git checkout 54fa9cf954

 builtin/fsck.c | 1 +
 cache.h        | 1 +
 read-cache.c   | 6 ++++++
 3 files changed, 8 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index d18244ab54..81cc5ad78d 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -763,6 +763,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 
 	if (keep_cache_objects) {
 		verify_index_checksum = 1;
+		verify_ce_order = 1;
 		read_cache();
 		for (i = 0; i < active_nr; i++) {
 			unsigned int mode;
diff --git a/cache.h b/cache.h
index 5e2c9512ff..4a4f879061 100644
--- a/cache.h
+++ b/cache.h
@@ -721,6 +721,7 @@ extern int hold_locked_index(struct lock_file *, int);
 extern void set_alternate_index_output(const char *);
 
 extern int verify_index_checksum;
+extern int verify_ce_order;
 
 /* Environment bits from configuration mechanism */
 extern int trust_executable_bit;
diff --git a/read-cache.c b/read-cache.c
index b64610c400..32743da157 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1513,6 +1513,9 @@ struct ondisk_cache_entry_extended {
 /* Allow fsck to force verification of the index checksum. */
 int verify_index_checksum;
 
+/* Allow fsck to force verification of the cache entry order. */
+int verify_ce_order;
+
 static int verify_hdr(struct cache_header *hdr, unsigned long size)
 {
 	git_SHA_CTX c;
@@ -1698,6 +1701,9 @@ static void check_ce_order(struct index_state *istate)
 {
 	unsigned int i;
 
+	if (!verify_ce_order)
+		return;
+
 	for (i = 1; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i - 1];
 		struct cache_entry *next_ce = istate->cache[i];

base-commit: 46309c695b080648f8c0ea5e2d1fd7387ee7f044
-- 
2.14.1.windows.1.1034.g0776750557


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-18 14:27 [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading Ben Peart
@ 2017-10-19  5:22 ` Junio C Hamano
  2017-10-19 15:12   ` Ben Peart
  2017-10-24 14:45 ` [PATCH v2] " Ben Peart
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2017-10-19  5:22 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, jeffhost, t.gummerer, mhagger

Ben Peart <benpeart@microsoft.com> writes:

> There is code in post_read_index_from() to catch out of order entries
> when reading an index file.  This order verification is ~13% of the cost
> of every call to read_index_from().

I find this a bit over-generalized claim---wouldn't the overhead
depend on various conditions, e.g. the size of the index and if
split-index is in effect?

In general, I get very skeptical towards any change that makes the
integrity of the data less certain based only on microbenchmarks,
and prefer to see a solution that can absorb the overhead in some
other way.

When we are using split-index, the current code is not validating
the two input files from the disk. Because merge_base_index()
depends on the base to be properly sorted before the overriding
entries are added into it, if the input from disk is in a wrong
order, we are screwed already, and the order check in post
processing is pointless.  If we want to do this order validation, I
think we should be doing it in do_read_index() where it does
create_from_disk() and the set_index_entry(), instead of having it
as a separate phase that scans a potentially large index array one
more time.  And doing so will not penalize the case where we do not
use split-index, either.

So, I think I like the direction of getting rid of the order
validation in post_read_index_from(), not only during the normal
operation but also in fsck.  I think it makes more sense to do so
incrementally inside do_read_index() all the time and see how fast
we can make it do so.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-19  5:22 ` Junio C Hamano
@ 2017-10-19 15:12   ` Ben Peart
  2017-10-19 16:05     ` Jeff King
  2017-10-19 22:14     ` Stefan Beller
  0 siblings, 2 replies; 18+ messages in thread
From: Ben Peart @ 2017-10-19 15:12 UTC (permalink / raw)
  To: Junio C Hamano, Ben Peart; +Cc: git, jeffhost, t.gummerer, mhagger

On 10/19/2017 1:22 AM, Junio C Hamano wrote:
> Ben Peart <benpeart@microsoft.com> writes:
> 
>> There is code in post_read_index_from() to catch out of order entries
>> when reading an index file.  This order verification is ~13% of the cost
>> of every call to read_index_from().
> 
> I find this a bit over-generalized claim---wouldn't the overhead
> depend on various conditions, e.g. the size of the index and if
> split-index is in effect?
> 

I tested it against 10K, 100K and 1,000K files and the absolute time 
varies with different sized indexes, but the percentage is relatively 
consistent (after doing multiple runs and averaging the results to 
reduce noise).  I didn't measure it with split index so can't say how 
that would impact performance.

> In general, I get very skeptical towards any change that makes the
> integrity of the data less certain based only on microbenchmarks,
> and prefer to see a solution that can absorb the overhead in some
> other way.
> 
> When we are using split-index, the current code is not validating
> the two input files from the disk. Because merge_base_index()
> depends on the base to be properly sorted before the overriding
> entries are added into it, if the input from disk is in a wrong
> order, we are screwed already, and the order check in post
> processing is pointless.  

The original commit message doesn't say *why* this test was added so I 
have to make some educated guesses.  Given no attempt is made to recover 
or continue and instead we just die when an out-of-order entry is 
detected, I'm assuming check_ce_order() is protecting against buggy code 
that incorrectly wrote out an invalid index (the sha check would have 
detected file corruption or torn writes).

If we are guarding against "git" writing out an invalid index, we can 
move this into an assert so that only git developers pay the cost of 
validating they haven't created a new bug.  I think this is better than 
just adding a new test case as a new test case would not achieve the 
same coverage.  This is my preferred solution.

If we are guarding against "some other application" writing out an 
invalid index, then everyone will have to pay the cost as we can't 
insert the test into "some other applications."  Without user reports of 
it happening or any telemetry saying it has happened I really have no 
idea if it every actually happens in the wild anymore and whether the 
cost on every index load is still justified.

> If we want to do this order validation, I
> think we should be doing it in do_read_index() where it does
> create_from_disk() and the set_index_entry(), instead of having it
> as a separate phase that scans a potentially large index array one
> more time.  And doing so will not penalize the case where we do not
> use split-index, either.
> 
> So, I think I like the direction of getting rid of the order
> validation in post_read_index_from(), not only during the normal
> operation but also in fsck.  I think it makes more sense to do so
> incrementally inside do_read_index() all the time and see how fast
> we can make it do so.
> 

Unfortunately, all the cost is in the strcmp() - the other tests are 
negligible so moving it to be done incrementally inside do_read_index() 
won't reduce the cost, it just moves it and makes it harder to identify.

The only idea I could come up with for reducing the cost to our end 
users is to keep it separate and split the test across multiple threads 
with some minimum index size threshold as we have done elsewhere.  This 
adds code and complexity that we'll have to maintain forever so is less 
preferred than making it an assert.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-19 15:12   ` Ben Peart
@ 2017-10-19 16:05     ` Jeff King
  2017-10-20  1:14       ` Junio C Hamano
  2017-10-19 22:14     ` Stefan Beller
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff King @ 2017-10-19 16:05 UTC (permalink / raw)
  To: Ben Peart; +Cc: Junio C Hamano, Ben Peart, git, jeffhost, t.gummerer, mhagger

On Thu, Oct 19, 2017 at 11:12:03AM -0400, Ben Peart wrote:

> > So, I think I like the direction of getting rid of the order
> > validation in post_read_index_from(), not only during the normal
> > operation but also in fsck.  I think it makes more sense to do so
> > incrementally inside do_read_index() all the time and see how fast
> > we can make it do so.
> 
> Unfortunately, all the cost is in the strcmp() - the other tests are
> negligible so moving it to be done incrementally inside do_read_index()
> won't reduce the cost, it just moves it and makes it harder to identify.

It's plausible that doing the strcmp() closer to where we are otherwise
manipulating the data may show an improvement due to memory cache
effects.

It should be easy enough to check that; the patch below implements it.
I couldn't measure any speedup with it running "git ls-files >/dev/null"
on linux.git (60k files). But nor could I get any by dropping the check
entirely.

This is mostly just a curiosity to me. For the record, I have no real
problem with dropping this kind of on-disk data-structure validation
when it's expensive. After all, we do not check the sort on pack .idx
files on each run (nor pack sha1 checksums, etc) because it's too
expensive to do so.

---
diff --git a/read-cache.c b/read-cache.c
index 65f4fe8375..ac8c8d2e93 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1664,25 +1664,19 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 	return ce;
 }
 
-static void check_ce_order(struct index_state *istate)
-{
-	unsigned int i;
-
-	for (i = 1; i < istate->cache_nr; i++) {
-		struct cache_entry *ce = istate->cache[i - 1];
-		struct cache_entry *next_ce = istate->cache[i];
-		int name_compare = strcmp(ce->name, next_ce->name);
-
-		if (0 < name_compare)
-			die("unordered stage entries in index");
-		if (!name_compare) {
-			if (!ce_stage(ce))
-				die("multiple stage entries for merged file '%s'",
-				    ce->name);
-			if (ce_stage(ce) > ce_stage(next_ce))
-				die("unordered stage entries for '%s'",
-				    ce->name);
-		}
+static void check_ce_order(struct cache_entry *ce, struct cache_entry *next_ce)
+{
+	int name_compare = strcmp(ce->name, next_ce->name);
+
+	if (0 < name_compare)
+		die("unordered stage entries in index");
+	if (!name_compare) {
+		if (!ce_stage(ce))
+			die("multiple stage entries for merged file '%s'",
+			    ce->name);
+		if (ce_stage(ce) > ce_stage(next_ce))
+			die("unordered stage entries for '%s'",
+			    ce->name);
 	}
 }
 
@@ -1720,7 +1714,6 @@ static void tweak_split_index(struct index_state *istate)
 
 static void post_read_index_from(struct index_state *istate)
 {
-	check_ce_order(istate);
 	tweak_untracked_cache(istate);
 	tweak_split_index(istate);
 }
@@ -1784,6 +1777,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
 		ce = create_from_disk(disk_ce, &consumed, previous_name);
+		if (i > 0)
+			check_ce_order(istate->cache[i - 1], ce);
 		set_index_entry(istate, i, ce);
 
 		src_offset += consumed;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-19 15:12   ` Ben Peart
  2017-10-19 16:05     ` Jeff King
@ 2017-10-19 22:14     ` Stefan Beller
  2017-10-20 12:47       ` Johannes Schindelin
  1 sibling, 1 reply; 18+ messages in thread
From: Stefan Beller @ 2017-10-19 22:14 UTC (permalink / raw)
  To: Ben Peart
  Cc: Junio C Hamano, Ben Peart, git@vger.kernel.org, Jeff Hostetler,
	Thomas Gummerer, Michael Haggerty

On Thu, Oct 19, 2017 at 8:12 AM, Ben Peart <peartben@gmail.com> wrote:

> If we are guarding against "git" writing out an invalid index, we can move
> this into an assert so that only git developers pay the cost of validating
> they haven't created a new bug.  I think this is better than just adding a
> new test case as a new test case would not achieve the same coverage.  This
> is my preferred solution.
>
> If we are guarding against "some other application" writing out an invalid
> index, then everyone will have to pay the cost as we can't insert the test
> into "some other applications."  Without user reports of it happening or any
> telemetry saying it has happened I really have no idea if it every actually
> happens in the wild anymore and whether the cost on every index load is
> still justified.

How well does this play out in the security realm?, c.f.
https://public-inbox.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-19 16:05     ` Jeff King
@ 2017-10-20  1:14       ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2017-10-20  1:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Ben Peart, Ben Peart, git, jeffhost, t.gummerer, mhagger

Jeff King <peff@peff.net> writes:

> It should be easy enough to check that; the patch below implements it.
> I couldn't measure any speedup with it running "git ls-files >/dev/null"
> on linux.git (60k files). But nor could I get any by dropping the check
> entirely.

I would expect that the speedup (due to possible cache effect)
wouldn't be measurable if the overhead of the existing check itself
is unmeasuably not-expensive.  No suprise here.

> This is mostly just a curiosity to me. For the record, I have no real
> problem with dropping this kind of on-disk data-structure validation
> when it's expensive. After all, we do not check the sort on pack .idx
> files on each run (nor pack sha1 checksums, etc) because it's too
> expensive to do so.

Yes, I agree with that stance, too.  If this were expensive in the
overall picture to be measurable, I think we are OK omitting when
the index is read, especially if we make sure we are not writing out
nonsense trees out of it.  Local damage to the index is very
contained as long as we do not spread breakages to trees and
commits.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-19 22:14     ` Stefan Beller
@ 2017-10-20 12:47       ` Johannes Schindelin
  2017-10-20 18:53         ` Stefan Beller
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin @ 2017-10-20 12:47 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Ben Peart, Junio C Hamano, Ben Peart, git@vger.kernel.org,
	Jeff Hostetler, Thomas Gummerer, Michael Haggerty

Hi Stefan,

On Thu, 19 Oct 2017, Stefan Beller wrote:

> On Thu, Oct 19, 2017 at 8:12 AM, Ben Peart <peartben@gmail.com> wrote:
> 
> > If we are guarding against "git" writing out an invalid index, we can move
> > this into an assert so that only git developers pay the cost of validating
> > they haven't created a new bug.  I think this is better than just adding a
> > new test case as a new test case would not achieve the same coverage.  This
> > is my preferred solution.
> >
> > If we are guarding against "some other application" writing out an invalid
> > index, then everyone will have to pay the cost as we can't insert the test
> > into "some other applications."  Without user reports of it happening or any
> > telemetry saying it has happened I really have no idea if it every actually
> > happens in the wild anymore and whether the cost on every index load is
> > still justified.
> 
> How well does this play out in the security realm?, c.f.
> https://public-inbox.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/

That link talks about security implications from administrators accessing
Git repositories with maliciously crafted hooks/pagers.

Ben's original mail talks about integrity checks of the index file, and
how expensive they get when you talk about any decent-sized index (read:
*a lot* larger than Git or even Linux developers will see regularly).

The text you quoted talks about our talking out of our rear ends when we
talk about typical user schenarios because we simply have no telemetry or
otherwise reliable statistics.

Now, I fail to see any relationship between Jonathan's mail and either of
Ben's statements.

Care to enlighten me?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-20 12:47       ` Johannes Schindelin
@ 2017-10-20 18:53         ` Stefan Beller
  2017-10-21  1:55           ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Beller @ 2017-10-20 18:53 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ben Peart, Junio C Hamano, Ben Peart, git@vger.kernel.org,
	Jeff Hostetler, Thomas Gummerer, Michael Haggerty

> Ben's original mail talks about integrity checks of the index file, and
> how expensive they get when you talk about any decent-sized index (read:
> *a lot* larger than Git or even Linux developers will see regularly).

I am quite aware of your situation.

> The text you quoted talks about our talking out of our rear ends when we
> talk about typical user schenarios because we simply have no telemetry or
> otherwise reliable statistics.
>
> Now, I fail to see any relationship between Jonathan's mail and either of
> Ben's statements.
>
> Care to enlighten me?

There was a recent thread (which I assumed was the one I linked), that talked
about security implications as soon as we loose the rather strict "git
is to be used
in a posix world", e.g. sharing your repo over NFS/Dropbox. The
specific question
that Peff asked was how the internal formats can be exploited. (Can a malicious
index file be crafted such that it is not just a segfault, but a
'remote' code execution,
given that you deploy the maliciously crafted file via NFS. Removing checks that
we already have made me a bit suspicious that it *may* be helping an
attacker here,
though I have no hard data to show)

Sorry for the confusion,

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-20 18:53         ` Stefan Beller
@ 2017-10-21  1:55           ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2017-10-21  1:55 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Johannes Schindelin, Ben Peart, Ben Peart, git@vger.kernel.org,
	Jeff Hostetler, Thomas Gummerer, Michael Haggerty

Stefan Beller <sbeller@google.com> writes:

> There was a recent thread (which I assumed was the one I linked), that talked
> about security implications as soon as we loose the rather strict "git
> is to be used
> in a posix world", e.g. sharing your repo over NFS/Dropbox. The
> specific question
> that Peff asked was how the internal formats can be exploited. (Can a malicious
> index file be crafted such that it is not just a segfault, but a
> 'remote' code execution,
> given that you deploy the maliciously crafted file via NFS. Removing checks that
> we already have made me a bit suspicious that it *may* be helping an
> attacker here,
> though I have no hard data to show)
>
> Sorry for the confusion,

Thanks for an explanation, as I had the same reaction as Dscho
initially.  I'd assumed that the worst would be to create a wrong
state (e.g. the same path registered twice with different contents
in the index, a malformed tree written out of it, etc.), but that's
merely an assumption not the result of an audit.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-18 14:27 [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading Ben Peart
  2017-10-19  5:22 ` Junio C Hamano
@ 2017-10-24 14:45 ` Ben Peart
  2017-10-30 12:48   ` Ben Peart
  1 sibling, 1 reply; 18+ messages in thread
From: Ben Peart @ 2017-10-24 14:45 UTC (permalink / raw)
  To: git
  Cc: chriscool, t.gummerer, l.s.r, jsorianopastor, peff, Ben Peart,
	Johannes Schindelin

There is code in post_read_index_from() to detect out of order cache
entries when reading an index file.  This order verification adds cost
to read_index_from() that grows with the size of the index.

Put this on-disk data-structure validation code behind an #ifdef DEBUG
so only debug builds have to pay the cost.

The effect can be seen using t/perf/p0002-read-cache.sh:

Test w/git repo                       HEAD            HEAD~1
----------------------------------------------------------------------------
read_cache/discard_cache 1000 times   0.42(0.01+0.09) 0.48(0.01+0.09) +14.3%
read_cache/discard_cache 1000 times   0.41(0.03+0.04) 0.49(0.00+0.10) +19.5%
read_cache/discard_cache 1000 times   0.42(0.03+0.06) 0.49(0.06+0.04) +16.7%

Test w/10K files                      HEAD            HEAD~1
---------------------------------------------------------------------------
read_cache/discard_cache 1000 times   1.58(0.04+0.00) 1.71(0.00+0.07) +8.2%
read_cache/discard_cache 1000 times   1.64(0.01+0.07) 1.76(0.01+0.09) +7.3%
read_cache/discard_cache 1000 times   1.62(0.03+0.04) 1.71(0.00+0.04) +5.6%

Test w/100K files                     HEAD             HEAD~1
-----------------------------------------------------------------------------
read_cache/discard_cache 1000 times   25.85(0.00+0.06) 27.35(0.01+0.06) +5.8%
read_cache/discard_cache 1000 times   25.82(0.01+0.07) 27.25(0.01+0.07) +5.5%
read_cache/discard_cache 1000 times   26.00(0.01+0.07) 27.36(0.06+0.03) +5.2%

Test with 1,000K files                HEAD              HEAD~1
-------------------------------------------------------------------------------
read_cache/discard_cache 1000 times   200.61(0.01+0.07) 218.23(0.03+0.06) +8.8%
read_cache/discard_cache 1000 times   201.62(0.03+0.06) 217.86(0.03+0.06) +8.1%
read_cache/discard_cache 1000 times   201.64(0.01+0.09) 217.89(0.03+0.07) +8.1%

Signed-off-by: Ben Peart <benpeart@microsoft.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/95e20f17ff
    Checkout: git fetch https://github.com/benpeart/git no_ce_order-v2 && git checkout 95e20f17ff
    
    ### Interdiff (v1..v2):
    
    ### Patches

 read-cache.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/read-cache.c b/read-cache.c
index 65f4fe8375..fc90ec0fce 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1664,6 +1664,7 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 	return ce;
 }
 
+#ifdef DEBUG
 static void check_ce_order(struct index_state *istate)
 {
 	unsigned int i;
@@ -1685,6 +1686,7 @@ static void check_ce_order(struct index_state *istate)
 		}
 	}
 }
+#endif
 
 static void tweak_untracked_cache(struct index_state *istate)
 {
@@ -1720,7 +1722,9 @@ static void tweak_split_index(struct index_state *istate)
 
 static void post_read_index_from(struct index_state *istate)
 {
+#ifdef DEBUG
 	check_ce_order(istate);
+#endif
 	tweak_untracked_cache(istate);
 	tweak_split_index(istate);
 }

base-commit: c52ca88430e6ec7c834af38720295070d8a1e330
-- 
2.14.1.windows.1.1034.g0776750557


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-24 14:45 ` [PATCH v2] " Ben Peart
@ 2017-10-30 12:48   ` Ben Peart
  2017-10-30 18:03     ` Jeff King
  2017-10-31  1:49     ` Junio C Hamano
  0 siblings, 2 replies; 18+ messages in thread
From: Ben Peart @ 2017-10-30 12:48 UTC (permalink / raw)
  To: Ben Peart, git, gitster
  Cc: chriscool, t.gummerer, l.s.r, jsorianopastor, peff,
	Johannes Schindelin

Any updates or thoughts on this one?  While the patch has become quite 
trivial, it does results in a savings of 5%-15% in index load time.

I thought the compromise of having this test only run when DEBUG is 
defined should limit it to developer builds (hopefully everyone 
developing on git is running DEBUG builds :)).  Since the test is trying 
to detect buggy code when writing the index, I thought that was the 
right time to test/catch any issues.

I am working on other, more substantial savings for index load times 
(stay tuned) but this seemed like a small simple way to help speed 
things up.

On 10/24/2017 10:45 AM, Ben Peart wrote:
> There is code in post_read_index_from() to detect out of order cache
> entries when reading an index file.  This order verification adds cost
> to read_index_from() that grows with the size of the index.
> 
> Put this on-disk data-structure validation code behind an #ifdef DEBUG
> so only debug builds have to pay the cost.
> 
> The effect can be seen using t/perf/p0002-read-cache.sh:
> 
> Test w/git repo                       HEAD            HEAD~1
> ----------------------------------------------------------------------------
> read_cache/discard_cache 1000 times   0.42(0.01+0.09) 0.48(0.01+0.09) +14.3%
> read_cache/discard_cache 1000 times   0.41(0.03+0.04) 0.49(0.00+0.10) +19.5%
> read_cache/discard_cache 1000 times   0.42(0.03+0.06) 0.49(0.06+0.04) +16.7%
> 
> Test w/10K files                      HEAD            HEAD~1
> ---------------------------------------------------------------------------
> read_cache/discard_cache 1000 times   1.58(0.04+0.00) 1.71(0.00+0.07) +8.2%
> read_cache/discard_cache 1000 times   1.64(0.01+0.07) 1.76(0.01+0.09) +7.3%
> read_cache/discard_cache 1000 times   1.62(0.03+0.04) 1.71(0.00+0.04) +5.6%
> 
> Test w/100K files                     HEAD             HEAD~1
> -----------------------------------------------------------------------------
> read_cache/discard_cache 1000 times   25.85(0.00+0.06) 27.35(0.01+0.06) +5.8%
> read_cache/discard_cache 1000 times   25.82(0.01+0.07) 27.25(0.01+0.07) +5.5%
> read_cache/discard_cache 1000 times   26.00(0.01+0.07) 27.36(0.06+0.03) +5.2%
> 
> Test with 1,000K files                HEAD              HEAD~1
> -------------------------------------------------------------------------------
> read_cache/discard_cache 1000 times   200.61(0.01+0.07) 218.23(0.03+0.06) +8.8%
> read_cache/discard_cache 1000 times   201.62(0.03+0.06) 217.86(0.03+0.06) +8.1%
> read_cache/discard_cache 1000 times   201.64(0.01+0.09) 217.89(0.03+0.07) +8.1%
> 
> Signed-off-by: Ben Peart <benpeart@microsoft.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
> 
> Notes:
>      Base Ref: master
>      Web-Diff: https://github.com/benpeart/git/commit/95e20f17ff
>      Checkout: git fetch https://github.com/benpeart/git no_ce_order-v2 && git checkout 95e20f17ff
>      
>      ### Interdiff (v1..v2):
>      
>      ### Patches
> 
>   read-cache.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/read-cache.c b/read-cache.c
> index 65f4fe8375..fc90ec0fce 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1664,6 +1664,7 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
>   	return ce;
>   }
>   
> +#ifdef DEBUG
>   static void check_ce_order(struct index_state *istate)
>   {
>   	unsigned int i;
> @@ -1685,6 +1686,7 @@ static void check_ce_order(struct index_state *istate)
>   		}
>   	}
>   }
> +#endif
>   
>   static void tweak_untracked_cache(struct index_state *istate)
>   {
> @@ -1720,7 +1722,9 @@ static void tweak_split_index(struct index_state *istate)
>   
>   static void post_read_index_from(struct index_state *istate)
>   {
> +#ifdef DEBUG
>   	check_ce_order(istate);
> +#endif
>   	tweak_untracked_cache(istate);
>   	tweak_split_index(istate);
>   }
> 
> base-commit: c52ca88430e6ec7c834af38720295070d8a1e330
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-30 12:48   ` Ben Peart
@ 2017-10-30 18:03     ` Jeff King
  2017-10-31  0:33       ` Alex Vandiver
  2017-10-31  1:49     ` Junio C Hamano
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff King @ 2017-10-30 18:03 UTC (permalink / raw)
  To: Ben Peart
  Cc: Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r,
	jsorianopastor, Johannes Schindelin

On Mon, Oct 30, 2017 at 08:48:48AM -0400, Ben Peart wrote:

> Any updates or thoughts on this one?  While the patch has become quite
> trivial, it does results in a savings of 5%-15% in index load time.

I like the general direction of avoiding the check during each read.
But...

> I thought the compromise of having this test only run when DEBUG is defined
> should limit it to developer builds (hopefully everyone developing on git is
> running DEBUG builds :)).  Since the test is trying to detect buggy code
> when writing the index, I thought that was the right time to test/catch any
> issues.

I certainly don't build with DEBUG. It traditionally hasn't done
anything useful. But I'm also not convinced that this is a likely way to
find bugs in the first place, so I'm OK missing out on it.

But what we probably _do_ need is to make sure that "git fsck" would
detect such an out-of-order index. So that developers and users alike
can diagnose suspected problems.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-30 18:03     ` Jeff King
@ 2017-10-31  0:33       ` Alex Vandiver
  2017-10-31 13:01         ` Ben Peart
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Vandiver @ 2017-10-31  0:33 UTC (permalink / raw)
  To: Jeff King
  Cc: Ben Peart, Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r,
	jsorianopastor, Johannes Schindelin

On Mon, 30 Oct 2017, Jeff King wrote:
> On Mon, Oct 30, 2017 at 08:48:48AM -0400, Ben Peart wrote:
> 
> > Any updates or thoughts on this one?  While the patch has become quite
> > trivial, it does results in a savings of 5%-15% in index load time.
> 
> I like the general direction of avoiding the check during each read.

Same -- the savings here are well worth it, IMHO.

> > I thought the compromise of having this test only run when DEBUG is defined
> > should limit it to developer builds (hopefully everyone developing on git is
> > running DEBUG builds :)).  Since the test is trying to detect buggy code
> > when writing the index, I thought that was the right time to test/catch any
> > issues.
> 
> I certainly don't build with DEBUG. It traditionally hasn't done
> anything useful. But I'm also not convinced that this is a likely way to
> find bugs in the first place, so I'm OK missing out on it.

I also don't compile with DEBUG -- there's no documentation that
mentions it, and I don't think I'd considered going poking for what
was `#ifdef`d.  I think it'd be reasonable to provide some
configure-time setting that adds `CFLAGS="-ggdb3 -O0 -DDEBUG"` or
similar, but that seems possibly moot for this particular change (see
below).

> But what we probably _do_ need is to make sure that "git fsck" would
> detect such an out-of-order index. So that developers and users alike
> can diagnose suspected problems.

Agree -- that seems like a better home for this logic.

> > I am working on other, more substantial savings for index load times
> > (stay tuned) but this seemed like a small simple way to help speed
> > things up.

I'm interested to hear more about what direction you're looking in here.
 - Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-30 12:48   ` Ben Peart
  2017-10-30 18:03     ` Jeff King
@ 2017-10-31  1:49     ` Junio C Hamano
  2017-10-31 12:51       ` Ben Peart
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2017-10-31  1:49 UTC (permalink / raw)
  To: Ben Peart
  Cc: Ben Peart, git, chriscool, t.gummerer, l.s.r, jsorianopastor,
	peff, Johannes Schindelin

Ben Peart <peartben@gmail.com> writes:

> Any updates or thoughts on this one?  While the patch has become quite
> trivial, it does results in a savings of 5%-15% in index load time.
>
> I thought the compromise of having this test only run when DEBUG is
> defined should limit it to developer builds (hopefully everyone
> developing on git is running DEBUG builds :)).  Since the test is
> trying to detect buggy code when writing the index, I thought that was
> the right time to test/catch any issues.

This check is more about catching a possible breakage (and a
malicious repository) early before we go too far into the operation.
I do not think this check is about debugging the implementation of
Git.  How would it be useful to turn it on in DEBUG build?

While I do think pursuing any runtime improvements better than a
couple of percents is worth it, I do not think the approach taken by
this iteration makes much sense; the previous one that at least
allowed fsck to catch breakage may have been already too leaky to
catch real issues (i.e. when you are asked to visit and look at an
unknown repository, you wouldn't start your session with "git fsck"
to protect yourself), and this round makes it much worse.

Besides, I see no -DDEBUG from "grep -e '-D[A-Z]*DEBUG' Makefile".
If this check were about allowing us easier time debugging the
binary (which I do not think it is), this probably should be
'#ifndef NDEBUG' instead.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-31  1:49     ` Junio C Hamano
@ 2017-10-31 12:51       ` Ben Peart
  0 siblings, 0 replies; 18+ messages in thread
From: Ben Peart @ 2017-10-31 12:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ben Peart, git, chriscool, t.gummerer, l.s.r, jsorianopastor,
	peff, Johannes Schindelin

On 10/30/2017 9:49 PM, Junio C Hamano wrote:
> Ben Peart <peartben@gmail.com> writes:
> 
>> Any updates or thoughts on this one?  While the patch has become quite
>> trivial, it does results in a savings of 5%-15% in index load time.
>>
>> I thought the compromise of having this test only run when DEBUG is
>> defined should limit it to developer builds (hopefully everyone
>> developing on git is running DEBUG builds :)).  Since the test is
>> trying to detect buggy code when writing the index, I thought that was
>> the right time to test/catch any issues.
> 
> This check is more about catching a possible breakage (and a
> malicious repository) early before we go too far into the operation.
> I do not think this check is about debugging the implementation of
> Git.  How would it be useful to turn it on in DEBUG build?
> 
> While I do think pursuing any runtime improvements better than a
> couple of percents is worth it, I do not think the approach taken by
> this iteration makes much sense; the previous one that at least
> allowed fsck to catch breakage may have been already too leaky to
> catch real issues (i.e. when you are asked to visit and look at an
> unknown repository, you wouldn't start your session with "git fsck"
> to protect yourself), and this round makes it much worse.
> 
> Besides, I see no -DDEBUG from "grep -e '-D[A-Z]*DEBUG' Makefile".
> If this check were about allowing us easier time debugging the
> binary (which I do not think it is), this probably should be
> '#ifndef NDEBUG' instead.
> 

I've tried 3 different ways to remove the overhead of this call from 
regular git operations.

The first was version 1 of the patch which had fsck catch breakage but 
removed it from other commands that read the index.  Since that version 
was not accepted, I took the feedback "I think I like the direction of 
getting rid of the order in post_read_index_from(), not only during the 
normal but also in fsck" to come up with a version 2.

I was hesitant to remove the code completely as I did believe it had 
some value in detecting invalid indexes so went looking for a macro I 
could use to have it 1) not happen during regular user commands but 2) 
still happen for developers.

The NDEBUG macro is guaranteed by the C89 standard 
(http://port70.net/~nsz/c/c89/c89-draft.html#4.1.2 ) to guard the code 
that is only necessary when assertions are in effect so seemed like a 
good choice.  When I used it however, I discovered that the git Makefile 
does not define NDEBUG so using this macro did not have any effect thus 
making the patch useless as the code continues to run in all cases.

On a side note, there are 434 instances of assert which up until this 
experience I believed were being removed in released builds.  As far as 
I can tell, that is not the case.  If removing them is the 
desired/expected behavior, we need to fix our Makefile as it only 
currently defines NDEBUG if USE_NED_ALLOCATOR is defined.

I then searched the code and found 47 instances where the macro DEBUG 
was used.  I (incorrectly) assumed that meant it must be used by other 
git developers.  I personally have a build alias that adds "-j12 
CFLAGS=-DDEBUG" to my make command but apparently I'm in the minority. :)

This assumption led me to the patch version 2 (guarding the code with 
#ifdef DEBUG) as it does meet the request to remove it during normal but 
also fsck and does so with regular/release builds as they are currently 
defined.

It seems that the current round of feedback is more in favor of leaving 
the test in fsck but removing it for other commands.  If that is the 
desired behavior, please use version 1 of the patch.

I'm also happy to flip this to "#ifndef NDEBUG" but that only makes 
sense if the released builds actually set NDEBUG which (I believe) will 
require a patch to Makefile.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-31  0:33       ` Alex Vandiver
@ 2017-10-31 13:01         ` Ben Peart
  2017-10-31 17:10           ` Jeff King
  0 siblings, 1 reply; 18+ messages in thread
From: Ben Peart @ 2017-10-31 13:01 UTC (permalink / raw)
  To: Alex Vandiver, Jeff King
  Cc: Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r,
	jsorianopastor, Johannes Schindelin



On 10/30/2017 8:33 PM, Alex Vandiver wrote:
> On Mon, 30 Oct 2017, Jeff King wrote:
>> On Mon, Oct 30, 2017 at 08:48:48AM -0400, Ben Peart wrote:
>>
>>> Any updates or thoughts on this one?  While the patch has become quite
>>> trivial, it does results in a savings of 5%-15% in index load time.
>>
>> I like the general direction of avoiding the check during each read.
> 
> Same -- the savings here are well worth it, IMHO.
> 
>>> I thought the compromise of having this test only run when DEBUG is defined
>>> should limit it to developer builds (hopefully everyone developing on git is
>>> running DEBUG builds :)).  Since the test is trying to detect buggy code
>>> when writing the index, I thought that was the right time to test/catch any
>>> issues.
>>
>> I certainly don't build with DEBUG. It traditionally hasn't done
>> anything useful. But I'm also not convinced that this is a likely way to
>> find bugs in the first place, so I'm OK missing out on it.
> 
> I also don't compile with DEBUG -- there's no documentation that
> mentions it, and I don't think I'd considered going poking for what
> was `#ifdef`d.  I think it'd be reasonable to provide some
> configure-time setting that adds `CFLAGS="-ggdb3 -O0 -DDEBUG"` or
> similar, but that seems possibly moot for this particular change (see
> below).
> 
>> But what we probably _do_ need is to make sure that "git fsck" would
>> detect such an out-of-order index. So that developers and users alike
>> can diagnose suspected problems.
> 
> Agree -- that seems like a better home for this logic.

That is how version 1 of this patch worked but the feedback to that 
patch was to remove it "not only during the normal operation but also in 
fsck."

> 
>>> I am working on other, more substantial savings for index load times
>>> (stay tuned) but this seemed like a small simple way to help speed
>>> things up.
> 
> I'm interested to hear more about what direction you're looking in here.
>   - Alex
> 

I'm working on parallelizing the index load process across multiple 
threads/cpu cores.  Specifically the loop that calls create_from_disk() 
and set_index_entry().  The serial nature of the index formats makes 
that difficult but I believe I've come up with a way to make it work 
across all existing index formats.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-31 13:01         ` Ben Peart
@ 2017-10-31 17:10           ` Jeff King
  2017-11-01  6:09             ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff King @ 2017-10-31 17:10 UTC (permalink / raw)
  To: Ben Peart
  Cc: Alex Vandiver, Ben Peart, git, gitster, chriscool, t.gummerer,
	l.s.r, jsorianopastor, Johannes Schindelin

On Tue, Oct 31, 2017 at 09:01:45AM -0400, Ben Peart wrote:

> > > But what we probably _do_ need is to make sure that "git fsck" would
> > > detect such an out-of-order index. So that developers and users alike
> > > can diagnose suspected problems.
> > 
> > Agree -- that seems like a better home for this logic.
> 
> That is how version 1 of this patch worked but the feedback to that patch
> was to remove it "not only during the normal operation but also in fsck."

Sorry for the mixed messages (I think they are mixed between different
people, and not mixed _just_ from me ;) ).

For what it's worth, I like your v1, but can live with either approach.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading
  2017-10-31 17:10           ` Jeff King
@ 2017-11-01  6:09             ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2017-11-01  6:09 UTC (permalink / raw)
  To: Jeff King
  Cc: Ben Peart, Alex Vandiver, Ben Peart, git, chriscool, t.gummerer,
	l.s.r, jsorianopastor, Johannes Schindelin

Jeff King <peff@peff.net> writes:

> On Tue, Oct 31, 2017 at 09:01:45AM -0400, Ben Peart wrote:
>
>> > > But what we probably _do_ need is to make sure that "git fsck" would
>> > > detect such an out-of-order index. So that developers and users alike
>> > > can diagnose suspected problems.
>> > 
>> > Agree -- that seems like a better home for this logic.
>> 
>> That is how version 1 of this patch worked but the feedback to that patch
>> was to remove it "not only during the normal operation but also in fsck."
>
> Sorry for the mixed messages (I think they are mixed between different
> people, and not mixed _just_ from me ;) ).
>
> For what it's worth, I like your v1, but can live with either approach.

I agree that v1 is the less bad one between the two.

To be honest, if the original code were done in that way (i.e. the
state with v1 applied), I probably would have had a very hard time
to justify accepting a patch to "make it safer by always checking at
runtime" (i.e. a reverse of v1 patch).

So, let's go with v1.  Thanks, all.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-11-01  6:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-18 14:27 [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading Ben Peart
2017-10-19  5:22 ` Junio C Hamano
2017-10-19 15:12   ` Ben Peart
2017-10-19 16:05     ` Jeff King
2017-10-20  1:14       ` Junio C Hamano
2017-10-19 22:14     ` Stefan Beller
2017-10-20 12:47       ` Johannes Schindelin
2017-10-20 18:53         ` Stefan Beller
2017-10-21  1:55           ` Junio C Hamano
2017-10-24 14:45 ` [PATCH v2] " Ben Peart
2017-10-30 12:48   ` Ben Peart
2017-10-30 18:03     ` Jeff King
2017-10-31  0:33       ` Alex Vandiver
2017-10-31 13:01         ` Ben Peart
2017-10-31 17:10           ` Jeff King
2017-11-01  6:09             ` Junio C Hamano
2017-10-31  1:49     ` Junio C Hamano
2017-10-31 12:51       ` Ben Peart

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).