* [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading @ 2017-10-18 14:27 Ben Peart 2017-10-19 5:22 ` Junio C Hamano 2017-10-24 14:45 ` [PATCH v2] " Ben Peart 0 siblings, 2 replies; 18+ messages in thread From: Ben Peart @ 2017-10-18 14:27 UTC (permalink / raw) To: git; +Cc: jeffhost, gitster, t.gummerer, mhagger, Ben Peart There is code in post_read_index_from() to catch out of order entries when reading an index file. This order verification is ~13% of the cost of every call to read_index_from(). Update check_ce_order() so that it skips this verification unless the "verify_ce_order" global variable is set. Teach fsck to force this verification. The effect can be seen using t/perf/p0002-read-cache.sh: Test HEAD HEAD~1 -------------------------------------------------------------------------------------- 0002.1: read_cache/discard_cache 1000 times 0.41(0.04+0.04) 0.50(0.00+0.10) +22.0% Signed-off-by: Ben Peart <benpeart@microsoft.com> --- Notes: Base Ref: Web-Diff: https://github.com/benpeart/git/commit/54fa9cf954 Checkout: git fetch https://github.com/benpeart/git verify_ce_order-v1 && git checkout 54fa9cf954 builtin/fsck.c | 1 + cache.h | 1 + read-cache.c | 6 ++++++ 3 files changed, 8 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index d18244ab54..81cc5ad78d 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -763,6 +763,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) if (keep_cache_objects) { verify_index_checksum = 1; + verify_ce_order = 1; read_cache(); for (i = 0; i < active_nr; i++) { unsigned int mode; diff --git a/cache.h b/cache.h index 5e2c9512ff..4a4f879061 100644 --- a/cache.h +++ b/cache.h @@ -721,6 +721,7 @@ extern int hold_locked_index(struct lock_file *, int); extern void set_alternate_index_output(const char *); extern int verify_index_checksum; +extern int verify_ce_order; /* Environment bits from configuration mechanism */ extern int trust_executable_bit; diff --git a/read-cache.c b/read-cache.c index b64610c400..32743da157 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1513,6 +1513,9 @@ struct ondisk_cache_entry_extended { /* Allow fsck to force verification of the index checksum. */ int verify_index_checksum; +/* Allow fsck to force verification of the cache entry order. */ +int verify_ce_order; + static int verify_hdr(struct cache_header *hdr, unsigned long size) { git_SHA_CTX c; @@ -1698,6 +1701,9 @@ static void check_ce_order(struct index_state *istate) { unsigned int i; + if (!verify_ce_order) + return; + for (i = 1; i < istate->cache_nr; i++) { struct cache_entry *ce = istate->cache[i - 1]; struct cache_entry *next_ce = istate->cache[i]; base-commit: 46309c695b080648f8c0ea5e2d1fd7387ee7f044 -- 2.14.1.windows.1.1034.g0776750557 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-18 14:27 [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading Ben Peart @ 2017-10-19 5:22 ` Junio C Hamano 2017-10-19 15:12 ` Ben Peart 2017-10-24 14:45 ` [PATCH v2] " Ben Peart 1 sibling, 1 reply; 18+ messages in thread From: Junio C Hamano @ 2017-10-19 5:22 UTC (permalink / raw) To: Ben Peart; +Cc: git, jeffhost, t.gummerer, mhagger Ben Peart <benpeart@microsoft.com> writes: > There is code in post_read_index_from() to catch out of order entries > when reading an index file. This order verification is ~13% of the cost > of every call to read_index_from(). I find this a bit over-generalized claim---wouldn't the overhead depend on various conditions, e.g. the size of the index and if split-index is in effect? In general, I get very skeptical towards any change that makes the integrity of the data less certain based only on microbenchmarks, and prefer to see a solution that can absorb the overhead in some other way. When we are using split-index, the current code is not validating the two input files from the disk. Because merge_base_index() depends on the base to be properly sorted before the overriding entries are added into it, if the input from disk is in a wrong order, we are screwed already, and the order check in post processing is pointless. If we want to do this order validation, I think we should be doing it in do_read_index() where it does create_from_disk() and the set_index_entry(), instead of having it as a separate phase that scans a potentially large index array one more time. And doing so will not penalize the case where we do not use split-index, either. So, I think I like the direction of getting rid of the order validation in post_read_index_from(), not only during the normal operation but also in fsck. I think it makes more sense to do so incrementally inside do_read_index() all the time and see how fast we can make it do so. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-19 5:22 ` Junio C Hamano @ 2017-10-19 15:12 ` Ben Peart 2017-10-19 16:05 ` Jeff King 2017-10-19 22:14 ` Stefan Beller 0 siblings, 2 replies; 18+ messages in thread From: Ben Peart @ 2017-10-19 15:12 UTC (permalink / raw) To: Junio C Hamano, Ben Peart; +Cc: git, jeffhost, t.gummerer, mhagger On 10/19/2017 1:22 AM, Junio C Hamano wrote: > Ben Peart <benpeart@microsoft.com> writes: > >> There is code in post_read_index_from() to catch out of order entries >> when reading an index file. This order verification is ~13% of the cost >> of every call to read_index_from(). > > I find this a bit over-generalized claim---wouldn't the overhead > depend on various conditions, e.g. the size of the index and if > split-index is in effect? > I tested it against 10K, 100K and 1,000K files and the absolute time varies with different sized indexes, but the percentage is relatively consistent (after doing multiple runs and averaging the results to reduce noise). I didn't measure it with split index so can't say how that would impact performance. > In general, I get very skeptical towards any change that makes the > integrity of the data less certain based only on microbenchmarks, > and prefer to see a solution that can absorb the overhead in some > other way. > > When we are using split-index, the current code is not validating > the two input files from the disk. Because merge_base_index() > depends on the base to be properly sorted before the overriding > entries are added into it, if the input from disk is in a wrong > order, we are screwed already, and the order check in post > processing is pointless. The original commit message doesn't say *why* this test was added so I have to make some educated guesses. Given no attempt is made to recover or continue and instead we just die when an out-of-order entry is detected, I'm assuming check_ce_order() is protecting against buggy code that incorrectly wrote out an invalid index (the sha check would have detected file corruption or torn writes). If we are guarding against "git" writing out an invalid index, we can move this into an assert so that only git developers pay the cost of validating they haven't created a new bug. I think this is better than just adding a new test case as a new test case would not achieve the same coverage. This is my preferred solution. If we are guarding against "some other application" writing out an invalid index, then everyone will have to pay the cost as we can't insert the test into "some other applications." Without user reports of it happening or any telemetry saying it has happened I really have no idea if it every actually happens in the wild anymore and whether the cost on every index load is still justified. > If we want to do this order validation, I > think we should be doing it in do_read_index() where it does > create_from_disk() and the set_index_entry(), instead of having it > as a separate phase that scans a potentially large index array one > more time. And doing so will not penalize the case where we do not > use split-index, either. > > So, I think I like the direction of getting rid of the order > validation in post_read_index_from(), not only during the normal > operation but also in fsck. I think it makes more sense to do so > incrementally inside do_read_index() all the time and see how fast > we can make it do so. > Unfortunately, all the cost is in the strcmp() - the other tests are negligible so moving it to be done incrementally inside do_read_index() won't reduce the cost, it just moves it and makes it harder to identify. The only idea I could come up with for reducing the cost to our end users is to keep it separate and split the test across multiple threads with some minimum index size threshold as we have done elsewhere. This adds code and complexity that we'll have to maintain forever so is less preferred than making it an assert. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-19 15:12 ` Ben Peart @ 2017-10-19 16:05 ` Jeff King 2017-10-20 1:14 ` Junio C Hamano 2017-10-19 22:14 ` Stefan Beller 1 sibling, 1 reply; 18+ messages in thread From: Jeff King @ 2017-10-19 16:05 UTC (permalink / raw) To: Ben Peart; +Cc: Junio C Hamano, Ben Peart, git, jeffhost, t.gummerer, mhagger On Thu, Oct 19, 2017 at 11:12:03AM -0400, Ben Peart wrote: > > So, I think I like the direction of getting rid of the order > > validation in post_read_index_from(), not only during the normal > > operation but also in fsck. I think it makes more sense to do so > > incrementally inside do_read_index() all the time and see how fast > > we can make it do so. > > Unfortunately, all the cost is in the strcmp() - the other tests are > negligible so moving it to be done incrementally inside do_read_index() > won't reduce the cost, it just moves it and makes it harder to identify. It's plausible that doing the strcmp() closer to where we are otherwise manipulating the data may show an improvement due to memory cache effects. It should be easy enough to check that; the patch below implements it. I couldn't measure any speedup with it running "git ls-files >/dev/null" on linux.git (60k files). But nor could I get any by dropping the check entirely. This is mostly just a curiosity to me. For the record, I have no real problem with dropping this kind of on-disk data-structure validation when it's expensive. After all, we do not check the sort on pack .idx files on each run (nor pack sha1 checksums, etc) because it's too expensive to do so. --- diff --git a/read-cache.c b/read-cache.c index 65f4fe8375..ac8c8d2e93 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1664,25 +1664,19 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk, return ce; } -static void check_ce_order(struct index_state *istate) -{ - unsigned int i; - - for (i = 1; i < istate->cache_nr; i++) { - struct cache_entry *ce = istate->cache[i - 1]; - struct cache_entry *next_ce = istate->cache[i]; - int name_compare = strcmp(ce->name, next_ce->name); - - if (0 < name_compare) - die("unordered stage entries in index"); - if (!name_compare) { - if (!ce_stage(ce)) - die("multiple stage entries for merged file '%s'", - ce->name); - if (ce_stage(ce) > ce_stage(next_ce)) - die("unordered stage entries for '%s'", - ce->name); - } +static void check_ce_order(struct cache_entry *ce, struct cache_entry *next_ce) +{ + int name_compare = strcmp(ce->name, next_ce->name); + + if (0 < name_compare) + die("unordered stage entries in index"); + if (!name_compare) { + if (!ce_stage(ce)) + die("multiple stage entries for merged file '%s'", + ce->name); + if (ce_stage(ce) > ce_stage(next_ce)) + die("unordered stage entries for '%s'", + ce->name); } } @@ -1720,7 +1714,6 @@ static void tweak_split_index(struct index_state *istate) static void post_read_index_from(struct index_state *istate) { - check_ce_order(istate); tweak_untracked_cache(istate); tweak_split_index(istate); } @@ -1784,6 +1777,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist) disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset); ce = create_from_disk(disk_ce, &consumed, previous_name); + if (i > 0) + check_ce_order(istate->cache[i - 1], ce); set_index_entry(istate, i, ce); src_offset += consumed; ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-19 16:05 ` Jeff King @ 2017-10-20 1:14 ` Junio C Hamano 0 siblings, 0 replies; 18+ messages in thread From: Junio C Hamano @ 2017-10-20 1:14 UTC (permalink / raw) To: Jeff King; +Cc: Ben Peart, Ben Peart, git, jeffhost, t.gummerer, mhagger Jeff King <peff@peff.net> writes: > It should be easy enough to check that; the patch below implements it. > I couldn't measure any speedup with it running "git ls-files >/dev/null" > on linux.git (60k files). But nor could I get any by dropping the check > entirely. I would expect that the speedup (due to possible cache effect) wouldn't be measurable if the overhead of the existing check itself is unmeasuably not-expensive. No suprise here. > This is mostly just a curiosity to me. For the record, I have no real > problem with dropping this kind of on-disk data-structure validation > when it's expensive. After all, we do not check the sort on pack .idx > files on each run (nor pack sha1 checksums, etc) because it's too > expensive to do so. Yes, I agree with that stance, too. If this were expensive in the overall picture to be measurable, I think we are OK omitting when the index is read, especially if we make sure we are not writing out nonsense trees out of it. Local damage to the index is very contained as long as we do not spread breakages to trees and commits. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-19 15:12 ` Ben Peart 2017-10-19 16:05 ` Jeff King @ 2017-10-19 22:14 ` Stefan Beller 2017-10-20 12:47 ` Johannes Schindelin 1 sibling, 1 reply; 18+ messages in thread From: Stefan Beller @ 2017-10-19 22:14 UTC (permalink / raw) To: Ben Peart Cc: Junio C Hamano, Ben Peart, git@vger.kernel.org, Jeff Hostetler, Thomas Gummerer, Michael Haggerty On Thu, Oct 19, 2017 at 8:12 AM, Ben Peart <peartben@gmail.com> wrote: > If we are guarding against "git" writing out an invalid index, we can move > this into an assert so that only git developers pay the cost of validating > they haven't created a new bug. I think this is better than just adding a > new test case as a new test case would not achieve the same coverage. This > is my preferred solution. > > If we are guarding against "some other application" writing out an invalid > index, then everyone will have to pay the cost as we can't insert the test > into "some other applications." Without user reports of it happening or any > telemetry saying it has happened I really have no idea if it every actually > happens in the wild anymore and whether the cost on every index load is > still justified. How well does this play out in the security realm?, c.f. https://public-inbox.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-19 22:14 ` Stefan Beller @ 2017-10-20 12:47 ` Johannes Schindelin 2017-10-20 18:53 ` Stefan Beller 0 siblings, 1 reply; 18+ messages in thread From: Johannes Schindelin @ 2017-10-20 12:47 UTC (permalink / raw) To: Stefan Beller Cc: Ben Peart, Junio C Hamano, Ben Peart, git@vger.kernel.org, Jeff Hostetler, Thomas Gummerer, Michael Haggerty Hi Stefan, On Thu, 19 Oct 2017, Stefan Beller wrote: > On Thu, Oct 19, 2017 at 8:12 AM, Ben Peart <peartben@gmail.com> wrote: > > > If we are guarding against "git" writing out an invalid index, we can move > > this into an assert so that only git developers pay the cost of validating > > they haven't created a new bug. I think this is better than just adding a > > new test case as a new test case would not achieve the same coverage. This > > is my preferred solution. > > > > If we are guarding against "some other application" writing out an invalid > > index, then everyone will have to pay the cost as we can't insert the test > > into "some other applications." Without user reports of it happening or any > > telemetry saying it has happened I really have no idea if it every actually > > happens in the wild anymore and whether the cost on every index load is > > still justified. > > How well does this play out in the security realm?, c.f. > https://public-inbox.org/git/20171002234517.GV19555@aiede.mtv.corp.google.com/ That link talks about security implications from administrators accessing Git repositories with maliciously crafted hooks/pagers. Ben's original mail talks about integrity checks of the index file, and how expensive they get when you talk about any decent-sized index (read: *a lot* larger than Git or even Linux developers will see regularly). The text you quoted talks about our talking out of our rear ends when we talk about typical user schenarios because we simply have no telemetry or otherwise reliable statistics. Now, I fail to see any relationship between Jonathan's mail and either of Ben's statements. Care to enlighten me? Ciao, Dscho ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-20 12:47 ` Johannes Schindelin @ 2017-10-20 18:53 ` Stefan Beller 2017-10-21 1:55 ` Junio C Hamano 0 siblings, 1 reply; 18+ messages in thread From: Stefan Beller @ 2017-10-20 18:53 UTC (permalink / raw) To: Johannes Schindelin Cc: Ben Peart, Junio C Hamano, Ben Peart, git@vger.kernel.org, Jeff Hostetler, Thomas Gummerer, Michael Haggerty > Ben's original mail talks about integrity checks of the index file, and > how expensive they get when you talk about any decent-sized index (read: > *a lot* larger than Git or even Linux developers will see regularly). I am quite aware of your situation. > The text you quoted talks about our talking out of our rear ends when we > talk about typical user schenarios because we simply have no telemetry or > otherwise reliable statistics. > > Now, I fail to see any relationship between Jonathan's mail and either of > Ben's statements. > > Care to enlighten me? There was a recent thread (which I assumed was the one I linked), that talked about security implications as soon as we loose the rather strict "git is to be used in a posix world", e.g. sharing your repo over NFS/Dropbox. The specific question that Peff asked was how the internal formats can be exploited. (Can a malicious index file be crafted such that it is not just a segfault, but a 'remote' code execution, given that you deploy the maliciously crafted file via NFS. Removing checks that we already have made me a bit suspicious that it *may* be helping an attacker here, though I have no hard data to show) Sorry for the confusion, Thanks, Stefan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-20 18:53 ` Stefan Beller @ 2017-10-21 1:55 ` Junio C Hamano 0 siblings, 0 replies; 18+ messages in thread From: Junio C Hamano @ 2017-10-21 1:55 UTC (permalink / raw) To: Stefan Beller Cc: Johannes Schindelin, Ben Peart, Ben Peart, git@vger.kernel.org, Jeff Hostetler, Thomas Gummerer, Michael Haggerty Stefan Beller <sbeller@google.com> writes: > There was a recent thread (which I assumed was the one I linked), that talked > about security implications as soon as we loose the rather strict "git > is to be used > in a posix world", e.g. sharing your repo over NFS/Dropbox. The > specific question > that Peff asked was how the internal formats can be exploited. (Can a malicious > index file be crafted such that it is not just a segfault, but a > 'remote' code execution, > given that you deploy the maliciously crafted file via NFS. Removing checks that > we already have made me a bit suspicious that it *may* be helping an > attacker here, > though I have no hard data to show) > > Sorry for the confusion, Thanks for an explanation, as I had the same reaction as Dscho initially. I'd assumed that the worst would be to create a wrong state (e.g. the same path registered twice with different contents in the index, a malformed tree written out of it, etc.), but that's merely an assumption not the result of an audit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-18 14:27 [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading Ben Peart 2017-10-19 5:22 ` Junio C Hamano @ 2017-10-24 14:45 ` Ben Peart 2017-10-30 12:48 ` Ben Peart 1 sibling, 1 reply; 18+ messages in thread From: Ben Peart @ 2017-10-24 14:45 UTC (permalink / raw) To: git Cc: chriscool, t.gummerer, l.s.r, jsorianopastor, peff, Ben Peart, Johannes Schindelin There is code in post_read_index_from() to detect out of order cache entries when reading an index file. This order verification adds cost to read_index_from() that grows with the size of the index. Put this on-disk data-structure validation code behind an #ifdef DEBUG so only debug builds have to pay the cost. The effect can be seen using t/perf/p0002-read-cache.sh: Test w/git repo HEAD HEAD~1 ---------------------------------------------------------------------------- read_cache/discard_cache 1000 times 0.42(0.01+0.09) 0.48(0.01+0.09) +14.3% read_cache/discard_cache 1000 times 0.41(0.03+0.04) 0.49(0.00+0.10) +19.5% read_cache/discard_cache 1000 times 0.42(0.03+0.06) 0.49(0.06+0.04) +16.7% Test w/10K files HEAD HEAD~1 --------------------------------------------------------------------------- read_cache/discard_cache 1000 times 1.58(0.04+0.00) 1.71(0.00+0.07) +8.2% read_cache/discard_cache 1000 times 1.64(0.01+0.07) 1.76(0.01+0.09) +7.3% read_cache/discard_cache 1000 times 1.62(0.03+0.04) 1.71(0.00+0.04) +5.6% Test w/100K files HEAD HEAD~1 ----------------------------------------------------------------------------- read_cache/discard_cache 1000 times 25.85(0.00+0.06) 27.35(0.01+0.06) +5.8% read_cache/discard_cache 1000 times 25.82(0.01+0.07) 27.25(0.01+0.07) +5.5% read_cache/discard_cache 1000 times 26.00(0.01+0.07) 27.36(0.06+0.03) +5.2% Test with 1,000K files HEAD HEAD~1 ------------------------------------------------------------------------------- read_cache/discard_cache 1000 times 200.61(0.01+0.07) 218.23(0.03+0.06) +8.8% read_cache/discard_cache 1000 times 201.62(0.03+0.06) 217.86(0.03+0.06) +8.1% read_cache/discard_cache 1000 times 201.64(0.01+0.09) 217.89(0.03+0.07) +8.1% Signed-off-by: Ben Peart <benpeart@microsoft.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- Notes: Base Ref: master Web-Diff: https://github.com/benpeart/git/commit/95e20f17ff Checkout: git fetch https://github.com/benpeart/git no_ce_order-v2 && git checkout 95e20f17ff ### Interdiff (v1..v2): ### Patches read-cache.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/read-cache.c b/read-cache.c index 65f4fe8375..fc90ec0fce 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1664,6 +1664,7 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk, return ce; } +#ifdef DEBUG static void check_ce_order(struct index_state *istate) { unsigned int i; @@ -1685,6 +1686,7 @@ static void check_ce_order(struct index_state *istate) } } } +#endif static void tweak_untracked_cache(struct index_state *istate) { @@ -1720,7 +1722,9 @@ static void tweak_split_index(struct index_state *istate) static void post_read_index_from(struct index_state *istate) { +#ifdef DEBUG check_ce_order(istate); +#endif tweak_untracked_cache(istate); tweak_split_index(istate); } base-commit: c52ca88430e6ec7c834af38720295070d8a1e330 -- 2.14.1.windows.1.1034.g0776750557 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-24 14:45 ` [PATCH v2] " Ben Peart @ 2017-10-30 12:48 ` Ben Peart 2017-10-30 18:03 ` Jeff King 2017-10-31 1:49 ` Junio C Hamano 0 siblings, 2 replies; 18+ messages in thread From: Ben Peart @ 2017-10-30 12:48 UTC (permalink / raw) To: Ben Peart, git, gitster Cc: chriscool, t.gummerer, l.s.r, jsorianopastor, peff, Johannes Schindelin Any updates or thoughts on this one? While the patch has become quite trivial, it does results in a savings of 5%-15% in index load time. I thought the compromise of having this test only run when DEBUG is defined should limit it to developer builds (hopefully everyone developing on git is running DEBUG builds :)). Since the test is trying to detect buggy code when writing the index, I thought that was the right time to test/catch any issues. I am working on other, more substantial savings for index load times (stay tuned) but this seemed like a small simple way to help speed things up. On 10/24/2017 10:45 AM, Ben Peart wrote: > There is code in post_read_index_from() to detect out of order cache > entries when reading an index file. This order verification adds cost > to read_index_from() that grows with the size of the index. > > Put this on-disk data-structure validation code behind an #ifdef DEBUG > so only debug builds have to pay the cost. > > The effect can be seen using t/perf/p0002-read-cache.sh: > > Test w/git repo HEAD HEAD~1 > ---------------------------------------------------------------------------- > read_cache/discard_cache 1000 times 0.42(0.01+0.09) 0.48(0.01+0.09) +14.3% > read_cache/discard_cache 1000 times 0.41(0.03+0.04) 0.49(0.00+0.10) +19.5% > read_cache/discard_cache 1000 times 0.42(0.03+0.06) 0.49(0.06+0.04) +16.7% > > Test w/10K files HEAD HEAD~1 > --------------------------------------------------------------------------- > read_cache/discard_cache 1000 times 1.58(0.04+0.00) 1.71(0.00+0.07) +8.2% > read_cache/discard_cache 1000 times 1.64(0.01+0.07) 1.76(0.01+0.09) +7.3% > read_cache/discard_cache 1000 times 1.62(0.03+0.04) 1.71(0.00+0.04) +5.6% > > Test w/100K files HEAD HEAD~1 > ----------------------------------------------------------------------------- > read_cache/discard_cache 1000 times 25.85(0.00+0.06) 27.35(0.01+0.06) +5.8% > read_cache/discard_cache 1000 times 25.82(0.01+0.07) 27.25(0.01+0.07) +5.5% > read_cache/discard_cache 1000 times 26.00(0.01+0.07) 27.36(0.06+0.03) +5.2% > > Test with 1,000K files HEAD HEAD~1 > ------------------------------------------------------------------------------- > read_cache/discard_cache 1000 times 200.61(0.01+0.07) 218.23(0.03+0.06) +8.8% > read_cache/discard_cache 1000 times 201.62(0.03+0.06) 217.86(0.03+0.06) +8.1% > read_cache/discard_cache 1000 times 201.64(0.01+0.09) 217.89(0.03+0.07) +8.1% > > Signed-off-by: Ben Peart <benpeart@microsoft.com> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> > --- > > Notes: > Base Ref: master > Web-Diff: https://github.com/benpeart/git/commit/95e20f17ff > Checkout: git fetch https://github.com/benpeart/git no_ce_order-v2 && git checkout 95e20f17ff > > ### Interdiff (v1..v2): > > ### Patches > > read-cache.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/read-cache.c b/read-cache.c > index 65f4fe8375..fc90ec0fce 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -1664,6 +1664,7 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk, > return ce; > } > > +#ifdef DEBUG > static void check_ce_order(struct index_state *istate) > { > unsigned int i; > @@ -1685,6 +1686,7 @@ static void check_ce_order(struct index_state *istate) > } > } > } > +#endif > > static void tweak_untracked_cache(struct index_state *istate) > { > @@ -1720,7 +1722,9 @@ static void tweak_split_index(struct index_state *istate) > > static void post_read_index_from(struct index_state *istate) > { > +#ifdef DEBUG > check_ce_order(istate); > +#endif > tweak_untracked_cache(istate); > tweak_split_index(istate); > } > > base-commit: c52ca88430e6ec7c834af38720295070d8a1e330 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-30 12:48 ` Ben Peart @ 2017-10-30 18:03 ` Jeff King 2017-10-31 0:33 ` Alex Vandiver 2017-10-31 1:49 ` Junio C Hamano 1 sibling, 1 reply; 18+ messages in thread From: Jeff King @ 2017-10-30 18:03 UTC (permalink / raw) To: Ben Peart Cc: Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r, jsorianopastor, Johannes Schindelin On Mon, Oct 30, 2017 at 08:48:48AM -0400, Ben Peart wrote: > Any updates or thoughts on this one? While the patch has become quite > trivial, it does results in a savings of 5%-15% in index load time. I like the general direction of avoiding the check during each read. But... > I thought the compromise of having this test only run when DEBUG is defined > should limit it to developer builds (hopefully everyone developing on git is > running DEBUG builds :)). Since the test is trying to detect buggy code > when writing the index, I thought that was the right time to test/catch any > issues. I certainly don't build with DEBUG. It traditionally hasn't done anything useful. But I'm also not convinced that this is a likely way to find bugs in the first place, so I'm OK missing out on it. But what we probably _do_ need is to make sure that "git fsck" would detect such an out-of-order index. So that developers and users alike can diagnose suspected problems. -Peff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-30 18:03 ` Jeff King @ 2017-10-31 0:33 ` Alex Vandiver 2017-10-31 13:01 ` Ben Peart 0 siblings, 1 reply; 18+ messages in thread From: Alex Vandiver @ 2017-10-31 0:33 UTC (permalink / raw) To: Jeff King Cc: Ben Peart, Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r, jsorianopastor, Johannes Schindelin On Mon, 30 Oct 2017, Jeff King wrote: > On Mon, Oct 30, 2017 at 08:48:48AM -0400, Ben Peart wrote: > > > Any updates or thoughts on this one? While the patch has become quite > > trivial, it does results in a savings of 5%-15% in index load time. > > I like the general direction of avoiding the check during each read. Same -- the savings here are well worth it, IMHO. > > I thought the compromise of having this test only run when DEBUG is defined > > should limit it to developer builds (hopefully everyone developing on git is > > running DEBUG builds :)). Since the test is trying to detect buggy code > > when writing the index, I thought that was the right time to test/catch any > > issues. > > I certainly don't build with DEBUG. It traditionally hasn't done > anything useful. But I'm also not convinced that this is a likely way to > find bugs in the first place, so I'm OK missing out on it. I also don't compile with DEBUG -- there's no documentation that mentions it, and I don't think I'd considered going poking for what was `#ifdef`d. I think it'd be reasonable to provide some configure-time setting that adds `CFLAGS="-ggdb3 -O0 -DDEBUG"` or similar, but that seems possibly moot for this particular change (see below). > But what we probably _do_ need is to make sure that "git fsck" would > detect such an out-of-order index. So that developers and users alike > can diagnose suspected problems. Agree -- that seems like a better home for this logic. > > I am working on other, more substantial savings for index load times > > (stay tuned) but this seemed like a small simple way to help speed > > things up. I'm interested to hear more about what direction you're looking in here. - Alex ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-31 0:33 ` Alex Vandiver @ 2017-10-31 13:01 ` Ben Peart 2017-10-31 17:10 ` Jeff King 0 siblings, 1 reply; 18+ messages in thread From: Ben Peart @ 2017-10-31 13:01 UTC (permalink / raw) To: Alex Vandiver, Jeff King Cc: Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r, jsorianopastor, Johannes Schindelin On 10/30/2017 8:33 PM, Alex Vandiver wrote: > On Mon, 30 Oct 2017, Jeff King wrote: >> On Mon, Oct 30, 2017 at 08:48:48AM -0400, Ben Peart wrote: >> >>> Any updates or thoughts on this one? While the patch has become quite >>> trivial, it does results in a savings of 5%-15% in index load time. >> >> I like the general direction of avoiding the check during each read. > > Same -- the savings here are well worth it, IMHO. > >>> I thought the compromise of having this test only run when DEBUG is defined >>> should limit it to developer builds (hopefully everyone developing on git is >>> running DEBUG builds :)). Since the test is trying to detect buggy code >>> when writing the index, I thought that was the right time to test/catch any >>> issues. >> >> I certainly don't build with DEBUG. It traditionally hasn't done >> anything useful. But I'm also not convinced that this is a likely way to >> find bugs in the first place, so I'm OK missing out on it. > > I also don't compile with DEBUG -- there's no documentation that > mentions it, and I don't think I'd considered going poking for what > was `#ifdef`d. I think it'd be reasonable to provide some > configure-time setting that adds `CFLAGS="-ggdb3 -O0 -DDEBUG"` or > similar, but that seems possibly moot for this particular change (see > below). > >> But what we probably _do_ need is to make sure that "git fsck" would >> detect such an out-of-order index. So that developers and users alike >> can diagnose suspected problems. > > Agree -- that seems like a better home for this logic. That is how version 1 of this patch worked but the feedback to that patch was to remove it "not only during the normal operation but also in fsck." > >>> I am working on other, more substantial savings for index load times >>> (stay tuned) but this seemed like a small simple way to help speed >>> things up. > > I'm interested to hear more about what direction you're looking in here. > - Alex > I'm working on parallelizing the index load process across multiple threads/cpu cores. Specifically the loop that calls create_from_disk() and set_index_entry(). The serial nature of the index formats makes that difficult but I believe I've come up with a way to make it work across all existing index formats. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-31 13:01 ` Ben Peart @ 2017-10-31 17:10 ` Jeff King 2017-11-01 6:09 ` Junio C Hamano 0 siblings, 1 reply; 18+ messages in thread From: Jeff King @ 2017-10-31 17:10 UTC (permalink / raw) To: Ben Peart Cc: Alex Vandiver, Ben Peart, git, gitster, chriscool, t.gummerer, l.s.r, jsorianopastor, Johannes Schindelin On Tue, Oct 31, 2017 at 09:01:45AM -0400, Ben Peart wrote: > > > But what we probably _do_ need is to make sure that "git fsck" would > > > detect such an out-of-order index. So that developers and users alike > > > can diagnose suspected problems. > > > > Agree -- that seems like a better home for this logic. > > That is how version 1 of this patch worked but the feedback to that patch > was to remove it "not only during the normal operation but also in fsck." Sorry for the mixed messages (I think they are mixed between different people, and not mixed _just_ from me ;) ). For what it's worth, I like your v1, but can live with either approach. -Peff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-31 17:10 ` Jeff King @ 2017-11-01 6:09 ` Junio C Hamano 0 siblings, 0 replies; 18+ messages in thread From: Junio C Hamano @ 2017-11-01 6:09 UTC (permalink / raw) To: Jeff King Cc: Ben Peart, Alex Vandiver, Ben Peart, git, chriscool, t.gummerer, l.s.r, jsorianopastor, Johannes Schindelin Jeff King <peff@peff.net> writes: > On Tue, Oct 31, 2017 at 09:01:45AM -0400, Ben Peart wrote: > >> > > But what we probably _do_ need is to make sure that "git fsck" would >> > > detect such an out-of-order index. So that developers and users alike >> > > can diagnose suspected problems. >> > >> > Agree -- that seems like a better home for this logic. >> >> That is how version 1 of this patch worked but the feedback to that patch >> was to remove it "not only during the normal operation but also in fsck." > > Sorry for the mixed messages (I think they are mixed between different > people, and not mixed _just_ from me ;) ). > > For what it's worth, I like your v1, but can live with either approach. I agree that v1 is the less bad one between the two. To be honest, if the original code were done in that way (i.e. the state with v1 applied), I probably would have had a very hard time to justify accepting a patch to "make it safer by always checking at runtime" (i.e. a reverse of v1 patch). So, let's go with v1. Thanks, all. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-30 12:48 ` Ben Peart 2017-10-30 18:03 ` Jeff King @ 2017-10-31 1:49 ` Junio C Hamano 2017-10-31 12:51 ` Ben Peart 1 sibling, 1 reply; 18+ messages in thread From: Junio C Hamano @ 2017-10-31 1:49 UTC (permalink / raw) To: Ben Peart Cc: Ben Peart, git, chriscool, t.gummerer, l.s.r, jsorianopastor, peff, Johannes Schindelin Ben Peart <peartben@gmail.com> writes: > Any updates or thoughts on this one? While the patch has become quite > trivial, it does results in a savings of 5%-15% in index load time. > > I thought the compromise of having this test only run when DEBUG is > defined should limit it to developer builds (hopefully everyone > developing on git is running DEBUG builds :)). Since the test is > trying to detect buggy code when writing the index, I thought that was > the right time to test/catch any issues. This check is more about catching a possible breakage (and a malicious repository) early before we go too far into the operation. I do not think this check is about debugging the implementation of Git. How would it be useful to turn it on in DEBUG build? While I do think pursuing any runtime improvements better than a couple of percents is worth it, I do not think the approach taken by this iteration makes much sense; the previous one that at least allowed fsck to catch breakage may have been already too leaky to catch real issues (i.e. when you are asked to visit and look at an unknown repository, you wouldn't start your session with "git fsck" to protect yourself), and this round makes it much worse. Besides, I see no -DDEBUG from "grep -e '-D[A-Z]*DEBUG' Makefile". If this check were about allowing us easier time debugging the binary (which I do not think it is), this probably should be '#ifndef NDEBUG' instead. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2] read_index_from(): Skip verification of the cache entry order to speed index loading 2017-10-31 1:49 ` Junio C Hamano @ 2017-10-31 12:51 ` Ben Peart 0 siblings, 0 replies; 18+ messages in thread From: Ben Peart @ 2017-10-31 12:51 UTC (permalink / raw) To: Junio C Hamano Cc: Ben Peart, git, chriscool, t.gummerer, l.s.r, jsorianopastor, peff, Johannes Schindelin On 10/30/2017 9:49 PM, Junio C Hamano wrote: > Ben Peart <peartben@gmail.com> writes: > >> Any updates or thoughts on this one? While the patch has become quite >> trivial, it does results in a savings of 5%-15% in index load time. >> >> I thought the compromise of having this test only run when DEBUG is >> defined should limit it to developer builds (hopefully everyone >> developing on git is running DEBUG builds :)). Since the test is >> trying to detect buggy code when writing the index, I thought that was >> the right time to test/catch any issues. > > This check is more about catching a possible breakage (and a > malicious repository) early before we go too far into the operation. > I do not think this check is about debugging the implementation of > Git. How would it be useful to turn it on in DEBUG build? > > While I do think pursuing any runtime improvements better than a > couple of percents is worth it, I do not think the approach taken by > this iteration makes much sense; the previous one that at least > allowed fsck to catch breakage may have been already too leaky to > catch real issues (i.e. when you are asked to visit and look at an > unknown repository, you wouldn't start your session with "git fsck" > to protect yourself), and this round makes it much worse. > > Besides, I see no -DDEBUG from "grep -e '-D[A-Z]*DEBUG' Makefile". > If this check were about allowing us easier time debugging the > binary (which I do not think it is), this probably should be > '#ifndef NDEBUG' instead. > I've tried 3 different ways to remove the overhead of this call from regular git operations. The first was version 1 of the patch which had fsck catch breakage but removed it from other commands that read the index. Since that version was not accepted, I took the feedback "I think I like the direction of getting rid of the order in post_read_index_from(), not only during the normal but also in fsck" to come up with a version 2. I was hesitant to remove the code completely as I did believe it had some value in detecting invalid indexes so went looking for a macro I could use to have it 1) not happen during regular user commands but 2) still happen for developers. The NDEBUG macro is guaranteed by the C89 standard (http://port70.net/~nsz/c/c89/c89-draft.html#4.1.2 ) to guard the code that is only necessary when assertions are in effect so seemed like a good choice. When I used it however, I discovered that the git Makefile does not define NDEBUG so using this macro did not have any effect thus making the patch useless as the code continues to run in all cases. On a side note, there are 434 instances of assert which up until this experience I believed were being removed in released builds. As far as I can tell, that is not the case. If removing them is the desired/expected behavior, we need to fix our Makefile as it only currently defines NDEBUG if USE_NED_ALLOCATOR is defined. I then searched the code and found 47 instances where the macro DEBUG was used. I (incorrectly) assumed that meant it must be used by other git developers. I personally have a build alias that adds "-j12 CFLAGS=-DDEBUG" to my make command but apparently I'm in the minority. :) This assumption led me to the patch version 2 (guarding the code with #ifdef DEBUG) as it does meet the request to remove it during normal but also fsck and does so with regular/release builds as they are currently defined. It seems that the current round of feedback is more in favor of leaving the test in fsck but removing it for other commands. If that is the desired behavior, please use version 1 of the patch. I'm also happy to flip this to "#ifndef NDEBUG" but that only makes sense if the released builds actually set NDEBUG which (I believe) will require a patch to Makefile. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2017-11-01 6:09 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-18 14:27 [PATCH v1] read_index_from(): Skip verification of the cache entry order to speed index loading Ben Peart 2017-10-19 5:22 ` Junio C Hamano 2017-10-19 15:12 ` Ben Peart 2017-10-19 16:05 ` Jeff King 2017-10-20 1:14 ` Junio C Hamano 2017-10-19 22:14 ` Stefan Beller 2017-10-20 12:47 ` Johannes Schindelin 2017-10-20 18:53 ` Stefan Beller 2017-10-21 1:55 ` Junio C Hamano 2017-10-24 14:45 ` [PATCH v2] " Ben Peart 2017-10-30 12:48 ` Ben Peart 2017-10-30 18:03 ` Jeff King 2017-10-31 0:33 ` Alex Vandiver 2017-10-31 13:01 ` Ben Peart 2017-10-31 17:10 ` Jeff King 2017-11-01 6:09 ` Junio C Hamano 2017-10-31 1:49 ` Junio C Hamano 2017-10-31 12:51 ` Ben Peart
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).