Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Thomas Gummerer" <t.gummerer@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Paul-Sebastian Ungureanu" <ungureanupaulsebastian@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index
Date: Sat, 29 Sep 2018 11:14:29 +0200	[thread overview]
Message-ID: <20180929091429.GF23446@localhost> (raw)
In-Reply-To: <20180929053608.GB20349@duynguyen.home>

On Sat, Sep 29, 2018 at 07:36:08AM +0200, Duy Nguyen wrote:
> On Fri, Sep 28, 2018 at 06:24:58PM +0200, SZEDER Gábor wrote:
> > When unpack_trees() constructs a new index, it copies cache entries
> > from the original index [1].  prepare_to_write_split_index() has to
> > deal with this, and it has a dedicated code path for copied entries
> > that are present in the shared index, where it compares the cached
> > data in the corresponding copied and original entries.  If the cached
> > data matches, then they are considered the same; if it differs, then
> > the copied entry will be marked for inclusion as a replacement entry
> > in the just about to be written split index by setting the
> > CE_UPDATE_IN_BASE flag.
> > 
> > However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
> > reading the split index, if the entry already has a replacement entry
> > there, or upon refreshing the cached stat data, if the corresponding
> > file was modified.  The state of this flag is then preserved when
> > unpack_trees() copies a cache entry from the shared index.
> > 
> > So modify prepare_to_write_split_index() to check the copied cache
> > entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
> > comparison of cached data if the flag is already set.
> 
> OK so this is an optimization, not a bug fix. Right?

Well, a microoptimization at most: with all what's going on in
unpack_trees() I seriously doubt that it's effect is measurable.

> > Note that comparing the cached data in copied and original entries in
> 
> s/cached data/cached stat data/ ? I was confused for a bit.

No, it's indeed cached data, but now that you mention it, the subject
line does need a s/stat //.

The comparison is done with this call:

  ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
               offsetof(struct cache_entry, name) -
               offsetof(struct cache_entry, ce_stat_data));

i.e. it starts at the stat data and ends just before the cache entry's
name, and 'struct cache_entry' has several other fields between these
two, including e.g. the cached oid:

  struct cache_entry {
          struct hashmap_entry ent;
          struct stat_data ce_stat_data;
          unsigned int ce_mode;
          unsigned int ce_flags;
          unsigned int mem_pool_allocated;
          unsigned int ce_namelen;
          unsigned int index;     /* for link extension */
          struct object_id oid;
          char name[FLEX_ARRAY]; /* more */
  };

However, to me it's mostly about clarity of the code, and about
documenting that CE_UPDATE_IN_BASE might have already been set at that
point and why, so the next dev diving in to debug the split index
doesn't have to figure this out himself.

> > the shared index might actually be entirely unnecessary.  In theory
> > all code paths refreshing the cached stat data of an entry in the
> > shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
> > unpack_trees() should preserve this flag when copying cache entries.
> > This means that the cached data is only ever changed if the
> > CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
> > confirm this: instrumenting the conditions in question and running the
> > test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
> > cached data in a copied entry differs from the data in the shared
> > entry only if its CE_UPDATE_IN_BASE flag is indeed set.
> 
> Yes I was probably just being paranoid (or sticking to simpler
> checks). I was told that split index is computation expensive and not
> doing unnecesary/expensive checks may help. But let's leave it for
> later.
> 
> > +			} else {
> > +				/*
> > +				 * Thoroughly compare the cached data to see
> > +				 * whether it should be marked for inclusion
> > +				 * in the split index.
> > +				 *
> > +				 * This comparison might be unnecessary, as
> > +				 * code paths modifying the cached data do
> > +				 * set CE_UPDATE_IN_BASE as well.
> > +				 */
> > +				const unsigned int ondisk_flags =
> > +					CE_STAGEMASK | CE_VALID |
> > +					CE_EXTENDED_FLAGS;
> > +				unsigned int ce_flags, base_flags, ret;
> > +				ce_flags = ce->ce_flags;
> > +				base_flags = base->ce_flags;
> > +				/* only on-disk flags matter */
> > +				ce->ce_flags   &= ondisk_flags;
> > +				base->ce_flags &= ondisk_flags;
> > +				ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
> > +					     offsetof(struct cache_entry, name) -
> > +					     offsetof(struct cache_entry, ce_stat_data));
> > +				ce->ce_flags = ce_flags;
> > +				base->ce_flags = base_flags;
> 
> Maybe make this block a separate function (compare_ce_content or
> something). The amount of indentation is getting too high.

Ah, I was secretly hoping for something along the lines of "your
analysis is correct, we can safely drop this comparison" :)

Btw, I thought about extracing this whole loop into a separate
function like mark_updated_entries_for_split_index() or something, but
there are other things going on in this loop as well, e.g. marking
with CE_MATCHED and deduplicating copied entries, not to mention the
conditions that set 'ce->index = 0', which I think should die() or
BUG() or are unnecessary, see my followup email to this patch in v4:

  https://public-inbox.org/git/20180927134324.GI27036@localhost/

> > +				if (ret)
> > +					ce->ce_flags |= CE_UPDATE_IN_BASE;
> > +			}
> >  			discard_cache_entry(base);
> >  			si->base->cache[ce->index - 1] = ce;
> >  		}
> > -- 
> > 2.19.0.361.gafc87ffe72
> >

next prev parent reply	other threads:[~2018-09-29  9:14 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-27 12:44 [PATCH v2 0/5] Fix the racy split index problem SZEDER Gábor
2018-09-27 12:44 ` [PATCH v2 1/5] split-index: add tests to demonstrate " SZEDER Gábor
2018-09-28  0:48   ` SZEDER Gábor
2018-09-28  2:40     ` SZEDER Gábor
2018-09-28 17:30     ` Junio C Hamano
2018-09-27 12:44 ` [PATCH v2 2/5] t1700-split-index: date back files to avoid racy situations SZEDER Gábor
2018-09-27 12:44 ` [PATCH v2 3/5] split-index: count the number of deleted entries SZEDER Gábor
2018-09-27 12:44 ` [PATCH v2 4/5] split-index: don't compare stat data of entries already marked for split index SZEDER Gábor
2018-09-27 13:43   ` SZEDER Gábor
2018-09-27 12:44 ` [PATCH v2 5/5] split-index: smudge and add racily clean cache entries to " SZEDER Gábor
2018-09-27 13:53 ` [PATCH v2 0/5] Fix the racy split index problem Ævar Arnfjörð Bjarmason
2018-09-27 14:23   ` SZEDER Gábor
2018-09-27 15:25     ` Ævar Arnfjörð Bjarmason
2018-09-28  6:57       ` Ævar Arnfjörð Bjarmason
2018-09-28 10:17         ` SZEDER Gábor
2018-10-08 14:54         ` Ævar Arnfjörð Bjarmason
2018-10-08 15:41           ` SZEDER Gábor
2018-09-28 16:24 ` [PATCH v3 0/6] " SZEDER Gábor
2018-09-28 16:24   ` [PATCH v3 1/6] t1700-split-index: document why FSMONITOR is disabled in this test script SZEDER Gábor
2018-09-28 16:24   ` [PATCH v3 2/6] split-index: add tests to demonstrate the racy split index problem SZEDER Gábor
2018-09-28 16:24   ` [PATCH v3 3/6] t1700-split-index: date back files to avoid racy situations SZEDER Gábor
2018-09-28 16:24   ` [PATCH v3 4/6] split-index: count the number of deleted entries SZEDER Gábor
2018-09-28 16:24   ` [PATCH v3 5/6] split-index: don't compare stat data of entries already marked for split index SZEDER Gábor
2018-09-29  5:36     ` Duy Nguyen
2018-09-29  9:14       ` SZEDER Gábor [this message]
2018-09-29 10:07         ` SZEDER Gábor
2018-09-28 16:24   ` [PATCH v3 6/6] split-index: smudge and add racily clean cache entries to " SZEDER Gábor
2018-09-29  5:21     ` Duy Nguyen
2018-09-29  7:57       ` SZEDER Gábor
2018-09-30 14:47   ` [PATCH v3 0/6] Fix the racy split index problem SZEDER Gábor
2018-10-05  6:15     ` Junio C Hamano
2018-10-11  9:43   ` [PATCH v4 " SZEDER Gábor
2018-10-11  9:43     ` [PATCH v4 1/6] t1700-split-index: document why FSMONITOR is disabled in this test script SZEDER Gábor
2018-10-11  9:43     ` [PATCH v4 2/6] split-index: add tests to demonstrate the racy split index problem SZEDER Gábor
2018-10-11  9:43     ` [PATCH v4 3/6] t1700-split-index: date back files to avoid racy situations SZEDER Gábor
2018-10-11  9:43     ` [PATCH v4 4/6] split-index: count the number of deleted entries SZEDER Gábor
2018-10-11  9:43     ` [PATCH v4 5/6] split-index: don't compare cached data of entries already marked for split index SZEDER Gábor
2018-10-11  9:43     ` [PATCH v4 6/6] split-index: smudge and add racily clean cache entries to " SZEDER Gábor
2018-10-11  9:53     ` [PATCH 7/6] split-index: BUG() when cache entry refers to non-existing shared entry SZEDER Gábor
2018-10-11 10:36     ` [PATCH v4 0/6] Fix the racy split index problem Ævar Arnfjörð Bjarmason
2018-10-11 11:38       ` SZEDER Gábor
2018-10-12  3:20       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180929091429.GF23446@localhost \
    --to=szeder.dev@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=t.gummerer@gmail.com \
    --cc=ungureanupaulsebastian@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).