Re: [PATCH v2 5/7] name-hash: perf improvement for lazy_init_name_hash

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff Hostetler <git@jeffhostetler.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, peff@peff.net,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH v2 5/7] name-hash: perf improvement for lazy_init_name_hash
Date: Mon, 27 Mar 2017 16:50:40 -0400	[thread overview]
Message-ID: <122907b4-e689-c436-18df-27664186a8ce@jeffhostetler.com> (raw)
In-Reply-To: <xmqqinmu1n8o.fsf@gitster.mtv.corp.google.com>



On 3/27/2017 4:24 PM, Junio C Hamano wrote:
> git@jeffhostetler.com writes:
>
>> +/*
>> + * We use n mutexes to guard n partitions of the "istate->dir_hash"
>> + * hashtable.  Since "find" and "insert" operations will hash to a
>> + * particular bucket and modify/search a single chain, we can say
>> + * that "all chains mod n" are guarded by the same mutex -- rather
>> + * than having a single mutex to guard the entire table.  (This does
>> + * require that we disable "rehashing" on the hashtable.)
>> + *
>> + * So, a larger value here decreases the probability of a collision
>> + * and the time that each thread must wait for the mutex.
>> + */
>> +#define LAZY_MAX_MUTEX   (32)
>
> Good thinking is very well explained in the in-code comment.
>
>> +static int handle_range_dir(
>> +	struct index_state *istate,
>> +	int k_start,
>> +	int k_end,
>> +	struct dir_entry *parent,
>> +	struct strbuf *prefix,
>> +	struct lazy_entry *lazy_entries,
>> +	struct dir_entry **dir_new_out)
>> +{
>> +	int rc, k;
>> +	int input_prefix_len = prefix->len;
>> +	struct dir_entry *dir_new;
>> +
>> +	dir_new = hash_dir_entry_with_parent_and_prefix(istate, parent, prefix);
>> +
>> +	strbuf_addch(prefix, '/');
>> +
>> +	/*
>> +	 * Scan forward in the index array for index entries having the same
>> +	 * path prefix (that are also in this directory).
>> +	 */
>> +	if (strncmp(istate->cache[k_start + 1]->name, prefix->buf, prefix->len) > 0)
>> +		k = k_start + 1;
>> +	else if (strncmp(istate->cache[k_end - 1]->name, prefix->buf, prefix->len) == 0)
>> +		k = k_end;
>> +	else {
>> +		int begin = k_start;
>> +		int end = k_end;
>> +		while (begin < end) {
>> +			int mid = (begin + end) >> 1;
>> +			int cmp = strncmp(istate->cache[mid]->name, prefix->buf, prefix->len);
>> +			if (cmp == 0) /* mid has same prefix; look in second part */
>> +				begin = mid + 1;
>> +			else if (cmp > 0) /* mid is past group; look in first part */
>> +				end = mid;
>> +			else
>> +				die("cache entry out of order");
>
> Dying at this low level in the callchain in a worker thread made me
> feel a little bit nervous, but even if we arrange to return to the
> caller without doing extra computation, we would want to detect and
> abort when the cache entries are not sorted anyway, so this hsould
> be OK.

yeah, i wasn't sure if there was any need to complicate things
returning this error.  the index is sorted before we get here,
so this should never happen.

>
>> +		}
>> +		k = begin;
>> +	}
>> +
>> +	/*
>> +	 * Recurse and process what we can of this subset [k_start, k).
>> +	 */
>> +	rc = handle_range_1(istate, k_start, k, dir_new, prefix, lazy_entries);
>> +
>> +	strbuf_setlen(prefix, input_prefix_len);
>> +
>> +	*dir_new_out = dir_new;
>> +	return rc;
>> +}
>
> The varilable "rc" (return code?) is a lot more than return code; it
> tells how many entries we processed and signals the caller that it
> still needs to sweep the remainder if we didn't reach k_end.  The
> caller of this function calls the variable to receive this
> "processed", so I didn't find it too confusing while reading the
> code top-down, though.

The "rc" variable came from clear_ce_flags_dir().  I stole the
diving logic from it and clear_ce_flags_1(), so I tried to keep
the same name here.

(And that reminds me, the linear search in clead_ce_flags_dir()
could be sped up with a variation of the binary search I put in
here.  But that's for another day.)

>
>> +static int handle_range_1(
>> +	struct index_state *istate,
>> +	int k_start,
>> +	int k_end,
>> +	struct dir_entry *parent,
>> +	struct strbuf *prefix,
>> +	struct lazy_entry *lazy_entries)
>> +{
>> +	int input_prefix_len = prefix->len;
>> +	int k = k_start;
>> +
>> +	while (k < k_end) {
>> +		struct cache_entry *ce_k = istate->cache[k];
>> +		const char *name, *slash;
>> +
>> +		if (prefix->len && strncmp(ce_k->name, prefix->buf, prefix->len))
>> +			break;
>
> At first I was worried by this early return (i.e. we chop the entire
> active_nr entries into lazy_nr_dir_threads and hand each of them a
> range [k_start, k_end)---stopping before a thread reaches the end of
> the range it is responsible for will leave gaps), but then realized
> that this early return "we are done with the entries in the same
> directory" kicks in only for recursive invocation of this function,
> which makes it correct and perfectly fine.
>
> I also briefly wondered if it is worth wiggling the boundary of
> ranges for adjacent workers to align with directory boundaries, as
> the last round of hashes done by a worker and the first round of
> hashes done by another worker adjacent to it will be hashing for the
> same parent directory, but I think that would be counter-productive
> and think your "almost even" distribution would be a lot more
> sensible.  After all, when distribution of paths is skewed, a single
> directory may have disproportionally more (leaf) entries than the
> remainder of the index and in such a case, we would want multiple
> workers to share the load of hashing them, even if that means they
> may have to hash the same leading path independently.

I thought about various options along this, but yeah the work to
find the end of the directory would have to be done before the
threads started or each thread would have to tweak their endpoints
and that might be more expensive than just repeating a few
directory component calculations along the boundaries (and much
more complicated).

>
> Nicely done.  Let's have this in 'next' and then in 'master'
> soonish.
>
> Thanks.
>

Thanks
Jeff

next prev parent reply	other threads:[~2017-03-27 20:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-23 13:46 [PATCH v2 0/7] thread lazy_init_name_hash git
2017-03-23 13:46 ` [PATCH v2 1/7] name-hash: specify initial size for istate.dir_hash table git
2017-03-23 13:47 ` [PATCH v2 2/7] hashmap: allow memihash computation to be continued git
2017-03-23 13:47 ` [PATCH v2 3/7] hashmap: Add disallow_rehash setting git
2017-03-23 13:47 ` [PATCH v2 4/7] hashmap: document memihash_cont, hashmap_disallow_rehash api git
2017-03-23 13:47 ` [PATCH v2 5/7] name-hash: perf improvement for lazy_init_name_hash git
2017-03-23 15:25   ` Ramsay Jones
2017-03-23 17:45     ` Stefan Beller
2017-03-24 12:36       ` Jeff Hostetler
2017-03-27 20:24   ` Junio C Hamano
2017-03-27 20:50     ` Jeff Hostetler [this message]
2017-03-23 13:47 ` [PATCH v2 6/7] name-hash: add test-lazy-init-name-hash git
2017-03-23 15:29   ` Ramsay Jones
2017-03-23 13:47 ` [PATCH v2 7/7] name-hash: add perf test for lazy_init_name_hash git
2017-03-23 17:52 ` [PATCH v2 0/7] thread lazy_init_name_hash Junio C Hamano
2017-03-24 12:39   ` Jeff Hostetler
2017-03-24 16:36     ` Stefan Beller
2017-03-24 17:44     ` Junio C Hamano
2017-04-06  2:22 ` Duy Nguyen
2017-04-06 13:53   ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=122907b4-e689-c436-18df-27664186a8ce@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).