git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH v1] diffcore-rename: speed up register_rename_src
Date: Thu, 20 Apr 2017 10:00:04 -0400	[thread overview]
Message-ID: <40228c69-7946-3ef1-35de-4cea9b0312e4@jeffhostetler.com> (raw)
In-Reply-To: <20170419031839.m2zgwywa2soejiqk@sigill.intra.peff.net>



On 4/18/2017 11:18 PM, Jeff King wrote:
> On Tue, Apr 18, 2017 at 10:56:08PM -0400, Jeff King wrote:
>
>>> When adding many things, we often just append and then sort at the
>>> end after we finished adding.  I wonder if recent "check the last
>>> one and append" optimization beats that strategy.
>>
>> The big question is whether we need to detect duplicates while we're
>> appending to the list, which is hard on an unsorted list.  In this
>> function, at least, we do detect when the path already exists and return
>> the existing entry. I'm not sure under what circumstances we would see
>> such a duplicate, though, as each filename should appear only once in
>> the tree diff. I would think.
>>
>> Doing:
>>
>> diff --git a/diffcore-rename.c b/diffcore-rename.c
>> index f7444c86b..56a493d97 100644
>> --- a/diffcore-rename.c
>> +++ b/diffcore-rename.c
>> @@ -86,7 +86,7 @@ static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
>>  		struct diff_rename_src *src = &(rename_src[next]);
>>  		int cmp = strcmp(one->path, src->p->one->path);
>>  		if (!cmp)
>> -			return src;
>> +			die("BUG: duplicate rename src: %s", one->path);
>>  		if (cmp < 0) {
>>  			last = next;
>>  			continue;
>>
>> passes the test suite, at least. :)
>
> Maybe relevant: 4d6be03b9 (diffcore-rename: avoid processing duplicate
> destinations, 2015-02-26). That's on the dst side, but possibly we
> should do something similar on the src side.
>
> BTW, I think the return value from register_rename_src() is
> questionable. It points to a "struct diff_rename_src" that may be
> reallocated by further calls to the function. Fortunately nobody
> actually looks at it, let alone saves it, so there's no bug.
>
> We may want to convert that return value to a void (if not just return
> an int for "hey, there's a duplicate", like we do for add_rename_dst()).
>
> Also, presumably that function could learn the same "check the last one"
> trick that the src side does. Which leads me back to "surely we can
> generalize this". I don't think bsearch() is quite what we want, because
> its interface doesn't tell us where to put the item when it isn't found.
> But I think we could make a general bsearch-like function that has
> similar semantics to index_name_pos(), with its negative-integer return.
>
> And then that would be a general lookup function, and we could easily
> build a general "look up and add" function around that. And the "check
> the last one" optimization would go in the latter.


I do see your point.  And I do get tired of littering the code with
"check the last before binary searching" tricks.  It would be nice
if we could better isolate this.  I've tried to follow the path of
least change/damage -- don't change functionality, but just short-cut
when possible, so the patches I've pushed up this month have mostly
taken that form.

And you're right, the gains on this particular patch are relatively
minor and it is a bit of a contrived case (lots of renames??), but
it did come up while testing on the Windows repo.  It doesn't happen
often, but it did happen.

So I'm OK with rejecting this one.

Perhaps the thing to learn from this (and the other ones) is that
we have lots of places where we are building a sorted list by
iterating over a sorted list.  The insert routines are general
purpose and cannot assume this, so they search first.  Perhaps it
would be clearer to have independent _append and _insert functions
and have the caller explicitly call the appropriate one.  The mainline
iterations on the existing index could just call the _append form
and never have to worry about searching or the negative-integer
return trick.  Whereas, the random iterations (such as on the
command's arg list), would always call the _insert form.

Jeff


  reply	other threads:[~2017-04-20 14:00 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-18 19:44 [PATCH v1] diffcore-rename speedup git
2017-04-18 19:44 ` [PATCH v1] diffcore-rename: speed up register_rename_src git
2017-04-19  1:32   ` Jeff King
2017-04-19  2:45     ` Junio C Hamano
2017-04-19  2:56       ` Jeff King
2017-04-19  3:18         ` Jeff King
2017-04-20 14:00           ` Jeff Hostetler [this message]
2017-04-20 16:13             ` Jeff King
2017-04-20 18:08               ` Jeff Hostetler
2017-04-20 18:34                 ` Jeff King
2017-04-21  1:19                   ` Junio C Hamano
2017-04-20 10:40     ` Johannes Schindelin
2017-04-20 15:50       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40228c69-7946-3ef1-35de-4cea9b0312e4@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).