git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: Derrick Stolee <stolee@gmail.com>,
	Constantine <hi-angel@yandex.ru>,
	Christian Couder <christian.couder@gmail.com>,
	Mike Hommey <mh@glandium.org>, git <git@vger.kernel.org>
Subject: Re: [PATCH] revision: quit pruning diff more quickly when possible
Date: Sat, 14 Oct 2017 11:43:30 +0900	[thread overview]
Message-ID: <xmqqy3oetpnx.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20171013152745.cgqt3qgvcngyr5ew@sigill.intra.peff.net> (Jeff King's message of "Fri, 13 Oct 2017 11:27:45 -0400")

Jeff King <peff@peff.net> writes:

> Here it is cleaned up and with a commit message. There's another case
> that can be optimized, too: --remove-empty with an all-deletions commit.
> That's probably even more obscure and pathological, but it was easy to
> cover in the same breath.

This one looks good.  It appears that again you guys had all the fun
while I was offline ;-).  And I am happy to see that we didn't veer
in the direction to optimize for a wrong case by keeping track of
what trees we already saw and things like that, of course.

Thanks.

> Subject: revision: quit pruning diff more quickly when possible
>
> When the revision traversal machinery is given a pathspec,
> we must compute the parent-diff for each commit to determine
> which ones are TREESAME. We set the QUICK diff flag to avoid
> looking at more entries than we need; we really just care
> whether there are any changes at all.
>
> But there is one case where we want to know a bit more: if
> --remove-empty is set, we care about finding cases where the
> change consists only of added entries (in which case we may
> prune the parent in try_to_simplify_commit()). To cover that
> case, our file_add_remove() callback does not quit the diff
> upon seeing an added entry; it keeps looking for other types
> of entries.
>
> But this means when --remove-empty is not set (and it is not
> by default), we compute more of the diff than is necessary.
> You can see this in a pathological case where a commit adds
> a very large number of entries, and we limit based on a
> broad pathspec. E.g.:
>
>   perl -e '
>     chomp(my $blob = `git hash-object -w --stdin </dev/null`);
>     for my $a (1..1000) {
>       for my $b (1..1000) {
>         print "100644 $blob\t$a/$b\n";
>       }
>     }
>   ' | git update-index --index-info
>   git commit -qm add
>
>   git rev-list HEAD -- .
>
> This case takes about 100ms now, but after this patch only
> needs 6ms. That's not a huge improvement, but it's easy to
> get and it protects us against even more pathological cases
> (e.g., going from 1 million to 10 million files would take
> ten times as long with the current code, but not increase at
> all after this patch).
>
> This is reported to minorly speed-up pathspec limiting in
> real world repositories (like the 100-million-file Windows
> repository), but probably won't make a noticeable difference
> outside of pathological setups.
>
> This patch actually covers the case without --remove-empty,
> and the case where we see only deletions. See the in-code
> comment for details.
>
> Note that we have to add a new member to the diff_options
> struct so that our callback can see the value of
> revs->remove_empty_trees. This callback parameter could be
> passed to the "add_remove" and "change" callbacks, but
> there's not much point. They already receive the
> diff_options struct, and doing it this way avoids having to
> update the function signature of the other callbacks
> (arguably the format_callback and output_prefix functions
> could benefit from the same simplification).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  diff.h     |  1 +
>  revision.c | 16 +++++++++++++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/diff.h b/diff.h
> index 7dcfcfbef7..4a34d256f1 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -180,6 +180,7 @@ struct diff_options {
>  	pathchange_fn_t pathchange;
>  	change_fn_t change;
>  	add_remove_fn_t add_remove;
> +	void *change_fn_data;
>  	diff_format_fn_t format_callback;
>  	void *format_callback_data;
>  	diff_prefix_fn_t output_prefix;
> diff --git a/revision.c b/revision.c
> index 8fd222f3bf..a3f245e2cc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs,
>   * if the whole diff is removal of old data, and otherwise
>   * REV_TREE_DIFFERENT (of course if the trees are the same we
>   * want REV_TREE_SAME).
> - * That means that once we get to REV_TREE_DIFFERENT, we do not
> - * have to look any further.
> + *
> + * The only time we care about the distinction is when
> + * remove_empty_trees is in effect, in which case we care only about
> + * whether the whole change is REV_TREE_NEW, or if there's another type
> + * of change. Which means we can stop the diff early in either of these
> + * cases:
> + *
> + *   1. We're not using remove_empty_trees at all.
> + *
> + *   2. We saw anything except REV_TREE_NEW.
>   */
>  static int tree_difference = REV_TREE_SAME;
>  
> @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,
>  		    const char *fullpath, unsigned dirty_submodule)
>  {
>  	int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
> +	struct rev_info *revs = options->change_fn_data;
>  
>  	tree_difference |= diff;
> -	if (tree_difference == REV_TREE_DIFFERENT)
> +	if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
>  		DIFF_OPT_SET(options, HAS_CHANGES);
>  }
>  
> @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)
>  	DIFF_OPT_SET(&revs->pruning, QUICK);
>  	revs->pruning.add_remove = file_add_remove;
>  	revs->pruning.change = file_change;
> +	revs->pruning.change_fn_data = revs;
>  	revs->sort_order = REV_SORT_IN_GRAPH_ORDER;
>  	revs->dense = 1;
>  	revs->prefix = prefix;

  parent reply	other threads:[~2017-10-14  2:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-13  9:51 git-clone causes out of memory Constantine
2017-10-13 10:06 ` Mike Hommey
2017-10-13 10:26   ` Christian Couder
2017-10-13 10:37     ` Mike Hommey
2017-10-13 10:44       ` Christian Couder
2017-10-13 12:04         ` Junio C Hamano
2017-10-13 12:12           ` Constantine
2017-10-13 12:44             ` Jeff King
2017-10-13 13:15               ` Derrick Stolee
2017-10-13 13:39                 ` Derrick Stolee
2017-10-13 13:50                   ` Jeff King
2017-10-13 13:55                     ` Derrick Stolee
2017-10-13 13:56                       ` Jeff King
2017-10-13 14:10                         ` Jeff King
2017-10-13 14:20                           ` Jeff King
2017-10-13 14:25                             ` Derrick Stolee
2017-10-13 14:26                               ` Jeff King
2017-10-13 14:30                                 ` Derrick Stolee
2017-10-13 15:27                                 ` [PATCH] revision: quit pruning diff more quickly when possible Jeff King
2017-10-13 15:37                                   ` Derrick Stolee
2017-10-13 15:44                                     ` Jeff King
2017-10-14  2:43                                   ` Junio C Hamano [this message]
2017-10-13 12:35   ` git-clone causes out of memory Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqy3oetpnx.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=hi-angel@yandex.ru \
    --cc=mh@glandium.org \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).