git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johan Herland <johan@herland.net>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Johan Herland <johan@herland.net>
Subject: [PATCH 6/6] New --dirstat=lines mode, doing dirstat analysis based on diffstat
Date: Tue, 26 Apr 2011 02:01:42 +0200	[thread overview]
Message-ID: <1303776102-9085-7-git-send-email-johan@herland.net> (raw)
In-Reply-To: <1303776102-9085-1-git-send-email-johan@herland.net>

This patch adds an alternative implementation of show_dirstat(), called
show_dirstat_by_line(), which uses the more expensive diffstat analysis
(as opposed to show_dirstat()'s own (relatively inexpensive) analysis)
to derive the numbers from which the --dirstat output is computed.

The alternative implementation is controlled by the new "lines" argument
to the --dirstat option (or the diff.dirstat config variable).

In linux-2.6.git, running the three different --dirstat modes:

  time git diff v2.6.20..v2.6.30 --dirstat=changes > /dev/null
vs.
  time git diff v2.6.20..v2.6.30 --dirstat=lines > /dev/null
vs.
  time git diff v2.6.20..v2.6.30 --dirstat=files > /dev/null

yields the following average runtimes on my machine:

- "changes" (default): ~6.0 s
- "lines":             ~9.6 s
- "files":             ~0.1 s

So, as expected, there's a considerable performance hit (~60%) by going
through the full diffstat analysis as compared to the default "changes"
analysis (obviously, "files" is much faster than both). As such, the
"lines" mode is probably only useful if you really need the --dirstat
numbers to be consistent with the numbers returned from the other
--*stat options.

The patch also includes documentation and tests for the new dirstat mode.

Signed-off-by: Johan Herland <johan@herland.net>
---
 Documentation/config.txt       |    7 +++
 Documentation/diff-options.txt |    7 +++
 diff.c                         |   60 +++++++++++++++++++++++-
 diff.h                         |    1 +
 t/t4046-diff-dirstat.sh        |  100 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 173 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 10fa89a..47b2423 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -837,6 +837,13 @@ diff.dirstat::
 	the amount of pure code movements within a file.  In other words,
 	rearranging lines in a file is not counted as much as other changes.
 	This is the default `--dirstat` behavior.
+`lines`;;
+	Compute the dirstat numbers by doing the regular line-based diff
+	analysis, and summing the removed/added line counts. This is a more
+	expensive `--dirstat` behavior than the `changes` behavior, but it
+	does count rearranged lines within a file as much as other changes.
+	The resulting output is consistent with what you get from the other
+	`--*stat` options.
 `files`;;
 	Compute the dirstat numbers by counting the number of files changed.
 	Each changed file counts equally in the dirstat analysis. This is
diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index b6b1448..0b7417b 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -80,6 +80,13 @@ endif::git-format-patch[]
 	the amount of pure code movements within a file.  In other words,
 	rearranging lines in a file is not counted as much as other changes.
 	This is the default `--dirstat` behavior.
+`lines`;;
+	Compute the dirstat numbers by doing the regular line-based diff
+	analysis, and summing the removed/added line counts. This is a more
+	expensive `--dirstat` behavior than the `changes` behavior, but it
+	does count rearranged lines within a file as much as other changes.
+	The resulting output is consistent with what you get from the other
+	`--*stat` options.
 `files`;;
 	Compute the dirstat numbers by counting the number of files changed.
 	Each changed file counts equally in the dirstat analysis. This is
diff --git a/diff.c b/diff.c
index 4da3b68..c00984f 100644
--- a/diff.c
+++ b/diff.c
@@ -131,10 +131,17 @@ static void dirstat_opt_args(struct diff_options *options, const char *args)
 	while (*p) {
 		if (!prefixcmp(p, "changes")) {
 			p += 7;
+			DIFF_OPT_CLR(options, DIRSTAT_BY_LINE);
+			DIFF_OPT_CLR(options, DIRSTAT_BY_FILE);
+		}
+		else if (!prefixcmp(p, "lines")) {
+			p += 5;
+			DIFF_OPT_SET(options, DIRSTAT_BY_LINE);
 			DIFF_OPT_CLR(options, DIRSTAT_BY_FILE);
 		}
 		else if (!prefixcmp(p, "files")) {
 			p += 5;
+			DIFF_OPT_CLR(options, DIRSTAT_BY_LINE);
 			DIFF_OPT_SET(options, DIRSTAT_BY_FILE);
 		}
 		else if (!prefixcmp(p, "noncumulative")) {
@@ -1677,6 +1684,48 @@ found_damage:
 	gather_dirstat(options, &dir, changed, "", 0);
 }
 
+static void show_dirstat_by_line(struct diffstat_t *data, struct diff_options *options)
+{
+	int i;
+	unsigned long changed;
+	struct dirstat_dir dir;
+
+	if (data->nr == 0)
+		return;
+
+	dir.files = NULL;
+	dir.alloc = 0;
+	dir.nr = 0;
+	dir.percent = options->dirstat_percent;
+	dir.cumulative = DIFF_OPT_TST(options, DIRSTAT_CUMULATIVE);
+
+	changed = 0;
+	for (i = 0; i < data->nr; i++) {
+		struct diffstat_file *file = data->files[i];
+		unsigned long damage = file->added + file->deleted;
+		if (damage && file->is_binary)
+			/*
+			 * binary files counts bytes, not lines. Must find some
+			 * way to normalize binary bytes vs. textual lines.
+			 * The following heuristic is cheap, but beyond ugly...
+			 */
+			damage = damage < 52 ? 1 : damage / 52;
+		ALLOC_GROW(dir.files, dir.nr + 1, dir.alloc);
+		dir.files[dir.nr].name = file->name;
+		dir.files[dir.nr].changed = damage;
+		changed += damage;
+		dir.nr++;
+	}
+
+	/* This can happen even with many files, if everything was renames */
+	if (!changed)
+		return;
+
+	/* Show all directories with more than x% of the changes */
+	qsort(dir.files, dir.nr, sizeof(dir.files[0]), dirstat_compare);
+	gather_dirstat(options, &dir, changed, "", 0);
+}
+
 static void free_diffstat_info(struct diffstat_t *diffstat)
 {
 	int i;
@@ -4081,6 +4130,7 @@ void diff_flush(struct diff_options *options)
 	struct diff_queue_struct *q = &diff_queued_diff;
 	int i, output_format = options->output_format;
 	int separator = 0;
+	int dirstat_by_line = 0;
 
 	/*
 	 * Order: raw, stat, summary, patch
@@ -4101,7 +4151,11 @@ void diff_flush(struct diff_options *options)
 		separator++;
 	}
 
-	if (output_format & (DIFF_FORMAT_DIFFSTAT|DIFF_FORMAT_SHORTSTAT|DIFF_FORMAT_NUMSTAT)) {
+	if (output_format & DIFF_FORMAT_DIRSTAT && DIFF_OPT_TST(options, DIRSTAT_BY_LINE))
+		dirstat_by_line = 1;
+
+	if (output_format & (DIFF_FORMAT_DIFFSTAT|DIFF_FORMAT_SHORTSTAT|DIFF_FORMAT_NUMSTAT) ||
+	    dirstat_by_line) {
 		struct diffstat_t diffstat;
 
 		memset(&diffstat, 0, sizeof(struct diffstat_t));
@@ -4116,10 +4170,12 @@ void diff_flush(struct diff_options *options)
 			show_stats(&diffstat, options);
 		if (output_format & DIFF_FORMAT_SHORTSTAT)
 			show_shortstats(&diffstat, options);
+		if (output_format & DIFF_FORMAT_DIRSTAT)
+			show_dirstat_by_line(&diffstat, options);
 		free_diffstat_info(&diffstat);
 		separator++;
 	}
-	if (output_format & DIFF_FORMAT_DIRSTAT)
+	if ((output_format & DIFF_FORMAT_DIRSTAT) && !dirstat_by_line)
 		show_dirstat(options);
 
 	if (output_format & DIFF_FORMAT_SUMMARY && !is_summary_empty(q)) {
diff --git a/diff.h b/diff.h
index 781c620..5f12049 100644
--- a/diff.h
+++ b/diff.h
@@ -78,6 +78,7 @@ typedef struct strbuf *(*diff_prefix_fn_t)(struct diff_options *opt, void *data)
 #define DIFF_OPT_IGNORE_UNTRACKED_IN_SUBMODULES (1 << 25)
 #define DIFF_OPT_IGNORE_DIRTY_SUBMODULES (1 << 26)
 #define DIFF_OPT_OVERRIDE_SUBMODULE_CONFIG (1 << 27)
+#define DIFF_OPT_DIRSTAT_BY_LINE     (1 << 28)
 
 #define DIFF_OPT_TST(opts, flag)    ((opts)->flags & DIFF_OPT_##flag)
 #define DIFF_OPT_SET(opts, flag)    ((opts)->flags |= DIFF_OPT_##flag)
diff --git a/t/t4046-diff-dirstat.sh b/t/t4046-diff-dirstat.sh
index da4484c..ef9326a 100755
--- a/t/t4046-diff-dirstat.sh
+++ b/t/t4046-diff-dirstat.sh
@@ -770,4 +770,104 @@ test_expect_success 'diff.dirstat=16.7,cumulative,files' '
 	test_cmp expect_diff_dirstat_CC actual_diff_dirstat_CC
 '
 
+cat <<EOF >expect_diff_dirstat
+  10.6% dst/copy/changed/
+  10.6% dst/copy/rearranged/
+  10.6% dst/copy/unchanged/
+  10.6% dst/move/changed/
+  10.6% dst/move/rearranged/
+  10.6% dst/move/unchanged/
+  10.6% src/move/changed/
+  10.6% src/move/rearranged/
+  10.6% src/move/unchanged/
+EOF
+
+cat <<EOF >expect_diff_dirstat_M
+   5.3% changed/
+  26.3% dst/copy/changed/
+  26.3% dst/copy/rearranged/
+  26.3% dst/copy/unchanged/
+   5.3% dst/move/changed/
+   5.3% dst/move/rearranged/
+   5.3% rearranged/
+EOF
+
+cat <<EOF >expect_diff_dirstat_CC
+  16.7% changed/
+  16.7% dst/copy/changed/
+  16.7% dst/copy/rearranged/
+  16.7% dst/move/changed/
+  16.7% dst/move/rearranged/
+  16.7% rearranged/
+EOF
+
+test_expect_success '--dirstat=lines' '
+	git diff --dirstat=lines HEAD^..HEAD >actual_diff_dirstat &&
+	test_cmp expect_diff_dirstat actual_diff_dirstat &&
+	git diff --dirstat=lines -M HEAD^..HEAD >actual_diff_dirstat_M &&
+	test_cmp expect_diff_dirstat_M actual_diff_dirstat_M &&
+	git diff --dirstat=lines -C -C HEAD^..HEAD >actual_diff_dirstat_CC &&
+	test_cmp expect_diff_dirstat_CC actual_diff_dirstat_CC
+'
+
+test_expect_success 'diff.dirstat=lines' '
+	git -c diff.dirstat=lines diff --dirstat HEAD^..HEAD >actual_diff_dirstat &&
+	test_cmp expect_diff_dirstat actual_diff_dirstat &&
+	git -c diff.dirstat=lines diff --dirstat -M HEAD^..HEAD >actual_diff_dirstat_M &&
+	test_cmp expect_diff_dirstat_M actual_diff_dirstat_M &&
+	git -c diff.dirstat=lines diff --dirstat -C -C HEAD^..HEAD >actual_diff_dirstat_CC &&
+	test_cmp expect_diff_dirstat_CC actual_diff_dirstat_CC
+'
+
+cat <<EOF >expect_diff_dirstat
+   2.1% changed/
+  10.6% dst/copy/changed/
+  10.6% dst/copy/rearranged/
+  10.6% dst/copy/unchanged/
+  10.6% dst/move/changed/
+  10.6% dst/move/rearranged/
+  10.6% dst/move/unchanged/
+   2.1% rearranged/
+  10.6% src/move/changed/
+  10.6% src/move/rearranged/
+  10.6% src/move/unchanged/
+EOF
+
+cat <<EOF >expect_diff_dirstat_M
+   5.3% changed/
+  26.3% dst/copy/changed/
+  26.3% dst/copy/rearranged/
+  26.3% dst/copy/unchanged/
+   5.3% dst/move/changed/
+   5.3% dst/move/rearranged/
+   5.3% rearranged/
+EOF
+
+cat <<EOF >expect_diff_dirstat_CC
+  16.7% changed/
+  16.7% dst/copy/changed/
+  16.7% dst/copy/rearranged/
+  16.7% dst/move/changed/
+  16.7% dst/move/rearranged/
+  16.7% rearranged/
+EOF
+
+test_expect_success '--dirstat=lines,0' '
+	git diff --dirstat=lines,0 HEAD^..HEAD >actual_diff_dirstat &&
+	test_cmp expect_diff_dirstat actual_diff_dirstat &&
+	git diff --dirstat=lines,0 -M HEAD^..HEAD >actual_diff_dirstat_M &&
+	test_cmp expect_diff_dirstat_M actual_diff_dirstat_M &&
+	git diff --dirstat=lines,0 -C -C HEAD^..HEAD >actual_diff_dirstat_CC &&
+	test_cmp expect_diff_dirstat_CC actual_diff_dirstat_CC
+'
+
+test_expect_success 'diff.dirstat=0,lines' '
+	git -c diff.dirstat=0,lines diff --dirstat HEAD^..HEAD >actual_diff_dirstat &&
+	test_cmp expect_diff_dirstat actual_diff_dirstat &&
+	git -c diff.dirstat=0,lines diff --dirstat -M HEAD^..HEAD >actual_diff_dirstat_M &&
+	test_cmp expect_diff_dirstat_M actual_diff_dirstat_M &&
+	git -c diff.dirstat=0,lines diff --dirstat -C -C HEAD^..HEAD >actual_diff_dirstat_CC &&
+	test_cmp expect_diff_dirstat_CC actual_diff_dirstat_CC
+'
+
 test_done
-- 
1.7.5.rc1.3.g4d7b

  parent reply	other threads:[~2011-04-26  0:02 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-07 13:49 BUG? in --dirstat when rearranging lines in a file Johan Herland
2011-04-07 14:56 ` Linus Torvalds
2011-04-07 22:43   ` Junio C Hamano
2011-04-07 22:59     ` Linus Torvalds
2011-04-08 14:46   ` Johan Herland
2011-04-08 14:48     ` [PATCH 1/3] --dirstat: Document shortcomings compared to --stat or regular diff Johan Herland
2011-04-08 19:50       ` Junio C Hamano
2011-04-08 14:50     ` [PATCH 2/3] --dirstat-by-file: Make it faster and more correct Johan Herland
2011-04-08 14:55     ` [RFC/PATCH 3/3] Teach --dirstat to not completely ignore rearranged lines Johan Herland
2011-04-08 15:04     ` BUG? in --dirstat when rearranging lines in a file Linus Torvalds
2011-04-08 19:56       ` Junio C Hamano
2011-04-10 22:48         ` [PATCHv2 0/3] --dirstat fixes Johan Herland
2011-04-10 22:48           ` [PATCHv2 1/3] --dirstat: Describe non-obvious differences relative to --stat or regular diff Johan Herland
2011-04-10 22:48           ` [PATCHv2 2/3] --dirstat-by-file: Make it faster and more correct Johan Herland
2011-04-11 18:14             ` Junio C Hamano
2011-04-10 22:48           ` [PATCHv2 3/3] Teach --dirstat to not completely ignore rearranged lines within a file Johan Herland
2011-04-11 21:38             ` Junio C Hamano
2011-04-11 21:56               ` Johan Herland
2011-04-11 22:08                 ` Junio C Hamano
2011-04-12  9:22                   ` Johan Herland
2011-04-12  9:24                     ` [PATCH 4/3] --dirstat: In case of renames, use target filename instead of source filename Johan Herland
2011-04-12 14:59                       ` Linus Torvalds
2011-04-12  9:26                     ` [RFC/PATCH 5/3] Alternative --dirstat implementation, based on diffstat analysis Johan Herland
2011-04-12 14:46                       ` Linus Torvalds
2011-04-12 15:08                         ` Linus Torvalds
2011-04-12 22:03                           ` Johan Herland
2011-04-12 22:12                             ` Linus Torvalds
2011-04-12 22:22                             ` Junio C Hamano
2011-04-26  0:01                         ` [PATCH 0/6] --dirstat fixes, part 2 Johan Herland
2011-04-26  0:01                           ` [PATCH 1/6] Add several testcases for --dirstat and friends Johan Herland
2011-04-26  0:01                           ` [PATCH 2/6] Make --dirstat=0 output directories that contribute < 0.1% of changes Johan Herland
2011-04-26  0:01                           ` [PATCH 3/6] Refactor --dirstat parsing; deprecate --cumulative and --dirstat-by-file Johan Herland
2011-04-26 16:36                             ` Junio C Hamano
2011-04-27  2:02                               ` Johan Herland
2011-04-27  4:53                                 ` Junio C Hamano
2011-04-27 20:51                                 ` Junio C Hamano
2011-04-27 21:01                                   ` Junio C Hamano
2011-04-26  0:01                           ` [PATCH 4/6] Add config variable for specifying default --dirstat behavior Johan Herland
2011-04-26 16:43                             ` Junio C Hamano
2011-04-27  2:02                               ` Johan Herland
2011-04-26  0:01                           ` [PATCH 5/6] Use floating point for --dirstat percentages Johan Herland
2011-04-26 16:52                             ` Junio C Hamano
2011-04-27  2:02                               ` Johan Herland
2011-04-27  4:42                                 ` Junio C Hamano
2011-04-27  4:53                                   ` Linus Torvalds
2011-04-27  5:20                                     ` Junio C Hamano
2011-04-26  0:01                           ` Johan Herland [this message]
2011-04-26 16:59                             ` [PATCH 6/6] New --dirstat=lines mode, doing dirstat analysis based on diffstat Junio C Hamano
2011-04-27  2:02                               ` Johan Herland
2011-04-26  0:15                           ` [PATCH 0/6] --dirstat fixes, part 2 Linus Torvalds
2011-04-27  2:12                           ` [PATCHv2 " Johan Herland
2011-04-27  2:12                             ` [PATCHv2 1/6] Add several testcases for --dirstat and friends Johan Herland
2011-04-27  2:12                             ` [PATCHv2 2/6] Make --dirstat=0 output directories that contribute < 0.1% of changes Johan Herland
2011-04-27  2:12                             ` [PATCHv2 3/6] Refactor --dirstat parsing; deprecate --cumulative and --dirstat-by-file Johan Herland
2011-04-27  2:12                             ` [PATCHv2 4/6] Add config variable for specifying default --dirstat behavior Johan Herland
2011-04-27  2:12                             ` [PATCHv2 5/6] Use floating point for --dirstat percentages Johan Herland
2011-04-27  2:45                               ` Linus Torvalds
2011-04-27  2:12                             ` [PATCHv2 6/6] New --dirstat=lines mode, doing dirstat analysis based on diffstat Johan Herland
2011-04-27  8:24                             ` [PATCHv3 0/6] --dirstat fixes, part 2 Johan Herland
2011-04-27  8:24                               ` [PATCHv3 1/6] Add several testcases for --dirstat and friends Johan Herland
2011-04-27  8:24                               ` [PATCHv3 2/6] Make --dirstat=0 output directories that contribute < 0.1% of changes Johan Herland
2011-04-27  8:24                               ` [PATCHv3 3/6] Refactor --dirstat parsing; deprecate --cumulative and --dirstat-by-file Johan Herland
2011-04-27  8:24                               ` [PATCHv3 4/6] Add config variable for specifying default --dirstat behavior Johan Herland
2011-04-27  8:24                               ` [PATCHv3 5/6] Allow specifying --dirstat cut-off percentage as a floating point number Johan Herland
2011-04-27  8:37                                 ` Linus Torvalds
2011-04-27 10:29                                   ` [PATCHv4 " Johan Herland
2011-04-27  8:24                               ` [PATCHv3 6/6] New --dirstat=lines mode, doing dirstat analysis based on diffstat Johan Herland
2011-04-28  1:17                               ` [PATCHv5 0/7] --dirstat fixes, part 2 Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 1/7] Add several testcases for --dirstat and friends Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 2/7] Make --dirstat=0 output directories that contribute < 0.1% of changes Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 3/7] Refactor --dirstat parsing; deprecate --cumulative and --dirstat-by-file Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 4/7] Add config variable for specifying default --dirstat behavior Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 5/7] Allow specifying --dirstat cut-off percentage as a floating point number Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 6/7] New --dirstat=lines mode, doing dirstat analysis based on diffstat Johan Herland
2011-04-28  1:17                                 ` [PATCHv5 7/7] Improve error handling when parsing dirstat parameters Johan Herland
2011-04-28 18:41                                   ` Junio C Hamano
2011-04-28 19:20                                     ` Junio C Hamano
2011-04-28 23:16                                       ` Johan Herland
2011-04-28 23:13                                     ` Johan Herland
2011-04-29  4:06                                       ` Junio C Hamano
2011-04-29  9:36                                         ` [PATCHv6 0/8] --dirstat fixes, part 2 Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 1/8] Add several testcases for --dirstat and friends Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 2/8] Make --dirstat=0 output directories that contribute < 0.1% of changes Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 3/8] Refactor --dirstat parsing; deprecate --cumulative and --dirstat-by-file Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 4/8] Add config variable for specifying default --dirstat behavior Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 5/8] Allow specifying --dirstat cut-off percentage as a floating point number Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 6/8] New --dirstat=lines mode, doing dirstat analysis based on diffstat Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 7/8] Improve error handling when parsing dirstat parameters Johan Herland
2011-04-29  9:36                                           ` [PATCHv6 8/8] Mark dirstat error messages for translation Johan Herland
2011-04-12 18:34                       ` [RFC/PATCH 5/3] Alternative --dirstat implementation, based on diffstat analysis Junio C Hamano
2011-04-10 23:17           ` [PATCHv2 0/3] --dirstat fixes Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1303776102-9085-7-git-send-email-johan@herland.net \
    --to=johan@herland.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).