git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [RFC/PATCH] add diff-pairs tool
Date: Thu, 1 Dec 2016 15:40:43 -0500	[thread overview]
Message-ID: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> (raw)

This is a small tool I cooked up for splitting the tree-diff out from
the generation of the patch text, and doing them in separate processes.
It's a little more expensive than doing it all in one process (besides
the process/pipe overhead, we may load blob content in the tree diff for
rename detection, and then again to do the actual patch). But it gives a
lot of flexibility when handling large diffs. You can save the tree-diff
and then feed it back in chunks to generate the actual patches, and do
so across many invocations (if you had, say, a website that showed diffs
and wanted to show small bits of them and let the user progressively
click through to see more).

Other things I tried but didn't work:

  - feeding two blobs to git-diff doesn't use plumbing, and loses
    information about filenames, etc. It also requires one process per
    pair.

  - feeding pathspecs back to git-diff-tree, like:

       git diff-tree --name-only $a $b |
       while read filename; do
         git diff-tree $a $b -- $filename
       done

    mostly works, but loses rename information (and is also slightly
    less efficient, as it has to walk the trees again).

So instead, I made a tool that takes the pairs generated by "diff-tree
--raw" and formats them as diff-tree would have if it had been done
in-process.

It's rather elegant, I think. But the main reason this is RFC is that
I'm not sure it's actually _useful_ to other people. It's a pretty
narrow use case.

The other reason it's an RFC is that it has a few rough edges. It works
well for my narrow use case, but there are cases where it probably
doesn't:

  - it expects "diff-tree -z" input, which keeps parsing simple. It
    probably wouldn't be too hard to have it handle the normal output,
    too.

  - it should handle submodules OK, but I don't know what it would do if
    you fed it things like dirty submodules from diff-files output.

  - some option combinations between the initial tree-diff and the
    followup diff do not make sense. For instance, if you do a
    non-recursive tree-diff (or one with "-t"), you'll have tree entries
    in your --raw output. Feeding that to "diff-pairs -p" will silently
    ignore the tree entries, because there's no way to represent them as
    a patch.

    Mostly these fall under "well, don't do that", but possibly the tool
    could be more aggressive about complaining, or coming up with some
    sensible behavior.

So anyway. Consider this a rough cut. What I'm really looking for at
this stage is whether anybody is interested enough in this for me to
bother polishing it.

-- >8 --
This takes the output of `diff-tree -z --raw` and feeds it
back to the later stages of the diff machinery to produce
diffs in other formats. Because the interim format contains
any whole-tree copy/rename information, you can safely feed
segments of the tree diff to get progressive patch-format
diffs. So something like:

  git diff-tree -r -z $a $b |
  git diff-pairs -p

should give you the same output that `git diff-tree -p`
would have.  Likewise, feeding each pair individually works,
too:

  git diff-tree -r -z -M $a $b |
  perl -0ne '
	my $meta = $_;
	my $path = <>;
	# only renames have an extra path
	my $path2 = <> if $meta =~ /[RC]\d+/;

	print STDERR "feeding one diff\n";
	open(my $fh, "|git diff-pairs -p");
	print $fh $meta, $path, $path2;
  '

The renames will still be shown just as if the diff had been
done in one process.

Signed-off-by: Jeff King <peff@peff.net>
---
 .gitignore                       |   1 +
 Documentation/git-diff-pairs.txt |  60 +++++++++++++++
 Makefile                         |   1 +
 builtin.h                        |   1 +
 builtin/diff-pairs.c             | 162 +++++++++++++++++++++++++++++++++++++++
 git.c                            |   1 +
 t/t4063-diff-pairs.sh            |  82 ++++++++++++++++++++
 7 files changed, 308 insertions(+)
 create mode 100644 Documentation/git-diff-pairs.txt
 create mode 100644 builtin/diff-pairs.c
 create mode 100755 t/t4063-diff-pairs.sh

diff --git a/.gitignore b/.gitignore
index 05cb58a3d..805751f2e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -48,6 +48,7 @@
 /git-diff
 /git-diff-files
 /git-diff-index
+/git-diff-pairs
 /git-diff-tree
 /git-difftool
 /git-difftool--helper
diff --git a/Documentation/git-diff-pairs.txt b/Documentation/git-diff-pairs.txt
new file mode 100644
index 000000000..4e3b6c909
--- /dev/null
+++ b/Documentation/git-diff-pairs.txt
@@ -0,0 +1,60 @@
+git-diff-pairs(1)
+=================
+
+NAME
+----
+git-diff-pairs - Compare blob pairs generated by `diff-tree --raw`
+
+SYNOPSIS
+--------
+[verse]
+'git diff-pairs' [diff_format_options]
+
+DESCRIPTION
+-----------
+
+Given the output of `diff-tree -z` on its stdin, `diff-pairs` will
+reformat that output into whatever format is requested on its command
+line.  For example:
+
+-----------------------------
+git diff-tree -z -M $a $b |
+git diff-pairs -p
+-----------------------------
+
+will compute the tree diff in one step (including renames), and then
+`diff-pairs` will compute and format the blob-level diffs for each pair.
+This can be used to modify the raw diff in the middle (without having to
+parse or re-create more complicated formats like `--patch`), or to
+compute diffs progressively over the course of multiple invocations of
+`diff-pairs`.
+
+Each blob pair is fed to the diff machinery individually and the output
+flushed immediately, meaning it is safe to interactively read and write
+from `diff-pairs`.
+
+OPTIONS
+-------
+
+All diff options below are accepted, but note that tree-wide options
+like `-M` are effectively noops, as we consider only one pair at a time.
+
+include::diff-options.txt[]
+
+BUGS
+----
+
+`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
+It may choke or otherwise misbehave on output from `diff-files`, etc.
+
+Here's an incomplete list of things that `diff-pairs` could do, but
+doesn't (mostly in the name of simplicity):
+
+ - Only `-z` input is accepted, not normal `--raw` input.
+
+ - Abbreviated sha1s are rejected in the input from `diff-tree`; if you
+   want to abbreviate the output, you can pass `--abbrev` to
+   `diff-pairs`.
+
+ - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
+   the initial `diff-tree` invocation.
diff --git a/Makefile b/Makefile
index f53fcc90d..c7ee3cd61 100644
--- a/Makefile
+++ b/Makefile
@@ -886,6 +886,7 @@ BUILTIN_OBJS += builtin/credential.o
 BUILTIN_OBJS += builtin/describe.o
 BUILTIN_OBJS += builtin/diff-files.o
 BUILTIN_OBJS += builtin/diff-index.o
+BUILTIN_OBJS += builtin/diff-pairs.o
 BUILTIN_OBJS += builtin/diff-tree.o
 BUILTIN_OBJS += builtin/diff.o
 BUILTIN_OBJS += builtin/fast-export.o
diff --git a/builtin.h b/builtin.h
index b9122bc5f..f62134ec1 100644
--- a/builtin.h
+++ b/builtin.h
@@ -59,6 +59,7 @@ extern int cmd_describe(int argc, const char **argv, const char *prefix);
 extern int cmd_diff_files(int argc, const char **argv, const char *prefix);
 extern int cmd_diff_index(int argc, const char **argv, const char *prefix);
 extern int cmd_diff(int argc, const char **argv, const char *prefix);
+extern int cmd_diff_pairs(int argc, const char **argv, const char *prefix);
 extern int cmd_diff_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_fast_export(int argc, const char **argv, const char *prefix);
 extern int cmd_fetch(int argc, const char **argv, const char *prefix);
diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c
new file mode 100644
index 000000000..f80c49fef
--- /dev/null
+++ b/builtin/diff-pairs.c
@@ -0,0 +1,162 @@
+#include "cache.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+
+static const char diff_pairs_usage[] =
+"git diff-pairs [diff-options]\n"
+"\n"
+"Reads pairs of blobs from stdin in 'diff-tree -z' syntax:\n"
+"\n"
+"  :<mode_a> <mode_b> <sha1_a> <sha1_b> <type>\\0<path>\0[path2\0]\n"
+"\n"
+"and outputs the diff for each a/b pair to stdout.";
+
+static unsigned parse_mode(const char *mode, char **endp)
+{
+	unsigned long ret;
+
+	errno = 0;
+	ret = strtoul(mode, endp, 8);
+	if (errno || *endp == mode || *(*endp)++ != ' ' || (unsigned)ret != ret)
+		die("unable to parse mode: %s", mode);
+	return ret;
+}
+
+static void parse_oid(char *p, char **endp, struct object_id *oid)
+{
+	*endp = p + GIT_SHA1_HEXSZ;
+	if (get_oid_hex(p, oid) || *(*endp)++ != ' ')
+		die("unable to parse object id: %s", p);
+}
+
+static unsigned short parse_score(char *score)
+{
+	unsigned long ret;
+	char *endp;
+
+	errno = 0;
+	ret = strtoul(score, &endp, 10);
+	ret *= MAX_SCORE / 100;
+	if (errno || endp == score || *endp || (unsigned short)ret != ret)
+		die("unable to parse rename/copy score: %s", score);
+	return ret;
+}
+
+/*
+ * The pair-creation is mostly done by diff_change and diff_addremove,
+ * which queue the filepair without returning it. So we have to resort
+ * to pulling it out of the global diff queue.
+ */
+static void set_pair_status(char status)
+{
+	/*
+	 * If we have no items in the queue, for some reason the pair wasn't
+	 * worth queueing. This generally shouldn't happen (since it means
+	 * dropping some parts of the diff), but the user can trigger it with
+	 * things like --ignore-submodules. If they do, the only sensible thing
+	 * is for us to play along and skip it.
+	 */
+	if (!diff_queued_diff.nr)
+		return;
+
+	diff_queued_diff.queue[0]->status = status;
+}
+
+int cmd_diff_pairs(int argc, const char **argv, const char *prefix)
+{
+	struct rev_info revs;
+	struct strbuf meta = STRBUF_INIT;
+	struct strbuf path = STRBUF_INIT;
+	struct strbuf path_dst = STRBUF_INIT;
+
+	init_revisions(&revs, prefix);
+	git_config(git_diff_basic_config, NULL);
+	revs.disable_stdin = 1;
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	/* Don't allow pathspecs at all. */
+	if (argc > 1)
+		usage(diff_pairs_usage);
+
+	if (!revs.diffopt.output_format)
+		revs.diffopt.output_format = DIFF_FORMAT_RAW;
+
+	while (1) {
+		unsigned mode_a, mode_b;
+		struct object_id oid_a, oid_b;
+		char status;
+		char *p;
+
+		if (strbuf_getline_nul(&meta, stdin) == EOF)
+			break;
+
+		p = meta.buf;
+		if (*p == ':')
+			p++;
+
+		mode_a = parse_mode(p, &p);
+		mode_b = parse_mode(p, &p);
+
+		parse_oid(p, &p, &oid_a);
+		parse_oid(p, &p, &oid_b);
+
+		status = *p++;
+
+		if (strbuf_getline_nul(&path, stdin) == EOF)
+			die("got EOF while reading path");
+
+		switch (status) {
+		case DIFF_STATUS_ADDED:
+			diff_addremove(&revs.diffopt, '+',
+				       mode_b, oid_b.hash,
+				       1, path.buf, 0);
+			set_pair_status(status);
+			break;
+
+		case DIFF_STATUS_DELETED:
+			diff_addremove(&revs.diffopt, '-',
+				       mode_a, oid_a.hash,
+				       1, path.buf, 0);
+			set_pair_status(status);
+			break;
+
+		case DIFF_STATUS_TYPE_CHANGED:
+		case DIFF_STATUS_MODIFIED:
+			diff_change(&revs.diffopt,
+				    mode_a, mode_b,
+				    oid_a.hash, oid_b.hash,
+				    1, 1, path.buf, 0, 0);
+			set_pair_status(status);
+			break;
+
+		case DIFF_STATUS_RENAMED:
+		case DIFF_STATUS_COPIED:
+			{
+				struct diff_filespec *a, *b;
+				struct diff_filepair *pair;
+
+				if (strbuf_getline_nul(&path_dst, stdin) == EOF)
+					die("got EOF while reading secondary path");
+
+				a = alloc_filespec(path.buf);
+				b = alloc_filespec(path_dst.buf);
+				fill_filespec(a, oid_a.hash, 1, mode_a);
+				fill_filespec(b, oid_b.hash, 1, mode_b);
+
+				pair = diff_queue(&diff_queued_diff, a, b);
+				pair->status = status;
+				pair->score = parse_score(p);
+				pair->renamed_pair = 1;
+			}
+			break;
+
+		default:
+			die("unknown diff status: %c", status);
+		}
+
+		diff_flush(&revs.diffopt);
+	}
+
+	return 0;
+}
diff --git a/git.c b/git.c
index dce529fcb..4063c5366 100644
--- a/git.c
+++ b/git.c
@@ -423,6 +423,7 @@ static struct cmd_struct commands[] = {
 	{ "diff", cmd_diff },
 	{ "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE },
 	{ "diff-index", cmd_diff_index, RUN_SETUP },
+	{ "diff-pairs", cmd_diff_pairs, RUN_SETUP },
 	{ "diff-tree", cmd_diff_tree, RUN_SETUP },
 	{ "fast-export", cmd_fast_export, RUN_SETUP },
 	{ "fetch", cmd_fetch, RUN_SETUP },
diff --git a/t/t4063-diff-pairs.sh b/t/t4063-diff-pairs.sh
new file mode 100755
index 000000000..a99ec6655
--- /dev/null
+++ b/t/t4063-diff-pairs.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='basic diff-pairs tests'
+. ./test-lib.sh
+
+# This creates a diff with added, modified, deleted, renamed, copied, and
+# typechange entries. That includes one in a subdirectory for non-recursive
+# tests, and both exact and inexact similarity scores.
+test_expect_success 'create commit with various diffs' '
+	echo to-be-gone >deleted &&
+	echo original >modified &&
+	echo now-a-file >symlink &&
+	test_seq 200 >two-hundred &&
+	test_seq 201 500 >five-hundred &&
+	git add . &&
+	test_tick &&
+	git commit -m base &&
+	git tag base &&
+
+	echo now-here >added &&
+	echo new >modified &&
+	rm deleted &&
+	mkdir subdir &&
+	echo content >subdir/file &&
+	mv two-hundred renamed &&
+	test_seq 201 500 | sed s/300/modified/ >copied &&
+	rm symlink &&
+	git add -A . &&
+	test_ln_s_add dest symlink &&
+	test_tick &&
+	git commit -m new &&
+	git tag new
+'
+
+test_expect_success 'diff-pairs recreates --raw' '
+	git diff-tree -r -M -C -C base new >expect &&
+	# note that diff-pairs uses the default abbrev,
+	# so we must tweak that for identical output
+	git diff-tree -r -M -C -C -z base new |
+	git diff-pairs --no-abbrev >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff-pairs can create -p output' '
+	git diff-tree -p -M -C -C base new >expect &&
+	git diff-tree -r -M -C -C -z base new |
+	git diff-pairs -p >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'non-recursive --raw retains tree entry' '
+	git diff-tree base new >expect &&
+	git diff-tree -z base new |
+	git diff-pairs --no-abbrev >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'split input across multiple diff-pairs' '
+	write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
+	$/ = "\0";
+	while (<>) {
+	  my $meta = $_;
+	  my $path = <>;
+	  # renames have an extra path
+	  my $path2 = <> if $meta =~ /[RC]\d+/;
+
+	  open(my $fh, ">", sprintf "diff%03d", $.);
+	  print $fh $meta, $path, $path2;
+	}
+	EOF
+
+	git diff-tree -p -M -C -C base new >expect &&
+
+	git diff-tree -r -z -M -C -C base new |
+	./split-raw-diff &&
+	for i in diff*; do
+		git diff-pairs -p <$i
+	done >actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.11.0.319.g1f4e1e0

             reply	other threads:[~2016-12-01 20:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-01 20:40 Jeff King [this message]
2016-12-01 20:52 ` [RFC/PATCH] add diff-pairs tool Junio C Hamano
2016-12-01 20:55   ` Jeff King
2016-12-01 20:58     ` Junio C Hamano
2016-12-01 21:04       ` Junio C Hamano
2016-12-01 21:30       ` Jeff King
2016-12-01 22:22         ` Junio C Hamano
2016-12-01 22:30           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).