From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [RFC/PATCH] add diff-pairs tool
Date: Thu, 1 Dec 2016 15:40:43 -0500 [thread overview]
Message-ID: <20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net> (raw)
This is a small tool I cooked up for splitting the tree-diff out from
the generation of the patch text, and doing them in separate processes.
It's a little more expensive than doing it all in one process (besides
the process/pipe overhead, we may load blob content in the tree diff for
rename detection, and then again to do the actual patch). But it gives a
lot of flexibility when handling large diffs. You can save the tree-diff
and then feed it back in chunks to generate the actual patches, and do
so across many invocations (if you had, say, a website that showed diffs
and wanted to show small bits of them and let the user progressively
click through to see more).
Other things I tried but didn't work:
- feeding two blobs to git-diff doesn't use plumbing, and loses
information about filenames, etc. It also requires one process per
pair.
- feeding pathspecs back to git-diff-tree, like:
git diff-tree --name-only $a $b |
while read filename; do
git diff-tree $a $b -- $filename
done
mostly works, but loses rename information (and is also slightly
less efficient, as it has to walk the trees again).
So instead, I made a tool that takes the pairs generated by "diff-tree
--raw" and formats them as diff-tree would have if it had been done
in-process.
It's rather elegant, I think. But the main reason this is RFC is that
I'm not sure it's actually _useful_ to other people. It's a pretty
narrow use case.
The other reason it's an RFC is that it has a few rough edges. It works
well for my narrow use case, but there are cases where it probably
doesn't:
- it expects "diff-tree -z" input, which keeps parsing simple. It
probably wouldn't be too hard to have it handle the normal output,
too.
- it should handle submodules OK, but I don't know what it would do if
you fed it things like dirty submodules from diff-files output.
- some option combinations between the initial tree-diff and the
followup diff do not make sense. For instance, if you do a
non-recursive tree-diff (or one with "-t"), you'll have tree entries
in your --raw output. Feeding that to "diff-pairs -p" will silently
ignore the tree entries, because there's no way to represent them as
a patch.
Mostly these fall under "well, don't do that", but possibly the tool
could be more aggressive about complaining, or coming up with some
sensible behavior.
So anyway. Consider this a rough cut. What I'm really looking for at
this stage is whether anybody is interested enough in this for me to
bother polishing it.
-- >8 --
This takes the output of `diff-tree -z --raw` and feeds it
back to the later stages of the diff machinery to produce
diffs in other formats. Because the interim format contains
any whole-tree copy/rename information, you can safely feed
segments of the tree diff to get progressive patch-format
diffs. So something like:
git diff-tree -r -z $a $b |
git diff-pairs -p
should give you the same output that `git diff-tree -p`
would have. Likewise, feeding each pair individually works,
too:
git diff-tree -r -z -M $a $b |
perl -0ne '
my $meta = $_;
my $path = <>;
# only renames have an extra path
my $path2 = <> if $meta =~ /[RC]\d+/;
print STDERR "feeding one diff\n";
open(my $fh, "|git diff-pairs -p");
print $fh $meta, $path, $path2;
'
The renames will still be shown just as if the diff had been
done in one process.
Signed-off-by: Jeff King <peff@peff.net>
---
.gitignore | 1 +
Documentation/git-diff-pairs.txt | 60 +++++++++++++++
Makefile | 1 +
builtin.h | 1 +
builtin/diff-pairs.c | 162 +++++++++++++++++++++++++++++++++++++++
git.c | 1 +
t/t4063-diff-pairs.sh | 82 ++++++++++++++++++++
7 files changed, 308 insertions(+)
create mode 100644 Documentation/git-diff-pairs.txt
create mode 100644 builtin/diff-pairs.c
create mode 100755 t/t4063-diff-pairs.sh
diff --git a/.gitignore b/.gitignore
index 05cb58a3d..805751f2e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -48,6 +48,7 @@
/git-diff
/git-diff-files
/git-diff-index
+/git-diff-pairs
/git-diff-tree
/git-difftool
/git-difftool--helper
diff --git a/Documentation/git-diff-pairs.txt b/Documentation/git-diff-pairs.txt
new file mode 100644
index 000000000..4e3b6c909
--- /dev/null
+++ b/Documentation/git-diff-pairs.txt
@@ -0,0 +1,60 @@
+git-diff-pairs(1)
+=================
+
+NAME
+----
+git-diff-pairs - Compare blob pairs generated by `diff-tree --raw`
+
+SYNOPSIS
+--------
+[verse]
+'git diff-pairs' [diff_format_options]
+
+DESCRIPTION
+-----------
+
+Given the output of `diff-tree -z` on its stdin, `diff-pairs` will
+reformat that output into whatever format is requested on its command
+line. For example:
+
+-----------------------------
+git diff-tree -z -M $a $b |
+git diff-pairs -p
+-----------------------------
+
+will compute the tree diff in one step (including renames), and then
+`diff-pairs` will compute and format the blob-level diffs for each pair.
+This can be used to modify the raw diff in the middle (without having to
+parse or re-create more complicated formats like `--patch`), or to
+compute diffs progressively over the course of multiple invocations of
+`diff-pairs`.
+
+Each blob pair is fed to the diff machinery individually and the output
+flushed immediately, meaning it is safe to interactively read and write
+from `diff-pairs`.
+
+OPTIONS
+-------
+
+All diff options below are accepted, but note that tree-wide options
+like `-M` are effectively noops, as we consider only one pair at a time.
+
+include::diff-options.txt[]
+
+BUGS
+----
+
+`diff-pairs` should handle any input generated by `diff-tree --raw -z`.
+It may choke or otherwise misbehave on output from `diff-files`, etc.
+
+Here's an incomplete list of things that `diff-pairs` could do, but
+doesn't (mostly in the name of simplicity):
+
+ - Only `-z` input is accepted, not normal `--raw` input.
+
+ - Abbreviated sha1s are rejected in the input from `diff-tree`; if you
+ want to abbreviate the output, you can pass `--abbrev` to
+ `diff-pairs`.
+
+ - Pathspecs are not handled by `diff-pairs`; you can limit the diff via
+ the initial `diff-tree` invocation.
diff --git a/Makefile b/Makefile
index f53fcc90d..c7ee3cd61 100644
--- a/Makefile
+++ b/Makefile
@@ -886,6 +886,7 @@ BUILTIN_OBJS += builtin/credential.o
BUILTIN_OBJS += builtin/describe.o
BUILTIN_OBJS += builtin/diff-files.o
BUILTIN_OBJS += builtin/diff-index.o
+BUILTIN_OBJS += builtin/diff-pairs.o
BUILTIN_OBJS += builtin/diff-tree.o
BUILTIN_OBJS += builtin/diff.o
BUILTIN_OBJS += builtin/fast-export.o
diff --git a/builtin.h b/builtin.h
index b9122bc5f..f62134ec1 100644
--- a/builtin.h
+++ b/builtin.h
@@ -59,6 +59,7 @@ extern int cmd_describe(int argc, const char **argv, const char *prefix);
extern int cmd_diff_files(int argc, const char **argv, const char *prefix);
extern int cmd_diff_index(int argc, const char **argv, const char *prefix);
extern int cmd_diff(int argc, const char **argv, const char *prefix);
+extern int cmd_diff_pairs(int argc, const char **argv, const char *prefix);
extern int cmd_diff_tree(int argc, const char **argv, const char *prefix);
extern int cmd_fast_export(int argc, const char **argv, const char *prefix);
extern int cmd_fetch(int argc, const char **argv, const char *prefix);
diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c
new file mode 100644
index 000000000..f80c49fef
--- /dev/null
+++ b/builtin/diff-pairs.c
@@ -0,0 +1,162 @@
+#include "cache.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+
+static const char diff_pairs_usage[] =
+"git diff-pairs [diff-options]\n"
+"\n"
+"Reads pairs of blobs from stdin in 'diff-tree -z' syntax:\n"
+"\n"
+" :<mode_a> <mode_b> <sha1_a> <sha1_b> <type>\\0<path>\0[path2\0]\n"
+"\n"
+"and outputs the diff for each a/b pair to stdout.";
+
+static unsigned parse_mode(const char *mode, char **endp)
+{
+ unsigned long ret;
+
+ errno = 0;
+ ret = strtoul(mode, endp, 8);
+ if (errno || *endp == mode || *(*endp)++ != ' ' || (unsigned)ret != ret)
+ die("unable to parse mode: %s", mode);
+ return ret;
+}
+
+static void parse_oid(char *p, char **endp, struct object_id *oid)
+{
+ *endp = p + GIT_SHA1_HEXSZ;
+ if (get_oid_hex(p, oid) || *(*endp)++ != ' ')
+ die("unable to parse object id: %s", p);
+}
+
+static unsigned short parse_score(char *score)
+{
+ unsigned long ret;
+ char *endp;
+
+ errno = 0;
+ ret = strtoul(score, &endp, 10);
+ ret *= MAX_SCORE / 100;
+ if (errno || endp == score || *endp || (unsigned short)ret != ret)
+ die("unable to parse rename/copy score: %s", score);
+ return ret;
+}
+
+/*
+ * The pair-creation is mostly done by diff_change and diff_addremove,
+ * which queue the filepair without returning it. So we have to resort
+ * to pulling it out of the global diff queue.
+ */
+static void set_pair_status(char status)
+{
+ /*
+ * If we have no items in the queue, for some reason the pair wasn't
+ * worth queueing. This generally shouldn't happen (since it means
+ * dropping some parts of the diff), but the user can trigger it with
+ * things like --ignore-submodules. If they do, the only sensible thing
+ * is for us to play along and skip it.
+ */
+ if (!diff_queued_diff.nr)
+ return;
+
+ diff_queued_diff.queue[0]->status = status;
+}
+
+int cmd_diff_pairs(int argc, const char **argv, const char *prefix)
+{
+ struct rev_info revs;
+ struct strbuf meta = STRBUF_INIT;
+ struct strbuf path = STRBUF_INIT;
+ struct strbuf path_dst = STRBUF_INIT;
+
+ init_revisions(&revs, prefix);
+ git_config(git_diff_basic_config, NULL);
+ revs.disable_stdin = 1;
+ argc = setup_revisions(argc, argv, &revs, NULL);
+
+ /* Don't allow pathspecs at all. */
+ if (argc > 1)
+ usage(diff_pairs_usage);
+
+ if (!revs.diffopt.output_format)
+ revs.diffopt.output_format = DIFF_FORMAT_RAW;
+
+ while (1) {
+ unsigned mode_a, mode_b;
+ struct object_id oid_a, oid_b;
+ char status;
+ char *p;
+
+ if (strbuf_getline_nul(&meta, stdin) == EOF)
+ break;
+
+ p = meta.buf;
+ if (*p == ':')
+ p++;
+
+ mode_a = parse_mode(p, &p);
+ mode_b = parse_mode(p, &p);
+
+ parse_oid(p, &p, &oid_a);
+ parse_oid(p, &p, &oid_b);
+
+ status = *p++;
+
+ if (strbuf_getline_nul(&path, stdin) == EOF)
+ die("got EOF while reading path");
+
+ switch (status) {
+ case DIFF_STATUS_ADDED:
+ diff_addremove(&revs.diffopt, '+',
+ mode_b, oid_b.hash,
+ 1, path.buf, 0);
+ set_pair_status(status);
+ break;
+
+ case DIFF_STATUS_DELETED:
+ diff_addremove(&revs.diffopt, '-',
+ mode_a, oid_a.hash,
+ 1, path.buf, 0);
+ set_pair_status(status);
+ break;
+
+ case DIFF_STATUS_TYPE_CHANGED:
+ case DIFF_STATUS_MODIFIED:
+ diff_change(&revs.diffopt,
+ mode_a, mode_b,
+ oid_a.hash, oid_b.hash,
+ 1, 1, path.buf, 0, 0);
+ set_pair_status(status);
+ break;
+
+ case DIFF_STATUS_RENAMED:
+ case DIFF_STATUS_COPIED:
+ {
+ struct diff_filespec *a, *b;
+ struct diff_filepair *pair;
+
+ if (strbuf_getline_nul(&path_dst, stdin) == EOF)
+ die("got EOF while reading secondary path");
+
+ a = alloc_filespec(path.buf);
+ b = alloc_filespec(path_dst.buf);
+ fill_filespec(a, oid_a.hash, 1, mode_a);
+ fill_filespec(b, oid_b.hash, 1, mode_b);
+
+ pair = diff_queue(&diff_queued_diff, a, b);
+ pair->status = status;
+ pair->score = parse_score(p);
+ pair->renamed_pair = 1;
+ }
+ break;
+
+ default:
+ die("unknown diff status: %c", status);
+ }
+
+ diff_flush(&revs.diffopt);
+ }
+
+ return 0;
+}
diff --git a/git.c b/git.c
index dce529fcb..4063c5366 100644
--- a/git.c
+++ b/git.c
@@ -423,6 +423,7 @@ static struct cmd_struct commands[] = {
{ "diff", cmd_diff },
{ "diff-files", cmd_diff_files, RUN_SETUP | NEED_WORK_TREE },
{ "diff-index", cmd_diff_index, RUN_SETUP },
+ { "diff-pairs", cmd_diff_pairs, RUN_SETUP },
{ "diff-tree", cmd_diff_tree, RUN_SETUP },
{ "fast-export", cmd_fast_export, RUN_SETUP },
{ "fetch", cmd_fetch, RUN_SETUP },
diff --git a/t/t4063-diff-pairs.sh b/t/t4063-diff-pairs.sh
new file mode 100755
index 000000000..a99ec6655
--- /dev/null
+++ b/t/t4063-diff-pairs.sh
@@ -0,0 +1,82 @@
+#!/bin/sh
+
+test_description='basic diff-pairs tests'
+. ./test-lib.sh
+
+# This creates a diff with added, modified, deleted, renamed, copied, and
+# typechange entries. That includes one in a subdirectory for non-recursive
+# tests, and both exact and inexact similarity scores.
+test_expect_success 'create commit with various diffs' '
+ echo to-be-gone >deleted &&
+ echo original >modified &&
+ echo now-a-file >symlink &&
+ test_seq 200 >two-hundred &&
+ test_seq 201 500 >five-hundred &&
+ git add . &&
+ test_tick &&
+ git commit -m base &&
+ git tag base &&
+
+ echo now-here >added &&
+ echo new >modified &&
+ rm deleted &&
+ mkdir subdir &&
+ echo content >subdir/file &&
+ mv two-hundred renamed &&
+ test_seq 201 500 | sed s/300/modified/ >copied &&
+ rm symlink &&
+ git add -A . &&
+ test_ln_s_add dest symlink &&
+ test_tick &&
+ git commit -m new &&
+ git tag new
+'
+
+test_expect_success 'diff-pairs recreates --raw' '
+ git diff-tree -r -M -C -C base new >expect &&
+ # note that diff-pairs uses the default abbrev,
+ # so we must tweak that for identical output
+ git diff-tree -r -M -C -C -z base new |
+ git diff-pairs --no-abbrev >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'diff-pairs can create -p output' '
+ git diff-tree -p -M -C -C base new >expect &&
+ git diff-tree -r -M -C -C -z base new |
+ git diff-pairs -p >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'non-recursive --raw retains tree entry' '
+ git diff-tree base new >expect &&
+ git diff-tree -z base new |
+ git diff-pairs --no-abbrev >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'split input across multiple diff-pairs' '
+ write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
+ $/ = "\0";
+ while (<>) {
+ my $meta = $_;
+ my $path = <>;
+ # renames have an extra path
+ my $path2 = <> if $meta =~ /[RC]\d+/;
+
+ open(my $fh, ">", sprintf "diff%03d", $.);
+ print $fh $meta, $path, $path2;
+ }
+ EOF
+
+ git diff-tree -p -M -C -C base new >expect &&
+
+ git diff-tree -r -z -M -C -C base new |
+ ./split-raw-diff &&
+ for i in diff*; do
+ git diff-pairs -p <$i
+ done >actual &&
+ test_cmp expect actual
+'
+
+test_done
--
2.11.0.319.g1f4e1e0
next reply other threads:[~2016-12-01 20:40 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-01 20:40 Jeff King [this message]
2016-12-01 20:52 ` [RFC/PATCH] add diff-pairs tool Junio C Hamano
2016-12-01 20:55 ` Jeff King
2016-12-01 20:58 ` Junio C Hamano
2016-12-01 21:04 ` Junio C Hamano
2016-12-01 21:30 ` Jeff King
2016-12-01 22:22 ` Junio C Hamano
2016-12-01 22:30 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).