[PATCH] blame-tree: add library and tests via "test-tool blame-tree"

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] blame-tree: add library and tests via "test-tool blame-tree"
@ 2023-02-05 20:47 Ævar Arnfjörð Bjarmason
  2023-02-10 15:42 ` Derrick Stolee
  2023-02-17 20:42 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-05 20:47 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

From: Jeff King <peff@peff.net>

The "blame-tree" library allows for finding the most recent
modification to paths in a tree. It does so by expanding the tree at a
given commit, taking note of the current state of each path, and then
walking backwards through history looking for commits where each path
changed into its final sha1.

The "git blame-tree" command was first noted on the ML 2011[1], and
over the years there have been mentions of it[2][3], and whether it
could be upstreamed. The sources have been available at [4]'s
"jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches.

This change attempts to start upstreaming this command, but rather
than adding a command whose interface & semantics may be controversial
starts by exposing & testing the core of its library through a new
test helper.

An eventual "git blame-tree" command, or e.g. a new format for "git
ls-tree" to show what a path "blames" to can then be implemented with
this library.

Aside from changing the code found at [4] into a test-tool the most
notable changes here are:

* Removing the "--max-depth" changes to the diff code. We'll need
  those eventually, but it's not required for a blame of a given list
  of paths.

  As has been noted in previous on-list discussions the semantics of
  the "max-depth" changes might be controversial, so it's worthwhile
  to split those out so that they can be reviewed separately.

* Made the "blame-tree" helper take "--" before any revision options,
  for clarity. An eventual built-in command (if any) probably doesn't
  want to enforce this, but it makes it clearer in the test helper
  what's an argument for "blame-tree" itself, and what's an argument for
  the revision machinery.

* Minor updates for using C99 syntax, and "size_t" instead of "int"
  when we're iterating over types whose "nr" is that size.

* Avoid sub-shelling in the tests, use "test-tool -C .." instead.

1. https://lore.kernel.org/git/20110302164031.GA18233@sigill.intra.peff.net/
2. https://lore.kernel.org/git/20160831054201.ldlwptlmcndjmfwu@sigill.intra.peff.net/
3. https://public-inbox.org/git/20130318121243.GC14789@sigill.intra.peff.net/
4. https://github.com/peff/git/
4. https://gitlab.com/gitlab-org/git/-/issues/114

Commit-message-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

The range-diff here is to peff's jk/blame-tree-wip. As noted above
this is far from the full thing, but hopefully getting the basic bits
of the library (sans the max-depth question) will make the review of
subsequent bits easier.

CI & branch for this at:
https://github.com/avar/git/tree/avar-jk/blame-tree-prep

Range-diff:
1:  65cfa9c91cd < -:  ----------- within_depth: fix return for empty path
2:  fa5f0c6dbec < -:  ----------- teach tree-diff a max-depth parameter
3:  c632d45c474 < -:  ----------- combine-diff: zero memory used for callback filepairs
4:  259e04af942 ! 1:  0ea849d900b add blame-tree command [ci skip]
    @@ Metadata
     Author: Jeff King <peff@peff.net>
     
      ## Commit message ##
    -    add blame-tree command [ci skip]
    +    blame-tree: add library and tests via "test-tool blame-tree"
     
    -    This command shows the most recent modification to paths in
    -    a tree. It does so by expanding the tree at a given commit,
    -    taking note of the current state of each path, and then
    -    walking backwards through history looking for commits where
    -    each path changed into its final sha1.
    +    The "blame-tree" library allows for finding the most recent
    +    modification to paths in a tree. It does so by expanding the tree at a
    +    given commit, taking note of the current state of each path, and then
    +    walking backwards through history looking for commits where each path
    +    changed into its final sha1.
     
    -    Signed-off-by: Jeff King <peff@peff.net>
    +    The "git blame-tree" command was first noted on the ML 2011[1], and
    +    over the years there have been mentions of it[2][3], and whether it
    +    could be upstreamed. The sources have been available at [4]'s
    +    "jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches.
     
    - ## .gitignore ##
    -@@
    - /git-archive
    - /git-bisect
    - /git-blame
    -+/git-blame-tree
    - /git-branch
    - /git-bugreport
    - /git-bundle
    +    This change attempts to start upstreaming this command, but rather
    +    than adding a command whose interface & semantics may be controversial
    +    starts by exposing & testing the core of its library through a new
    +    test helper.
    +
    +    An eventual "git blame-tree" command, or e.g. a new format for "git
    +    ls-tree" to show what a path "blames" to can then be implemented with
    +    this library.
    +
    +    Aside from changing the code found at [4] into a test-tool the most
    +    notable changes here are:
    +
    +    * Removing the "--max-depth" changes to the diff code. We'll need
    +      those eventually, but it's not required for a blame of a given list
    +      of paths.
    +
    +      As has been noted in previous on-list discussions the semantics of
    +      the "max-depth" changes might be controversial, so it's worthwhile
    +      to split those out so that they can be reviewed separately.
    +
    +    * Made the "blame-tree" helper take "--" before any revision options,
    +      for clarity. An eventual built-in command (if any) probably doesn't
    +      want to enforce this, but it makes it clearer in the test helper
    +      what's an argument for "blame-tree" itself, and what's an argument for
    +      the revision machinery.
    +
    +    * Minor updates for using C99 syntax, and "size_t" instead of "int"
    +      when we're iterating over types whose "nr" is that size.
    +
    +    * Avoid sub-shelling in the tests, use "test-tool -C .." instead.
    +
    +    1. https://lore.kernel.org/git/20110302164031.GA18233@sigill.intra.peff.net/
    +    2. https://lore.kernel.org/git/20160831054201.ldlwptlmcndjmfwu@sigill.intra.peff.net/
    +    3. https://public-inbox.org/git/20130318121243.GC14789@sigill.intra.peff.net/
    +    4. https://github.com/peff/git/
    +    4. https://gitlab.com/gitlab-org/git/-/issues/114
    +
    +    Commit-message-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
    +    Signed-off-by: Jeff King <peff@peff.net>
    +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## Makefile ##
    +@@ Makefile: PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
    + 
    + TEST_BUILTINS_OBJS += test-advise.o
    + TEST_BUILTINS_OBJS += test-bitmap.o
    ++TEST_BUILTINS_OBJS += test-blame-tree.o
    + TEST_BUILTINS_OBJS += test-bloom.o
    + TEST_BUILTINS_OBJS += test-bundle-uri.o
    + TEST_BUILTINS_OBJS += test-cache-tree.o
     @@ Makefile: LIB_OBJS += archive.o
      LIB_OBJS += attr.o
      LIB_OBJS += base85.o
    @@ Makefile: LIB_OBJS += archive.o
      LIB_OBJS += blame.o
      LIB_OBJS += blob.o
      LIB_OBJS += bloom.o
    -@@ Makefile: BUILTIN_OBJS += builtin/apply.o
    - BUILTIN_OBJS += builtin/archive.o
    - BUILTIN_OBJS += builtin/bisect.o
    - BUILTIN_OBJS += builtin/blame.o
    -+BUILTIN_OBJS += builtin/blame-tree.o
    - BUILTIN_OBJS += builtin/branch.o
    - BUILTIN_OBJS += builtin/bugreport.o
    - BUILTIN_OBJS += builtin/bundle.o
     
      ## blame-tree.c (new) ##
     @@
     +#include "cache.h"
     +#include "blame-tree.h"
    ++#include "strvec.h"
    ++#include "hash.h"
     +#include "commit.h"
    -+#include "diff.h"
     +#include "diffcore.h"
    ++#include "diff.h"
    ++#include "object.h"
     +#include "revision.h"
     +#include "log-tree.h"
    -+#include "dir.h"
    ++
    ++void blame_tree_opts_release(struct blame_tree_options *bto)
    ++{
    ++	strvec_clear(&bto->args);
    ++}
     +
     +struct blame_tree_entry {
     +	struct object_id oid;
    @@ blame-tree.c (new)
     +};
     +
     +static void add_from_diff(struct diff_queue_struct *q,
    -+			  struct diff_options *opt,
    -+			  void *data)
    ++			  struct diff_options *opt, void *data)
     +{
     +	struct blame_tree *bt = data;
    -+	int i;
     +
    -+	for (i = 0; i < q->nr; i++) {
    ++	for (size_t i = 0; i < q->nr; i++) {
     +		struct diff_filepair *p = q->queue[i];
     +		struct blame_tree_entry *ent = xcalloc(1, sizeof(*ent));
    -+		struct string_list_item *it;
     +
     +		oidcpy(&ent->oid, &p->two->oid);
    -+		it = string_list_append(&bt->paths, p->two->path);
    -+		it->util = ent;
    ++		string_list_append(&bt->paths, p->two->path)->util = ent;
     +	}
     +}
     +
     +static int add_from_revs(struct blame_tree *bt)
     +{
    -+	unsigned int i;
    -+	int count = 0;
    ++	size_t count = 0;
     +	struct diff_options diffopt;
     +
     +	memcpy(&diffopt, &bt->rev.diffopt, sizeof(diffopt));
     +	diffopt.output_format = DIFF_FORMAT_CALLBACK;
     +	diffopt.format_callback = add_from_diff;
     +	diffopt.format_callback_data = bt;
    ++	diffopt.no_free = 1;
     +
    -+	for (i = 0; i < bt->rev.pending.nr; i++) {
    ++	for (size_t i = 0; i < bt->rev.pending.nr; i++) {
     +		struct object_array_entry *obj = bt->rev.pending.objects + i;
     +
     +		if (obj->item->flags & UNINTERESTING)
     +			continue;
     +
     +		if (count++)
    -+			return error("can only blame one tree at a time");
    -+
    -+		diff_tree_sha1(EMPTY_TREE_SHA1_BIN, obj->item->oid.hash, "", &diffopt);
    ++			return error(_("can only blame one tree at a time"));
    ++		diff_tree_oid(the_hash_algo->empty_tree, &obj->item->oid, "",
    ++			      &diffopt);
     +		diff_flush(&diffopt);
     +	}
     +
    @@ blame-tree.c (new)
     +	return 0;
     +}
     +
    -+void blame_tree_init(struct blame_tree *bt, int argc, const char **argv,
    -+		     const char *prefix)
    ++void blame_tree_init(struct blame_tree *bt,
    ++		     const struct blame_tree_options *opts)
     +{
    -+	memset(bt, 0, sizeof(*bt));
    -+	bt->paths.strdup_strings = 1;
    -+
    -+	init_revisions(&bt->rev, prefix);
    -+	bt->rev.def = "HEAD";
    ++	init_revisions(&bt->rev, opts->prefix);
    ++	bt->rev.def = oid_to_hex(&opts->oid);
     +	bt->rev.combine_merges = 1;
     +	bt->rev.show_root_diff = 1;
     +	bt->rev.boundary = 1;
     +	bt->rev.no_commit_id = 1;
     +	bt->rev.diff = 1;
    -+	DIFF_OPT_SET(&bt->rev.diffopt, RECURSIVE);
    -+	setup_revisions(argc, argv, &bt->rev, NULL);
    ++	bt->rev.diffopt.flags.recursive = opts->recursive;
    ++	setup_revisions(opts->args.nr, opts->args.v, &bt->rev, NULL);
     +
     +	if (add_from_revs(bt) < 0)
    -+		die("unable to setup blame-tree");
    ++		die(_("unable to setup blame-tree"));
     +}
     +
     +void blame_tree_release(struct blame_tree *bt)
     +{
     +	string_list_clear(&bt->paths, 1);
    ++	release_revisions(&bt->rev);
     +}
     +
    -+
     +struct blame_tree_callback_data {
     +	struct commit *commit;
     +	struct string_list *paths;
    -+	int num_interesting;
    ++	size_t num_interesting;
     +
    -+	blame_tree_callback callback;
    ++	blame_tree_fn callback;
     +	void *callback_data;
     +};
     +
    @@ blame-tree.c (new)
     +	ent = item->util;
     +	if (ent->commit)
     +		return;
    ++
     +	/*
     +	 * Is it arriving at a version of interest, or is it from a side branch
     +	 * which did not contribute to the final state?
     +	 */
    -+	if (oidcmp(oid, &ent->oid))
    ++	if (!oideq(oid, &ent->oid))
     +		return;
     +
     +	ent->commit = data->commit;
    @@ blame-tree.c (new)
     +}
     +
     +static void blame_diff(struct diff_queue_struct *q,
    -+			 struct diff_options *opt, void *cbdata)
    ++		       struct diff_options *opt, void *cbdata)
     +{
     +	struct blame_tree_callback_data *data = cbdata;
    -+	int i;
     +
    -+	for (i = 0; i < q->nr; i++) {
    ++	for (size_t i = 0; i < q->nr; i++) {
     +		struct diff_filepair *p = q->queue[i];
     +		switch (p->status) {
     +		case DIFF_STATUS_DELETED:
    @@ blame-tree.c (new)
     +	}
     +}
     +
    -+int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
    ++int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
     +{
     +	struct blame_tree_callback_data data;
     +
    @@ blame-tree.c (new)
     +
     +	prepare_revision_walk(&bt->rev);
     +
    -+	while (data.num_interesting > 0) {
    ++	while (data.num_interesting) {
     +		data.commit = get_revision(&bt->rev);
     +		if (!data.commit)
     +			break;
     +
     +		if (data.commit->object.flags & BOUNDARY) {
    -+			diff_tree_sha1(EMPTY_TREE_SHA1_BIN,
    -+				       data.commit->object.oid.hash,
    ++			diff_tree_oid(the_hash_algo->empty_tree,
    ++				       &data.commit->object.oid,
     +				       "", &bt->rev.diffopt);
     +			diff_flush(&bt->rev.diffopt);
    -+		}
    -+		else
    ++		} else {
     +			log_tree_commit(&bt->rev, data.commit);
    ++		}
     +	}
     +
     +	return 0;
    @@ blame-tree.h (new)
     +#ifndef BLAME_TREE_H
     +#define BLAME_TREE_H
     +
    -+#include "commit.h"
    -+#include "diff.h"
    -+#include "revision.h"
    ++#include "hash.h"
    ++#include "strvec.h"
     +#include "string-list.h"
    ++#include "revision.h"
    ++#include "commit.h"
    ++
    ++struct blame_tree_options {
    ++	struct object_id oid;
    ++	const char *prefix;
    ++	unsigned int recursive;
    ++	struct strvec args;
    ++};
    ++#define BLAME_TREE_OPTIONS_INIT(...) { \
    ++	.args = STRVEC_INIT, \
    ++	__VA_ARGS__ \
    ++}
    ++void blame_tree_opts_release(struct blame_tree_options *bto);
     +
     +struct blame_tree {
     +	struct string_list paths;
     +	struct rev_info rev;
     +};
    ++#define BLAME_TREE_INIT { \
    ++	.paths = STRING_LIST_INIT_DUP, \
    ++	.rev = REV_INFO_INIT, \
    ++}
     +
    -+void blame_tree_init(struct blame_tree *,
    -+		     int argc, const char **argv, const char *prefix);
    ++void blame_tree_init(struct blame_tree *bt,
    ++		     const struct blame_tree_options *opts);
     +void blame_tree_release(struct blame_tree *);
     +
    -+typedef void (*blame_tree_callback)(const char *path,
    -+				    const struct commit *commit,
    -+				    void *data);
    -+int blame_tree_run(struct blame_tree *,
    -+		   blame_tree_callback cb,
    -+		   void *data);
    ++typedef void (*blame_tree_fn)(const char *path, const struct commit *commit,
    ++			      void *data);
    ++int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data);
     +
     +#endif /* BLAME_TREE_H */
     
    - ## builtin.h ##
    -@@ builtin.h: int cmd_apply(int argc, const char **argv, const char *prefix);
    - int cmd_archive(int argc, const char **argv, const char *prefix);
    - int cmd_bisect(int argc, const char **argv, const char *prefix);
    - int cmd_blame(int argc, const char **argv, const char *prefix);
    -+int cmd_blame_tree(int argc, const char **argv, const char *prefix);
    - int cmd_branch(int argc, const char **argv, const char *prefix);
    - int cmd_bugreport(int argc, const char **argv, const char *prefix);
    - int cmd_bundle(int argc, const char **argv, const char *prefix);
    -
    - ## builtin/blame-tree.c (new) ##
    + ## t/helper/test-blame-tree.c (new) ##
     @@
    -+#include "cache.h"
    ++#include "test-tool.h"
     +#include "blame-tree.h"
     +#include "quote.h"
    ++#include "config.h"
     +#include "parse-options.h"
     +
     +static void show_entry(const char *path, const struct commit *commit, void *d)
    @@ builtin/blame-tree.c (new)
     +		putchar('^');
     +	printf("%s\t", oid_to_hex(&commit->object.oid));
     +
    -+	if (bt->rev.diffopt.line_termination)
    -+		write_name_quoted(path, stdout, '\n');
    -+	else
    -+		printf("%s%c", path, '\0');
    ++	write_name_quoted(path, stdout, bt->rev.diffopt.line_termination);
     +
     +	fflush(stdout);
     +}
     +
    -+int cmd_blame_tree(int argc, const char **argv, const char *prefix)
    ++int cmd__blame_tree(int argc, const char **argv)
     +{
    -+	struct blame_tree bt;
    ++	const char *prefix = setup_git_directory();
    ++	struct blame_tree_options opts = BLAME_TREE_OPTIONS_INIT(
    ++		.prefix = prefix,
    ++		.recursive = 1,
    ++	);
    ++	struct blame_tree bt = BLAME_TREE_INIT;
    ++	struct option options[] = {
    ++		OPT_BOOL(0, "recursive", &opts.recursive,
    ++			 "recurse into to subtrees"),
    ++		OPT_END()
    ++	};
    ++	static char const* const usage[] = {
    ++		"test-tool blame-tree [--no-recursive] [-- <rev-opts>]",
    ++		NULL
    ++	};
    ++
    ++	argc = parse_options(argc, argv, prefix, options, usage,
    ++			     PARSE_OPT_KEEP_DASHDASH);
    ++
    ++	if (argc) {
    ++		if (strcmp(argv[0], "--") &&
    ++		    strcmp(argv[0], "--end-of-options"))
    ++			usage_msg_opt("need -- before <rev-opts>", usage, options);
    ++		argc--;
    ++		argv++;
    ++	}
     +
    -+	git_config(git_default_config, NULL);
    ++	if (repo_get_oid(the_repository, "HEAD", &opts.oid))
    ++		 die("unable to get HEAD");
     +
    -+	blame_tree_init(&bt, argc, argv, prefix);
    ++	strvec_push(&opts.args, "blame-tree");
    ++	if (argc)
    ++		strvec_pushv(&opts.args, argv);
    ++	blame_tree_init(&bt, &opts);
     +	if (blame_tree_run(&bt, show_entry, &bt) < 0)
     +		die("error running blame-tree traversal");
     +	blame_tree_release(&bt);
    ++	blame_tree_opts_release(&opts);
     +
     +	return 0;
     +}
     
    - ## git.c ##
    -@@ git.c: static struct cmd_struct commands[] = {
    - 	{ "archive", cmd_archive, RUN_SETUP_GENTLY },
    - 	{ "bisect", cmd_bisect, RUN_SETUP },
    - 	{ "blame", cmd_blame, RUN_SETUP },
    -+	{ "blame-tree", cmd_blame_tree, RUN_SETUP },
    - 	{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
    - 	{ "bugreport", cmd_bugreport, RUN_SETUP_GENTLY },
    - 	{ "bundle", cmd_bundle, RUN_SETUP_GENTLY },
    + ## t/helper/test-tool.c ##
    +@@ t/helper/test-tool.c: static const char * const test_tool_usage[] = {
    + static struct test_cmd cmds[] = {
    + 	{ "advise", cmd__advise_if_enabled },
    + 	{ "bitmap", cmd__bitmap },
    ++	{ "blame-tree", cmd__blame_tree },
    + 	{ "bloom", cmd__bloom },
    + 	{ "bundle-uri", cmd__bundle_uri },
    + 	{ "cache-tree", cmd__cache_tree },
    +
    + ## t/helper/test-tool.h ##
    +@@
    + 
    + int cmd__advise_if_enabled(int argc, const char **argv);
    + int cmd__bitmap(int argc, const char **argv);
    ++int cmd__blame_tree(int argc, const char **argv);
    + int cmd__bloom(int argc, const char **argv);
    + int cmd__bundle_uri(int argc, const char **argv);
    + int cmd__cache_tree(int argc, const char **argv);
     
    - ## t/t8011-blame-tree.sh (new) ##
    + ## t/t8020-blame-tree-lib.sh (new) ##
     @@
     +#!/bin/sh
     +
    -+test_description='basic blame-tree tests'
    ++test_description='basic blame-tree library tests'
    ++
     +. ./test-lib.sh
     +
     +test_expect_success 'setup' '
    @@ t/t8011-blame-tree.sh (new)
     +'
     +
     +test_expect_success 'cannot blame two trees' '
    -+	test_must_fail git blame-tree HEAD HEAD~1
    ++	test_must_fail test-tool blame-tree HEAD HEAD~1
     +'
     +
     +check_blame() {
    ++	local indir= &&
    ++	while test $# != 0
    ++	do
    ++		case "$1" in
    ++		-C)
    ++			indir="$2"
    ++			shift
    ++			;;
    ++		*)
    ++			break
    ++			;;
    ++		esac &&
    ++		shift
    ++	done &&
    ++
     +	cat >expect &&
    -+	git blame-tree "$@" >actual &&
    -+	git name-rev --stdin --name-only --tags <actual >tmp &&
    -+	mv tmp actual &&
    -+	tr '\t' ' ' <actual >tmp &&
    -+	mv tmp actual &&
    -+	sort <actual >tmp &&
    -+	mv tmp actual &&
    ++	test_when_finished "rm -f tmp.*" &&
    ++	test-tool ${indir:+-C "$indir"} \
    ++		blame-tree "$@" >tmp.1 &&
    ++	git name-rev --annotate-stdin --name-only --tags \
    ++		<tmp.1 >tmp.2 &&
    ++	tr '\t' ' ' <tmp.2 >tmp.3 &&
    ++	sort tmp.3 >actual &&
     +	test_cmp expect actual
     +}
     +
    @@ t/t8011-blame-tree.sh (new)
     +'
     +
     +test_expect_success 'blame root' '
    -+	check_blame --max-depth=0 <<-\EOF
    ++	check_blame --no-recursive <<-\EOF
     +	1 file
     +	3 a
     +	EOF
     +'
     +
     +test_expect_success 'blame subdir' '
    -+	check_blame --max-depth=1 a <<-\EOF
    ++	check_blame -- a <<-\EOF
     +	2 a/file
    -+	3 a/b
    ++	3 a/b/file
     +	EOF
     +'
     +
     +test_expect_success 'blame from non-HEAD commit' '
    -+	check_blame --max-depth=0 HEAD^ <<-\EOF
    ++	check_blame --no-recursive -- HEAD^ <<-\EOF
     +	1 file
     +	2 a
     +	EOF
     +'
     +
    -+test_expect_success 'blame from subdir defaults to root' '(
    -+	cd a &&
    -+	check_blame --max-depth=0 <<-\EOF
    ++test_expect_success 'blame from subdir defaults to root' '
    ++	check_blame -C a --no-recursive <<-\EOF
     +	1 file
     +	3 a
     +	EOF
    -+)'
    ++'
     +
    -+test_expect_success 'blame from subdir uses relative pathspecs' '(
    -+	cd a &&
    -+	check_blame --max-depth=1 b <<-\EOF
    ++test_expect_success 'blame from subdir uses relative pathspecs' '
    ++	check_blame -C a -- b <<-\EOF
     +	3 a/b/file
     +	EOF
    -+)'
    ++'
     +
     +test_expect_success 'limit blame traversal by count' '
    -+	check_blame --max-depth=0 -1 <<-\EOF
    ++	check_blame --no-recursive -- -1 <<-\EOF
     +	3 a
     +	^2 file
     +	EOF
     +'
     +
     +test_expect_success 'limit blame traversal by commit' '
    -+	check_blame --max-depth=0 HEAD~2..HEAD <<-\EOF
    ++	check_blame --no-recursive -- HEAD~2..HEAD <<-\EOF
     +	3 a
     +	^1 file
     +	EOF
    @@ t/t8011-blame-tree.sh (new)
     +	test_commit c2 conflict &&
     +	test_must_fail git merge c1 &&
     +	test_commit resolved conflict &&
    -+	check_blame conflict <<-\EOF
    ++	check_blame -- conflict <<-\EOF
     +	resolved conflict
     +	EOF
     +'

 Makefile                   |   2 +
 blame-tree.c               | 195 +++++++++++++++++++++++++++++++++++++
 blame-tree.h               |  39 ++++++++
 t/helper/test-blame-tree.c |  62 ++++++++++++
 t/helper/test-tool.c       |   1 +
 t/helper/test-tool.h       |   1 +
 t/t8020-blame-tree-lib.sh  | 138 ++++++++++++++++++++++++++
 7 files changed, 438 insertions(+)
 create mode 100644 blame-tree.c
 create mode 100644 blame-tree.h
 create mode 100644 t/helper/test-blame-tree.c
 create mode 100755 t/t8020-blame-tree-lib.sh

diff --git a/Makefile b/Makefile
index 45bd6ac9c3e..ed82dfee8e7 100644
--- a/Makefile
+++ b/Makefile
@@ -786,6 +786,7 @@ PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
 TEST_BUILTINS_OBJS += test-advise.o
 TEST_BUILTINS_OBJS += test-bitmap.o
+TEST_BUILTINS_OBJS += test-blame-tree.o
 TEST_BUILTINS_OBJS += test-bloom.o
 TEST_BUILTINS_OBJS += test-bundle-uri.o
 TEST_BUILTINS_OBJS += test-cache-tree.o
@@ -974,6 +975,7 @@ LIB_OBJS += archive.o
 LIB_OBJS += attr.o
 LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
+LIB_OBJS += blame-tree.o
 LIB_OBJS += blame.o
 LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
diff --git a/blame-tree.c b/blame-tree.c
new file mode 100644
index 00000000000..dc739b23bff
--- /dev/null
+++ b/blame-tree.c
@@ -0,0 +1,195 @@
+#include "cache.h"
+#include "blame-tree.h"
+#include "strvec.h"
+#include "hash.h"
+#include "commit.h"
+#include "diffcore.h"
+#include "diff.h"
+#include "object.h"
+#include "revision.h"
+#include "log-tree.h"
+
+void blame_tree_opts_release(struct blame_tree_options *bto)
+{
+	strvec_clear(&bto->args);
+}
+
+struct blame_tree_entry {
+	struct object_id oid;
+	struct commit *commit;
+};
+
+static void add_from_diff(struct diff_queue_struct *q,
+			  struct diff_options *opt, void *data)
+{
+	struct blame_tree *bt = data;
+
+	for (size_t i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		struct blame_tree_entry *ent = xcalloc(1, sizeof(*ent));
+
+		oidcpy(&ent->oid, &p->two->oid);
+		string_list_append(&bt->paths, p->two->path)->util = ent;
+	}
+}
+
+static int add_from_revs(struct blame_tree *bt)
+{
+	size_t count = 0;
+	struct diff_options diffopt;
+
+	memcpy(&diffopt, &bt->rev.diffopt, sizeof(diffopt));
+	diffopt.output_format = DIFF_FORMAT_CALLBACK;
+	diffopt.format_callback = add_from_diff;
+	diffopt.format_callback_data = bt;
+	diffopt.no_free = 1;
+
+	for (size_t i = 0; i < bt->rev.pending.nr; i++) {
+		struct object_array_entry *obj = bt->rev.pending.objects + i;
+
+		if (obj->item->flags & UNINTERESTING)
+			continue;
+
+		if (count++)
+			return error(_("can only blame one tree at a time"));
+		diff_tree_oid(the_hash_algo->empty_tree, &obj->item->oid, "",
+			      &diffopt);
+		diff_flush(&diffopt);
+	}
+
+	string_list_sort(&bt->paths);
+	return 0;
+}
+
+void blame_tree_init(struct blame_tree *bt,
+		     const struct blame_tree_options *opts)
+{
+	init_revisions(&bt->rev, opts->prefix);
+	bt->rev.def = oid_to_hex(&opts->oid);
+	bt->rev.combine_merges = 1;
+	bt->rev.show_root_diff = 1;
+	bt->rev.boundary = 1;
+	bt->rev.no_commit_id = 1;
+	bt->rev.diff = 1;
+	bt->rev.diffopt.flags.recursive = opts->recursive;
+	setup_revisions(opts->args.nr, opts->args.v, &bt->rev, NULL);
+
+	if (add_from_revs(bt) < 0)
+		die(_("unable to setup blame-tree"));
+}
+
+void blame_tree_release(struct blame_tree *bt)
+{
+	string_list_clear(&bt->paths, 1);
+	release_revisions(&bt->rev);
+}
+
+struct blame_tree_callback_data {
+	struct commit *commit;
+	struct string_list *paths;
+	size_t num_interesting;
+
+	blame_tree_fn callback;
+	void *callback_data;
+};
+
+static void mark_path(const char *path, const struct object_id *oid,
+		      struct blame_tree_callback_data *data)
+{
+	struct string_list_item *item = string_list_lookup(data->paths, path);
+	struct blame_tree_entry *ent;
+
+	/* Is it even a path that exists in our tree? */
+	if (!item)
+		return;
+
+	/* Have we already blamed a commit? */
+	ent = item->util;
+	if (ent->commit)
+		return;
+
+	/*
+	 * Is it arriving at a version of interest, or is it from a side branch
+	 * which did not contribute to the final state?
+	 */
+	if (!oideq(oid, &ent->oid))
+		return;
+
+	ent->commit = data->commit;
+	data->num_interesting--;
+	if (data->callback)
+		data->callback(path, data->commit, data->callback_data);
+}
+
+static void blame_diff(struct diff_queue_struct *q,
+		       struct diff_options *opt, void *cbdata)
+{
+	struct blame_tree_callback_data *data = cbdata;
+
+	for (size_t i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		switch (p->status) {
+		case DIFF_STATUS_DELETED:
+			/*
+			 * There's no point in feeding a deletion, as it could
+			 * not have resulted in our current state, which
+			 * actually has the file.
+			 */
+			break;
+
+		default:
+			/*
+			 * Otherwise, we care only that we somehow arrived at
+			 * a final path/sha1 state. Note that this covers some
+			 * potentially controversial areas, including:
+			 *
+			 *  1. A rename or copy will be blamed, as it is the
+			 *     first time the content has arrived at the given
+			 *     path.
+			 *
+			 *  2. Even a non-content modification like a mode or
+			 *     type change will trigger it.
+			 *
+			 * We take the inclusive approach for now, and blame
+			 * anything which impacts the path. Options to tweak
+			 * the behavior (e.g., to "--follow" the content across
+			 * renames) can come later.
+			 */
+			mark_path(p->two->path, &p->two->oid, data);
+			break;
+		}
+	}
+}
+
+int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
+{
+	struct blame_tree_callback_data data;
+
+	data.paths = &bt->paths;
+	data.num_interesting = bt->paths.nr;
+	data.callback = cb;
+	data.callback_data = cbdata;
+
+	bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+	bt->rev.diffopt.format_callback = blame_diff;
+	bt->rev.diffopt.format_callback_data = &data;
+
+	prepare_revision_walk(&bt->rev);
+
+	while (data.num_interesting) {
+		data.commit = get_revision(&bt->rev);
+		if (!data.commit)
+			break;
+
+		if (data.commit->object.flags & BOUNDARY) {
+			diff_tree_oid(the_hash_algo->empty_tree,
+				       &data.commit->object.oid,
+				       "", &bt->rev.diffopt);
+			diff_flush(&bt->rev.diffopt);
+		} else {
+			log_tree_commit(&bt->rev, data.commit);
+		}
+	}
+
+	return 0;
+}
diff --git a/blame-tree.h b/blame-tree.h
new file mode 100644
index 00000000000..fb4ba1004e7
--- /dev/null
+++ b/blame-tree.h
@@ -0,0 +1,39 @@
+#ifndef BLAME_TREE_H
+#define BLAME_TREE_H
+
+#include "hash.h"
+#include "strvec.h"
+#include "string-list.h"
+#include "revision.h"
+#include "commit.h"
+
+struct blame_tree_options {
+	struct object_id oid;
+	const char *prefix;
+	unsigned int recursive;
+	struct strvec args;
+};
+#define BLAME_TREE_OPTIONS_INIT(...) { \
+	.args = STRVEC_INIT, \
+	__VA_ARGS__ \
+}
+void blame_tree_opts_release(struct blame_tree_options *bto);
+
+struct blame_tree {
+	struct string_list paths;
+	struct rev_info rev;
+};
+#define BLAME_TREE_INIT { \
+	.paths = STRING_LIST_INIT_DUP, \
+	.rev = REV_INFO_INIT, \
+}
+
+void blame_tree_init(struct blame_tree *bt,
+		     const struct blame_tree_options *opts);
+void blame_tree_release(struct blame_tree *);
+
+typedef void (*blame_tree_fn)(const char *path, const struct commit *commit,
+			      void *data);
+int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data);
+
+#endif /* BLAME_TREE_H */
diff --git a/t/helper/test-blame-tree.c b/t/helper/test-blame-tree.c
new file mode 100644
index 00000000000..1ea5ff783af
--- /dev/null
+++ b/t/helper/test-blame-tree.c
@@ -0,0 +1,62 @@
+#include "test-tool.h"
+#include "blame-tree.h"
+#include "quote.h"
+#include "config.h"
+#include "parse-options.h"
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+	struct blame_tree *bt = d;
+
+	if (commit->object.flags & BOUNDARY)
+		putchar('^');
+	printf("%s\t", oid_to_hex(&commit->object.oid));
+
+	write_name_quoted(path, stdout, bt->rev.diffopt.line_termination);
+
+	fflush(stdout);
+}
+
+int cmd__blame_tree(int argc, const char **argv)
+{
+	const char *prefix = setup_git_directory();
+	struct blame_tree_options opts = BLAME_TREE_OPTIONS_INIT(
+		.prefix = prefix,
+		.recursive = 1,
+	);
+	struct blame_tree bt = BLAME_TREE_INIT;
+	struct option options[] = {
+		OPT_BOOL(0, "recursive", &opts.recursive,
+			 "recurse into to subtrees"),
+		OPT_END()
+	};
+	static char const* const usage[] = {
+		"test-tool blame-tree [--no-recursive] [-- <rev-opts>]",
+		NULL
+	};
+
+	argc = parse_options(argc, argv, prefix, options, usage,
+			     PARSE_OPT_KEEP_DASHDASH);
+
+	if (argc) {
+		if (strcmp(argv[0], "--") &&
+		    strcmp(argv[0], "--end-of-options"))
+			usage_msg_opt("need -- before <rev-opts>", usage, options);
+		argc--;
+		argv++;
+	}
+
+	if (repo_get_oid(the_repository, "HEAD", &opts.oid))
+		 die("unable to get HEAD");
+
+	strvec_push(&opts.args, "blame-tree");
+	if (argc)
+		strvec_pushv(&opts.args, argv);
+	blame_tree_init(&bt, &opts);
+	if (blame_tree_run(&bt, show_entry, &bt) < 0)
+		die("error running blame-tree traversal");
+	blame_tree_release(&bt);
+	blame_tree_opts_release(&opts);
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index abe8a785eb6..8f34ff18d6f 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -12,6 +12,7 @@ static const char * const test_tool_usage[] = {
 static struct test_cmd cmds[] = {
 	{ "advise", cmd__advise_if_enabled },
 	{ "bitmap", cmd__bitmap },
+	{ "blame-tree", cmd__blame_tree },
 	{ "bloom", cmd__bloom },
 	{ "bundle-uri", cmd__bundle_uri },
 	{ "cache-tree", cmd__cache_tree },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index ea2672436c9..745a2097000 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -5,6 +5,7 @@
 
 int cmd__advise_if_enabled(int argc, const char **argv);
 int cmd__bitmap(int argc, const char **argv);
+int cmd__blame_tree(int argc, const char **argv);
 int cmd__bloom(int argc, const char **argv);
 int cmd__bundle_uri(int argc, const char **argv);
 int cmd__cache_tree(int argc, const char **argv);
diff --git a/t/t8020-blame-tree-lib.sh b/t/t8020-blame-tree-lib.sh
new file mode 100755
index 00000000000..4213c25de8b
--- /dev/null
+++ b/t/t8020-blame-tree-lib.sh
@@ -0,0 +1,138 @@
+#!/bin/sh
+
+test_description='basic blame-tree library tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit 1 file &&
+	mkdir a &&
+	test_commit 2 a/file &&
+	mkdir a/b &&
+	test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot blame two trees' '
+	test_must_fail test-tool blame-tree HEAD HEAD~1
+'
+
+check_blame() {
+	local indir= &&
+	while test $# != 0
+	do
+		case "$1" in
+		-C)
+			indir="$2"
+			shift
+			;;
+		*)
+			break
+			;;
+		esac &&
+		shift
+	done &&
+
+	cat >expect &&
+	test_when_finished "rm -f tmp.*" &&
+	test-tool ${indir:+-C "$indir"} \
+		blame-tree "$@" >tmp.1 &&
+	git name-rev --annotate-stdin --name-only --tags \
+		<tmp.1 >tmp.2 &&
+	tr '\t' ' ' <tmp.2 >tmp.3 &&
+	sort tmp.3 >actual &&
+	test_cmp expect actual
+}
+
+test_expect_success 'blame recursive' '
+	check_blame <<-\EOF
+	1 file
+	2 a/file
+	3 a/b/file
+	EOF
+'
+
+test_expect_success 'blame root' '
+	check_blame --no-recursive <<-\EOF
+	1 file
+	3 a
+	EOF
+'
+
+test_expect_success 'blame subdir' '
+	check_blame -- a <<-\EOF
+	2 a/file
+	3 a/b/file
+	EOF
+'
+
+test_expect_success 'blame from non-HEAD commit' '
+	check_blame --no-recursive -- HEAD^ <<-\EOF
+	1 file
+	2 a
+	EOF
+'
+
+test_expect_success 'blame from subdir defaults to root' '
+	check_blame -C a --no-recursive <<-\EOF
+	1 file
+	3 a
+	EOF
+'
+
+test_expect_success 'blame from subdir uses relative pathspecs' '
+	check_blame -C a -- b <<-\EOF
+	3 a/b/file
+	EOF
+'
+
+test_expect_success 'limit blame traversal by count' '
+	check_blame --no-recursive -- -1 <<-\EOF
+	3 a
+	^2 file
+	EOF
+'
+
+test_expect_success 'limit blame traversal by commit' '
+	check_blame --no-recursive -- HEAD~2..HEAD <<-\EOF
+	3 a
+	^1 file
+	EOF
+'
+
+test_expect_success 'only blame files in the current tree' '
+	git rm -rf a &&
+	git commit -m "remove a" &&
+	check_blame <<-\EOF
+	1 file
+	EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+	git checkout HEAD^0 &&
+	git rm -rf . &&
+	test_commit m1 &&
+	git checkout HEAD^ &&
+	git rm -rf . &&
+	test_commit m2 &&
+	git merge m1 &&
+	check_blame <<-\EOF
+	m1 m1.t
+	m2 m2.t
+	EOF
+'
+
+test_expect_success 'blame merge for resolved conflicts' '
+	git checkout HEAD^0 &&
+	git rm -rf . &&
+	test_commit c1 conflict &&
+	git checkout HEAD^ &&
+	git rm -rf . &&
+	test_commit c2 conflict &&
+	test_must_fail git merge c1 &&
+	test_commit resolved conflict &&
+	check_blame -- conflict <<-\EOF
+	resolved conflict
+	EOF
+'
+
+test_done
-- 
2.39.1.1398.g1368d6f56cf


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"
  2023-02-05 20:47 [PATCH] blame-tree: add library and tests via "test-tool blame-tree" Ævar Arnfjörð Bjarmason
@ 2023-02-10 15:42 ` Derrick Stolee
  2023-03-07 13:56   ` Ævar Arnfjörð Bjarmason
  2023-02-17 20:42 ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2023-02-10 15:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jeff King, Taylor Blau

On 2/5/2023 3:47 PM, Ævar Arnfjörð Bjarmason wrote:

I've been meaning to get to this all week, but today was my first chance.

> The "blame-tree" library allows for finding the most recent
> modification to paths in a tree. It does so by expanding the tree at a
> given commit, taking note of the current state of each path, and then
> walking backwards through history looking for commits where each path
> changed into its final sha1.
>
> The "git blame-tree" command was first noted on the ML 2011[1], and
> over the years there have been mentions of it[2][3], and whether it
> could be upstreamed. The sources have been available at [4]'s
> "jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches.
>
> This change attempts to start upstreaming this command, but rather
> than adding a command whose interface & semantics may be controversial
> starts by exposing & testing the core of its library through a new
> test helper.

One thing that I think is important for this work and the future we build
on top of it is that we need to reconsider every fact about the existing
contribution and make sure the new thing is built for the actual needs.

In that vein, I think your use of a test-tool is a really good one. It can
help us define an API boundary that could then be reflected in a CLI in
independent review from improved algorithms underneath the API.

> +struct blame_tree_options {
> +	struct object_id oid;
> +	const char *prefix;
> +	unsigned int recursive;
> +	struct strvec args;
> +};
> +#define BLAME_TREE_OPTIONS_INIT(...) { \
> +	.args = STRVEC_INIT, \
> +	__VA_ARGS__ \
> +}
> +void blame_tree_opts_release(struct blame_tree_options *bto);
> +
> +struct blame_tree {
> +	struct string_list paths;
> +	struct rev_info rev;
> +};
> +#define BLAME_TREE_INIT { \
> +	.paths = STRING_LIST_INIT_DUP, \
> +	.rev = REV_INFO_INIT, \
> +}
> +
> +void blame_tree_init(struct blame_tree *bt,
> +		     const struct blame_tree_options *opts);
> +void blame_tree_release(struct blame_tree *);
> +
> +typedef void (*blame_tree_fn)(const char *path, const struct commit *commit,
> +			      void *data);
> +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data);

However, this API is too leaky. Specifically, it allows passing arbitrary
'args' instead of structured options.

Before I get into what I think needs to change here, let's look at the
initial implementation:

> +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
> +{
> +	struct blame_tree_callback_data data;
> +
> +	data.paths = &bt->paths;
> +	data.num_interesting = bt->paths.nr;
> +	data.callback = cb;
> +	data.callback_data = cbdata;
> +
> +	bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> +	bt->rev.diffopt.format_callback = blame_diff;
> +	bt->rev.diffopt.format_callback_data = &data;
> +
> +	prepare_revision_walk(&bt->rev);
> +
> +	while (data.num_interesting) {
> +		data.commit = get_revision(&bt->rev);
> +		if (!data.commit)
> +			break;
> +
> +		if (data.commit->object.flags & BOUNDARY) {
> +			diff_tree_oid(the_hash_algo->empty_tree,
> +				       &data.commit->object.oid,
> +				       "", &bt->rev.diffopt);
> +			diff_flush(&bt->rev.diffopt);
> +		} else {
> +			log_tree_commit(&bt->rev, data.commit);
> +		}
> +	}
> +
> +	return 0;
> +}

When I think of 'blame-tree', I think of the following output:

  For each path (within a pathspec, recursive or not), output the first
  commit that changed that path according to simplified file history. This
  history walk begins at a single commit, but may be terminated by a
  negative commit range, so the output could indicate that we reached the
  boundary.

With that definition, the most-obvious first implementation would be to
first generate the path list, then run
`git rev-list -n 1 <revisions> -- <path>` for each path in the pathspec.
This could be done by N prepare_revision_walk()/get_revision() executions
within a loop instead of actually executing subcommands.

The implementation in this patch is simultaneously smarter than that basic
approach, but also not smart enough for the best performance. In order to
get the optimal performance, the following parts of my output definition
are critical:

 1. We use simplified file history.
 2. The history walk begins at a single commit.

If either of these conditions are broken, then the current-best algorithm
cannot be used (and even more: our proactive caching mechanism cannot be
used).

The API as currently defined does not allow us to enforce that restriction
because the arbitrary arguments could specify `--full-history`, or worse
`--simplify-merge` (and also `--first-parent`) to modify the history mode
or even `--all` to modify the number of starting commits.

The version we use in our custom fork has the historical baggage of this
open-ended argument-parsing, but because we have full control of the
callers to the CLI, we have enforced that those arguments are not being
used. While we are not currently reviewing the CLI of a builtin, the API
layer is essentially providing a pass-through of arbitrary revision options.

I am happy to see that the 'struct blame_tree_options' includes a callback
for the output, as that is valuable for both reporting results to stdout
or to collect results into a list for caching purposes, in the future
where we have that ability.

--- Recommended API ---

Using 'struct blame_tree_options' instead of many parameters creates a
good future-proof method prototype, so we can always _add_ options when
explicitly understood as important to callers of one kind or another.

However, to drop the arbitrary 'args' we need to instead make it very
explicit which commits are being used in the history walk.

struct blame_tree_options {
	struct object_id oid;
	const char *prefix;
	unsigned int recursive;
	struct commit *start;
	struct commit_list *uninteresting;
};

This definition of the struct should be enough to demonstrate the behavior
we are describing. However, for the v1 of the API, we may even want to
completely remove the 'uninteresting' choice, and instead focus only on a
single starting position ('start'). If we decide that 'uninteresting' is
valuable, then it can be added again later, hopefully after the primary
use of this command is established.

Again, thinking about the lifetime of this command in our infrastructure,
the commit range behavior was used at one point as a performance improvement
where a cache was given for a specific starting commit B, then we
dynamically computed the blame-tree for the range B..A when given a new
commit A, and merged the two results together. However, this was not always
correct due to complexities around parallel changes. A different caching
mechanism was built into the low-level algorithm itself which resolves
these complications, but also _that cache cannot be used when given range
queries_.

Further, I recommend building the simplest implementation first, since it
is actually closer to the very fast implementation in that it has two
parts:

 1. Compute the list of paths at the tip.
 2. Perform history walks for those paths.

The fast implementation does a single history walk that essentially walks
the simplified history for all of the paths simultaneously, but it is
critical to have that list of paths at the start of that walk. In that way
it's actually easier to adapt the simpler algorithm to the current-best
algorithm than it is to adapt the smart algorithm from this patch to the
current-best algorithm.

All this is to say, that I'd like to see this API start with the smallest
possible surface area and with the simplest implementation, and then I'd
be happy to contribute those algorithms within the API boundary while the
CLI is handled independently.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"
  2023-02-05 20:47 [PATCH] blame-tree: add library and tests via "test-tool blame-tree" Ævar Arnfjörð Bjarmason
  2023-02-10 15:42 ` Derrick Stolee
@ 2023-02-17 20:42 ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2023-02-17 20:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Junio C Hamano

On Sun, Feb 05, 2023 at 09:47:03PM +0100, Ævar Arnfjörð Bjarmason wrote:

> From: Jeff King <peff@peff.net>

I appreciate being credited, of course, but at some point I think this
becomes "Based-on-a-patch-by". And we may have crossed the line here.

> The "blame-tree" library allows for finding the most recent
> modification to paths in a tree. It does so by expanding the tree at a
> given commit, taking note of the current state of each path, and then
> walking backwards through history looking for commits where each path
> changed into its final sha1.
> 
> The "git blame-tree" command was first noted on the ML 2011[1], and
> over the years there have been mentions of it[2][3], and whether it
> could be upstreamed. The sources have been available at [4]'s
> "jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches.

Sort of. jk/blame-tree-wip probably matches what I sent to the list in
2011 (and that probably matches what was deployed at GitHub at the
time). And like all of my branches, I continually rebase it on git.git's
master branch. But like all branches in my repo with "-wip" in them, it
is not part of my daily driver, and I generally do not even build it
(and I would not be surprised if it does not build at all, or one of
those rebases introduced a horrible bug).

The jk/faster-blame-tree-wip was an experiment from 2014 to narrow the
pathspec as we found answers. But it was never deployed anywhere, and
likewise may or may not even build now. I think the general idea there
is sound (make pathspec lookups faster by using a trie, and then narrow
it), but I suspect it's not a full solution. In particular, I don't
think it does anything clever with merges. Since we are narrowing the
pathspec as we traverse, we can't rely on the usual pruning of side
branches that happens via limit_list(), as that is all up-front. So it
probably needs to look at each merge as we traverse and cull parents
based on TREESAME.

I believe GitHub later had patches to do that, but they didn't use the
pathspec machinery (because without the tries, it becomes accidentally
quadratic in the number of paths). I didn't work on those patches,
though (Stolee and Taylor did), and they're not anywhere in my
repository (and I no longer have access to the private github repo).

> This change attempts to start upstreaming this command, but rather
> than adding a command whose interface & semantics may be controversial
> starts by exposing & testing the core of its library through a new
> test helper.
> 
> An eventual "git blame-tree" command, or e.g. a new format for "git
> ls-tree" to show what a path "blames" to can then be implemented with
> this library.

OK. It's a little weird, as I do think the interface and semantics are
the more interesting part, but this certainly isn't hurting anybody to
go this route.

> * Removing the "--max-depth" changes to the diff code. We'll need
>   those eventually, but it's not required for a blame of a given list
>   of paths.
> 
>   As has been noted in previous on-list discussions the semantics of
>   the "max-depth" changes might be controversial, so it's worthwhile
>   to split those out so that they can be reviewed separately.

That's probably reasonable. The only two interesting "depths" are really
"recurse" and "don't recurse" (where "recurse" is probably what you'd
put in a user-facing tool, and "don't recurse" is what a site like
GitHub uses to do the blame for a single level of tree it's showing).
And that narrows the problem space quite a bit.

> * Made the "blame-tree" helper take "--" before any revision options,
>   for clarity. An eventual built-in command (if any) probably doesn't
>   want to enforce this, but it makes it clearer in the test helper
>   what's an argument for "blame-tree" itself, and what's an argument for
>   the revision machinery.

OK. Since this is just a test-helper, we don't care too much either way.

> * Minor updates for using C99 syntax, and "size_t" instead of "int"
>   when we're iterating over types whose "nr" is that size.

Reasonable.

> * Avoid sub-shelling in the tests, use "test-tool -C .." instead.

Yeah, the original code probably predates "-C". ;)

> The range-diff here is to peff's jk/blame-tree-wip. As noted above
> this is far from the full thing, but hopefully getting the basic bits
> of the library (sans the max-depth question) will make the review of
> subsequent bits easier.

I doubt this range-diff is useful to anybody. I'm probably the person
most likely to make sense of it, and it means nothing to me. It probably
makes sense conceptually to just treat this as a new topic that happens
to be based on older work.

But if you did want to base it on something, you probably ought to do so
on what GitHub is currently running in production (and again, talk to
Taylor or Stolee for that). Unless the intent is to use this as a base
for showing their changes. Though even then, I'm not sure the
intermediate state is all that interesting.

> [oodles of patch]

TBH, I didn't really look at this closely. It's been a decade, so even
for me reviewing this would basically be looking at it from scratch. And
as my Git time is a bit limited these days, I can't really promise
timely review of a big topic.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"
  2023-02-10 15:42 ` Derrick Stolee
@ 2023-03-07 13:56   ` Ævar Arnfjörð Bjarmason
  2023-03-08 15:30     ` Derrick Stolee
  2023-03-08 15:49     ` Taylor Blau
  0 siblings, 2 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-03-07 13:56 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Junio C Hamano, Jeff King, Taylor Blau


On Fri, Feb 10 2023, Derrick Stolee wrote:

> On 2/5/2023 3:47 PM, Ævar Arnfjörð Bjarmason wrote:
> [...]
>> +struct blame_tree_options {
>> +	struct object_id oid;
>> +	const char *prefix;
>> +	unsigned int recursive;
>> +	struct strvec args;
>> +};
>> +#define BLAME_TREE_OPTIONS_INIT(...) { \
>> +	.args = STRVEC_INIT, \
>> +	__VA_ARGS__ \
>> +}
>> +void blame_tree_opts_release(struct blame_tree_options *bto);
>> +
>> +struct blame_tree {
>> +	struct string_list paths;
>> +	struct rev_info rev;
>> +};
>> +#define BLAME_TREE_INIT { \
>> +	.paths = STRING_LIST_INIT_DUP, \
>> +	.rev = REV_INFO_INIT, \
>> +}
>> +
>> +void blame_tree_init(struct blame_tree *bt,
>> +		     const struct blame_tree_options *opts);
>> +void blame_tree_release(struct blame_tree *);
>> +
>> +typedef void (*blame_tree_fn)(const char *path, const struct commit *commit,
>> +			      void *data);
>> +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data);
>
> However, this API is too leaky. Specifically, it allows passing arbitrary
> 'args' instead of structured options.
>
> Before I get into what I think needs to change here, let's look at the
> initial implementation:
>
>> +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
>> +{
>> +	struct blame_tree_callback_data data;
>> +
>> +	data.paths = &bt->paths;
>> +	data.num_interesting = bt->paths.nr;
>> +	data.callback = cb;
>> +	data.callback_data = cbdata;
>> +
>> +	bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
>> +	bt->rev.diffopt.format_callback = blame_diff;
>> +	bt->rev.diffopt.format_callback_data = &data;
>> +
>> +	prepare_revision_walk(&bt->rev);
>> +
>> +	while (data.num_interesting) {
>> +		data.commit = get_revision(&bt->rev);
>> +		if (!data.commit)
>> +			break;
>> +
>> +		if (data.commit->object.flags & BOUNDARY) {
>> +			diff_tree_oid(the_hash_algo->empty_tree,
>> +				       &data.commit->object.oid,
>> +				       "", &bt->rev.diffopt);
>> +			diff_flush(&bt->rev.diffopt);
>> +		} else {
>> +			log_tree_commit(&bt->rev, data.commit);
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>
> When I think of 'blame-tree', I think of the following output:
>
>   For each path (within a pathspec, recursive or not), output the first
>   commit that changed that path according to simplified file history. This
>   history walk begins at a single commit, but may be terminated by a
>   negative commit range, so the output could indicate that we reached the
>   boundary.
>
> With that definition, the most-obvious first implementation would be to
> first generate the path list, then run
> `git rev-list -n 1 <revisions> -- <path>` for each path in the pathspec.
> This could be done by N prepare_revision_walk()/get_revision() executions
> within a loop instead of actually executing subcommands.
>
> The implementation in this patch is simultaneously smarter than that basic
> approach, but also not smart enough for the best performance. In order to
> get the optimal performance, the following parts of my output definition
> are critical:
>
>  1. We use simplified file history.
>  2. The history walk begins at a single commit.
>
> If either of these conditions are broken, then the current-best algorithm
> cannot be used (and even more: our proactive caching mechanism cannot be
> used).
>
> The API as currently defined does not allow us to enforce that restriction
> because the arbitrary arguments could specify `--full-history`, or worse
> `--simplify-merge` (and also `--first-parent`) to modify the history mode
> or even `--all` to modify the number of starting commits.

There's already a limitation on --all in this patch, or anything else
that would yield a many<->many rev<->path relationship, e.g. "--all" or
"HEAD HEAD~", that's covered by the "can only blame one tree at a time"
error condition.

For --full-history v.s. --simplify-merge the output on git.git at least
(maybe not on others?) would be the same:

	diff -u <(t/helper/test-tool -C "$PWD/t" blame-tree -- --full-history) <(t/helper/test-tool -C "$PWD/t" blame-tree -- --simplify-merge)

But of course for --first-parent it's not, but more on that below.

> The version we use in our custom fork has the historical baggage of this
> open-ended argument-parsing, but because we have full control of the
> callers to the CLI, we have enforced that those arguments are not being
> used. While we are not currently reviewing the CLI of a builtin, the API
> layer is essentially providing a pass-through of arbitrary revision options.
>
> I am happy to see that the 'struct blame_tree_options' includes a callback
> for the output, as that is valuable for both reporting results to stdout
> or to collect results into a list for caching purposes, in the future
> where we have that ability.
>
> --- Recommended API ---
>
> Using 'struct blame_tree_options' instead of many parameters creates a
> good future-proof method prototype, so we can always _add_ options when
> explicitly understood as important to callers of one kind or another.
>
> However, to drop the arbitrary 'args' we need to instead make it very
> explicit which commits are being used in the history walk.
>
> struct blame_tree_options {
> 	struct object_id oid;
> 	const char *prefix;
> 	unsigned int recursive;
> 	struct commit *start;
> 	struct commit_list *uninteresting;
> };
>
> This definition of the struct should be enough to demonstrate the behavior
> we are describing. However, for the v1 of the API, we may even want to
> completely remove the 'uninteresting' choice, and instead focus only on a
> single starting position ('start'). If we decide that 'uninteresting' is
> valuable, then it can be added again later, hopefully after the primary
> use of this command is established.
>
> Again, thinking about the lifetime of this command in our infrastructure,
> the commit range behavior was used at one point as a performance improvement
> where a cache was given for a specific starting commit B, then we
> dynamically computed the blame-tree for the range B..A when given a new
> commit A, and merged the two results together. However, this was not always
> correct due to complexities around parallel changes. A different caching
> mechanism was built into the low-level algorithm itself which resolves
> these complications, but also _that cache cannot be used when given range
> queries_.
>
> Further, I recommend building the simplest implementation first, since it
> is actually closer to the very fast implementation in that it has two
> parts:
>
>  1. Compute the list of paths at the tip.
>  2. Perform history walks for those paths.
>
> The fast implementation does a single history walk that essentially walks
> the simplified history for all of the paths simultaneously, but it is
> critical to have that list of paths at the start of that walk. In that way
> it's actually easier to adapt the simpler algorithm to the current-best
> algorithm than it is to adapt the smart algorithm from this patch to the
> current-best algorithm.
>
> All this is to say, that I'd like to see this API start with the smallest
> possible surface area and with the simplest implementation, and then I'd
> be happy to contribute those algorithms within the API boundary while the
> CLI is handled independently.

I hear your concern about leaving this open for optimization, and in
general I'd vehemently agree with it, except for needing to eventually
feed a command-line to setup_revisions().

Ideally the revision API would make what you're describing easy, but the
way it's currently implemented (and changing it would be a much larger
project) someone who'd like to pass structured options in the way you'd
describe will end up having to re-implement bug-for-bug compatible
versions of some subset of the option parsing in revision.c.

Isn't a way to get the best of both worlds to have a small snippet of
code that inspects the "struct rev_info" before & after
setup_revisions(), and which would only implement certain optimizations
if certain known options are provided, but not if any unknown ones are?

That way those who'd like the faster happy path could use that subset of
options, while the general API would allow any revision options. We'd
then error() or BUG() out only if we fail to map our expected paths to
OIDs.

Specifically, I'm thinking of something similar to what
match_stat_with_submodule() in diff-lib.c now does with "orig_flags".

I also think as a practical matter it's acceptable for us to leave the
underlying low-level API open-ended for potential (ab)use, while any
porcelain or plumbing tool for this would document that a given narrow
subset of revision options is what we expect.

That's what we do for format-patch, and e.g. as of my df569c3f31f
(range-diff doc: add a section about output stability, 2018-11-09)
range-diff explicitly documents that feeding it options it won't expect
might result in nonsense output.

I think those are all good ways forward here, and I'd much prefer those
to having to re-implement or pull out subsets of the current option
parsing logic in revision.c. What do you think?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"
  2023-03-07 13:56   ` Ævar Arnfjörð Bjarmason
@ 2023-03-08 15:30     ` Derrick Stolee
  2023-03-08 15:49     ` Taylor Blau
  1 sibling, 0 replies; 6+ messages in thread
From: Derrick Stolee @ 2023-03-08 15:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Taylor Blau

On 3/7/2023 8:56 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Feb 10 2023, Derrick Stolee wrote:

>> All this is to say, that I'd like to see this API start with the smallest
>> possible surface area and with the simplest implementation, and then I'd
>> be happy to contribute those algorithms within the API boundary while the
>> CLI is handled independently.
> 
> I hear your concern about leaving this open for optimization, and in
> general I'd vehemently agree with it, except for needing to eventually
> feed a command-line to setup_revisions().

The most-correct way to build this, with full optimizations, does not
involve revisions.c at all, so this "eventually" is incorrect. It's
only something to do for the "first" implementation, as a reference.

In order to do the single-walk approach for every path simultaneously,
we _must_ have full control of the commit walk. There was a time where
we had done a single-walk approach by letting the revision machinery
walk all commits that changed the base tree, then looked for changes
to the contained paths. However, this results in _incorrect_ results
because commits that would normally be ignored by the simplified
history walk for "<dir>/<entry>" were not ignored by the simplified
history walk for "<dir>/" and thus that algorithm presented _incorrect
results_.

For that reason, doing a single walk that outputs the blame-tree
results for each path must have full control over which commits are
walked and which paths could emit a change for those commits. This
means we must not use revision.c as a base for full control.

> Ideally the revision API would make what you're describing easy, but the
> way it's currently implemented (and changing it would be a much larger
> project) someone who'd like to pass structured options in the way you'd
> describe will end up having to re-implement bug-for-bug compatible
> versions of some subset of the option parsing in revision.c.

The subset of option parsing is "a starting revision" and "a base tree"
and _perhaps_ "is the diff recursive or not?" (and this last one isn't
even in revision.c yet). That does not seem like using revision.c's
parsing is actually helpful at all.

> Isn't a way to get the best of both worlds to have a small snippet of
> code that inspects the "struct rev_info" before & after
> setup_revisions(), and which would only implement certain optimizations
> if certain known options are provided, but not if any unknown ones are?
> 
> That way those who'd like the faster happy path could use that subset of
> options, while the general API would allow any revision options. We'd
> then error() or BUG() out only if we fail to map our expected paths to
> OIDs.

This option requires examining the long and ever-growing list of options
to struct rev_info which will take much more work than parsing a starting
ref and a path from the command-line.

> I think those are all good ways forward here, and I'd much prefer those
> to having to re-implement or pull out subsets of the current option
> parsing logic in revision.c. What do you think?

I think you are skirting over the difficult part about upstreaming the
blame-tree command, which is the biggest reason we have not done it in
the past. The way it is implemented in our fork started with this "just
parse args using revision.c" because that's the easiest way to implement
the naive implementation, but we were able to make optimizations on top
only because we had full control over the callers not using any other
options. We would not have been able to make the assumptions that allowed
those performance enhancements without that control. Actually building the
interface in a way that guarantees the behavior will be stable and
understood is not easy, but is worth doing well.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] blame-tree: add library and tests via "test-tool blame-tree"
  2023-03-07 13:56   ` Ævar Arnfjörð Bjarmason
  2023-03-08 15:30     ` Derrick Stolee
@ 2023-03-08 15:49     ` Taylor Blau
  1 sibling, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2023-03-08 15:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, git, Junio C Hamano, Jeff King

On Tue, Mar 07, 2023 at 02:56:29PM +0100, Ævar Arnfjörð Bjarmason wrote:
> I hear your concern about leaving this open for optimization, and in
> general I'd vehemently agree with it, except for needing to eventually
> feed a command-line to setup_revisions().
>
> Ideally the revision API would make what you're describing easy, but the
> way it's currently implemented (and changing it would be a much larger
> project) someone who'd like to pass structured options in the way you'd
> describe will end up having to re-implement bug-for-bug compatible
> versions of some subset of the option parsing in revision.c.

I get what you are both saying here, but I think I find myself tending
to agree with Ævar a little bit more here.

In an ideal world, sure, having the blame-tree API take a single struct
called 'blame_tree_options' would be very clean. But the crux is that we
have to pass some arguments to setup_revisions(), and that our problems
here stem from the leakiness of *that* API, not this one.

I ran into a similar problem when looking at rewriting the bitmap
traversal code a year or so ago (which is sadly still on my to-do list).

Without getting into the details, part of that work involved calling
limit_list() as a function of setup_revisions() to discover the
traversal boundary. And if the caller happened to put --topo-order in
their command-line arguments, we wouldn't end up calling limit_list() at
all, since (as Stolee well knows ;-)) those two code paths are quite
different.

I can't recall if I either detected if '--topo-order' was passed (by
looking to see if `revs.topo_order` was set), or grafted an extra
`--no-topo-order` argument onto the end of the list. Either way, I think
those two problems are more or less equivalent in this context, and that
it seemed like a much more expedient solution to work around the
fundamental leakiness of the setup_revision() API rather than refactor
it.

> Isn't a way to get the best of both worlds to have a small snippet of
> code that inspects the "struct rev_info" before & after
> setup_revisions(), and which would only implement certain optimizations
> if certain known options are provided, but not if any unknown ones are?

Yeah, I think this is basically the same as my "let's check if the caller
passed `--topo-order` by checking the `revs.topo_order` bit" above.

> I think those are all good ways forward here, and I'd much prefer those
> to having to re-implement or pull out subsets of the current option
> parsing logic in revision.c. What do you think?

Same :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-08 15:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-05 20:47 [PATCH] blame-tree: add library and tests via "test-tool blame-tree" Ævar Arnfjörð Bjarmason
2023-02-10 15:42 ` Derrick Stolee
2023-03-07 13:56   ` Ævar Arnfjörð Bjarmason
2023-03-08 15:30     ` Derrick Stolee
2023-03-08 15:49     ` Taylor Blau
2023-02-17 20:42 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).