git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, peff@peff.net, avarab@gmail.com,
	jrnieder@gmail.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v4 4/6] revision: implement sparse algorithm
Date: Fri, 11 Jan 2019 15:20:06 -0800
Message-ID: <xmqqh8eeye6x.fsf@gitster-ct.c.googlers.com> (raw)
In-Reply-To: <c44172c35ece7aafec02c7f3c8438ccca4f69023.1544822533.git.gitgitgadget@gmail.com> (Derrick Stolee via GitGitGadget's message of "Fri, 14 Dec 2018 13:22:21 -0800 (PST)")

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When enumerating objects to place in a pack-file during 'git
> pack-objects --revs', we discover the "frontier" of commits
> that we care about and the boundary with commit we find
> uninteresting. From that point, we walk trees to discover which
> trees and blobs are uninteresting. Finally, we walk trees from the
> interesting commits to find the interesting objects that are
> placed in the pack.
>
> This commit introduces a new, "sparse" way to discover the
> uninteresting trees. We use the perspective of a single user trying
> to push their topic to a large repository. That user likely changed
> a very small fraction of the paths in their working directory, but
> we spend a lot of time walking all reachable trees.
>
> The way to switch the logic to work in this sparse way is to start
> caring about which paths introduce new trees. While it is not
> possible to generate a diff between the frontier boundary and all
> of the interesting commits, we can simulate that behavior by
> inspecting all of the root trees as a whole, then recursing down
> to the set of trees at each path.
>
> We already had taken the first step by passing an oidset to
> mark_trees_uninteresting_sparse(). We now create a dictionary
> whose keys are paths and values are oidsets. We consider the set
> of trees that appear at each path. While we inspect a tree, we
> add its subtrees to the oidsets corresponding to the tree entry's
> path. We also mark trees as UNINTERESTING if the tree we are
> parsing is UNINTERESTING.
>
> To actually improve the peformance, we need to terminate our

performance?

> recursion. If the oidset contains only UNINTERESTING trees, then

So, at this point, it is not quite clear what "the oidset" refers to
to me.  When you are looking at the root tree of one commit, you'd
add the object ID of the tree to the dictionary keyed by the path
(i.e. root of the tree).  You traverse to the parent and keep adding
these root trees, and after a while you'd hit a negative commit and
add its tree with UNINTERESTING bit.  Before that point, the oidset
for the root of the tree has only interesting trees, so you won't
recurse into its contents, but now at the final point you'd walk to
propagate UNINTERESTING bit down?

> we do not continue the recursion. This avoids walking trees that
> are likely to not be reachable from interesting trees. If the
> oidset contains only interesting trees, then we will walk these
> trees in the final stage that collects the intersting objects to
> place in the pack. Thus, we only recurse if the oidset contains
> both interesting and UNINITERESTING trees.
>
> There are a few ways that this is not a universally better option.
>
> First, we can pack extra objects. If someone copies a subtree
> from one tree to another, the first tree will appear UNINTERESTING
> and we will not recurse to see that the subtree should also be
> UNINTERESTING. We will walk the new tree and see the subtree as
> a "new" object and add it to the pack. We add a test case that
> demonstrates this as a way to prove that the --sparse option is
> actually working.

That's an interesting use of the word "working" ;-)

> Second, we can have extra memory pressure. If instead of being a
> single user pushing a small topic we are a server sending new
> objects from across the entire working directory, then we will
> gain very little (the recursion will rarely terminate early) but
> will spend extra time maintaining the path-oidset dictionaries.
> ...
> I used the number of walked trees the main metric above because
> it is consistent across multiple runs. When I ran my tests, the
> performance of the pack-objects command with the same options
> could change the end-to-end time by 10x depending on the file
> system being warm. However, by repeating the same test on repeat
> I could get more consistent timing results. The git.git and
> Linux tests were too fast overall (less than 0.5s) to measure
> an end-to-end difference. The Azure DevOps case was slow enough
> to see the time improve from 15s to 1s in the warm case. The
> cold case was 90s to 9s in my testing.

Understandable.  Projects that tend to be deep (e.g. Java?) would
likely to benefit if they are structured and partitioned very well.

> +struct path_and_oids_entry {
> +	struct hashmap_entry ent;
> +	char *path;
> +	struct oidset set;

Again "set"?  If I understand the description of the logic explained
in the proposed commit log message correctly, then this is a set of
"tree objects at this path", right?  At least calling this "trees"
may give us slightly better indication.

> +static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
> +			     const struct path_and_oids_entry *e1,
> +			     const struct path_and_oids_entry *e2,
> +			     const void *keydata)
> +{
> +	return strcmp(e1->path, e2->path);
> +}
> +
> +int map_flags = 0;

That's too bland a name for a system-wide global variable.  If it is
a file-scope "static", that might be more acceptable.

In any case, let BSS take care of the initialization and do not
initialize an int explicitly to 0.

Is it even used?  At least at this step the comparison callback
function does not seem to make any use of the fn-data, so a better
organization may be to delay the introduction of this variable to a
later step where it actually gets used, at which time, the code may
have a better context to give the variable a more suitable name.

> +static void paths_and_oids_init(struct hashmap *map)
> +{
> +	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0);
> +}
> +
> +static void paths_and_oids_clear(struct hashmap *map)
> +{
> +	struct hashmap_iter iter;
> +	struct path_and_oids_entry *entry;
> +	hashmap_iter_init(map, &iter);
> +
> +	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
> +		oidset_clear(&entry->set);
> +		free(entry->path);
> +	}
> +
> +	hashmap_free(map, 1);
> +}
> +
> +static void paths_and_oids_insert(struct hashmap *map,
> +				  const char *path,
> +				  const struct object_id *oid)
> +{
> +	int hash = strhash(path);
> +	struct path_and_oids_entry key;
> +	struct path_and_oids_entry *entry;
> +
> +	hashmap_entry_init(&key, hash);
> +	key.path = xstrdup(path);
> +	oidset_init(&key.set, 0);
> +
> +	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
> +		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
> +		hashmap_entry_init(entry, hash);
> +		entry->path = key.path;
> +		oidset_init(&entry->set, 16);
> +		hashmap_put(map, entry);
> +	} else {
> +		free(key.path);
> +	}

Hmph, having to always allocate only to free sounds wasteful.  

Can't we instead do the look-up with "path" that are given read-only
to us, and allocate only when we are creating the entry in the
dictionary for the first time?  After all, we'd make many look-ups
of the same path to add trees for the same path and only one of them
(i.e. the first one) is used to create an entry in the dictionary.

Or does it not matter?

> +
> +	oidset_insert(&entry->set, oid);
> +}
> +
> +static void add_children_by_path(struct repository *r,
> +				 struct tree *tree,
> +				 struct hashmap *map)
> +{
> +	struct tree_desc desc;
> +	struct name_entry entry;
> +
> +	if (!tree)
> +		return;
> +
> +	if (parse_tree_gently(tree, 1) < 0)
> +		return;
> +
> +	init_tree_desc(&desc, tree->buffer, tree->size);
> +	while (tree_entry(&desc, &entry)) {
> +		switch (object_type(entry.mode)) {
> +		case OBJ_TREE:
> +			paths_and_oids_insert(map, entry.path, entry.oid);
> +
> +			if (tree->object.flags & UNINTERESTING) {
> +				struct tree *child = lookup_tree(r, entry.oid);
> +				if (child)
> +					child->object.flags |= UNINTERESTING;
> +			}
> +			break;
> +		case OBJ_BLOB:
> +			if (tree->object.flags & UNINTERESTING) {
> +				struct blob *child = lookup_blob(r, entry.oid);
> +				if (child)
> +					child->object.flags |= UNINTERESTING;
> +			}
> +			break;
> +		default:
> +			/* Subproject commit - not in this repository */
> +			break;
> +		}
> +	}
> +
> +	free_tree_buffer(tree);
> +}

OK.

>  void mark_trees_uninteresting_sparse(struct repository *r,
>  				     struct oidset *set)
>  {
> +	unsigned has_interesting = 0, has_uninteresting = 0;
> +	struct hashmap map;
> +	struct hashmap_iter map_iter;
> +	struct path_and_oids_entry *entry;
>  	struct object_id *oid;
>  	struct oidset_iter iter;
>  
>  	oidset_iter_init(set, &iter);
> -	while ((oid = oidset_iter_next(&iter))) {
> +	while ((!has_interesting || !has_uninteresting) &&
> +	       (oid = oidset_iter_next(&iter))) {
>  		struct tree *tree = lookup_tree(r, oid);
>  
>  		if (!tree)
>  			continue;
>  
> +		if (tree->object.flags & UNINTERESTING)
> +			has_uninteresting = 1;
> +		else
> +			has_interesting = 1;
> +	}
> +
> +	/* Do not walk unless we have both types of trees. */
> +	if (!has_uninteresting || !has_interesting)
> +		return;

OK.

> +	paths_and_oids_init(&map);
> +
> +	oidset_iter_init(set, &iter);
> +	while ((oid = oidset_iter_next(&iter))) {
> +		struct tree *tree = lookup_tree(r, oid);
> +		add_children_by_path(r, tree, &map);
>  	}
> +
> +	hashmap_iter_init(&map, &map_iter);
> +	while ((entry = hashmap_iter_next(&map_iter)))
> +		mark_trees_uninteresting_sparse(r, &entry->set);
> +
> +	paths_and_oids_clear(&map);
>  }
>  
>  struct commit_stack {
> diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
> index 81f6805bc3..45dba6e014 100755
> --- a/t/t5322-pack-objects-sparse.sh
> +++ b/t/t5322-pack-objects-sparse.sh
> @@ -83,22 +83,25 @@ test_expect_success 'sparse pack-objects' '
>  	test_cmp expect_objects.txt sparse_objects.txt
>  '
>  
> +# Demonstrate that the algorithms differ when we copy a tree wholesale
> +# from one folder to another.
> +
>  test_expect_success 'duplicate a folder from f1 into f3' '
>  	mkdir f3/f4 &&
>  	cp -r f1/f1/* f3/f4 &&
>  	git add f3/f4 &&
>  	git commit -m "Copied f1/f1 to f3/f4" &&
> -	cat >packinput.txt <<-EOF &&
> +	cat >packinput.txt <<-EOF
>  	topic1
>  	^topic1~1
>  	EOF
> -	git rev-parse		\
> -		topic1		\
> -		topic1^{tree}	\
> -		topic1:f3 | sort >expect_objects.txt
>  '
>  
>  test_expect_success 'non-sparse pack-objects' '
> +	git rev-parse			\
> +		topic1			\
> +		topic1^{tree}		\
> +		topic1:f3 | sort >expect_objects.txt &&

OK, the change above sort of makes sense ;-)

>  	git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
>  	git index-pack -o nonsparse.idx nonsparse.pack &&
>  	git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
> @@ -106,10 +109,16 @@ test_expect_success 'non-sparse pack-objects' '
>  '
>  
>  test_expect_success 'sparse pack-objects' '
> +	git rev-parse			\
> +		topic1			\
> +		topic1^{tree}		\
> +		topic1:f3		\
> +		topic1:f3/f4		\
> +		topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&

And the reason why f4 and its contents are wastefully sent is
because the whole tree is duplicated, which is understandable.

>  	git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
>  	git index-pack -o sparse.idx sparse.pack &&
>  	git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
> -	test_cmp expect_objects.txt sparse_objects.txt
> +	test_cmp expect_sparse_objects.txt sparse_objects.txt
>  '
>  
>  test_done

  parent reply index

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-28 21:52 [PATCH 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 3/5] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2018-11-28 22:11   ` Stefan Beller
2018-11-29 14:20     ` Derrick Stolee
2018-11-30  2:39       ` Junio C Hamano
2018-11-30 15:53         ` Derrick Stolee
2018-11-28 21:52 ` [PATCH 4/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-11-28 21:52 ` [PATCH 5/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-11-28 22:18 ` [PATCH 0/5] Add a new "sparse" tree walk algorithm Ævar Arnfjörð Bjarmason
2018-11-29  4:05   ` Derrick Stolee
2018-11-29 14:24 ` [PATCH v2 0/6] " Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-11-29 14:24   ` [PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
2018-12-10 16:42   ` [PATCH v3 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-12-10 16:42     ` [PATCH v3 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
2018-12-14 21:22     ` [PATCH v4 0/6] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2018-12-14 21:22       ` [PATCH v4 1/6] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2019-01-11 19:43         ` Junio C Hamano
2019-01-11 20:25           ` Junio C Hamano
2019-01-11 22:05             ` Derrick Stolee
2018-12-14 21:22       ` [PATCH v4 2/6] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2019-01-11 23:20         ` Junio C Hamano
2018-12-14 21:22       ` [PATCH v4 3/6] pack-objects: add --sparse option Derrick Stolee via GitGitGadget
2019-01-11 22:30         ` Junio C Hamano
2019-01-15 15:06           ` Derrick Stolee
2019-01-15 18:23             ` Junio C Hamano
2018-12-14 21:22       ` [PATCH v4 4/6] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2018-12-14 23:32         ` Ævar Arnfjörð Bjarmason
2018-12-17 14:20           ` Derrick Stolee
2018-12-17 14:26             ` Ævar Arnfjörð Bjarmason
2018-12-17 14:50               ` Derrick Stolee
2019-01-11 23:20         ` Junio C Hamano [this message]
2018-12-14 21:22       ` [PATCH v4 5/6] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2018-12-14 21:22       ` [PATCH v4 6/6] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget
2019-01-16 18:25       ` [PATCH v5 0/5] Add a new "sparse" tree walk algorithm Derrick Stolee via GitGitGadget
2019-01-16 18:25         ` [PATCH v5 2/5] list-objects: consume sparse tree walk Derrick Stolee via GitGitGadget
2019-01-16 18:25         ` [PATCH v5 1/5] revision: add mark_tree_uninteresting_sparse Derrick Stolee via GitGitGadget
2019-01-16 18:25         ` [PATCH v5 3/5] revision: implement sparse algorithm Derrick Stolee via GitGitGadget
2019-01-16 18:26         ` [PATCH v5 4/5] pack-objects: create pack.useSparse setting Derrick Stolee via GitGitGadget
2019-01-16 18:26         ` [PATCH v5 5/5] pack-objects: create GIT_TEST_PACK_SPARSE Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqh8eeye6x.fsf@gitster-ct.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git