git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCHv2] push: change submodule default to check
@ 2016-08-24 17:30 Stefan Beller
  2016-08-24 18:38 ` Junio C Hamano
       [not found] ` <20160824183112.ceekegpzavnbybxp@sigill.intra.peff.net>
  0 siblings, 2 replies; 25+ messages in thread
From: Stefan Beller @ 2016-08-24 17:30 UTC (permalink / raw)
  Cc: git, hvoigt, Jens.Lehmann, iveqy, leandro.lucarella, gitster,
	Stefan Beller

When working with submodules, it is easy to forget to push the submodules.
The setting 'check', which checks if any existing submodule is present on
at least one remote of the submodule remotes, is designed to prevent this
mistake.

Flipping the default to check for submodules is safer than the current
default of ignoring submodules while pushing.

Signed-off-by: Stefan Beller <sbeller@google.com>
---

Slightly reworded commit message than in v1,
(https://public-inbox.org/git/20160817204848.8983-1-sbeller@google.com/)
The patch itself is however the same.

I just push it out now with a new commit message, such that we can easier
pick it up later for Git 3.0, when changes that change default make more sense.

As said in an earlier message, you could however also argue that this is
fixing a bug in your workflow, so it might be worth fixing before 3.0
as well. I dunno.

Thanks,
Stefan

 builtin/push.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/push.c b/builtin/push.c
index 3bb9d6b..479150a 100644
--- a/builtin/push.c
+++ b/builtin/push.c
@@ -22,7 +22,7 @@ static int deleterefs;
 static const char *receivepack;
 static int verbosity;
 static int progress = -1;
-static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT;
+static int recurse_submodules = RECURSE_SUBMODULES_CHECK;
 static enum transport_family family;
 
 static struct push_cas_option cas;
-- 
2.10.0.rc1.1.g1ceb01a


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCHv2] push: change submodule default to check
  2016-08-24 17:30 [PATCHv2] push: change submodule default to check Stefan Beller
@ 2016-08-24 18:38 ` Junio C Hamano
       [not found] ` <20160824183112.ceekegpzavnbybxp@sigill.intra.peff.net>
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2016-08-24 18:38 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, hvoigt, Jens.Lehmann, iveqy, leandro.lucarella

Stefan Beller <sbeller@google.com> writes:

> When working with submodules, it is easy to forget to push the submodules.
> The setting 'check', which checks if any existing submodule is present on
> at least one remote of the submodule remotes, is designed to prevent this
> mistake.
>
> Flipping the default to check for submodules is safer than the current
> default of ignoring submodules while pushing.
>
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>
> Slightly reworded commit message than in v1,
> (https://public-inbox.org/git/20160817204848.8983-1-sbeller@google.com/)
> The patch itself is however the same.
>
> I just push it out now with a new commit message, such that we can easier
> pick it up later for Git 3.0, when changes that change default make more sense.
>
> As said in an earlier message, you could however also argue that this is
> fixing a bug in your workflow, so it might be worth fixing before 3.0
> as well. I dunno.

With this change suddenly landing on the version of Git a user just
updated, the only possible negative effect would be that somebody
who did not configure push.recurseSubmodules suddenly starts getting
an error.  What would the error message say?

What I want you to think about and explain in the justification is
if it gives the user enough information to decide the best course of
action to adjust to the new world order, when the user's expectation
has been that the superproject gets pushed independent from its
submodules.  If the existing error message gives enough information,
explains why Git suddenly started refusing the push is generally a
good thing for the user, tells some minority users (in whose
workflow it is perfectly safe to push out the toplevel independently
from the submodule) how to turn it back to the old behaviour clearly
enough, then this single-liner may be good enough.  Otherwise we may
need a new logic to allow us to see if recurse_submodules that is
set to RECURSE_SUBMODULES_CHECK is because the user explicitly
configured, or because the user did not have any configuration, and
issue a different message in the latter case.



>  builtin/push.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/push.c b/builtin/push.c
> index 3bb9d6b..479150a 100644
> --- a/builtin/push.c
> +++ b/builtin/push.c
> @@ -22,7 +22,7 @@ static int deleterefs;
>  static const char *receivepack;
>  static int verbosity;
>  static int progress = -1;
> -static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT;
> +static int recurse_submodules = RECURSE_SUBMODULES_CHECK;
>  static enum transport_family family;
>  
>  static struct push_cas_option cas;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2] push: change submodule default to check
       [not found] ` <20160824183112.ceekegpzavnbybxp@sigill.intra.peff.net>
@ 2016-08-24 19:37   ` Junio C Hamano
  2016-08-24 21:26     ` Junio C Hamano
  2016-08-24 22:37     ` Stefan Beller
  0 siblings, 2 replies; 25+ messages in thread
From: Junio C Hamano @ 2016-08-24 19:37 UTC (permalink / raw)
  To: Jeff King
  Cc: Stefan Beller, git, hvoigt, Jens.Lehmann, iveqy,
	leandro.lucarella

Jeff King <peff@peff.net> writes:

This seems to be dropped from the list, probably due to no "To:"
header in the original, which led to "no", "To-header" "on" and
"input <" on YOUR recipient list, so I am quoting it in full without
trimming.

> On Wed, Aug 24, 2016 at 10:30:17AM -0700, Stefan Beller wrote:
>
>> When working with submodules, it is easy to forget to push the submodules.
>> The setting 'check', which checks if any existing submodule is present on
>> at least one remote of the submodule remotes, is designed to prevent this
>> mistake.
>> 
>> Flipping the default to check for submodules is safer than the current
>> default of ignoring submodules while pushing.
>
> It is safer, and that's good. But it's also slower, because it requires
> an extra traversal of all of the pushed commits. And now people will
> have to pay the price even if they are not using submodules at all.
>
> For instance, try this from a checkout of linux.git:
>
>   for i in no check; do
> 	rm -rf dst.git
> 	git init --bare dst.git
> 	echo "==> Pushing with submodules=$i"
> 	time git push --recurse-submodules=$i dst.git HEAD
>   done
>
> The second case takes 30-40 seconds longer. This is a full push of
> history, so it's an extreme case[1], but it's still rather unfortunate.
>
> Can we tie this default to some sign that submodules are actually in
> use? I don't think the presence of .gitmodules is perfect (because you
> might be in a bare repo, for example, and have just fetched some other
> history you are relaying), but it may be a good compromise.  I'm
> envisioning something like "--recurse-submodules=auto-check" which
> auto-detects common situations (e.g., presence of .gitmodules or
> .git/modules checkouts) and enables "check", and then setting the
> default to that in the long run.
>
> -Peff
>
> [1] Actually, there is another much worse case lurking there. Try:
>
>       git push --recurse-submodules=check --mirror dst.git
>
>     from the kernel. I didn't let it finish, but I'd estimate it would
>     take on the order of 5 hours. The problem is that push feeds each
>     updated ref tip to find_unpushed_submodules(), so we end up walking
>     over the same history over and over.
>
>     I think it should feed all of the "before" and "after" ref tips it
>     proposes to update to a _single_ revision traversal.

That sounds massively ... broken.  So before even thinking about
flipping it to default, this needs to be fixed first.

>     I also notice that it uses "--remote=...", which is weird, because
>     push knows exactly what it proposes to update, which may be ahead of
>     where our refs/remotes/* cache is. Not to mention that we may be
>     pushing to a remote for which we do not keep tracking refs at all!
>
>     So I'd actually suspect that with your patch, a bare URL like:
>
>       git push https://github.com/peff/linux.git
>
>     would do the full 40-second walk, even if I was only pushing up one
>     or two objects.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2] push: change submodule default to check
  2016-08-24 19:37   ` Junio C Hamano
@ 2016-08-24 21:26     ` Junio C Hamano
  2016-08-24 22:37     ` Stefan Beller
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2016-08-24 21:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Stefan Beller, git, hvoigt, Jens.Lehmann, iveqy,
	leandro.lucarella

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> For instance, try this from a checkout of linux.git:
>>
>>   for i in no check; do
>> 	rm -rf dst.git
>> 	git init --bare dst.git
>> 	echo "==> Pushing with submodules=$i"
>> 	time git push --recurse-submodules=$i dst.git HEAD
>>   done
>>
>> The second case takes 30-40 seconds longer. This is a full push of
>> history, so it's an extreme case[1], but it's still rather unfortunate.

This actually bit me just now.  github.com/gitster/git.git is
mirror pushed, and it would seem to take forever, so I killed it,
and flipped the configuration variable off.  Which means the feature
won't ever get any testing from me in real life.

People, git.git is not a large project in any sense of the word.

Why wasn't it discovered that "push.recursesubmodules = check" is
UNUSABLE since it was introduced is simply beyond me.

I am mad (which is unusual for me, isn't it? -- I've seen somebody
else saying "I am mad", but I do not think I ever said it here).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2] push: change submodule default to check
  2016-08-24 19:37   ` Junio C Hamano
  2016-08-24 21:26     ` Junio C Hamano
@ 2016-08-24 22:37     ` Stefan Beller
  2016-08-24 23:01       ` Jeff King
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Beller @ 2016-08-24 22:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, git@vger.kernel.org, Heiko Voigt, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Wed, Aug 24, 2016 at 12:37 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
> This seems to be dropped from the list, probably due to no "To:"
> header in the original, which led to "no", "To-header" "on" and
> "input <" on YOUR recipient list, so I am quoting it in full without
> trimming.
>
>> On Wed, Aug 24, 2016 at 10:30:17AM -0700, Stefan Beller wrote:
>>
>>> When working with submodules, it is easy to forget to push the submodules.
>>> The setting 'check', which checks if any existing submodule is present on
>>> at least one remote of the submodule remotes, is designed to prevent this
>>> mistake.
>>>
>>> Flipping the default to check for submodules is safer than the current
>>> default of ignoring submodules while pushing.
>>
>> It is safer, and that's good. But it's also slower, because it requires
>> an extra traversal of all of the pushed commits. And now people will
>> have to pay the price even if they are not using submodules at all.
>>
>> For instance, try this from a checkout of linux.git:
>>
>>   for i in no check; do
>>       rm -rf dst.git
>>       git init --bare dst.git
>>       echo "==> Pushing with submodules=$i"
>>       time git push --recurse-submodules=$i dst.git HEAD
>>   done
>>
>> The second case takes 30-40 seconds longer. This is a full push of
>> history, so it's an extreme case[1], but it's still rather unfortunate.
>>
>> Can we tie this default to some sign that submodules are actually in
>> use? I don't think the presence of .gitmodules is perfect (because you
>> might be in a bare repo, for example, and have just fetched some other
>> history you are relaying), but it may be a good compromise.  I'm
>> envisioning something like "--recurse-submodules=auto-check" which
>> auto-detects common situations (e.g., presence of .gitmodules or
>> .git/modules checkouts) and enables "check", and then setting the
>> default to that in the long run.
>>
>> -Peff
>>
>> [1] Actually, there is another much worse case lurking there. Try:
>>
>>       git push --recurse-submodules=check --mirror dst.git
>>
>>     from the kernel. I didn't let it finish, but I'd estimate it would
>>     take on the order of 5 hours. The problem is that push feeds each
>>     updated ref tip to find_unpushed_submodules(), so we end up walking
>>     over the same history over and over.
>>
>>     I think it should feed all of the "before" and "after" ref tips it
>>     proposes to update to a _single_ revision traversal.
>
> That sounds massively ... broken.  So before even thinking about
> flipping it to default, this needs to be fixed first.
>

I agree. That sounds bad.

However having the --auto-check feels like papering over the
actual problem which to me sounds like a design problem.
However this may be a viable short term solution.

About the design issue:
We need to answer the question: Which submodule commits
are referenced by a given set of superproject commits.

This question is advancing a very similar question that we'd
have to ask in git-gc. In gc we would end up without having to
worry about a specific set, but rather the all reachable commits
of the superproject are in the given set.

So we could solve two issues at the same time if we had a quick
way to answer this question quickly.

And for that I considered introducing a ref in the submodule
(e.g.) refs/superproject/gc_protector, which records both the
superprojects commit as well as the submodule commit
in case the superproject changed the submodule pointer.

I have some rough patches for this, but I did not consider sending
that to the list as the patches are still rough and not quite
fully thought out on the design level.

--
Actually screw this. After an office discussion with Jonathan (cc'd):

1)  We need to have an "alternate refs namespace", which contains
    secondary data (as this can be derived from the primary data with
    lots of compute). So maybe we need a file similar to the packed refs
    file, "superproject-refs" that behaves as if it is storing refs,
but that file
    is safe to delete as we know all its contents can be recreated.

2) In this new alternate refs space, we can have
    refs/superproject/<sha1> in the submodule with the sha1 of the
    superproject that points to a commit which is a dirty merge of all
    submodule commits, that have no other parents.

e.g.
In the superproject we might have a history of:

  A <- B <- C

with A being origin/master and C our local master.

In the submodule we have:

  D <- E <- F
    \
      G

F is the master branch in the submodule, G is a dangling commit.

Now we could have the following git links:

    A -> D
    B -> G
    C -> F

Then the refs/superproject/<C>
would be a merge of F and G.

When pushing in the superproject (A..C) we would need to make sure
the submodule is updated as well, i.e. we'd look at
refs/superproject/<C> to know we need to advance the remote
submodule history to contain at least G and F.

Then we can do a standard rev walk starting at G and F downwards until we
hit a commit that is contained in remote/refs/*.

> Why wasn't it discovered that "push.recursesubmodules = check" is
> UNUSABLE since it was introduced is simply beyond me.

Maybe it is not compatible with your workflow as you push only once a day?
In a centralized server model you rarely exceed a few commits on push time,
but
    $ git rev-list a829490...c08db5a |wc -l
    23
    $ git rev-list f5df0f2...f3c0fa5 |wc -l
    61
    $ git rev-list 8f73021...de36215 |wc -l
    95

and some new branches, so you pushed around 200 commits.
I think this misbehaves strongly for typical maintainer work flows.

> I am mad (which is unusual for me, isn't it? -- I've seen somebody
> else saying "I am mad", but I do not think I ever said it here).

Please direct your anger at the design of submodules.
The assumption of the submodule being sort of 'independant' doesn't allow
for efficient data structures I would think. So maybe we need to teach the
submodules about the existence and state of the superproject?

/me ducks

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHv2] push: change submodule default to check
  2016-08-24 22:37     ` Stefan Beller
@ 2016-08-24 23:01       ` Jeff King
  2016-09-14 17:31         ` [PATCH 1/2] serialize collection of changed submodules Heiko Voigt
  2016-09-14 17:51         ` [PATCH 2/2] serialize collection of refs that contain submodule changes Heiko Voigt
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff King @ 2016-08-24 23:01 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Junio C Hamano, git@vger.kernel.org, Heiko Voigt, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Wed, Aug 24, 2016 at 03:37:29PM -0700, Stefan Beller wrote:

> > That sounds massively ... broken.  So before even thinking about
> > flipping it to default, this needs to be fixed first.
> 
> I agree. That sounds bad.
> 
> However having the --auto-check feels like papering over the
> actual problem which to me sounds like a design problem.
> However this may be a viable short term solution.

Sort of...

I may not have been clear, but there are really a few things going on.

One is that the design of find_unpushed_submodules() is just brain-dead.
It does one traversal per updated ref, which means a from-scratch mirror
is O(nr_of_refs * nr_of_commits). That's just silly, and can easily be
fixed behind the scenes to be O(nr_of_commits).

And I _suspect_ it is what made Junio's earlier push so awful; he
probably pushed up the same commits as part of many different branches,
so he did the same diffs over and over.

So clearing that up seems like an obvious first step, and dulls the pain
to "if submodule recursion is on, the worst case is that you walk all
the new objects you are sending". That's still _a_ traversal, but it's
one we have to do anyway in pack-objects, so it's the same order of
magnitude as the rest of the push[1].

Then you've got two cases: the repo is using submodules at all, or they
are not. The former is an easy case, if we can identify it; we can avoid
the traversal at all, and people who do not use submodules are not
regressed at all.

That leaves people who _are_ using submodules with paying the extra
traversal cost. Not great, but only really a pain when you have a really
big chunk of history to push. It may be lost in the noise for such a
push in more normal circumstances (where bandwidth to push up the
objects dominates, though it is unfortunate that we do not even start
utilizing the bandwidth, the critical resource, until we are done with
the submodule check).

[1] Of course with reachability bitmaps the pack-objects traversal goes
    away, but the same cannot be accomplished here (because they do not
    store the gitlink sha1s at all, because they do not imply
    reachability).

> We need to answer the question: Which submodule commits
> are referenced by a given set of superproject commits.
> 
> This question is advancing a very similar question that we'd
> have to ask in git-gc. In gc we would end up without having to
> worry about a specific set, but rather the all reachable commits
> of the superproject are in the given set.
> 
> So we could solve two issues at the same time if we had a quick
> way to answer this question quickly.
> [...]

I snipped here because your solutions sound complicated (which isn't to
say they're wrong, but that I am not willing to give them a lot of
thought at this time in the evening ;) ).

One opposite approach which appeals to me is not to remove the need for
the traversal, but to make it much faster. E.g., by storing commits in a
form that can be traversed more quickly, and possibly keeping a bitmap
cache of which paths are touched in each commit (I have posted patches
to the list for the former, but have only been considering experimenting
with the latter).

That's _also_ complicated, but it applies to way more things. Including
normal log traversals, path-limiting for diffs, the "counting" traversal
done by pack-object, etc.

And while it is complicated in some ways, it's conceptually simple at
the git data model layer. It's returning the same old answers about
commits and trees, just faster.

Anyway, food for thought.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] serialize collection of changed submodules
  2016-08-24 23:01       ` Jeff King
@ 2016-09-14 17:31         ` Heiko Voigt
  2016-09-14 22:30           ` Junio C Hamano
  2016-09-16 17:27           ` [PATCH 1/2] serialize collection of changed submodules Junio C Hamano
  2016-09-14 17:51         ` [PATCH 2/2] serialize collection of refs that contain submodule changes Heiko Voigt
  1 sibling, 2 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-14 17:31 UTC (permalink / raw)
  To: Jeff King
  Cc: Stefan Beller, Junio C Hamano, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

To check whether a submodule needs to be pushed we need to collect all
changed submodules. Lets collect them first and then execute the
possibly expensive test whether certain revisions are already pushed
only once per submodule.

There is further potential for optimization since we can assemble one
command and only issued that instead of one call for each remote ref in
the submodule.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---
Sorry about the late reply. I was not able to process emails until now.
Here are two patches that should help to improve the situation and batch
up some processing. This one is for repositories with submodules, so
that they do not iterate over the same submodule twice with the same
hash.

The second one will be the one people without submodules are interested
in.

Cheers Heiko

 submodule.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 5 deletions(-)

diff --git a/submodule.c b/submodule.c
index 0ef2ff4..b04c066 100644
--- a/submodule.c
+++ b/submodule.c
@@ -554,19 +554,38 @@ static int submodule_needs_pushing(const char *path, const unsigned char sha1[20
 	return 0;
 }
 
+static struct sha1_array *get_sha1s_from_list(struct string_list *submodules,
+		const char *path)
+{
+	struct string_list_item *item;
+	struct sha1_array *hashes;
+
+	item = string_list_insert(submodules, path);
+	if (item->util)
+		return (struct sha1_array *) item->util;
+
+	hashes = (struct sha1_array *) xmalloc(sizeof(struct sha1_array));
+	/* NEEDSWORK: should we add an initializer function for
+	 * sha1_array ? */
+	memset(hashes, 0, sizeof(struct sha1_array));
+	item->util = hashes;
+	return hashes;
+}
+
 static void collect_submodules_from_diff(struct diff_queue_struct *q,
 					 struct diff_options *options,
 					 void *data)
 {
 	int i;
-	struct string_list *needs_pushing = data;
+	struct string_list *submodules = data;
 
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
+		struct sha1_array *hashes;
 		if (!S_ISGITLINK(p->two->mode))
 			continue;
-		if (submodule_needs_pushing(p->two->path, p->two->oid.hash))
-			string_list_insert(needs_pushing, p->two->path);
+		hashes = get_sha1s_from_list(submodules, p->two->path);
+		sha1_array_append(hashes, p->two->oid.hash);
 	}
 }
 
@@ -582,14 +601,41 @@ static void find_unpushed_submodule_commits(struct commit *commit,
 	diff_tree_combined_merge(commit, 1, &rev);
 }
 
+struct collect_submodule_from_sha1s_data {
+	char *submodule_path;
+	struct string_list *needs_pushing;
+};
+
+static void collect_submodules_from_sha1s(const unsigned char sha1[20],
+		void *data)
+{
+	struct collect_submodule_from_sha1s_data *me =
+		(struct collect_submodule_from_sha1s_data *) data;
+
+	if (submodule_needs_pushing(me->submodule_path, sha1))
+		string_list_insert(me->needs_pushing, me->submodule_path);
+}
+
+static void free_submodules_sha1s(struct string_list *submodules)
+{
+	int i;
+	for (i = 0; i < submodules->nr; i++) {
+		struct string_list_item *item = &submodules->items[i];
+		struct sha1_array *hashes = (struct sha1_array *) item->util;
+		sha1_array_clear(hashes);
+	}
+	string_list_clear(submodules, 1);
+}
+
 int find_unpushed_submodules(unsigned char new_sha1[20],
 		const char *remotes_name, struct string_list *needs_pushing)
 {
 	struct rev_info rev;
 	struct commit *commit;
 	const char *argv[] = {NULL, NULL, "--not", "NULL", NULL};
-	int argc = ARRAY_SIZE(argv) - 1;
+	int argc = ARRAY_SIZE(argv) - 1, i;
 	char *sha1_copy;
+	struct string_list submodules = STRING_LIST_INIT_DUP;
 
 	struct strbuf remotes_arg = STRBUF_INIT;
 
@@ -603,12 +649,23 @@ int find_unpushed_submodules(unsigned char new_sha1[20],
 		die("revision walk setup failed");
 
 	while ((commit = get_revision(&rev)) != NULL)
-		find_unpushed_submodule_commits(commit, needs_pushing);
+		find_unpushed_submodule_commits(commit, &submodules);
 
 	reset_revision_walk();
 	free(sha1_copy);
 	strbuf_release(&remotes_arg);
 
+	for (i = 0; i < submodules.nr; i++) {
+		struct string_list_item *item = &submodules.items[i];
+		struct collect_submodule_from_sha1s_data data;
+		data.submodule_path = item->string;
+		data.needs_pushing = needs_pushing;
+		sha1_array_for_each_unique((struct sha1_array *) item->util,
+				collect_submodules_from_sha1s,
+				&data);
+	}
+	free_submodules_sha1s(&submodules);
+
 	return needs_pushing->nr;
 }
 
-- 
2.0.2.832.g083c931


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/2] serialize collection of refs that contain submodule changes
  2016-08-24 23:01       ` Jeff King
  2016-09-14 17:31         ` [PATCH 1/2] serialize collection of changed submodules Heiko Voigt
@ 2016-09-14 17:51         ` Heiko Voigt
  2016-09-14 19:46           ` Heiko Voigt
  2016-09-16 17:47           ` Junio C Hamano
  1 sibling, 2 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-14 17:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Stefan Beller, Junio C Hamano, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

We are iterating over each pushed ref and want to check whether it
contains changes to submodules. Instead of immediately checking each ref
lets first collect them and then do the check for all of them in one
revision walk.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---

Sorry this was not catched earlier. This was implemented as part of
summer of code and it seems we never tested with --mirror.

This is the one which does only one revision walk instead of one for
each ref. Here are some numbers (using the my development clone of git
itself) from my local machine:

rm -rf <test-git> && mkdir <test-git> &&
(cd <test-git> && git init) &&
time git push --mirror <test-git>

   real	0m16.056s
   user	0m24.424s
   sys	0m1.380s

   real	0m15.885s
   user	0m24.204s
   sys	0m1.296s

   real	0m16.731s
   user	0m24.176s
   sys	0m1.244s

rm -rf <test-git> && mkdir <test-git> &&
(cd <test-git> && git init) &&
time git push --mirror --recurse-submodules=check <test-git>

   real	0m21.441s
   user	0m29.560s
   sys	0m1.480s

   real	0m21.319s
   user	0m29.484s
   sys	0m1.464s

   real	0m21.261s
   user	0m29.252s
   sys	0m1.592s

Without my patches and --recurse-submodules=check the numbers are
basically the same. I stopped the test with --recurse-submodules=check
after ~ 9 minutes.

Cheers Heiko

 submodule.c | 36 +++++++++++++++++++++---------------
 submodule.h |  5 +++--
 transport.c | 22 ++++++++++++++--------
 3 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/submodule.c b/submodule.c
index b04c066..a15e346 100644
--- a/submodule.c
+++ b/submodule.c
@@ -627,24 +627,31 @@ static void free_submodules_sha1s(struct string_list *submodules)
 	string_list_clear(submodules, 1);
 }
 
-int find_unpushed_submodules(unsigned char new_sha1[20],
+static void append_hash_to_argv(const unsigned char sha1[20],
+		void *data)
+{
+	struct argv_array *argv = (struct argv_array *) data;
+	argv_array_push(argv, sha1_to_hex(sha1));
+}
+
+int find_unpushed_submodules(struct sha1_array *hashes,
 		const char *remotes_name, struct string_list *needs_pushing)
 {
 	struct rev_info rev;
 	struct commit *commit;
-	const char *argv[] = {NULL, NULL, "--not", "NULL", NULL};
-	int argc = ARRAY_SIZE(argv) - 1, i;
-	char *sha1_copy;
+	int i;
 	struct string_list submodules = STRING_LIST_INIT_DUP;
+	struct argv_array argv = ARGV_ARRAY_INIT;
 
-	struct strbuf remotes_arg = STRBUF_INIT;
-
-	strbuf_addf(&remotes_arg, "--remotes=%s", remotes_name);
 	init_revisions(&rev, NULL);
-	sha1_copy = xstrdup(sha1_to_hex(new_sha1));
-	argv[1] = sha1_copy;
-	argv[3] = remotes_arg.buf;
-	setup_revisions(argc, argv, &rev, NULL);
+
+	/* argv.argv[0] will be ignored by setup_revisions */
+	argv_array_push(&argv, "find_unpushed_submodules");
+	sha1_array_for_each_unique(hashes, append_hash_to_argv, &argv);
+	argv_array_push(&argv, "--not");
+	argv_array_pushf(&argv, "--remotes=%s", remotes_name);
+
+	setup_revisions(argv.argc, argv.argv, &rev, NULL);
 	if (prepare_revision_walk(&rev))
 		die("revision walk setup failed");
 
@@ -652,8 +659,7 @@ int find_unpushed_submodules(unsigned char new_sha1[20],
 		find_unpushed_submodule_commits(commit, &submodules);
 
 	reset_revision_walk();
-	free(sha1_copy);
-	strbuf_release(&remotes_arg);
+	argv_array_clear(&argv);
 
 	for (i = 0; i < submodules.nr; i++) {
 		struct string_list_item *item = &submodules.items[i];
@@ -691,12 +697,12 @@ static int push_submodule(const char *path)
 	return 1;
 }
 
-int push_unpushed_submodules(unsigned char new_sha1[20], const char *remotes_name)
+int push_unpushed_submodules(struct sha1_array *hashes, const char *remotes_name)
 {
 	int i, ret = 1;
 	struct string_list needs_pushing = STRING_LIST_INIT_DUP;
 
-	if (!find_unpushed_submodules(new_sha1, remotes_name, &needs_pushing))
+	if (!find_unpushed_submodules(hashes, remotes_name, &needs_pushing))
 		return 1;
 
 	for (i = 0; i < needs_pushing.nr; i++) {
diff --git a/submodule.h b/submodule.h
index d9e197a..065b2f0 100644
--- a/submodule.h
+++ b/submodule.h
@@ -3,6 +3,7 @@
 
 struct diff_options;
 struct argv_array;
+struct sha1_array;
 
 enum {
 	RECURSE_SUBMODULES_CHECK = -4,
@@ -62,9 +63,9 @@ int submodule_uses_gitfile(const char *path);
 int ok_to_remove_submodule(const char *path);
 int merge_submodule(unsigned char result[20], const char *path, const unsigned char base[20],
 		    const unsigned char a[20], const unsigned char b[20], int search);
-int find_unpushed_submodules(unsigned char new_sha1[20], const char *remotes_name,
+int find_unpushed_submodules(struct sha1_array *hashes, const char *remotes_name,
 		struct string_list *needs_pushing);
-int push_unpushed_submodules(unsigned char new_sha1[20], const char *remotes_name);
+int push_unpushed_submodules(struct sha1_array *hashes, const char *remotes_name);
 void connect_work_tree_and_git_dir(const char *work_tree, const char *git_dir);
 int parallel_submodules(void);
 
diff --git a/transport.c b/transport.c
index 94d6dc3..76e1daf 100644
--- a/transport.c
+++ b/transport.c
@@ -903,23 +903,29 @@ int transport_push(struct transport *transport,
 
 		if ((flags & TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND) && !is_bare_repository()) {
 			struct ref *ref = remote_refs;
+			struct sha1_array hashes = SHA1_ARRAY_INIT;
+
 			for (; ref; ref = ref->next)
-				if (!is_null_oid(&ref->new_oid) &&
-				    !push_unpushed_submodules(ref->new_oid.hash,
-					    transport->remote->name))
-				    die ("Failed to push all needed submodules!");
+				if (!is_null_oid(&ref->new_oid))
+					sha1_array_append(&hashes, ref->new_oid.hash);
+
+			if (!push_unpushed_submodules(&hashes, transport->remote->name))
+				die ("Failed to push all needed submodules!");
 		}
 
 		if ((flags & (TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND |
 			      TRANSPORT_RECURSE_SUBMODULES_CHECK)) && !is_bare_repository()) {
 			struct ref *ref = remote_refs;
 			struct string_list needs_pushing = STRING_LIST_INIT_DUP;
+			struct sha1_array hashes = SHA1_ARRAY_INIT;
 
 			for (; ref; ref = ref->next)
-				if (!is_null_oid(&ref->new_oid) &&
-				    find_unpushed_submodules(ref->new_oid.hash,
-					    transport->remote->name, &needs_pushing))
-					die_with_unpushed_submodules(&needs_pushing);
+				if (!is_null_oid(&ref->new_oid))
+					sha1_array_append(&hashes, ref->new_oid.hash);
+
+			if (find_unpushed_submodules(&hashes, transport->remote->name,
+						&needs_pushing))
+				die_with_unpushed_submodules(&needs_pushing);
 		}
 
 		push_ret = transport->push_refs(transport, remote_refs, flags);
-- 
2.0.2.832.g083c931


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] serialize collection of refs that contain submodule changes
  2016-09-14 17:51         ` [PATCH 2/2] serialize collection of refs that contain submodule changes Heiko Voigt
@ 2016-09-14 19:46           ` Heiko Voigt
  2016-09-14 20:04             ` Stefan Beller
  2016-09-16 17:47           ` Junio C Hamano
  1 sibling, 1 reply; 25+ messages in thread
From: Heiko Voigt @ 2016-09-14 19:46 UTC (permalink / raw)
  To: Jeff King
  Cc: Stefan Beller, Junio C Hamano, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Wed, Sep 14, 2016 at 07:51:30PM +0200, Heiko Voigt wrote:
> Here are some numbers (using the my development clone of git
> itself) from my local machine:
> 
> rm -rf <test-git> && mkdir <test-git> &&
> (cd <test-git> && git init) &&
> time git push --mirror <test-git>
> 
>    real	0m16.056s
>    user	0m24.424s
>    sys	0m1.380s
> 
>    real	0m15.885s
>    user	0m24.204s
>    sys	0m1.296s
> 
>    real	0m16.731s
>    user	0m24.176s
>    sys	0m1.244s
> 
> rm -rf <test-git> && mkdir <test-git> &&
> (cd <test-git> && git init) &&
> time git push --mirror --recurse-submodules=check <test-git>
> 
>    real	0m21.441s
>    user	0m29.560s
>    sys	0m1.480s
> 
>    real	0m21.319s
>    user	0m29.484s
>    sys	0m1.464s
> 
>    real	0m21.261s
>    user	0m29.252s
>    sys	0m1.592s
> 
> Without my patches and --recurse-submodules=check the numbers are
> basically the same. I stopped the test with --recurse-submodules=check
> after ~ 9 minutes.

Fun fact, I let the push without my patch and with
--recurse-submodules=check finish:

real	27m7.962s
user	27m15.568s
sys	0m2.016s

Thats quite some time...

Cheers Heiko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] serialize collection of refs that contain submodule changes
  2016-09-14 19:46           ` Heiko Voigt
@ 2016-09-14 20:04             ` Stefan Beller
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Beller @ 2016-09-14 20:04 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Junio C Hamano, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Wed, Sep 14, 2016 at 12:46 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> On Wed, Sep 14, 2016 at 07:51:30PM +0200, Heiko Voigt wrote:
>> Here are some numbers (using the my development clone of git
>> itself) from my local machine:
>>
>> rm -rf <test-git> && mkdir <test-git> &&
>> (cd <test-git> && git init) &&
>> time git push --mirror <test-git>
>>
>>    real       0m16.056s
>>    user       0m24.424s
>>    sys        0m1.380s
>>
>>    real       0m15.885s
>>    user       0m24.204s
>>    sys        0m1.296s
>>
>>    real       0m16.731s
>>    user       0m24.176s
>>    sys        0m1.244s
>>
>> rm -rf <test-git> && mkdir <test-git> &&
>> (cd <test-git> && git init) &&
>> time git push --mirror --recurse-submodules=check <test-git>
>>
>>    real       0m21.441s
>>    user       0m29.560s
>>    sys        0m1.480s
>>
>>    real       0m21.319s
>>    user       0m29.484s
>>    sys        0m1.464s
>>
>>    real       0m21.261s
>>    user       0m29.252s
>>    sys        0m1.592s
>>
>> Without my patches and --recurse-submodules=check the numbers are
>> basically the same. I stopped the test with --recurse-submodules=check
>> after ~ 9 minutes.
>
> Fun fact, I let the push without my patch and with
> --recurse-submodules=check finish:

Thanks for the numbers, one of the major push backs for
origin/sb/push-make-submodule-check-the-default
was that it introduced slowness; this patch might help a bit there.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] serialize collection of changed submodules
  2016-09-14 17:31         ` [PATCH 1/2] serialize collection of changed submodules Heiko Voigt
@ 2016-09-14 22:30           ` Junio C Hamano
  2016-09-15 12:10             ` [PATCH 3/2] batch check whether submodule needs pushing into one call Heiko Voigt
  2016-09-15 12:18             ` [PATCH 4/2] use actual start hashes for submodule push check instead of local refs Heiko Voigt
  2016-09-16 17:27           ` [PATCH 1/2] serialize collection of changed submodules Junio C Hamano
  1 sibling, 2 replies; 25+ messages in thread
From: Junio C Hamano @ 2016-09-14 22:30 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

> Sorry about the late reply. I was not able to process emails until now.
> Here are two patches that should help to improve the situation and batch
> up some processing. This one is for repositories with submodules, so
> that they do not iterate over the same submodule twice with the same
> hash.
>
> The second one will be the one people without submodules are interested
> in.

Thanks.  Will take a look at later as I'm already deep in today's
integration cycle.  Very much appreciated.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-14 22:30           ` Junio C Hamano
@ 2016-09-15 12:10             ` Heiko Voigt
  2016-09-15 21:08               ` Junio C Hamano
  2016-09-16 17:59               ` Junio C Hamano
  2016-09-15 12:18             ` [PATCH 4/2] use actual start hashes for submodule push check instead of local refs Heiko Voigt
  1 sibling, 2 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-15 12:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

We run a command for each sha1 change in a submodule. This is
unnecessary since we can simply batch all sha1's we want to check into
one command. Lets do it so we can speedup the check when many submodule
changes are in need of checking.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---
On Wed, Sep 14, 2016 at 03:30:53PM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> 
> > Sorry about the late reply. I was not able to process emails until now.
> > Here are two patches that should help to improve the situation and batch
> > up some processing. This one is for repositories with submodules, so
> > that they do not iterate over the same submodule twice with the same
> > hash.
> >
> > The second one will be the one people without submodules are interested
> > in.
> 
> Thanks.  Will take a look at later as I'm already deep in today's
> integration cycle.  Very much appreciated.

No problem. While I am at it: Here are actually another two patches that
should make life of submodule users easier (push times of big pushes).

In Numbers with the qt5[1] superproject and all submodules initialized.
The same --mirror test as before with the git repository:

# Without patch:

book:qt5 hvoigt (5.6)$
rm -rf ~/Downloads/git-test && mkdir ~/Downloads/git-test &&
   (cd ~/Downloads/git-test && git init) &&
   time git push --mirror --recurse-submodules=check ~/Downloads/git-test

real	4m0.881s
user	3m30.139s
sys	0m22.329s

Without --recurse-submodules=check

real	0m0.251s
user	0m0.218s
sys	0m0.082s


# With patch:

real	0m1.167s
user	0m0.846s
sys	0m0.262s

real	0m1.110s
user	0m0.815s
sys	0m0.247s

real	0m1.111s
user	0m0.818s
sys	0m0.251s

Without --recurse-submodules=check

real	0m0.294s
user	0m0.221s
sys	0m0.104s

real	0m0.248s
user	0m0.216s
sys	0m0.080s

real	0m0.247s
user	0m0.212s
sys	0m0.082s

[1] git://code.qt.io/qt/qt5.git

 submodule.c | 75 ++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/submodule.c b/submodule.c
index a15e346..28bb74e 100644
--- a/submodule.c
+++ b/submodule.c
@@ -522,27 +522,54 @@ static int has_remote(const char *refname, const struct object_id *oid,
 	return 1;
 }
 
-static int submodule_needs_pushing(const char *path, const unsigned char sha1[20])
+static void append_hash_to_argv(const unsigned char sha1[20], void *data)
 {
-	if (add_submodule_odb(path) || !lookup_commit_reference(sha1))
+	struct argv_array *argv = (struct argv_array *) data;
+	argv_array_push(argv, sha1_to_hex(sha1));
+}
+
+static void check_has_hash(const unsigned char sha1[20], void *data)
+{
+	int *has_hash = (int *) data;
+
+	if (!lookup_commit_reference(sha1))
+		*has_hash = 0;
+}
+
+static int submodule_has_hashes(const char *path, struct sha1_array *hashes)
+{
+	int has_hash = 1;
+
+	if (add_submodule_odb(path))
+		return 0;
+
+	sha1_array_for_each_unique(hashes, check_has_hash, &has_hash);
+	return has_hash;
+}
+
+static int submodule_needs_pushing(const char *path, struct sha1_array *hashes)
+{
+	if (!submodule_has_hashes(path, hashes))
 		return 0;
 
 	if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
 		struct child_process cp = CHILD_PROCESS_INIT;
-		const char *argv[] = {"rev-list", NULL, "--not", "--remotes", "-n", "1" , NULL};
+
+		argv_array_push(&cp.args, "rev-list");
+		sha1_array_for_each_unique(hashes, append_hash_to_argv, &cp.args);
+		argv_array_pushl(&cp.args, "--not", "--remotes", "-n", "1" , NULL);
+
 		struct strbuf buf = STRBUF_INIT;
 		int needs_pushing = 0;
 
-		argv[1] = sha1_to_hex(sha1);
-		cp.argv = argv;
 		prepare_submodule_repo_env(&cp.env_array);
 		cp.git_cmd = 1;
 		cp.no_stdin = 1;
 		cp.out = -1;
 		cp.dir = path;
 		if (start_command(&cp))
-			die("Could not run 'git rev-list %s --not --remotes -n 1' command in submodule %s",
-				sha1_to_hex(sha1), path);
+			die("Could not run 'git rev-list <hashes> --not --remotes -n 1' command in submodule %s",
+					path);
 		if (strbuf_read(&buf, cp.out, 41))
 			needs_pushing = 1;
 		finish_command(&cp);
@@ -601,21 +628,6 @@ static void find_unpushed_submodule_commits(struct commit *commit,
 	diff_tree_combined_merge(commit, 1, &rev);
 }
 
-struct collect_submodule_from_sha1s_data {
-	char *submodule_path;
-	struct string_list *needs_pushing;
-};
-
-static void collect_submodules_from_sha1s(const unsigned char sha1[20],
-		void *data)
-{
-	struct collect_submodule_from_sha1s_data *me =
-		(struct collect_submodule_from_sha1s_data *) data;
-
-	if (submodule_needs_pushing(me->submodule_path, sha1))
-		string_list_insert(me->needs_pushing, me->submodule_path);
-}
-
 static void free_submodules_sha1s(struct string_list *submodules)
 {
 	int i;
@@ -627,13 +639,6 @@ static void free_submodules_sha1s(struct string_list *submodules)
 	string_list_clear(submodules, 1);
 }
 
-static void append_hash_to_argv(const unsigned char sha1[20],
-		void *data)
-{
-	struct argv_array *argv = (struct argv_array *) data;
-	argv_array_push(argv, sha1_to_hex(sha1));
-}
-
 int find_unpushed_submodules(struct sha1_array *hashes,
 		const char *remotes_name, struct string_list *needs_pushing)
 {
@@ -662,13 +667,11 @@ int find_unpushed_submodules(struct sha1_array *hashes,
 	argv_array_clear(&argv);
 
 	for (i = 0; i < submodules.nr; i++) {
-		struct string_list_item *item = &submodules.items[i];
-		struct collect_submodule_from_sha1s_data data;
-		data.submodule_path = item->string;
-		data.needs_pushing = needs_pushing;
-		sha1_array_for_each_unique((struct sha1_array *) item->util,
-				collect_submodules_from_sha1s,
-				&data);
+		struct string_list_item *submodule = &submodules.items[i];
+		struct sha1_array *hashes = (struct sha1_array *) submodule->util;
+
+		if (submodule_needs_pushing(submodule->string, hashes))
+			string_list_insert(needs_pushing, submodule->string);
 	}
 	free_submodules_sha1s(&submodules);
 
-- 
2.10.0.133.g13017a3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/2] use actual start hashes for submodule push check instead of local refs
  2016-09-14 22:30           ` Junio C Hamano
  2016-09-15 12:10             ` [PATCH 3/2] batch check whether submodule needs pushing into one call Heiko Voigt
@ 2016-09-15 12:18             ` Heiko Voigt
  1 sibling, 0 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-15 12:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Push knows the actual revision range it is actually pushing to a remote.
Let's use the start revisions to reduce the amount of checked revisions
instead of the locally cached remote refs which might be out of date.

This actually changes behavior as it now can also properly handle pushes
with URLs. That is also the reason why we change some tests. When
passing the remote as URL the push is made unconditionally in on-demand.
This is wrong since on-demand should only push the submodule if its SHA1
reference got changed in the superproject history that is getting pushed.
The reason behind that is that the exclusion list for revisions was
reduced by using the parameters "--not --remotes=<remote name>" to
reduce the amount of revisions for the revision walk. In case of an URL
this does not match anything and thus we would always do a full revision
walk until the root commit.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---
And here is another one which makes the check more correct. For push
--recurse-submodules=check|on-demand with URLs this is also a
performance improvement since at the moment we iterate over all
revisions unconditionally as explained above.

This patch changes the behavior (now its how its supposed to be) so
there may be fallout from users complaining that their submodules do not
get pushed with on-demand anymore. This is only for users using URL for
pushing though.

Cheers Heiko

BTW: Peff, thanks for the good diagnosis of all these problems.

 submodule.c                    | 12 ++++++------
 submodule.h                    |  6 +++---
 t/t5531-deep-submodule-push.sh | 24 ++++++++++++++++++------
 transport.c                    | 40 ++++++++++++++++++++++++++++++----------
 4 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/submodule.c b/submodule.c
index 28bb74e..1f82974 100644
--- a/submodule.c
+++ b/submodule.c
@@ -639,8 +639,8 @@ static void free_submodules_sha1s(struct string_list *submodules)
 	string_list_clear(submodules, 1);
 }
 
-int find_unpushed_submodules(struct sha1_array *hashes,
-		const char *remotes_name, struct string_list *needs_pushing)
+int find_unpushed_submodules(struct sha1_array *old_hashes,
+		struct sha1_array *new_hashes, struct string_list *needs_pushing)
 {
 	struct rev_info rev;
 	struct commit *commit;
@@ -652,9 +652,9 @@ int find_unpushed_submodules(struct sha1_array *hashes,
 
 	/* argv.argv[0] will be ignored by setup_revisions */
 	argv_array_push(&argv, "find_unpushed_submodules");
-	sha1_array_for_each_unique(hashes, append_hash_to_argv, &argv);
+	sha1_array_for_each_unique(new_hashes, append_hash_to_argv, &argv);
 	argv_array_push(&argv, "--not");
-	argv_array_pushf(&argv, "--remotes=%s", remotes_name);
+	sha1_array_for_each_unique(old_hashes, append_hash_to_argv, &argv);
 
 	setup_revisions(argv.argc, argv.argv, &rev, NULL);
 	if (prepare_revision_walk(&rev))
@@ -700,12 +700,12 @@ static int push_submodule(const char *path)
 	return 1;
 }
 
-int push_unpushed_submodules(struct sha1_array *hashes, const char *remotes_name)
+int push_unpushed_submodules(struct sha1_array *old_hashes, struct sha1_array *new_hashes)
 {
 	int i, ret = 1;
 	struct string_list needs_pushing = STRING_LIST_INIT_DUP;
 
-	if (!find_unpushed_submodules(hashes, remotes_name, &needs_pushing))
+	if (!find_unpushed_submodules(old_hashes, new_hashes, &needs_pushing))
 		return 1;
 
 	for (i = 0; i < needs_pushing.nr; i++) {
diff --git a/submodule.h b/submodule.h
index 065b2f0..55c6c91 100644
--- a/submodule.h
+++ b/submodule.h
@@ -63,9 +63,9 @@ int submodule_uses_gitfile(const char *path);
 int ok_to_remove_submodule(const char *path);
 int merge_submodule(unsigned char result[20], const char *path, const unsigned char base[20],
 		    const unsigned char a[20], const unsigned char b[20], int search);
-int find_unpushed_submodules(struct sha1_array *hashes, const char *remotes_name,
-		struct string_list *needs_pushing);
-int push_unpushed_submodules(struct sha1_array *hashes, const char *remotes_name);
+int find_unpushed_submodules(struct sha1_array *old_hashes,
+		struct sha1_array *new_hashes, struct string_list *needs_pushing);
+int push_unpushed_submodules(struct sha1_array *old_hashes, struct sha1_array *new_hashes);
 void connect_work_tree_and_git_dir(const char *work_tree, const char *git_dir);
 int parallel_submodules(void);
 
diff --git a/t/t5531-deep-submodule-push.sh b/t/t5531-deep-submodule-push.sh
index 198ce84..a5b8ef6 100755
--- a/t/t5531-deep-submodule-push.sh
+++ b/t/t5531-deep-submodule-push.sh
@@ -154,6 +154,8 @@ test_expect_success 'push recurse-submodules on command line overrides config' '
 		git fetch ../pub.git &&
 		git diff --quiet FETCH_HEAD master &&
 		(cd gar/bage && git diff --quiet origin/master master^) &&
+		# Since this push was executed reset to previous state
+		(git push -f ../pub.git master^:master) &&
 
 		# Ensure that we can override check in the config to
 		# disable submodule recursion entirely (alternative form)
@@ -161,6 +163,8 @@ test_expect_success 'push recurse-submodules on command line overrides config' '
 		git fetch ../pub.git &&
 		git diff --quiet FETCH_HEAD master &&
 		(cd gar/bage && git diff --quiet origin/master master^) &&
+		# Since this push was executed reset to previous state
+		(git push -f ../pub.git master^:master) &&
 
 		# Ensure that we can override check in the config to
 		# push the submodule too
@@ -198,11 +202,15 @@ test_expect_success 'push recurse-submodules last one wins on command line' '
 		git diff --quiet FETCH_HEAD master &&
 		# Check that the submodule commit did not get there
 		(cd gar/bage && git diff --quiet origin/master master^) &&
+		# Since this push was executed reset to previous state
+		(git push -f ../pub.git master^:master) &&
 
 		# should result in "no"
 		git push --recurse-submodules=on-demand --no-recurse-submodules ../pub.git master &&
 		# Check that the submodule commit did not get there
 		(cd gar/bage && git diff --quiet origin/master master^) &&
+		# Since this push was executed reset to previous state
+		(git push -f ../pub.git master^:master) &&
 
 		# But the options in the other order should push the submodule
 		git push --recurse-submodules=check --recurse-submodules=on-demand ../pub.git master &&
@@ -249,9 +257,11 @@ test_expect_success 'push succeeds if submodule commit disabling recursion from
 		git fetch ../pub.git &&
 		git diff --quiet FETCH_HEAD master &&
 		# But that the submodule commit did not
-		( cd gar/bage && git diff --quiet origin/master master^ ) &&
-		# Now push it to avoid confusing future tests
-		git push --recurse-submodules=on-demand ../pub.git master
+		(cd gar/bage &&
+			git diff --quiet origin/master master^ &&
+			# Now push it to avoid confusing future tests
+			git push
+		)
 	)
 '
 
@@ -271,9 +281,11 @@ test_expect_success 'push succeeds if submodule commit disabling recursion from
 		git fetch ../pub.git &&
 		git diff --quiet FETCH_HEAD master &&
 		# But that the submodule commit did not
-		( cd gar/bage && git diff --quiet origin/master master^ ) &&
-		# Now push it to avoid confusing future tests
-		git push --recurse-submodules=on-demand ../pub.git master
+		(cd gar/bage &&
+			git diff --quiet origin/master master^ &&
+			# Now push it to avoid confusing future tests
+			git push
+		)
 	)
 '
 
diff --git a/transport.c b/transport.c
index 76e1daf..475faaf 100644
--- a/transport.c
+++ b/transport.c
@@ -845,6 +845,28 @@ static int run_pre_push_hook(struct transport *transport,
 	return ret;
 }
 
+static void collect_new_old_hashes(struct ref *ref, struct sha1_array *new_hashes,
+		struct sha1_array *old_hashes)
+{
+	for (; ref; ref = ref->next) {
+		if (!is_null_oid(&ref->new_oid)) {
+			/* Just need to handle non-null oid's. If there
+			 * is no new we are handling a deletion and do
+			 * not need to check
+			 */
+			sha1_array_append(new_hashes, ref->new_oid.hash);
+
+			/* if the old id is null we are handling an new
+			 * ref and can simply skip appending the oid
+			 * since they are used to reduce the amount of
+			 * checked revisions.
+			 */
+			if (!is_null_oid(&ref->old_oid))
+				sha1_array_append(old_hashes, ref->old_oid.hash);
+		}
+	}
+}
+
 int transport_push(struct transport *transport,
 		   int refspec_nr, const char **refspec, int flags,
 		   unsigned int *reject_reasons)
@@ -903,13 +925,12 @@ int transport_push(struct transport *transport,
 
 		if ((flags & TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND) && !is_bare_repository()) {
 			struct ref *ref = remote_refs;
-			struct sha1_array hashes = SHA1_ARRAY_INIT;
+			struct sha1_array new_hashes = SHA1_ARRAY_INIT;
+			struct sha1_array old_hashes = SHA1_ARRAY_INIT;
 
-			for (; ref; ref = ref->next)
-				if (!is_null_oid(&ref->new_oid))
-					sha1_array_append(&hashes, ref->new_oid.hash);
+			collect_new_old_hashes(ref, &new_hashes, &old_hashes);
 
-			if (!push_unpushed_submodules(&hashes, transport->remote->name))
+			if (!push_unpushed_submodules(&old_hashes, &new_hashes))
 				die ("Failed to push all needed submodules!");
 		}
 
@@ -917,13 +938,12 @@ int transport_push(struct transport *transport,
 			      TRANSPORT_RECURSE_SUBMODULES_CHECK)) && !is_bare_repository()) {
 			struct ref *ref = remote_refs;
 			struct string_list needs_pushing = STRING_LIST_INIT_DUP;
-			struct sha1_array hashes = SHA1_ARRAY_INIT;
+			struct sha1_array new_hashes = SHA1_ARRAY_INIT;
+			struct sha1_array old_hashes = SHA1_ARRAY_INIT;
 
-			for (; ref; ref = ref->next)
-				if (!is_null_oid(&ref->new_oid))
-					sha1_array_append(&hashes, ref->new_oid.hash);
+			collect_new_old_hashes(ref, &new_hashes, &old_hashes);
 
-			if (find_unpushed_submodules(&hashes, transport->remote->name,
+			if (find_unpushed_submodules(&old_hashes, &new_hashes,
 						&needs_pushing))
 				die_with_unpushed_submodules(&needs_pushing);
 		}
-- 
2.10.0.133.g13017a3


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-15 12:10             ` [PATCH 3/2] batch check whether submodule needs pushing into one call Heiko Voigt
@ 2016-09-15 21:08               ` Junio C Hamano
  2016-09-16  9:40                 ` Heiko Voigt
  2016-09-16 17:59               ` Junio C Hamano
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2016-09-15 21:08 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

>  	if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
>  		struct child_process cp = CHILD_PROCESS_INIT;
> -		const char *argv[] = {"rev-list", NULL, "--not", "--remotes", "-n", "1" , NULL};
> +
> +		argv_array_push(&cp.args, "rev-list");
> +		sha1_array_for_each_unique(hashes, append_hash_to_argv, &cp.args);
> +		argv_array_pushl(&cp.args, "--not", "--remotes", "-n", "1" , NULL);
> +
>  		struct strbuf buf = STRBUF_INIT;
>  		int needs_pushing = 0;

These two become decl-after-stmt; move your new lines a bit lower,
perhaps?

> -		argv[1] = sha1_to_hex(sha1);
> -		cp.argv = argv;
>  		prepare_submodule_repo_env(&cp.env_array);

By the way, with the two new patches, 'pu' seems to start failing
some tests, e.g. 5533 5404 5405.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-15 21:08               ` Junio C Hamano
@ 2016-09-16  9:40                 ` Heiko Voigt
  2016-09-16 12:31                   ` Heiko Voigt
  0 siblings, 1 reply; 25+ messages in thread
From: Heiko Voigt @ 2016-09-16  9:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Thu, Sep 15, 2016 at 02:08:58PM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> 
> >  	if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
> >  		struct child_process cp = CHILD_PROCESS_INIT;
> > -		const char *argv[] = {"rev-list", NULL, "--not", "--remotes", "-n", "1" , NULL};
> > +
> > +		argv_array_push(&cp.args, "rev-list");
> > +		sha1_array_for_each_unique(hashes, append_hash_to_argv, &cp.args);
> > +		argv_array_pushl(&cp.args, "--not", "--remotes", "-n", "1" , NULL);
> > +
> >  		struct strbuf buf = STRBUF_INIT;
> >  		int needs_pushing = 0;
> 
> These two become decl-after-stmt; move your new lines a bit lower,
> perhaps?

Thanks, missed those. Will do.

> > -		argv[1] = sha1_to_hex(sha1);
> > -		cp.argv = argv;
> >  		prepare_submodule_repo_env(&cp.env_array);
> 
> By the way, with the two new patches, 'pu' seems to start failing
> some tests, e.g. 5533 5404 5405.

Ah ok I did only test on master, will look into those.

Cheers Heiko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-16  9:40                 ` Heiko Voigt
@ 2016-09-16 12:31                   ` Heiko Voigt
  2016-09-16 18:13                     ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Heiko Voigt @ 2016-09-16 12:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Fri, Sep 16, 2016 at 11:40:19AM +0200, Heiko Voigt wrote:
> > By the way, with the two new patches, 'pu' seems to start failing
> > some tests, e.g. 5533 5404 5405.
> 
> Ah ok I did only test on master, will look into those.

Ok I had a look into these and the reason t5533 fails is because on pu
--recurse-submodules is enabled by default and I missed the case when
overwriting a ref. In that case we get the sha1 from the remote side as
old. So we could catch that and fall back to all revisions there, but...

... tl;dr: The solution to use the old revisions from the remote side
was too simple and does not make matters better but actually worse for
some typical usecases. Its only in the last patch.

... that lead me to further thinking about the previous solution (using
the locally cached remote refs) which might actually be a good default
for the non-fastforward cases like creating new ref or overwriting a
ref.

My current patch would handle the --mirror case nicer, since it gets a
lot of old revs to reduce the revisions to check. For the typical one
branch push it would most likely be worse. Except when the user is
updating (fast-forwarding) the remote ref we would scan all revs of a
ref until the root because we do not get enough valid revs that already
exist on the remote.

The most exact solution would be to use all actual remote refs available
(not sure if we have them at this point in the process?) another
solution would be to still append the --remotes=<remotename> option as a
fallback to reduce the revisions.

What do others think? Will leave this for a little bit further thinking.
Its just the last patch ("use actual start hashes for submodule push
check instead of local refs") that needs to go back to the drawing
board.

Cheers Heiko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] serialize collection of changed submodules
  2016-09-14 17:31         ` [PATCH 1/2] serialize collection of changed submodules Heiko Voigt
  2016-09-14 22:30           ` Junio C Hamano
@ 2016-09-16 17:27           ` Junio C Hamano
  2016-09-19 19:44             ` Heiko Voigt
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2016-09-16 17:27 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

> +static struct sha1_array *get_sha1s_from_list(struct string_list *submodules,
> +		const char *path)
> +{
> +	struct string_list_item *item;
> +	struct sha1_array *hashes;
> +
> +	item = string_list_insert(submodules, path);
> +	if (item->util)
> +		return (struct sha1_array *) item->util;
> +
> +	hashes = (struct sha1_array *) xmalloc(sizeof(struct sha1_array));
> +	/* NEEDSWORK: should we add an initializer function for
> +	 * sha1_array ? */
> +	memset(hashes, 0, sizeof(struct sha1_array));
> +	item->util = hashes;


	/* NEEDSWORK: should we have SHA1_ARRAY_INIT etc.? */
	item->util = xcalloc(1, sizeof(struct sha1_array));

>  static void collect_submodules_from_diff(struct diff_queue_struct *q,
>  					 struct diff_options *options,
>  					 void *data)
>  {
>  	int i;
> -	struct string_list *needs_pushing = data;
> +	struct string_list *submodules = data;
>  
>  	for (i = 0; i < q->nr; i++) {
>  		struct diff_filepair *p = q->queue[i];
> +		struct sha1_array *hashes;
>  		if (!S_ISGITLINK(p->two->mode))
>  			continue;
> -		if (submodule_needs_pushing(p->two->path, p->two->oid.hash))
> -			string_list_insert(needs_pushing, p->two->path);
> +		hashes = get_sha1s_from_list(submodules, p->two->path);
> +		sha1_array_append(hashes, p->two->oid.hash);
>  	}
>  }

So the idea at this step is still let each commit in the top-level
history inspected for any submodule change, but the result is
collected in a mapping (submodule -> [ list of submodule commits ]).
As we do not expect too many "oops, the old commit was better, so
let's revert and rebind the old one from the submodule" in the
history of the top-level, appending and then running for-each-unique
is an efficient way, instead of first checking if we already have
it and then inserting new ones to maintain the uniqueness.

Makes sense.

> @@ -582,14 +601,41 @@ static void find_unpushed_submodule_commits(struct commit *commit,
>  	diff_tree_combined_merge(commit, 1, &rev);
>  }
>  
> +struct collect_submodule_from_sha1s_data {
> +	char *submodule_path;
> +	struct string_list *needs_pushing;
> +};
> +
> +static void collect_submodules_from_sha1s(const unsigned char sha1[20],
> +		void *data)
> +{
> +	struct collect_submodule_from_sha1s_data *me =
> +		(struct collect_submodule_from_sha1s_data *) data;
> +
> +	if (submodule_needs_pushing(me->submodule_path, sha1))
> +		string_list_insert(me->needs_pushing, me->submodule_path);
> +}

This is called from sha1_array_for_each_unique() that iterates over
the submodule commit object names for one submodule and then ends up
calling submodule_needs_pushing() number of times, which smells less
efficient than it could be.  You can ask

    rev-list <all the submodule commits to be pushed> --not --remotes

just once in the submodule repository.  I imagine that is what you'll
do in the next patch.

An obvious but much less efficient way to optimize this part would
be to see if me->needs_pushing already has me->submodule_path and
skip the check for submodule_needs_pushing(), but if you drop the
call by find_unpushed_submodule to sha1_array_for_each_unique() to
walk new submodule commits one by one, that would become irrelevant.

> +static void free_submodules_sha1s(struct string_list *submodules)
> +{
> +	int i;
> +	for (i = 0; i < submodules->nr; i++) {
> +		struct string_list_item *item = &submodules->items[i];
> +		struct sha1_array *hashes = (struct sha1_array *) item->util;
> +		sha1_array_clear(hashes);
> +	}
> +	string_list_clear(submodules, 1);
> +}
> +
>  int find_unpushed_submodules(unsigned char new_sha1[20],
>  		const char *remotes_name, struct string_list *needs_pushing)
>  {
>  	struct rev_info rev;
>  	struct commit *commit;
>  	const char *argv[] = {NULL, NULL, "--not", "NULL", NULL};
> -	int argc = ARRAY_SIZE(argv) - 1;
> +	int argc = ARRAY_SIZE(argv) - 1, i;
>  	char *sha1_copy;
> +	struct string_list submodules = STRING_LIST_INIT_DUP;
>  
>  	struct strbuf remotes_arg = STRBUF_INIT;
>  
> @@ -603,12 +649,23 @@ int find_unpushed_submodules(unsigned char new_sha1[20],
>  		die("revision walk setup failed");
>  
>  	while ((commit = get_revision(&rev)) != NULL)
> -		find_unpushed_submodule_commits(commit, needs_pushing);
> +		find_unpushed_submodule_commits(commit, &submodules);
>  
>  	reset_revision_walk();
>  	free(sha1_copy);
>  	strbuf_release(&remotes_arg);
>  
> +	for (i = 0; i < submodules.nr; i++) {
> +		struct string_list_item *item = &submodules.items[i];
> +		struct collect_submodule_from_sha1s_data data;
> +		data.submodule_path = item->string;
> +		data.needs_pushing = needs_pushing;
> +		sha1_array_for_each_unique((struct sha1_array *) item->util,
> +				collect_submodules_from_sha1s,
> +				&data);
> +	}
> +	free_submodules_sha1s(&submodules);
> +
>  	return needs_pushing->nr;
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] serialize collection of refs that contain submodule changes
  2016-09-14 17:51         ` [PATCH 2/2] serialize collection of refs that contain submodule changes Heiko Voigt
  2016-09-14 19:46           ` Heiko Voigt
@ 2016-09-16 17:47           ` Junio C Hamano
  2016-09-19 19:51             ` Heiko Voigt
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2016-09-16 17:47 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

> diff --git a/submodule.c b/submodule.c
> index b04c066..a15e346 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -627,24 +627,31 @@ static void free_submodules_sha1s(struct string_list *submodules)
>  	string_list_clear(submodules, 1);
>  }
>  
> -int find_unpushed_submodules(unsigned char new_sha1[20],
> +static void append_hash_to_argv(const unsigned char sha1[20],
> +		void *data)
> +{
> +	struct argv_array *argv = (struct argv_array *) data;
> +	argv_array_push(argv, sha1_to_hex(sha1));
> +}
> +
> +int find_unpushed_submodules(struct sha1_array *hashes,
>  		const char *remotes_name, struct string_list *needs_pushing)
>  {
>  	struct rev_info rev;
>  	struct commit *commit;
> -	const char *argv[] = {NULL, NULL, "--not", "NULL", NULL};
> -	int argc = ARRAY_SIZE(argv) - 1, i;
> -	char *sha1_copy;
> +	int i;
>  	struct string_list submodules = STRING_LIST_INIT_DUP;
> +	struct argv_array argv = ARGV_ARRAY_INIT;
>  
> -	struct strbuf remotes_arg = STRBUF_INIT;
> -
> -	strbuf_addf(&remotes_arg, "--remotes=%s", remotes_name);
>  	init_revisions(&rev, NULL);
> -	sha1_copy = xstrdup(sha1_to_hex(new_sha1));
> -	argv[1] = sha1_copy;
> -	argv[3] = remotes_arg.buf;
> -	setup_revisions(argc, argv, &rev, NULL);
> +
> +	/* argv.argv[0] will be ignored by setup_revisions */
> +	argv_array_push(&argv, "find_unpushed_submodules");
> +	sha1_array_for_each_unique(hashes, append_hash_to_argv, &argv);
> +	argv_array_push(&argv, "--not");
> +	argv_array_pushf(&argv, "--remotes=%s", remotes_name);
> +
> +	setup_revisions(argv.argc, argv.argv, &rev, NULL);

Yes, its about time to for us to lose that fixed-size argv[] and
replace it with an argv-array ;-).

>  	if (prepare_revision_walk(&rev))
>  		die("revision walk setup failed");

So this one used to get a single commit at the tip of what we pushed
in the superproject and was asked "Look at the history we just
pushed leading to the tip commit, and tell me if any of the ones new
to the remote requires submodule commits the remote does not yet
have".  Now the caller collects all the tip commits and asks us
once: "Here are the new tips we just pushed; in the history leading
to them, is there a commit that the remote did not have that requires
submodule history the remote does not yet have?".

Makes sort-of sense.

I speculated that you would be doing the same kind of optimization
to feed all positive commits to rev-list at once in each submodule
repository in the review of the previous one, but you didn't do it
here.  You did the same for superproject in this step.  Perhaps 3 or
4 would do so in the submodule repository.

One thing that makes me worried is how the ref cache layer interacts
with this.  I see you first call push_unpushed_submodules() when
ON_DEMAND is set, which would result in pushes in submodule
repositories, updating their remote tracking branches.  At that
point, before you make another call to find_unpushed_submodules(),
is our cached ref layer knows that the remote tracking branches
are now up to date (otherwise, we would incorrectly judge that these
submodules need pushing based on stale information)?

> diff --git a/transport.c b/transport.c
> index 94d6dc3..76e1daf 100644
> --- a/transport.c
> +++ b/transport.c
> @@ -903,23 +903,29 @@ int transport_push(struct transport *transport,
>  
>  		if ((flags & TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND) && !is_bare_repository()) {
>  			struct ref *ref = remote_refs;
> +			struct sha1_array hashes = SHA1_ARRAY_INIT;
> +
>  			for (; ref; ref = ref->next)
> -				if (!is_null_oid(&ref->new_oid) &&
> -				    !push_unpushed_submodules(ref->new_oid.hash,
> -					    transport->remote->name))
> -				    die ("Failed to push all needed submodules!");
> +				if (!is_null_oid(&ref->new_oid))
> +					sha1_array_append(&hashes, ref->new_oid.hash);
> +
> +			if (!push_unpushed_submodules(&hashes, transport->remote->name))
> +				die ("Failed to push all needed submodules!");

Do we leak the contents of hashes here?

>  		}
>  
>  		if ((flags & (TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND |
>  			      TRANSPORT_RECURSE_SUBMODULES_CHECK)) && !is_bare_repository()) {
>  			struct ref *ref = remote_refs;
>  			struct string_list needs_pushing = STRING_LIST_INIT_DUP;
> +			struct sha1_array hashes = SHA1_ARRAY_INIT;
>  
>  			for (; ref; ref = ref->next)
> -				if (!is_null_oid(&ref->new_oid) &&
> -				    find_unpushed_submodules(ref->new_oid.hash,
> -					    transport->remote->name, &needs_pushing))
> -					die_with_unpushed_submodules(&needs_pushing);
> +				if (!is_null_oid(&ref->new_oid))
> +					sha1_array_append(&hashes, ref->new_oid.hash);
> +
> +			if (find_unpushed_submodules(&hashes, transport->remote->name,
> +						&needs_pushing))
> +				die_with_unpushed_submodules(&needs_pushing);

Do we leak the contents of hashes here?  I do not think we need to
worry about needs_pushing leaking, as we will always die if it is
not empty, but it might be a better code hygiene to clear it as
well.

>  		}
>  
>  		push_ret = transport->push_refs(transport, remote_refs, flags);

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-15 12:10             ` [PATCH 3/2] batch check whether submodule needs pushing into one call Heiko Voigt
  2016-09-15 21:08               ` Junio C Hamano
@ 2016-09-16 17:59               ` Junio C Hamano
  2016-09-19 19:58                 ` Heiko Voigt
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2016-09-16 17:59 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

> +static void append_hash_to_argv(const unsigned char sha1[20], void *data)
>  {
> -	if (add_submodule_odb(path) || !lookup_commit_reference(sha1))
> +	struct argv_array *argv = (struct argv_array *) data;
> +	argv_array_push(argv, sha1_to_hex(sha1));
> +}

Hmph, why do I think I've seen this before in the previous patch?

    ... scans through this patch and finds that a similar one is
    removed ;-)

OK.  This makes sense.

> +static void check_has_hash(const unsigned char sha1[20], void *data)
> +{
> +	int *has_hash = (int *) data;
> +
> +	if (!lookup_commit_reference(sha1))
> +		*has_hash = 0;
> +}
> +
> +static int submodule_has_hashes(const char *path, struct sha1_array *hashes)
> +{
> +	int has_hash = 1;
> +
> +	if (add_submodule_odb(path))
> +		return 0;
> +
> +	sha1_array_for_each_unique(hashes, check_has_hash, &has_hash);
> +	return has_hash;
> +}
> +
> +static int submodule_needs_pushing(const char *path, struct sha1_array *hashes)
> +{
> +	if (!submodule_has_hashes(path, hashes))
>  		return 0;

I think you meant well, but this optimization is wrong.  A mere
presence of an object does not mean that the current tip can reach
that object.  Imagine you pushed commit A earlier to them at the
tip, then pushed commit A~ to them at the tip, which is the current
state of the remote of the submodule, and since them they may have
GC'ed.  They no longer have the commit A.

For that matter, because you are doing this check by pretending as
if all the submodule objects are in the object store of the current
superproject you are working in, and saying "it exists there in the
submodule repository" when the only thing you know is it exists in
an object store of either the submodule repository, the superproject
repository, or any of the other submodule repositories, you really
cannot tell much from a mere presence of an object.  Not just the
remote of the submodule repository you are interested in, but the
submodule repository you are interested in itself, may not have that
object.

Drop the previous two helper functions and this short-cut.

>  	if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
>  		struct child_process cp = CHILD_PROCESS_INIT;
> -		const char *argv[] = {"rev-list", NULL, "--not", "--remotes", "-n", "1" , NULL};
> +
> +		argv_array_push(&cp.args, "rev-list");
> +		sha1_array_for_each_unique(hashes, append_hash_to_argv, &cp.args);
> +		argv_array_pushl(&cp.args, "--not", "--remotes", "-n", "1" , NULL);
> +
>  		struct strbuf buf = STRBUF_INIT;
>  		int needs_pushing = 0;
>  
> -		argv[1] = sha1_to_hex(sha1);
> -		cp.argv = argv;
>  		prepare_submodule_repo_env(&cp.env_array);
>  		cp.git_cmd = 1;
>  		cp.no_stdin = 1;
>  		cp.out = -1;
>  		cp.dir = path;
>  		if (start_command(&cp))
> -			die("Could not run 'git rev-list %s --not --remotes -n 1' command in submodule %s",
> -				sha1_to_hex(sha1), path);
> +			die("Could not run 'git rev-list <hashes> --not --remotes -n 1' command in submodule %s",
> +					path);
>  		if (strbuf_read(&buf, cp.out, 41))
>  			needs_pushing = 1;
>  		finish_command(&cp);
> @@ -601,21 +628,6 @@ static void find_unpushed_submodule_commits(struct commit *commit,
>  	diff_tree_combined_merge(commit, 1, &rev);
>  }

Good.  This is the optimization I alluded to in the review of the
first one in the series.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-16 12:31                   ` Heiko Voigt
@ 2016-09-16 18:13                     ` Junio C Hamano
  2016-09-19 20:08                       ` Heiko Voigt
  0 siblings, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2016-09-16 18:13 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

> On Fri, Sep 16, 2016 at 11:40:19AM +0200, Heiko Voigt wrote:
>> > By the way, with the two new patches, 'pu' seems to start failing
>> > some tests, e.g. 5533 5404 5405.
>> 
>> Ah ok I did only test on master, will look into those.
>
> Ok I had a look into these and the reason t5533 fails is because on pu
> --recurse-submodules is enabled by default and I missed the case when
> overwriting a ref. In that case we get the sha1 from the remote side as
> old. So we could catch that and fall back to all revisions there, but...
>
> ... tl;dr: The solution to use the old revisions from the remote side
> was too simple and does not make matters better but actually worse for
> some typical usecases. Its only in the last patch.

You may not even have the old one in your copy of the remote
repository if you haven't fetched from them and you are forcing your
push.  "rev-list <new ones> --not <old ones>" may fail in such a case,
not producing the list of new commits.  You'd need to exclude old ones
you learned over the wire that you do not yet have locally.

> The most exact solution would be to use all actual remote refs available
> (not sure if we have them at this point in the process?) another
> solution would be to still append the --remotes=<remotename> option as a
> fallback to reduce the revisions.

I'd say --remotes=<remotename> is the least problematic thing to do.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] serialize collection of changed submodules
  2016-09-16 17:27           ` [PATCH 1/2] serialize collection of changed submodules Junio C Hamano
@ 2016-09-19 19:44             ` Heiko Voigt
  0 siblings, 0 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-19 19:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Fri, Sep 16, 2016 at 10:27:04AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> 
> > +static struct sha1_array *get_sha1s_from_list(struct string_list *submodules,
> > +		const char *path)
> > +{
> > +	struct string_list_item *item;
> > +	struct sha1_array *hashes;
> > +
> > +	item = string_list_insert(submodules, path);
> > +	if (item->util)
> > +		return (struct sha1_array *) item->util;
> > +
> > +	hashes = (struct sha1_array *) xmalloc(sizeof(struct sha1_array));
> > +	/* NEEDSWORK: should we add an initializer function for
> > +	 * sha1_array ? */
> > +	memset(hashes, 0, sizeof(struct sha1_array));
> > +	item->util = hashes;
> 
> 
> 	/* NEEDSWORK: should we have SHA1_ARRAY_INIT etc.? */
> 	item->util = xcalloc(1, sizeof(struct sha1_array));

Ok will do.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] serialize collection of refs that contain submodule changes
  2016-09-16 17:47           ` Junio C Hamano
@ 2016-09-19 19:51             ` Heiko Voigt
  2016-09-19 20:09               ` Junio C Hamano
  0 siblings, 1 reply; 25+ messages in thread
From: Heiko Voigt @ 2016-09-19 19:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Hi,

On Fri, Sep 16, 2016 at 10:47:46AM -0700, Junio C Hamano wrote:
> One thing that makes me worried is how the ref cache layer interacts
> with this.  I see you first call push_unpushed_submodules() when
> ON_DEMAND is set, which would result in pushes in submodule
> repositories, updating their remote tracking branches.  At that
> point, before you make another call to find_unpushed_submodules(),
> is our cached ref layer knows that the remote tracking branches
> are now up to date (otherwise, we would incorrectly judge that these
> submodules need pushing based on stale information)?

I am not sure if I understand you correctly here. With the "ref cache layer"
you are referring to add_submodule_odb() which is called indirectly from
submodule_needs_pushing()? Those revs are only used to check whether the hash
we need on the remote side exists in the local submodule. That should not
change due to a push. The actual check whether the commit(s) exist on the
remote side is done using a 'rev-list' in a subprocess later.


> > diff --git a/transport.c b/transport.c
> > index 94d6dc3..76e1daf 100644
> > --- a/transport.c
> > +++ b/transport.c
> > @@ -903,23 +903,29 @@ int transport_push(struct transport *transport,
> >  
> >  		if ((flags & TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND) && !is_bare_repository()) {
> >  			struct ref *ref = remote_refs;
> > +			struct sha1_array hashes = SHA1_ARRAY_INIT;
> > +
> >  			for (; ref; ref = ref->next)
> > -				if (!is_null_oid(&ref->new_oid) &&
> > -				    !push_unpushed_submodules(ref->new_oid.hash,
> > -					    transport->remote->name))
> > -				    die ("Failed to push all needed submodules!");
> > +				if (!is_null_oid(&ref->new_oid))
> > +					sha1_array_append(&hashes, ref->new_oid.hash);
> > +
> > +			if (!push_unpushed_submodules(&hashes, transport->remote->name))
> > +				die ("Failed to push all needed submodules!");
> 
> Do we leak the contents of hashes here?

Good catch, sorry about that. Will clear the hashes array.


> >  		}
> >  
> >  		if ((flags & (TRANSPORT_RECURSE_SUBMODULES_ON_DEMAND |
> >  			      TRANSPORT_RECURSE_SUBMODULES_CHECK)) && !is_bare_repository()) {
> >  			struct ref *ref = remote_refs;
> >  			struct string_list needs_pushing = STRING_LIST_INIT_DUP;
> > +			struct sha1_array hashes = SHA1_ARRAY_INIT;
> >  
> >  			for (; ref; ref = ref->next)
> > -				if (!is_null_oid(&ref->new_oid) &&
> > -				    find_unpushed_submodules(ref->new_oid.hash,
> > -					    transport->remote->name, &needs_pushing))
> > -					die_with_unpushed_submodules(&needs_pushing);
> > +				if (!is_null_oid(&ref->new_oid))
> > +					sha1_array_append(&hashes, ref->new_oid.hash);
> > +
> > +			if (find_unpushed_submodules(&hashes, transport->remote->name,
> > +						&needs_pushing))
> > +				die_with_unpushed_submodules(&needs_pushing);
> 
> Do we leak the contents of hashes here?  I do not think we need to
> worry about needs_pushing leaking, as we will always die if it is
> not empty, but it might be a better code hygiene to clear it as
> well.

As above, will fix.

Cheers Heiko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-16 17:59               ` Junio C Hamano
@ 2016-09-19 19:58                 ` Heiko Voigt
  0 siblings, 0 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-19 19:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Fri, Sep 16, 2016 at 10:59:37AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> > +static void check_has_hash(const unsigned char sha1[20], void *data)
> > +{
> > +	int *has_hash = (int *) data;
> > +
> > +	if (!lookup_commit_reference(sha1))
> > +		*has_hash = 0;
> > +}
> > +
> > +static int submodule_has_hashes(const char *path, struct sha1_array *hashes)
> > +{
> > +	int has_hash = 1;
> > +
> > +	if (add_submodule_odb(path))
> > +		return 0;
> > +
> > +	sha1_array_for_each_unique(hashes, check_has_hash, &has_hash);
> > +	return has_hash;
> > +}
> > +
> > +static int submodule_needs_pushing(const char *path, struct sha1_array *hashes)
> > +{
> > +	if (!submodule_has_hashes(path, hashes))
> >  		return 0;
> 
> I think you meant well, but this optimization is wrong.  A mere
> presence of an object does not mean that the current tip can reach
> that object.  Imagine you pushed commit A earlier to them at the
> tip, then pushed commit A~ to them at the tip, which is the current
> state of the remote of the submodule, and since them they may have
> GC'ed.  They no longer have the commit A.
> 
> For that matter, because you are doing this check by pretending as
> if all the submodule objects are in the object store of the current
> superproject you are working in, and saying "it exists there in the
> submodule repository" when the only thing you know is it exists in
> an object store of either the submodule repository, the superproject
> repository, or any of the other submodule repositories, you really
> cannot tell much from a mere presence of an object.  Not just the
> remote of the submodule repository you are interested in, but the
> submodule repository you are interested in itself, may not have that
> object.
> 
> Drop the previous two helper functions and this short-cut.

This is nothing I added. This came from the original code which I simply
extended to check for all sha1's. The diff is a little bit misleading. This
would be the local diff:

-       if (add_submodule_odb(path) || !lookup_commit_reference(sha1))
+       if (!submodule_has_hashes(path, hashes))
                return 0;

I think that this is more a shortcut for: If we can not find the sha1, we do
not need to bother spawning a process for the real check. We default to the
answer no in that case.

It looks like the reason for the sha1 check is to avoid error output from the
spawned 'rev-list' process. The error output might be confusing to the user in
case the sha1 does not exist in the submodule. We should probably die here
since we were not able to check a submodule that has a diff in the revisions we
would push.

After thinking about this further: I think it is actually bad to proceed here,
as the current revisions (just in the superproject) would still be pushed and
the user possibly gets an inconsistent state on the remote side which he tried
to avoid with check/on-demand-push enabled.

So in short this deserves some further love. Will have a look into adding
another patch for this if we agree on my suggestion to die above.

Cheers Heiko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/2] batch check whether submodule needs pushing into one call
  2016-09-16 18:13                     ` Junio C Hamano
@ 2016-09-19 20:08                       ` Heiko Voigt
  0 siblings, 0 replies; 25+ messages in thread
From: Heiko Voigt @ 2016-09-19 20:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

On Fri, Sep 16, 2016 at 11:13:09AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> > The most exact solution would be to use all actual remote refs available
> > (not sure if we have them at this point in the process?) another
> > solution would be to still append the --remotes=<remotename> option as a
> > fallback to reduce the revisions.
> 
> I'd say --remotes=<remotename> is the least problematic thing to do.

Ok then lets drop my last patch and keep it the way it was. Because if
the remote sha1 differs we probably do not have it locally anyway. The
only case this does not catch is when the user specifies a remote URL.
But that just means we will iterate over all revisions instead of a
reduced set, which makes the check slower but still correct. As one can
see from my measurements that should not be that bad anymore.

Cheers Heiko

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] serialize collection of refs that contain submodule changes
  2016-09-19 19:51             ` Heiko Voigt
@ 2016-09-19 20:09               ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2016-09-19 20:09 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jeff King, Stefan Beller, git@vger.kernel.org, Jens Lehmann,
	Fredrik Gustafsson, Leandro Lucarella

Heiko Voigt <hvoigt@hvoigt.net> writes:

> I am not sure if I understand you correctly here. With the "ref cache layer"
> you are referring to add_submodule_odb() which is called indirectly from
> submodule_needs_pushing()? Those revs are only used to check whether the hash
> we need on the remote side exists in the local submodule. That should not
> change due to a push. The actual check whether the commit(s) exist on the
> remote side is done using a 'rev-list' in a subprocess later.

I was wondering what would happen in this scenario:

 * You have ON_DEMAND set, which causes "git -C sub push origin" to
   push out what are new, updating the remote tracking branches in
   the submodule, sub/.git/refs/remotes/origin/*.

 * Then you check again.  If you used for-each-ref-in-submodule, the
   updated refs/remotes/origin/* may not have been re-read.

But you check by spawning "rev-list ... --not --remotes" as a
separate process in submodule_needs_pushing(), and that will force
the new process to read the updated state, so it turns out that I
was overly worried without good reason ;-)

It may matter once somebody tries to internalize the external
rev-list call done via start_command() interface to an internal
call, though.  But we are not there yet.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-09-19 20:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-24 17:30 [PATCHv2] push: change submodule default to check Stefan Beller
2016-08-24 18:38 ` Junio C Hamano
     [not found] ` <20160824183112.ceekegpzavnbybxp@sigill.intra.peff.net>
2016-08-24 19:37   ` Junio C Hamano
2016-08-24 21:26     ` Junio C Hamano
2016-08-24 22:37     ` Stefan Beller
2016-08-24 23:01       ` Jeff King
2016-09-14 17:31         ` [PATCH 1/2] serialize collection of changed submodules Heiko Voigt
2016-09-14 22:30           ` Junio C Hamano
2016-09-15 12:10             ` [PATCH 3/2] batch check whether submodule needs pushing into one call Heiko Voigt
2016-09-15 21:08               ` Junio C Hamano
2016-09-16  9:40                 ` Heiko Voigt
2016-09-16 12:31                   ` Heiko Voigt
2016-09-16 18:13                     ` Junio C Hamano
2016-09-19 20:08                       ` Heiko Voigt
2016-09-16 17:59               ` Junio C Hamano
2016-09-19 19:58                 ` Heiko Voigt
2016-09-15 12:18             ` [PATCH 4/2] use actual start hashes for submodule push check instead of local refs Heiko Voigt
2016-09-16 17:27           ` [PATCH 1/2] serialize collection of changed submodules Junio C Hamano
2016-09-19 19:44             ` Heiko Voigt
2016-09-14 17:51         ` [PATCH 2/2] serialize collection of refs that contain submodule changes Heiko Voigt
2016-09-14 19:46           ` Heiko Voigt
2016-09-14 20:04             ` Stefan Beller
2016-09-16 17:47           ` Junio C Hamano
2016-09-19 19:51             ` Heiko Voigt
2016-09-19 20:09               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).