git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Stefan Beller <sbeller@google.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>,
	Heiko Voigt <hvoigt@hvoigt.net>,
	Jens Lehmann <Jens.Lehmann@web.de>,
	Fredrik Gustafsson <iveqy@iveqy.com>,
	Leandro Lucarella <leandro.lucarella@sociomantic.com>
Subject: Re: [PATCHv2] push: change submodule default to check
Date: Wed, 24 Aug 2016 19:01:16 -0400	[thread overview]
Message-ID: <20160824230115.jhmcr4r7wobj5ejb@sigill.intra.peff.net> (raw)
In-Reply-To: <CAGZ79kYOBqQ0FF4J-+KbefSD8HRrUeMqpO27m_jprhm93aB+LA@mail.gmail.com>

On Wed, Aug 24, 2016 at 03:37:29PM -0700, Stefan Beller wrote:

> > That sounds massively ... broken.  So before even thinking about
> > flipping it to default, this needs to be fixed first.
> 
> I agree. That sounds bad.
> 
> However having the --auto-check feels like papering over the
> actual problem which to me sounds like a design problem.
> However this may be a viable short term solution.

Sort of...

I may not have been clear, but there are really a few things going on.

One is that the design of find_unpushed_submodules() is just brain-dead.
It does one traversal per updated ref, which means a from-scratch mirror
is O(nr_of_refs * nr_of_commits). That's just silly, and can easily be
fixed behind the scenes to be O(nr_of_commits).

And I _suspect_ it is what made Junio's earlier push so awful; he
probably pushed up the same commits as part of many different branches,
so he did the same diffs over and over.

So clearing that up seems like an obvious first step, and dulls the pain
to "if submodule recursion is on, the worst case is that you walk all
the new objects you are sending". That's still _a_ traversal, but it's
one we have to do anyway in pack-objects, so it's the same order of
magnitude as the rest of the push[1].

Then you've got two cases: the repo is using submodules at all, or they
are not. The former is an easy case, if we can identify it; we can avoid
the traversal at all, and people who do not use submodules are not
regressed at all.

That leaves people who _are_ using submodules with paying the extra
traversal cost. Not great, but only really a pain when you have a really
big chunk of history to push. It may be lost in the noise for such a
push in more normal circumstances (where bandwidth to push up the
objects dominates, though it is unfortunate that we do not even start
utilizing the bandwidth, the critical resource, until we are done with
the submodule check).

[1] Of course with reachability bitmaps the pack-objects traversal goes
    away, but the same cannot be accomplished here (because they do not
    store the gitlink sha1s at all, because they do not imply
    reachability).

> We need to answer the question: Which submodule commits
> are referenced by a given set of superproject commits.
> 
> This question is advancing a very similar question that we'd
> have to ask in git-gc. In gc we would end up without having to
> worry about a specific set, but rather the all reachable commits
> of the superproject are in the given set.
> 
> So we could solve two issues at the same time if we had a quick
> way to answer this question quickly.
> [...]

I snipped here because your solutions sound complicated (which isn't to
say they're wrong, but that I am not willing to give them a lot of
thought at this time in the evening ;) ).

One opposite approach which appeals to me is not to remove the need for
the traversal, but to make it much faster. E.g., by storing commits in a
form that can be traversed more quickly, and possibly keeping a bitmap
cache of which paths are touched in each commit (I have posted patches
to the list for the former, but have only been considering experimenting
with the latter).

That's _also_ complicated, but it applies to way more things. Including
normal log traversals, path-limiting for diffs, the "counting" traversal
done by pack-object, etc.

And while it is complicated in some ways, it's conceptually simple at
the git data model layer. It's returning the same old answers about
commits and trees, just faster.

Anyway, food for thought.

-Peff

  reply	other threads:[~2016-08-24 23:02 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-24 17:30 [PATCHv2] push: change submodule default to check Stefan Beller
2016-08-24 18:38 ` Junio C Hamano
     [not found] ` <20160824183112.ceekegpzavnbybxp@sigill.intra.peff.net>
2016-08-24 19:37   ` Junio C Hamano
2016-08-24 21:26     ` Junio C Hamano
2016-08-24 22:37     ` Stefan Beller
2016-08-24 23:01       ` Jeff King [this message]
2016-09-14 17:31         ` [PATCH 1/2] serialize collection of changed submodules Heiko Voigt
2016-09-14 22:30           ` Junio C Hamano
2016-09-15 12:10             ` [PATCH 3/2] batch check whether submodule needs pushing into one call Heiko Voigt
2016-09-15 21:08               ` Junio C Hamano
2016-09-16  9:40                 ` Heiko Voigt
2016-09-16 12:31                   ` Heiko Voigt
2016-09-16 18:13                     ` Junio C Hamano
2016-09-19 20:08                       ` Heiko Voigt
2016-09-16 17:59               ` Junio C Hamano
2016-09-19 19:58                 ` Heiko Voigt
2016-09-15 12:18             ` [PATCH 4/2] use actual start hashes for submodule push check instead of local refs Heiko Voigt
2016-09-16 17:27           ` [PATCH 1/2] serialize collection of changed submodules Junio C Hamano
2016-09-19 19:44             ` Heiko Voigt
2016-09-14 17:51         ` [PATCH 2/2] serialize collection of refs that contain submodule changes Heiko Voigt
2016-09-14 19:46           ` Heiko Voigt
2016-09-14 20:04             ` Stefan Beller
2016-09-16 17:47           ` Junio C Hamano
2016-09-19 19:51             ` Heiko Voigt
2016-09-19 20:09               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160824230115.jhmcr4r7wobj5ejb@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=Jens.Lehmann@web.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hvoigt@hvoigt.net \
    --cc=iveqy@iveqy.com \
    --cc=leandro.lucarella@sociomantic.com \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).