git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Martin Fick <mfick@codeaurora.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.net>,
	Michael Haggerty <mhagger@alum.mit.edu>,
	git discussion list <git@vger.kernel.org>
Subject: Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)
Date: Tue, 22 May 2012 18:46:37 -0600	[thread overview]
Message-ID: <201205221846.38230.mfick@codeaurora.org> (raw)
In-Reply-To: <7v8vgj3imy.fsf@alter.siamese.dyndns.org>

On Tuesday, May 22, 2012 05:23:17 pm Junio C Hamano wrote:
> Martin Fick <mfick@codeaurora.org> writes:
>
> I think we can ignore that 400 part from the above for
> now.  As they are independent "projects", it is
> unreasonable to expect that any solution would add costs
> less than linearly when you add more projects.  

As far as git is concerned, sure they are independent, but 
the projects aren't really independent.  Separate repos were 
likely selected for scalibility reasons in the first place, 
but this ends up being a "scale fail" pattern.  I don't 
expect git to fix this * 400 issue, but I want to highlight, 
how difficult it is to make things scale in these cases and 
that this problems is a natural multiplier when you split 
something into separate git repos because your tags tend to 
now be multiplied.  So while 10K tags sound like a lot but 
manageable for 1 repo, it ends up actually being 4M tags if 
you split your repos.


...
> I think the assumption in your build-bot scenario needs
> to be a bit more clarified to give context.  I am
> assuming, when you say "see if projects are up to date",
> you mean your build-bot is interested in a single branch
> in each repository, possibly together with new tags that
> have become reachable from the updated tip of the branch
> (if the branch tip moved). 

Just the branch generally.


> I also assume that the
> build-bot polls very often and that is why 200MB (or
> divided by 400 it would be 0.5MB, which still is a lot
> when you talk about a single repository) hurts a lot.

To clarify for others (since you likely know), we serialize 
our submits, so the "poll" is before every build to ensure 
that we are up to date with what just got merged.  In our 
case we use Gerrit with slaves (read only Gerrit copies), so 
we offload the "update" to the slave, which incurs the same 
hit, all though it is not quite a No Op but close (there 
will have been a batch of commits added to a few projects).  
But then since the slave may be slightly behind the master 
(replication from the master to the slave takes time, 
partially because of the same ref advertisement problem), so 
we want to verify the slave update by polling the master, 
which if replication was fast enough would be a No Op.  So, 
our attempts to scale by using slaves does not help very 
much since the real update and the No Op are almost 
identical in their impacts... There are workarounds, but 
mostly hacks.


...
> Unlike the capability exchange that happens after the
> initial connection, there is no gap in the protocol for
> the side that initiates the connection to tell the
> serving side to skip/delay the initial ref
> advertisement.

Crazy stupid suggestion: could the URI somehow be exploited 
to signal the option to disable advertisements?  What if the 
path part of the URI started with something like several 
slashes?  Older servers just strip the extra slashes right?  
Newer servers could disable advertisements if they see the 
slashes.  Of course, this gets ugly if scripts add slashes 
and expect them to be stripped (old clients talking to new 
servers would probably then fail),

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

      reply	other threads:[~2012-05-23  0:46 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-21  8:13 remove_duplicates() in builtin/fetch-pack.c is O(N^2) Michael Haggerty
2012-05-21  9:09 ` Junio C Hamano
2012-05-21  9:42 ` demerphq
2012-05-21 17:45 ` Jeff King
2012-05-21 22:14   ` Jeff King
2012-05-21 22:17     ` [PATCH 1/5] fetch-pack: sort incoming heads Jeff King
2012-05-22 20:08       ` Junio C Hamano
2012-05-22 20:23         ` Jeff King
2012-05-24  6:04           ` Jeff King
2012-05-21 22:17     ` [PATCH 2/5] fetch-pack: avoid quadratic behavior in remove_duplicates Jeff King
2012-05-21 22:19     ` [PATCH 3/5] add sorting infrastructure for list refs Jeff King
2012-05-21 22:19     ` [PATCH 4/5] fetch-pack: sort the list of incoming refs Jeff King
2012-05-21 22:23     ` [PATCH 5/5] fetch-pack: avoid quadratic loop in filter_refs Jeff King
2012-05-22 20:16       ` Junio C Hamano
2012-05-21 23:52     ` remove_duplicates() in builtin/fetch-pack.c is O(N^2) Jeff King
2012-05-22  0:07       ` Jeff King
2012-05-22  3:59       ` Michael Haggerty
2012-05-22  4:11         ` Jeff King
2012-05-22  7:15           ` Michael Haggerty
2012-05-22  7:37             ` Jeff King
2012-05-22 13:28               ` Michael Haggerty
2012-05-22 17:33                 ` Jeff King
2012-05-24 12:05                   ` Michael Haggerty
2012-05-25  0:17     ` Martin Fick
2012-05-25  0:39       ` Jeff King
2012-05-25  0:54         ` Martin Fick
2012-05-25  1:04           ` Jeff King
2012-05-25  1:32             ` Martin Fick
2012-05-25  6:50               ` Jeff King
2012-05-22 12:18   ` Nguyen Thai Ngoc Duy
2012-05-22 13:35     ` Michael Haggerty
2012-05-22 17:01     ` Jeff King
2012-05-22 17:35       ` Junio C Hamano
2012-05-22 17:46         ` Jeff King
2012-05-24  4:54         ` Michael Haggerty
2012-05-23  1:20       ` Nguyen Thai Ngoc Duy
2012-05-22 16:16   ` Junio C Hamano
2012-05-21 18:15 ` Martin Fick
2012-05-21 19:41   ` Jeff King
2012-05-21 22:08     ` Junio C Hamano
2012-05-21 22:24       ` Jeff King
2012-05-22  5:51     ` Martin Fick
2012-05-22 18:21       ` Jeff King
2012-05-22 22:19         ` Martin Fick
2012-05-22 23:23           ` Junio C Hamano
2012-05-23  0:46             ` Martin Fick [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201205221846.38230.mfick@codeaurora.org \
    --to=mfick@codeaurora.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).