git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: "Martin Fick" <mfick@codeaurora.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Lars Schneider" <larsxschneider@gmail.com>,
	git <git@vger.kernel.org>, "Duy Nguyen" <pclouds@gmail.com>
Subject: Re: worktrees vs. alternates
Date: Wed, 16 May 2018 13:17:28 -0700	[thread overview]
Message-ID: <20180516201727.GD4036@sigill.intra.peff.net> (raw)
In-Reply-To: <5156717b-6fc9-b792-dfa4-1ba48ac50333@linuxfoundation.org>

On Wed, May 16, 2018 at 04:02:53PM -0400, Konstantin Ryabitsev wrote:

> On 05/16/18 15:37, Jeff King wrote:
> > Yes, that's pretty close to what we do at GitHub. Before doing any
> > repacking in the mother repo, we actually do the equivalent of:
> > 
> >   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
> >   git repack -Adl
> > 
> > from each child to pick up any new objects to de-duplicate (our "mother"
> > repos are not real repos at all, but just big shared-object stores).
> 
> Yes, I keep thinking of doing the same, too -- instead of using
> torvalds/linux.git for alternates, have an internal repo where objects
> from all forks are stored. This conversation may finally give me the
> shove I've been needing to poke at this. :)
> 
> Is your delta-islands patch heading into upstream, or is that something
> that's going to remain external?

I have vague plans to submit it upstream, but I'm still not convinced
it's quite optimal. The resulting packs tend to be a fair bit larger
than they could be when packed by themselves, because we miss many delta
opportunities (and it's important to "repack -f --window=250" once in a
while, since we're throwing away so many delta candidates).

There's an alternative way of doing it, too, which I think git.or.cz
uses: it "layers" forks in a hierarchy. So if I fork torvalds/linux.git,
then I get my own repo that uses torvalds/linux as an alternate. And if
somebody forks my repo, then I'm their alternate, and they recursively
depend on torvalds/linux. So each fork basically layers a slice of its
own pack on top of the parent.

This is all from recollections of past discussions (which were sadly not
on the list -- I don't know if they've written up their scheme anywhere
public), so I may have some details wrong. But I think that their
repacking is done hierarchically, too: any objects which the root fork
might drop get migrated up to the children instead, and so forth, until
the leaf nodes can actually throw away objects.

The big problem with this is that Git tends to behave better when
objects are in the same pack:

  1. We don't bother looking for new deltas within the same pack,
     whereas a clone of a fork may actually try to find new deltas
     between the layers.

  2. Reachability bitmaps can't cross pack boundaries (due to the way
     they're implemented, but also the current on-disk format). So you
     can only bitmap the root repo, not any of the other layers.

> I feel like a whitepaper on "how we deal with bajillions of forks at
> GitHub" would be nice. :) I was previously told that it's unlikely such
> paper could be written due to so many custom-built things at GH, but I
> would be very happy if that turned out not to be the case.

We have a few engineering blog posts on the subject, like:

  https://githubengineering.com/counting-objects/
  https://githubengineering.com/introducing-dgit/
  https://githubengineering.com/building-resilience-in-spokes/

but we haven't done a very good job of keeping that up. I think a
summary whitepaper would interesting. Maybe one day...:)

-Peff

  reply	other threads:[~2018-05-16 20:17 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-16  8:13 worktrees vs. alternates Lars Schneider
2018-05-16  9:29 ` Ævar Arnfjörð Bjarmason
2018-05-16  9:42   ` Robert P. J. Day
2018-05-16 11:07     ` Ævar Arnfjörð Bjarmason
2018-05-16  9:51   ` Lars Schneider
2018-05-16 10:33     ` Ævar Arnfjörð Bjarmason
2018-05-16 13:02       ` Derrick Stolee
2018-05-16 14:58         ` Konstantin Ryabitsev
2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
2018-05-16 15:49             ` Konstantin Ryabitsev
2018-05-16 17:54               ` Ævar Arnfjörð Bjarmason
2018-05-16 17:14           ` Martin Fick
2018-05-16 17:41             ` Konstantin Ryabitsev
2018-05-16 18:02               ` Ævar Arnfjörð Bjarmason
2018-05-16 18:12                 ` Konstantin Ryabitsev
2018-05-16 18:26                   ` Martin Fick
2018-05-16 19:01                     ` Konstantin Ryabitsev
2018-05-16 19:03                       ` Martin Fick
2018-05-16 19:11                         ` Konstantin Ryabitsev
2018-05-16 19:18                           ` Martin Fick
2018-05-16 19:23                       ` Jeff King
2018-05-16 19:29                         ` Konstantin Ryabitsev
2018-05-16 19:37                           ` Jeff King
2018-05-16 19:40                             ` Martin Fick
2018-05-16 20:06                               ` Jeff King
2018-05-16 20:43                                 ` Martin Fick
2018-05-16 20:02                             ` Konstantin Ryabitsev
2018-05-16 20:17                               ` Jeff King [this message]
2018-05-17  0:43                               ` Sitaram Chamarty
2018-05-17  3:31                                 ` Jeff King
2018-05-19  5:45                                   ` Duy Nguyen
2018-05-16 19:14           ` Jeff King
2018-05-16 21:18             ` Stefan Beller
2018-05-16 23:45               ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180516201727.GD4036@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=larsxschneider@gmail.com \
    --cc=mfick@codeaurora.org \
    --cc=pclouds@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).