git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* worktrees vs. alternates
@ 2018-05-16  8:13 Lars Schneider
  2018-05-16  9:29 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 34+ messages in thread
From: Lars Schneider @ 2018-05-16  8:13 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Duy Nguyen

Hi,

I am looking into different options to cache Git repositories on build
machines. The two most promising ways seem to be git-worktree [1] and
git-alternates [2].

I wonder if you see an advantage of one over the other? 

My impression is that git-worktree supersedes git-alternates. Would
that be a fair statement? If yes, would it makes sense to deprecate
alternates for simplification?

Thanks,
Lars


[1] https://git-scm.com/docs/git-worktree
[2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16  8:13 worktrees vs. alternates Lars Schneider
@ 2018-05-16  9:29 ` Ævar Arnfjörð Bjarmason
  2018-05-16  9:42   ` Robert P. J. Day
  2018-05-16  9:51   ` Lars Schneider
  0 siblings, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-16  9:29 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, Jeff King, Duy Nguyen


On Wed, May 16 2018, Lars Schneider wrote:

> I am looking into different options to cache Git repositories on build
> machines. The two most promising ways seem to be git-worktree [1] and
> git-alternates [2].
>
> I wonder if you see an advantage of one over the other?
>
> My impression is that git-worktree supersedes git-alternates. Would
> that be a fair statement? If yes, would it makes sense to deprecate
> alternates for simplification?
>
> [1] https://git-scm.com/docs/git-worktree
> [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates

It's not correct that worktrees supersede alternates, or the other way
around, they're orthagonal features.

git-worktree allows you to create a new working directory connected to
the same local object store.

Alternates allow you to declare in any given local object store, that
your set of objects isn't complete, and you can find the rest at some
other location, those object stores may or may not have more than one
worktree connected to them.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16  9:29 ` Ævar Arnfjörð Bjarmason
@ 2018-05-16  9:42   ` Robert P. J. Day
  2018-05-16 11:07     ` Ævar Arnfjörð Bjarmason
  2018-05-16  9:51   ` Lars Schneider
  1 sibling, 1 reply; 34+ messages in thread
From: Robert P. J. Day @ 2018-05-16  9:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Lars Schneider, git, Jeff King, Duy Nguyen

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

On Wed, 16 May 2018, Ævar Arnfjörð Bjarmason wrote:

>
> On Wed, May 16 2018, Lars Schneider wrote:
>
> > I am looking into different options to cache Git repositories on build
> > machines. The two most promising ways seem to be git-worktree [1] and
> > git-alternates [2].
> >
> > I wonder if you see an advantage of one over the other?
> >
> > My impression is that git-worktree supersedes git-alternates. Would
> > that be a fair statement? If yes, would it makes sense to deprecate
> > alternates for simplification?
> >
> > [1] https://git-scm.com/docs/git-worktree
> > [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates
>
> It's not correct that worktrees supersede alternates, or the other
> way around, they're orthagonal features.
>
> git-worktree allows you to create a new working directory connected
> to the same local object store.
>
> Alternates allow you to declare in any given local object store,
> that your set of objects isn't complete, and you can find the rest
> at some other location, those object stores may or may not have more
> than one worktree connected to them.

  just to be clear here, there should be nothing about how alternates
are set up for a repository that should affect the normal behaviour of
working trees for that repository, correct? i never thought there was,
i just thought i'd make absolutely sure.

rday

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16  9:29 ` Ævar Arnfjörð Bjarmason
  2018-05-16  9:42   ` Robert P. J. Day
@ 2018-05-16  9:51   ` Lars Schneider
  2018-05-16 10:33     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 34+ messages in thread
From: Lars Schneider @ 2018-05-16  9:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Duy Nguyen


> On 16 May 2018, at 11:29, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> 
> 
> On Wed, May 16 2018, Lars Schneider wrote:
> 
>> I am looking into different options to cache Git repositories on build
>> machines. The two most promising ways seem to be git-worktree [1] and
>> git-alternates [2].
>> 
>> I wonder if you see an advantage of one over the other?
>> 
>> My impression is that git-worktree supersedes git-alternates. Would
>> that be a fair statement? If yes, would it makes sense to deprecate
>> alternates for simplification?
>> 
>> [1] https://git-scm.com/docs/git-worktree
>> [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates
> 
> It's not correct that worktrees supersede alternates, or the other way
> around, they're orthagonal features.
> 
> git-worktree allows you to create a new working directory connected to
> the same local object store.
> 
> Alternates allow you to declare in any given local object store, that
> your set of objects isn't complete, and you can find the rest at some
> other location, those object stores may or may not have more than one
> worktree connected to them.

OK. I just wonder in what situation I would work with an incomplete
object store. The only use case I could imagine is that two repos share
a common set of objects (most likely blobs). However, in that situation
I would keep the two independent lines of development in a single repo
with two root commits.

Would it be fair to say that "git alternates" are a good mechanism to 
cache objects across different repos? However, I would consider a cache 
hit  between different repos unlikely. In that line of thinking
"git worktree" would be a good (maybe better?) mechanism to cache objects
for a single repo?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16  9:51   ` Lars Schneider
@ 2018-05-16 10:33     ` Ævar Arnfjörð Bjarmason
  2018-05-16 13:02       ` Derrick Stolee
  0 siblings, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-16 10:33 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, Jeff King, Duy Nguyen


On Wed, May 16 2018, Lars Schneider wrote:

>> On 16 May 2018, at 11:29, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>>
>> On Wed, May 16 2018, Lars Schneider wrote:
>>
>>> I am looking into different options to cache Git repositories on build
>>> machines. The two most promising ways seem to be git-worktree [1] and
>>> git-alternates [2].
>>>
>>> I wonder if you see an advantage of one over the other?
>>>
>>> My impression is that git-worktree supersedes git-alternates. Would
>>> that be a fair statement? If yes, would it makes sense to deprecate
>>> alternates for simplification?
>>>
>>> [1] https://git-scm.com/docs/git-worktree
>>> [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates
>>
>> It's not correct that worktrees supersede alternates, or the other way
>> around, they're orthagonal features.
>>
>> git-worktree allows you to create a new working directory connected to
>> the same local object store.
>>
>> Alternates allow you to declare in any given local object store, that
>> your set of objects isn't complete, and you can find the rest at some
>> other location, those object stores may or may not have more than one
>> worktree connected to them.
>
> OK. I just wonder in what situation I would work with an incomplete
> object store. The only use case I could imagine is that two repos share
> a common set of objects (most likely blobs). However, in that situation
> I would keep the two independent lines of development in a single repo
> with two root commits.
>
> Would it be fair to say that "git alternates" are a good mechanism to
> cache objects across different repos? However, I would consider a cache
> hit  between different repos unlikely. In that line of thinking
> "git worktree" would be a good (maybe better?) mechanism to cache objects
> for a single repo?

The use case is cloning with e.g. --shared or --reference.

Consider the following scenario:

 * You have 100 developers with *nix accounts on a single machine.

 * These 100 all need access to the same repo, but .git/objects is 1G

 * This would then naïvely require 100G of space + working tree. If the
   machine has 92G of RAM you'll be swapping the fscache in & out and
   performance will be horrible.

Instead, you have a single repository maintained on the system designed
to have all the alternates point to it, cloned as:

    git clone --reference /usr/share/git_tree/bigrepo ssh://....bigrepo.git ~/bigrepo

Now you're using just a bit over 1GB of space in total, but any new
objects the devs create will be written to their local .git dir, since
you're spending 1GB for those 100 repos instead of 100GB the data is
always in the FS cache.

And here's where this isn't at all like "worktree", each of those 100
will have their own "master" branch, and they can all create 100
different branches called "topic" that can be different.

With worktree the references are all shared across the same worktrees,
so it's designed for one dev working on different topic branches in
different checkouts.

The --reference feature is also commonly used in CI-like
environments. Imagine the above example, but except with 100 devs you
have CI jobs on the same machine being spun up all the time, although
here you get some overlap, if you're OK with the main branch name being
different you can also do this with worktrees instead of alternates.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16  9:42   ` Robert P. J. Day
@ 2018-05-16 11:07     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-16 11:07 UTC (permalink / raw)
  To: Robert P. J. Day; +Cc: Lars Schneider, git, Jeff King, Duy Nguyen


On Wed, May 16 2018, Robert P. J. Day wrote:

> On Wed, 16 May 2018, Ævar Arnfjörð Bjarmason wrote:
>
>>
>> On Wed, May 16 2018, Lars Schneider wrote:
>>
>> > I am looking into different options to cache Git repositories on build
>> > machines. The two most promising ways seem to be git-worktree [1] and
>> > git-alternates [2].
>> >
>> > I wonder if you see an advantage of one over the other?
>> >
>> > My impression is that git-worktree supersedes git-alternates. Would
>> > that be a fair statement? If yes, would it makes sense to deprecate
>> > alternates for simplification?
>> >
>> > [1] https://git-scm.com/docs/git-worktree
>> > [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates
>>
>> It's not correct that worktrees supersede alternates, or the other
>> way around, they're orthagonal features.
>>
>> git-worktree allows you to create a new working directory connected
>> to the same local object store.
>>
>> Alternates allow you to declare in any given local object store,
>> that your set of objects isn't complete, and you can find the rest
>> at some other location, those object stores may or may not have more
>> than one worktree connected to them.
>
>   just to be clear here, there should be nothing about how alternates
> are set up for a repository that should affect the normal behaviour of
> working trees for that repository, correct? i never thought there was,
> i just thought i'd make absolutely sure.

That's correct. The worktree(s) are logically composed of the
index/cache, checked-out files, and the local reference store (and some
auxiliary things, like per-worktree refs like HEAD, and config...).

Whether you have one worktree or many, eventually git needs to look up
objects somewhere. The alternates mechanism is just one more way to
specify where to look, along with some special logic in pack-objects and
the like where we need to be aware of them for the purposes of
maintaining objects in the repository.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 10:33     ` Ævar Arnfjörð Bjarmason
@ 2018-05-16 13:02       ` Derrick Stolee
  2018-05-16 14:58         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 34+ messages in thread
From: Derrick Stolee @ 2018-05-16 13:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Lars Schneider
  Cc: git, Jeff King, Duy Nguyen

On 5/16/2018 6:33 AM, Ævar Arnfjörð Bjarmason wrote:
[big snip]
>
> And here's where this isn't at all like "worktree", each of those 100
> will have their own "master" branch, and they can all create 100
> different branches called "topic" that can be different.

This is the biggest difference. You cannot have the same ref checked out 
in multiple worktrees, as they both may edit that ref. The alternates 
allow you to share data in a "read only" fashion. If you have one repo 
that is the "base" repo that manages that objects dir, then that is 
probably a good way to reduce the duplication. I'm not familiar with 
what happens when a "child" repo does 'git gc' or 'git repack', will it 
delete the local objects that is sees exist in the alternate?

GVFS uses alternates in this same way: we create a drive-wide "shared 
object cache" that GVFS manages. We put our prefetch packs filled with 
commits and trees in there, and any loose objects that are downloaded 
via the object virtualization are placed as loose objects in the 
alternate. We also store the multi-pack-index and commit-graph in that 
alternate. This means that the only objects in each src dir are those 
created by the developer doing their normal work.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 13:02       ` Derrick Stolee
@ 2018-05-16 14:58         ` Konstantin Ryabitsev
  2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
                             ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 14:58 UTC (permalink / raw)
  To: Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider
  Cc: git, Jeff King, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 2651 bytes --]

On 05/16/18 09:02, Derrick Stolee wrote:
> This is the biggest difference. You cannot have the same ref checked out
> in multiple worktrees, as they both may edit that ref. The alternates
> allow you to share data in a "read only" fashion. If you have one repo
> that is the "base" repo that manages that objects dir, then that is
> probably a good way to reduce the duplication. I'm not familiar with
> what happens when a "child" repo does 'git gc' or 'git repack', will it
> delete the local objects that is sees exist in the alternate?

The parent repo is not keeping track of any other repositories that may
be using it for alternates, which is why you basically:

1. never run auto-gc in the parent repo
2. repack it manually using -Ad to keep loose objects that other repos
may be borrowing (but we don't know if they are)
3. never prune the parent repo, because this may delete objects other
repos are borrowing

Very infrequently you may consider this extra set of maintenance steps:

1. Find every repo mentioning the parent repository in their alternates
2. Repack them without the -l switch (which copies all the borrowed
objects into those repos)
3. Once all child repos have been repacked this way, prune the parent
repo (it's safe now)
4. Repack child repos again, this time with the -l flag, to get your
savings back.

I would heartily love a way to teach git-repack to recognize when an
object it's borrowing from the parent repo is in danger of being pruned.
The cheapest way of doing this would probably be to hardlink loose
objects into its own objects directory and only consider "safe" objects
those that are part of the parent repository's pack. This should make
alternates a lot safer, just in case git-prune happens to run by accident.

> GVFS uses alternates in this same way: we create a drive-wide "shared
> object cache" that GVFS manages. We put our prefetch packs filled with
> commits and trees in there, and any loose objects that are downloaded
> via the object virtualization are placed as loose objects in the
> alternate. We also store the multi-pack-index and commit-graph in that
> alternate. This means that the only objects in each src dir are those
> created by the developer doing their normal work.

I'm very interested in GVFS, because it would certainly make my life
easier maintaining source.codeaurora.org, which is many thousands of
repos that are mostly forks of the same stuff. However, GVFS appears to
only exist for Windows (hint-hint, nudge-nudge). :)

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 14:58         ` Konstantin Ryabitsev
@ 2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
  2018-05-16 15:49             ` Konstantin Ryabitsev
  2018-05-16 17:14           ` Martin Fick
  2018-05-16 19:14           ` Jeff King
  2 siblings, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-16 15:34 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen


On Wed, May 16 2018, Konstantin Ryabitsev wrote:

> On 05/16/18 09:02, Derrick Stolee wrote:
>> This is the biggest difference. You cannot have the same ref checked out
>> in multiple worktrees, as they both may edit that ref. The alternates
>> allow you to share data in a "read only" fashion. If you have one repo
>> that is the "base" repo that manages that objects dir, then that is
>> probably a good way to reduce the duplication. I'm not familiar with
>> what happens when a "child" repo does 'git gc' or 'git repack', will it
>> delete the local objects that is sees exist in the alternate?
>
> The parent repo is not keeping track of any other repositories that may
> be using it for alternates, which is why you basically:
>
> 1. never run auto-gc in the parent repo
> 2. repack it manually using -Ad to keep loose objects that other repos
> may be borrowing (but we don't know if they are)
> 3. never prune the parent repo, because this may delete objects other
> repos are borrowing
>
> Very infrequently you may consider this extra set of maintenance steps:
>
> 1. Find every repo mentioning the parent repository in their alternates
> 2. Repack them without the -l switch (which copies all the borrowed
> objects into those repos)
> 3. Once all child repos have been repacked this way, prune the parent
> repo (it's safe now)
> 4. Repack child repos again, this time with the -l flag, to get your
> savings back.
>
> I would heartily love a way to teach git-repack to recognize when an
> object it's borrowing from the parent repo is in danger of being pruned.
> The cheapest way of doing this would probably be to hardlink loose
> objects into its own objects directory and only consider "safe" objects
> those that are part of the parent repository's pack. This should make
> alternates a lot safer, just in case git-prune happens to run by accident.

I may have missed some edge case, but I believe this entire workaround
isn't needed if you guarantee that the parent repo doesn't contain any
objects that will get un-referenced.

You'd do that in the common case by cloning with --single-branch, and
depending on your setup --no-tags (if you delete tags). This is assuming
that your HEAD branch points to something like a "master" that doesn't
get rewound.

The problem you're describing happens if say you clone git.git and have
the "pu" branch in there in the parent, and as a result you get child
repos referencing those objects, but when the parent GCs after "pu" is
rewound the child repos break. Thus your elaborate work-around.

But that situation isn't possible in the first place if you only ever
import the "master" branch, or other references guaranteed not to
change.

Of course that has the trade-off that every child repo needs to get its
own objects for the "next" branch, "pu", etc. But those are
comparatively tiny.

I wasn't aware of -l (--local), or had forgotten about it. I thought
that we didn't have that and the "child" repos would just keep growing
over time, i.e. not get rid of the objects we're fetching into the
parent (which the parent might get later due to the child, say if it's
fetched in a daily cronjob). Good to know that's not the case.

With that --local flag the trade-off of not fetching "next" and "pu"
etc. should become irrelevant over time, as they migrate to "master"
they'll get de-duplicated, or alternatively GC'd by the child repos if
they don't make it.

>> GVFS uses alternates in this same way: we create a drive-wide "shared
>> object cache" that GVFS manages. We put our prefetch packs filled with
>> commits and trees in there, and any loose objects that are downloaded
>> via the object virtualization are placed as loose objects in the
>> alternate. We also store the multi-pack-index and commit-graph in that
>> alternate. This means that the only objects in each src dir are those
>> created by the developer doing their normal work.
>
> I'm very interested in GVFS, because it would certainly make my life
> easier maintaining source.codeaurora.org, which is many thousands of
> repos that are mostly forks of the same stuff. However, GVFS appears to
> only exist for Windows (hint-hint, nudge-nudge). :)

This should make you happy:

https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux/

But I don't know what the current status is or where it can be followed.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
@ 2018-05-16 15:49             ` Konstantin Ryabitsev
  2018-05-16 17:54               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 15:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen

On Wed, May 16, 2018 at 05:34:34PM +0200, Ævar Arnfjörð Bjarmason wrote:
>I may have missed some edge case, but I believe this entire workaround
>isn't needed if you guarantee that the parent repo doesn't contain any
>objects that will get un-referenced.

You can't guarantee that, because the parent repo can have its history
rewritten either via a forced push, or via a rebase. Obviously, this
won't happen in something like torvalds/linux.git, which is why it's
pretty safe to alternate off of that repo for us, but codeaurora.org
repos aren't always strictly-ff (e.g. because they may rebase themselves
based on what is in upstream AOSP repos) -- so objects in them may
become unreferenced and pruned away, corrupting any repos using them for
alternates.

>> I'm very interested in GVFS, because it would certainly make my life
>> easier maintaining source.codeaurora.org, which is many thousands of
>> repos that are mostly forks of the same stuff. However, GVFS appears to
>> only exist for Windows (hint-hint, nudge-nudge). :)
>
>This should make you happy:
>
>https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux/
>
>But I don't know what the current status is or where it can be followed.

Very good to know, thanks!

-K

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 14:58         ` Konstantin Ryabitsev
  2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
@ 2018-05-16 17:14           ` Martin Fick
  2018-05-16 17:41             ` Konstantin Ryabitsev
  2018-05-16 19:14           ` Jeff King
  2 siblings, 1 reply; 34+ messages in thread
From: Martin Fick @ 2018-05-16 17:14 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Lars Schneider, git, Jeff King, Duy Nguyen

On Wednesday, May 16, 2018 10:58:19 AM Konstantin Ryabitsev 
wrote:
> 
> 1. Find every repo mentioning the parent repository in
> their alternates 2. Repack them without the -l switch
> (which copies all the borrowed objects into those repos)
> 3. Once all child repos have been repacked this way, prune
> the parent repo (it's safe now)

This is probably only true if the repos are in read-only 
mode?  I suspect this is still racy on a busy server with no 
downtime.

> 4. Repack child repos again, this time with the -l flag,
> to get your savings back.
 
> I would heartily love a way to teach git-repack to
> recognize when an object it's borrowing from the parent
> repo is in danger of being pruned. The cheapest way of
> doing this would probably be to hardlink loose objects
> into its own objects directory and only consider "safe"
> objects those that are part of the parent repository's
> pack. This should make alternates a lot safer, just in
> case git-prune happens to run by accident.

I think that hard linking is generally a good approach to 
solving many of the "pruning" races left in git.

I have uploaded a "hard linking" proposal to jgit that could 
potentially solve a similar situation that is not alternate 
specific, and only for packfiles, with the intent of 
eventually also doing something similar for loose 
objects.  You can see this here: 

https://git.eclipse.org/r/c/122288/2

I think it would be good to fill in more of these pruning 
gaps!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 17:14           ` Martin Fick
@ 2018-05-16 17:41             ` Konstantin Ryabitsev
  2018-05-16 18:02               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 17:41 UTC (permalink / raw)
  To: Martin Fick
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Lars Schneider, git, Jeff King, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 1283 bytes --]

On 05/16/18 13:14, Martin Fick wrote:
> On Wednesday, May 16, 2018 10:58:19 AM Konstantin Ryabitsev 
> wrote:
>>
>> 1. Find every repo mentioning the parent repository in
>> their alternates 2. Repack them without the -l switch
>> (which copies all the borrowed objects into those repos)
>> 3. Once all child repos have been repacked this way, prune
>> the parent repo (it's safe now)
> 
> This is probably only true if the repos are in read-only 
> mode?  I suspect this is still racy on a busy server with no 
> downtime.

We don't actually do this anywhere. :) It's a feature I keep hoping to
add one day to grokmirror, but keep putting off because of various
considerations. As you can imagine, if we have 300 forks of linux.git
all using torvalds/linux.git as their alternates, then repacking them
all without -l would balloon our disk usage 300-fold. At this time it's
just cheaper to keep a bunch of loose objects around forever at the cost
of decreased performance.

Maybe git-repack can be told to only borrow parent objects if they are
in packs. Anything not in packs should be hardlinked into the child
repo. That's my wishful think for the day. :)

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 15:49             ` Konstantin Ryabitsev
@ 2018-05-16 17:54               ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-16 17:54 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen


On Wed, May 16 2018, Konstantin Ryabitsev wrote:

> On Wed, May 16, 2018 at 05:34:34PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>I may have missed some edge case, but I believe this entire workaround
>>isn't needed if you guarantee that the parent repo doesn't contain any
>>objects that will get un-referenced.
>
> You can't guarantee that, because the parent repo can have its history
> rewritten either via a forced push, or via a rebase. Obviously, this
> won't happen in something like torvalds/linux.git, which is why it's
> pretty safe to alternate off of that repo for us, but codeaurora.org
> repos aren't always strictly-ff (e.g. because they may rebase themselves
> based on what is in upstream AOSP repos) -- so objects in them may
> become unreferenced and pruned away, corrupting any repos using them for
> alternates.

Right, it wouldn't work in the general case. I was thinking of the
use-case for doing this (say with known big monorepos) where you know a
given branch won't be unwound.

Still, there's a tiny variation on this that should work with arbitrary
repos whose master may be rewound, you just setup a refspec to fetch
their upstream HEAD into master-1 without having "+" in the
refspec. Then if they never rewind you keep fetching to master-1
forever.

If they do rewind you fetch that to master-2 and so forth, so you can
follow an upstream rewinding branch while still guaranteeing that no
objects ever disappear from your parent repo. This is still a lot
simpler than the juggling approach you noted, since it's just a tiny
shellscript around the "fetch".

This assumes that:

  1. Whenever this happens the history is still similar enough that the
     parent won't balloon in size like this, or at least it won't be
     worse than not using alternates at all.

 2. You're getting most of the gains of the object sharing by just
    grabbing the upstream HEAD branch, i.e. you don't have some repo
    with huge and N unrelated histories.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 17:41             ` Konstantin Ryabitsev
@ 2018-05-16 18:02               ` Ævar Arnfjörð Bjarmason
  2018-05-16 18:12                 ` Konstantin Ryabitsev
  0 siblings, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-16 18:02 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Martin Fick, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen


On Wed, May 16 2018, Konstantin Ryabitsev wrote:

> Maybe git-repack can be told to only borrow parent objects if they are
> in packs. Anything not in packs should be hardlinked into the child
> repo. That's my wishful think for the day. :)

Can you elaborate on how this would help?

We're just going to create loose objects on interactive "git commit",
presumably you're not adding someone's working copy as the alternate.

Otherwise if it's just being pushed to all those pushes are going to be
in packs, and the packs may contain e.g. pushes for the "pu" branch or
whatever, which are objects that'll go away.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 18:02               ` Ævar Arnfjörð Bjarmason
@ 2018-05-16 18:12                 ` Konstantin Ryabitsev
  2018-05-16 18:26                   ` Martin Fick
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 18:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Fick, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 1535 bytes --]

On 05/16/18 14:02, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, May 16 2018, Konstantin Ryabitsev wrote:
> 
>> Maybe git-repack can be told to only borrow parent objects if they are
>> in packs. Anything not in packs should be hardlinked into the child
>> repo. That's my wishful think for the day. :)
> 
> Can you elaborate on how this would help?
> 
> We're just going to create loose objects on interactive "git commit",
> presumably you're not adding someone's working copy as the alternate.

The loose objects I'm thinking of are those that are generated when we
do "git repack -Ad" -- this takes all unreachable objects and loosens
them (see man git-repack for more info). Normally, these would be pruned
after a certain period, but we're deliberately keeping them around
forever just in case another repo relies on them via alternates. I want
those repos to "claim" these loose objects via hardlinks, such that we
can run git-prune on the mother repo instead of dragging all the
unreachable objects on forever just in case.

> Otherwise if it's just being pushed to all those pushes are going to be
> in packs, and the packs may contain e.g. pushes for the "pu" branch or
> whatever, which are objects that'll go away.

There are lots of cases where unreachable objects in one repo would
never become unreachable in another -- for example, if the author had
stopped updating it.

Hope this helps.

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 18:12                 ` Konstantin Ryabitsev
@ 2018-05-16 18:26                   ` Martin Fick
  2018-05-16 19:01                     ` Konstantin Ryabitsev
  0 siblings, 1 reply; 34+ messages in thread
From: Martin Fick @ 2018-05-16 18:26 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git, Jeff King, Duy Nguyen

On Wednesday, May 16, 2018 02:12:24 PM Konstantin Ryabitsev 
wrote:
> The loose objects I'm thinking of are those that are
> generated when we do "git repack -Ad" -- this takes all
> unreachable objects and loosens them (see man git-repack
> for more info). Normally, these would be pruned after a
> certain period, but we're deliberately keeping them
> around forever just in case another repo relies on them
> via alternates. I want those repos to "claim" these loose
> objects via hardlinks, such that we can run git-prune on
> the mother repo instead of dragging all the unreachable
> objects on forever just in case.

If you are going to keep the unreferenced objects around 
forever, it might be better to keep them around in packed 
form?  We currently do that because we don't think there is 
a safe way to prune objects yet on a running server (which 
is why I am teaching jgit to be able to recover from a racy 
pruning error),

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 18:26                   ` Martin Fick
@ 2018-05-16 19:01                     ` Konstantin Ryabitsev
  2018-05-16 19:03                       ` Martin Fick
  2018-05-16 19:23                       ` Jeff King
  0 siblings, 2 replies; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 19:01 UTC (permalink / raw)
  To: Martin Fick
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git, Jeff King, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 644 bytes --]

On 05/16/18 14:26, Martin Fick wrote:
> If you are going to keep the unreferenced objects around 
> forever, it might be better to keep them around in packed 
> form?

I'm undecided about that. On the one hand this does create lots of small
files and inevitably causes (some) performance degradation. On the other
hand, I don't want to keep useless objects in the pack, because that
would also cause performance degradation for people cloning the "mother
repo." If my assumptions on any of that are incorrect, I'm happy to
learn more.

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:01                     ` Konstantin Ryabitsev
@ 2018-05-16 19:03                       ` Martin Fick
  2018-05-16 19:11                         ` Konstantin Ryabitsev
  2018-05-16 19:23                       ` Jeff King
  1 sibling, 1 reply; 34+ messages in thread
From: Martin Fick @ 2018-05-16 19:03 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git, Jeff King, Duy Nguyen

On Wednesday, May 16, 2018 03:01:13 PM Konstantin Ryabitsev 
wrote:
> On 05/16/18 14:26, Martin Fick wrote:
> > If you are going to keep the unreferenced objects around
> > forever, it might be better to keep them around in
> > packed
> > form?
> 
> I'm undecided about that. On the one hand this does create
> lots of small files and inevitably causes (some)
> performance degradation. On the other hand, I don't want
> to keep useless objects in the pack, because that would
> also cause performance degradation for people cloning the
> "mother repo." If my assumptions on any of that are
> incorrect, I'm happy to learn more.

My suggestion is to use science, not logic or hearsay. :) 
i.e. test it!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:03                       ` Martin Fick
@ 2018-05-16 19:11                         ` Konstantin Ryabitsev
  2018-05-16 19:18                           ` Martin Fick
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 19:11 UTC (permalink / raw)
  To: Martin Fick
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git, Jeff King, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --]

On 05/16/18 15:03, Martin Fick wrote:
>> I'm undecided about that. On the one hand this does create
>> lots of small files and inevitably causes (some)
>> performance degradation. On the other hand, I don't want
>> to keep useless objects in the pack, because that would
>> also cause performance degradation for people cloning the
>> "mother repo." If my assumptions on any of that are
>> incorrect, I'm happy to learn more.
> My suggestion is to use science, not logic or hearsay. :) 
> i.e. test it!

I think the answer will be "it depends." In many of our cases the repos
that need those loose objects are rarely accessed -- usually because
they are forks with older data (hence why they need objects that are no
longer used by the mother repo). Therefore, performance impacts of
occasionally touching a handful of loose objects will be fairly
negligible. This is especially true on non-spinning media where seek
times are low anyway. Having slimmer packs for the mother repo would be
more beneficial in this case.

On the other hand, if the "child repo" is frequently used, then the
impact of needing a bunch of loose objects would be greater. For the
sake of simplicity, I think I'll leave things as they are -- it's
cheaper to fix this via reducing seek times than by applying complicated
logic trying to optimize on a per-repo basis.

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 14:58         ` Konstantin Ryabitsev
  2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
  2018-05-16 17:14           ` Martin Fick
@ 2018-05-16 19:14           ` Jeff King
  2018-05-16 21:18             ` Stefan Beller
  2 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2018-05-16 19:14 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Lars Schneider, git, Duy Nguyen

On Wed, May 16, 2018 at 10:58:19AM -0400, Konstantin Ryabitsev wrote:

> The parent repo is not keeping track of any other repositories that may
> be using it for alternates, which is why you basically:
> 
> 1. never run auto-gc in the parent repo
> 2. repack it manually using -Ad to keep loose objects that other repos
> may be borrowing (but we don't know if they are)
> 3. never prune the parent repo, because this may delete objects other
> repos are borrowing
> 
> Very infrequently you may consider this extra set of maintenance steps:
> 
> 1. Find every repo mentioning the parent repository in their alternates
> 2. Repack them without the -l switch (which copies all the borrowed
> objects into those repos)
> 3. Once all child repos have been repacked this way, prune the parent
> repo (it's safe now)
> 4. Repack child repos again, this time with the -l flag, to get your
> savings back.

You can also do periodic maintenance like:

  1. Copy each ref in the forked repositories into the parent repository
     (e.g., giving each child that borrows from the parent its own
     hierarchy in refs/remotes/<child>/*).

  2. Repack the parent as normal. It will retain any objects referenced
     by the children (because they are now referenced by it).

But note that:

  1. It's not atomic with respect to updates in the child repos (but
     then, neither is the single-repo case!).

  2. It doesn't know about reflogs or the index in the child
     repositories.

This is more or less how we use alternates at GitHub.

> I would heartily love a way to teach git-repack to recognize when an
> object it's borrowing from the parent repo is in danger of being pruned.
> The cheapest way of doing this would probably be to hardlink loose
> objects into its own objects directory and only consider "safe" objects
> those that are part of the parent repository's pack. This should make
> alternates a lot safer, just in case git-prune happens to run by accident.

If you set:

  git config core.repositoryformatversion 1
  git config extensions.preciousObjects true

in the parent, git-prune (repack -d) will refuse to run. That doesn't
solve the problem of how to repack, but it can help prevent accidental
misuse.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:11                         ` Konstantin Ryabitsev
@ 2018-05-16 19:18                           ` Martin Fick
  0 siblings, 0 replies; 34+ messages in thread
From: Martin Fick @ 2018-05-16 19:18 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git, Jeff King, Duy Nguyen

On Wednesday, May 16, 2018 03:11:47 PM Konstantin Ryabitsev 
wrote:
> On 05/16/18 15:03, Martin Fick wrote:
> >> I'm undecided about that. On the one hand this does
> >> create lots of small files and inevitably causes
> >> (some) performance degradation. On the other hand, I
> >> don't want to keep useless objects in the pack,
> >> because that would also cause performance degradation
> >> for people cloning the "mother repo." If my
> >> assumptions on any of that are incorrect, I'm happy to
> >> learn more.
> > 
> > My suggestion is to use science, not logic or hearsay.
> > :)
> > i.e. test it!
> 
> I think the answer will be "it depends." In many of our
> cases the repos that need those loose objects are rarely
> accessed -- usually because they are forks with older
> data (hence why they need objects that are no longer used
> by the mother repo). Therefore, performance impacts of
> occasionally touching a handful of loose objects will be
> fairly negligible. This is especially true on
> non-spinning media where seek times are low anyway.
> Having slimmer packs for the mother repo would be more
> beneficial in this case.
> 
> On the other hand, if the "child repo" is frequently used,
> then the impact of needing a bunch of loose objects would
> be greater. For the sake of simplicity, I think I'll
> leave things as they are -- it's cheaper to fix this via
> reducing seek times than by applying complicated logic
> trying to optimize on a per-repo basis.

I think a major performance issue with loose objects is not 
just the seek time, but also the fact that they are not 
delta compressed.  This means that sending them over the 
wire will likely have a significant cost before sending it. 
Unlike the seek time, this cost is not mitigated across 
concurrent fetches by the FS (or jgit if you were to use it) 
caching,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:01                     ` Konstantin Ryabitsev
  2018-05-16 19:03                       ` Martin Fick
@ 2018-05-16 19:23                       ` Jeff King
  2018-05-16 19:29                         ` Konstantin Ryabitsev
  1 sibling, 1 reply; 34+ messages in thread
From: Jeff King @ 2018-05-16 19:23 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Martin Fick, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

On Wed, May 16, 2018 at 03:01:13PM -0400, Konstantin Ryabitsev wrote:

> On 05/16/18 14:26, Martin Fick wrote:
> > If you are going to keep the unreferenced objects around 
> > forever, it might be better to keep them around in packed 
> > form?
> 
> I'm undecided about that. On the one hand this does create lots of small
> files and inevitably causes (some) performance degradation. On the other
> hand, I don't want to keep useless objects in the pack, because that
> would also cause performance degradation for people cloning the "mother
> repo." If my assumptions on any of that are incorrect, I'm happy to
> learn more.

I implemented "repack -k", which keeps all objects and just rolls them
into the new pack (along with any currently-loose unreachable objects).
Aside from corner cases (e.g., where somebody accidentally added a 20GB
file to an otherwise 100MB-repo and then rolled it back), it usually
doesn't significantly affect the repository size.

And it generally should not cause performance problems for people
cloning, since Git will create a custom pack for each client with only
the reachable objects.

There _is_ an interesting corner case where a reachable object might be
a delta against an unreachable one, which can cause a clone to have to
break that relationship and find a new delta. At GitHub we have some
custom code that tries to avoid these kind of delta dependencies (not
just to unreachable objects, but to other forks that share object
storage). You can see the patch at:

  https://github.com/peff/git jk/delta-islands

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:23                       ` Jeff King
@ 2018-05-16 19:29                         ` Konstantin Ryabitsev
  2018-05-16 19:37                           ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 19:29 UTC (permalink / raw)
  To: Jeff King
  Cc: Martin Fick, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 872 bytes --]

On 05/16/18 15:23, Jeff King wrote:
> I implemented "repack -k", which keeps all objects and just rolls them
> into the new pack (along with any currently-loose unreachable objects).
> Aside from corner cases (e.g., where somebody accidentally added a 20GB
> file to an otherwise 100MB-repo and then rolled it back), it usually
> doesn't significantly affect the repository size.

Hmm... I should read manpages more often! :)

So, do you suggest that this is a better approach:

- mother repos: "git repack -adk"
- child repos: "git repack -Adl" (followed by prune)

Currently, we do "-Adl" regardless, but we already track whether a repo
is being used for alternates anywhere (so we don't prune it) and can do
different flags if that improves performance.

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:29                         ` Konstantin Ryabitsev
@ 2018-05-16 19:37                           ` Jeff King
  2018-05-16 19:40                             ` Martin Fick
  2018-05-16 20:02                             ` Konstantin Ryabitsev
  0 siblings, 2 replies; 34+ messages in thread
From: Jeff King @ 2018-05-16 19:37 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Martin Fick, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

On Wed, May 16, 2018 at 03:29:42PM -0400, Konstantin Ryabitsev wrote:

> On 05/16/18 15:23, Jeff King wrote:
> > I implemented "repack -k", which keeps all objects and just rolls them
> > into the new pack (along with any currently-loose unreachable objects).
> > Aside from corner cases (e.g., where somebody accidentally added a 20GB
> > file to an otherwise 100MB-repo and then rolled it back), it usually
> > doesn't significantly affect the repository size.
> 
> Hmm... I should read manpages more often! :)
> 
> So, do you suggest that this is a better approach:
> 
> - mother repos: "git repack -adk"
> - child repos: "git repack -Adl" (followed by prune)

Yes, that's pretty close to what we do at GitHub. Before doing any
repacking in the mother repo, we actually do the equivalent of:

  git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
  git repack -Adl

from each child to pick up any new objects to de-duplicate (our "mother"
repos are not real repos at all, but just big shared-object stores).

I say "equivalent" because those commands can actually be a bit slow. So
we do some hacky tricks like directly moving objects in the filesystem.

In theory the fetch means that it's safe to actually prune in the mother
repo, but in practice there are still races. They don't come up often,
but if you have enough repositories, they do eventually. :)

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:37                           ` Jeff King
@ 2018-05-16 19:40                             ` Martin Fick
  2018-05-16 20:06                               ` Jeff King
  2018-05-16 20:02                             ` Konstantin Ryabitsev
  1 sibling, 1 reply; 34+ messages in thread
From: Martin Fick @ 2018-05-16 19:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Konstantin Ryabitsev, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

On Wednesday, May 16, 2018 12:37:45 PM Jeff King wrote:
> On Wed, May 16, 2018 at 03:29:42PM -0400, Konstantin 
Ryabitsev wrote:
> Yes, that's pretty close to what we do at GitHub. Before
> doing any repacking in the mother repo, we actually do
> the equivalent of:
> 
>   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
>   git repack -Adl
> 
> from each child to pick up any new objects to de-duplicate
> (our "mother" repos are not real repos at all, but just
> big shared-object stores).
... 
> In theory the fetch means that it's safe to actually prune
> in the mother repo, but in practice there are still
> races. They don't come up often, but if you have enough
> repositories, they do eventually. :)

Peff,

I would be very curious to hear what you think of this 
approach to mitigating the effect of those races?

https://git.eclipse.org/r/c/122288/2

-Martin
-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:37                           ` Jeff King
  2018-05-16 19:40                             ` Martin Fick
@ 2018-05-16 20:02                             ` Konstantin Ryabitsev
  2018-05-16 20:17                               ` Jeff King
  2018-05-17  0:43                               ` Sitaram Chamarty
  1 sibling, 2 replies; 34+ messages in thread
From: Konstantin Ryabitsev @ 2018-05-16 20:02 UTC (permalink / raw)
  To: Jeff King
  Cc: Martin Fick, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

[-- Attachment #1.1: Type: text/plain, Size: 1520 bytes --]

On 05/16/18 15:37, Jeff King wrote:
> Yes, that's pretty close to what we do at GitHub. Before doing any
> repacking in the mother repo, we actually do the equivalent of:
> 
>   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
>   git repack -Adl
> 
> from each child to pick up any new objects to de-duplicate (our "mother"
> repos are not real repos at all, but just big shared-object stores).

Yes, I keep thinking of doing the same, too -- instead of using
torvalds/linux.git for alternates, have an internal repo where objects
from all forks are stored. This conversation may finally give me the
shove I've been needing to poke at this. :)

Is your delta-islands patch heading into upstream, or is that something
that's going to remain external?

> I say "equivalent" because those commands can actually be a bit slow. So
> we do some hacky tricks like directly moving objects in the filesystem.
> 
> In theory the fetch means that it's safe to actually prune in the mother
> repo, but in practice there are still races. They don't come up often,
> but if you have enough repositories, they do eventually. :)

I feel like a whitepaper on "how we deal with bajillions of forks at
GitHub" would be nice. :) I was previously told that it's unlikely such
paper could be written due to so many custom-built things at GH, but I
would be very happy if that turned out not to be the case.

Best,
-- 
Konstantin Ryabitsev
Director, IT Infrastructure Security
The Linux Foundation


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:40                             ` Martin Fick
@ 2018-05-16 20:06                               ` Jeff King
  2018-05-16 20:43                                 ` Martin Fick
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2018-05-16 20:06 UTC (permalink / raw)
  To: Martin Fick
  Cc: Konstantin Ryabitsev, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

On Wed, May 16, 2018 at 01:40:56PM -0600, Martin Fick wrote:

> > In theory the fetch means that it's safe to actually prune
> > in the mother repo, but in practice there are still
> > races. They don't come up often, but if you have enough
> > repositories, they do eventually. :)
> 
> Peff,
> 
> I would be very curious to hear what you think of this 
> approach to mitigating the effect of those races?
> 
> https://git.eclipse.org/r/c/122288/2

The crux of the problem is that we have no way to atomically mark an
object as "I am using this -- do not delete" with respect to the actual
deletion. 

So if I'm reading your approach correctly, you put objects into a
purgatory rather than delete them, and let some operations rescue them
from purgatory if we had a race.  That's certainly a direction we've
considered, but I think there are some open questions, like:

  1. When do you rescue from purgatory? Any time the object is
     referenced? Do you then pull in all of its reachable objects too?

  2. How do you decide when to drop an object from purgatory? And
     specifically, how do you avoid racing with somebody using the
     object as you're pruning purgatory?

  3. How do you know that an operation has been run that will actually
     rescue the object, as opposed to silently having a corrupted state
     on disk?

     E.g., imagine this sequence:

       a. git-prune computes reachability and finds that commit X is
          ready to be pruned

       b. another process sees that commit X exists and builds a commit
          that references it as a parent

       c. git-prune drops the object into purgatory

     Now we have a corrupt state created by the process in (b), since we
     have a reachable object in purgatory. But what if nobody goes back
     and tries to read those commits in the meantime?

I think this might be solvable by using the purgatory as a kind of
"lock", where prune does something like:

  1. compute reachability

  2. move candidate objects into purgatory; nobody can look into
     purgatory except us

  3. compute reachability _again_, making sure that no purgatory objects
     are used (if so, rollback the deletion and try again)

But even that's not quite there, because you need to have some
consistent atomic view of what's "used". Just checking refs isn't
enough, because some other process may be planning to reference a
purgatory object but not yet have updated the ref. So you need some
atomic way of saying "I am interested in using this object".

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 20:02                             ` Konstantin Ryabitsev
@ 2018-05-16 20:17                               ` Jeff King
  2018-05-17  0:43                               ` Sitaram Chamarty
  1 sibling, 0 replies; 34+ messages in thread
From: Jeff King @ 2018-05-16 20:17 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Martin Fick, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

On Wed, May 16, 2018 at 04:02:53PM -0400, Konstantin Ryabitsev wrote:

> On 05/16/18 15:37, Jeff King wrote:
> > Yes, that's pretty close to what we do at GitHub. Before doing any
> > repacking in the mother repo, we actually do the equivalent of:
> > 
> >   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
> >   git repack -Adl
> > 
> > from each child to pick up any new objects to de-duplicate (our "mother"
> > repos are not real repos at all, but just big shared-object stores).
> 
> Yes, I keep thinking of doing the same, too -- instead of using
> torvalds/linux.git for alternates, have an internal repo where objects
> from all forks are stored. This conversation may finally give me the
> shove I've been needing to poke at this. :)
> 
> Is your delta-islands patch heading into upstream, or is that something
> that's going to remain external?

I have vague plans to submit it upstream, but I'm still not convinced
it's quite optimal. The resulting packs tend to be a fair bit larger
than they could be when packed by themselves, because we miss many delta
opportunities (and it's important to "repack -f --window=250" once in a
while, since we're throwing away so many delta candidates).

There's an alternative way of doing it, too, which I think git.or.cz
uses: it "layers" forks in a hierarchy. So if I fork torvalds/linux.git,
then I get my own repo that uses torvalds/linux as an alternate. And if
somebody forks my repo, then I'm their alternate, and they recursively
depend on torvalds/linux. So each fork basically layers a slice of its
own pack on top of the parent.

This is all from recollections of past discussions (which were sadly not
on the list -- I don't know if they've written up their scheme anywhere
public), so I may have some details wrong. But I think that their
repacking is done hierarchically, too: any objects which the root fork
might drop get migrated up to the children instead, and so forth, until
the leaf nodes can actually throw away objects.

The big problem with this is that Git tends to behave better when
objects are in the same pack:

  1. We don't bother looking for new deltas within the same pack,
     whereas a clone of a fork may actually try to find new deltas
     between the layers.

  2. Reachability bitmaps can't cross pack boundaries (due to the way
     they're implemented, but also the current on-disk format). So you
     can only bitmap the root repo, not any of the other layers.

> I feel like a whitepaper on "how we deal with bajillions of forks at
> GitHub" would be nice. :) I was previously told that it's unlikely such
> paper could be written due to so many custom-built things at GH, but I
> would be very happy if that turned out not to be the case.

We have a few engineering blog posts on the subject, like:

  https://githubengineering.com/counting-objects/
  https://githubengineering.com/introducing-dgit/
  https://githubengineering.com/building-resilience-in-spokes/

but we haven't done a very good job of keeping that up. I think a
summary whitepaper would interesting. Maybe one day...:)

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 20:06                               ` Jeff King
@ 2018-05-16 20:43                                 ` Martin Fick
  0 siblings, 0 replies; 34+ messages in thread
From: Martin Fick @ 2018-05-16 20:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Konstantin Ryabitsev, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

On Wednesday, May 16, 2018 01:06:59 PM Jeff King wrote:
> On Wed, May 16, 2018 at 01:40:56PM -0600, Martin Fick 
wrote:
> > > In theory the fetch means that it's safe to actually
> > > prune in the mother repo, but in practice there are
> > > still races. They don't come up often, but if you
> > > have enough repositories, they do eventually. :)
> > 
> > Peff,
> > 
> > I would be very curious to hear what you think of this
> > approach to mitigating the effect of those races?
> > 
> > https://git.eclipse.org/r/c/122288/2
> 
> The crux of the problem is that we have no way to
> atomically mark an object as "I am using this -- do not
> delete" with respect to the actual deletion.
> 
> So if I'm reading your approach correctly, you put objects
> into a purgatory rather than delete them, and let some
> operations rescue them from purgatory if we had a race. 

Yes.  This has the cost of extra disk space for a while, but 
once I realized that we are incurring that cost already 
because for our repos, we already put things into purgatory 
to avoid getting stale NFS File handle errors during 
unrecoverable paths (while streaming an object).  So 
effectively this has no extra space cost then what is needed 
to run safely on NFS.

>   1. When do you rescue from purgatory? Any time the
> object is referenced? Do you then pull in all of its
> reachable objects too?

For my approach, I decided a) Yes b) No

Because:

a) Rescue on reference is cheap and allows any other policy 
to be built upon it, just ensure that policy references it 
at some point before it is prune from the purgatory.

b)  The other referenced objects will likely get pulled in 
on reference anyway or by virtue of being in the same pack.

>   2. How do you decide when to drop an object from
> purgatory? And specifically, how do you avoid racing with
> somebody using the object as you're pruning purgatory?

If you clean the purgatory during repacking after creating 
all the new packs and before deleting the old ones, you will 
have a significant grace window to handle most longer running 
operations.  In this way, repacking will have re-referenced 
any missing objects from the purgatory before it gets pruned 
causing them to be recovered if necessary.  Those missing 
objects, believed to be in the exact packs in the purgatory 
at that time, should only ever have been referenced by write 
operations that started before those packs were moved to the 
purgatory, which was before the previous repacking round 
ended.  This leaves write operations a full repacking cycle 
to complete in to avoid loosing objects.

>   3. How do you know that an operation has been run that
> will actually rescue the object, as opposed to silently
> having a corrupted state on disk?
> 
>      E.g., imagine this sequence:
> 
>        a. git-prune computes reachability and finds that
> commit X is ready to be pruned
> 
>        b. another process sees that commit X exists and
> builds a commit that references it as a parent
> 
>        c. git-prune drops the object into purgatory
> 
>      Now we have a corrupt state created by the process in
> (b), since we have a reachable object in purgatory. But
> what if nobody goes back and tries to read those commits
> in the meantime?

See answer to #2, repacking itself should rescue any objects 
that need to be rescued before pruning the purgatory.

> I think this might be solvable by using the purgatory as a
> kind of "lock", where prune does something like:
> 
>   1. compute reachability
> 
>   2. move candidate objects into purgatory; nobody can
> look into purgatory except us

I don't think this is needed.

It should be OK to let others see the objects in the 
purgatory after 1 and before 3 as long as "seeing" them, 
causes them to be recovered!

>   3. compute reachability _again_, making sure that no
> purgatory objects are used (if so, rollback the deletion
> and try again)

Yes, you laid out the formula, but nothing says this 
recompute can't wait until the next repack (again see my 
answer to #2)!  i.e. there is no rush to cause a recovery as 
long as it gets recovered before it gets pruned from the 
purgatory.


> But even that's not quite there, because you need to have
> some consistent atomic view of what's "used". Just
> checking refs isn't enough, because some other process
> may be planning to reference a purgatory object but not
> yet have updated the ref. So you need some atomic way of
> saying "I am interested in using this object".

As long as all write paths also read the object first (I 
assume they do, or we would be in big trouble already), then 
this should not be an issue.  The idea is to force all reads 
(and thus all writes also) to recover the object,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 19:14           ` Jeff King
@ 2018-05-16 21:18             ` Stefan Beller
  2018-05-16 23:45               ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Beller @ 2018-05-16 21:18 UTC (permalink / raw)
  To: Jeff King
  Cc: Konstantin Ryabitsev, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Lars Schneider, git,
	Duy Nguyen

>
> You can also do periodic maintenance like:
>
>   1. Copy each ref in the forked repositories into the parent repository
>      (e.g., giving each child that borrows from the parent its own
>      hierarchy in refs/remotes/<child>/*).

Can you just copy? I assume the mother repo doesn't know about
all objects, hence by copying the ref, we have a "spotty" history.

And to improve copying could permanent symlinking be used instead?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 21:18             ` Stefan Beller
@ 2018-05-16 23:45               ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2018-05-16 23:45 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Konstantin Ryabitsev, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Lars Schneider, git,
	Duy Nguyen

On Wed, May 16, 2018 at 02:18:20PM -0700, Stefan Beller wrote:

> >
> > You can also do periodic maintenance like:
> >
> >   1. Copy each ref in the forked repositories into the parent repository
> >      (e.g., giving each child that borrows from the parent its own
> >      hierarchy in refs/remotes/<child>/*).
> 
> Can you just copy? I assume the mother repo doesn't know about
> all objects, hence by copying the ref, we have a "spotty" history.
> 
> And to improve copying could permanent symlinking be used instead?

Sorry, by copying, I meant "fetching". I.e., migrating objects and refs.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-16 20:02                             ` Konstantin Ryabitsev
  2018-05-16 20:17                               ` Jeff King
@ 2018-05-17  0:43                               ` Sitaram Chamarty
  2018-05-17  3:31                                 ` Jeff King
  1 sibling, 1 reply; 34+ messages in thread
From: Sitaram Chamarty @ 2018-05-17  0:43 UTC (permalink / raw)
  To: Konstantin Ryabitsev
  Cc: Jeff King, Martin Fick, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Lars Schneider, git, Duy Nguyen

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On Wed, May 16, 2018 at 04:02:53PM -0400, Konstantin Ryabitsev wrote:
> On 05/16/18 15:37, Jeff King wrote:
> > Yes, that's pretty close to what we do at GitHub. Before doing any
> > repacking in the mother repo, we actually do the equivalent of:
> > 
> >   git fetch --prune ../$id.git +refs/*:refs/remotes/$id/*
> >   git repack -Adl
> > 
> > from each child to pick up any new objects to de-duplicate (our "mother"
> > repos are not real repos at all, but just big shared-object stores).
> 
> Yes, I keep thinking of doing the same, too -- instead of using
> torvalds/linux.git for alternates, have an internal repo where objects
> from all forks are stored. This conversation may finally give me the
> shove I've been needing to poke at this. :)

I may have missed a few of the earlier messages, but in the last
20 or so in this thread, I did not see namespaces mentioned by
anyone. (I.e., apologies if it was addressed and discarded
earlier!)

I was under the impression that, as long as "read" access need
not be controlled (Konstantin's situation, at least, and maybe
Peff's too, for public repos), namespaces are a good way to
create and manage that "mother repo".

Is that not true anymore?  Mind, I have not actually used them
in anger anywhere, so I could be missing some really big point
here.

sitaram

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-17  0:43                               ` Sitaram Chamarty
@ 2018-05-17  3:31                                 ` Jeff King
  2018-05-19  5:45                                   ` Duy Nguyen
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2018-05-17  3:31 UTC (permalink / raw)
  To: Sitaram Chamarty
  Cc: Konstantin Ryabitsev, Martin Fick,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git, Duy Nguyen

On Thu, May 17, 2018 at 06:13:55AM +0530, Sitaram Chamarty wrote:

> I may have missed a few of the earlier messages, but in the last
> 20 or so in this thread, I did not see namespaces mentioned by
> anyone. (I.e., apologies if it was addressed and discarded
> earlier!)
> 
> I was under the impression that, as long as "read" access need
> not be controlled (Konstantin's situation, at least, and maybe
> Peff's too, for public repos), namespaces are a good way to
> create and manage that "mother repo".
> 
> Is that not true anymore?  Mind, I have not actually used them
> in anger anywhere, so I could be missing some really big point
> here.

The biggest problem with namespaces as they are currently implemented is
that they do not apply universally to all commands. If you only access
the repo via push/fetch, they may be fine. But as soon as you start
doing other operations (e.g., showing the history of a branch in a web
interface), you don't get to use the namespaced names anymore.

I think a different implementation of namespaces could do this better.
E.g., by controlling the view of the refs at the refs.c layer (or
perhaps as a filtering backend).

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: worktrees vs. alternates
  2018-05-17  3:31                                 ` Jeff King
@ 2018-05-19  5:45                                   ` Duy Nguyen
  0 siblings, 0 replies; 34+ messages in thread
From: Duy Nguyen @ 2018-05-19  5:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Sitaram Chamarty, Konstantin Ryabitsev, Martin Fick,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Lars Schneider, git

On Thu, May 17, 2018 at 5:31 AM, Jeff King <peff@peff.net> wrote:
> On Thu, May 17, 2018 at 06:13:55AM +0530, Sitaram Chamarty wrote:
>
>> I may have missed a few of the earlier messages, but in the last
>> 20 or so in this thread, I did not see namespaces mentioned by
>> anyone. (I.e., apologies if it was addressed and discarded
>> earlier!)
>>
>> I was under the impression that, as long as "read" access need
>> not be controlled (Konstantin's situation, at least, and maybe
>> Peff's too, for public repos), namespaces are a good way to
>> create and manage that "mother repo".
>>
>> Is that not true anymore?  Mind, I have not actually used them
>> in anger anywhere, so I could be missing some really big point
>> here.
>
> The biggest problem with namespaces as they are currently implemented is
> that they do not apply universally to all commands. If you only access
> the repo via push/fetch, they may be fine. But as soon as you start
> doing other operations (e.g., showing the history of a branch in a web
> interface), you don't get to use the namespaced names anymore.
>
> I think a different implementation of namespaces could do this better.
> E.g., by controlling the view of the refs at the refs.c layer (or
> perhaps as a filtering backend).

Yeah. Namespaces (that work for all commands) + worktree was my plan
for centralizing repos (for one user). But I never got that far to
look into making ref namespaces work for everything.
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, back to index

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16  8:13 worktrees vs. alternates Lars Schneider
2018-05-16  9:29 ` Ævar Arnfjörð Bjarmason
2018-05-16  9:42   ` Robert P. J. Day
2018-05-16 11:07     ` Ævar Arnfjörð Bjarmason
2018-05-16  9:51   ` Lars Schneider
2018-05-16 10:33     ` Ævar Arnfjörð Bjarmason
2018-05-16 13:02       ` Derrick Stolee
2018-05-16 14:58         ` Konstantin Ryabitsev
2018-05-16 15:34           ` Ævar Arnfjörð Bjarmason
2018-05-16 15:49             ` Konstantin Ryabitsev
2018-05-16 17:54               ` Ævar Arnfjörð Bjarmason
2018-05-16 17:14           ` Martin Fick
2018-05-16 17:41             ` Konstantin Ryabitsev
2018-05-16 18:02               ` Ævar Arnfjörð Bjarmason
2018-05-16 18:12                 ` Konstantin Ryabitsev
2018-05-16 18:26                   ` Martin Fick
2018-05-16 19:01                     ` Konstantin Ryabitsev
2018-05-16 19:03                       ` Martin Fick
2018-05-16 19:11                         ` Konstantin Ryabitsev
2018-05-16 19:18                           ` Martin Fick
2018-05-16 19:23                       ` Jeff King
2018-05-16 19:29                         ` Konstantin Ryabitsev
2018-05-16 19:37                           ` Jeff King
2018-05-16 19:40                             ` Martin Fick
2018-05-16 20:06                               ` Jeff King
2018-05-16 20:43                                 ` Martin Fick
2018-05-16 20:02                             ` Konstantin Ryabitsev
2018-05-16 20:17                               ` Jeff King
2018-05-17  0:43                               ` Sitaram Chamarty
2018-05-17  3:31                                 ` Jeff King
2018-05-19  5:45                                   ` Duy Nguyen
2018-05-16 19:14           ` Jeff King
2018-05-16 21:18             ` Stefan Beller
2018-05-16 23:45               ` Jeff King

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox