* worktrees vs. alternates @ 2018-05-16 8:13 Lars Schneider 2018-05-16 9:29 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 34+ messages in thread From: Lars Schneider @ 2018-05-16 8:13 UTC (permalink / raw) To: git; +Cc: Jeff King, Duy Nguyen Hi, I am looking into different options to cache Git repositories on build machines. The two most promising ways seem to be git-worktree [1] and git-alternates [2]. I wonder if you see an advantage of one over the other? My impression is that git-worktree supersedes git-alternates. Would that be a fair statement? If yes, would it makes sense to deprecate alternates for simplification? Thanks, Lars [1] https://git-scm.com/docs/git-worktree [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 8:13 worktrees vs. alternates Lars Schneider @ 2018-05-16 9:29 ` Ævar Arnfjörð Bjarmason 2018-05-16 9:42 ` Robert P. J. Day 2018-05-16 9:51 ` Lars Schneider 0 siblings, 2 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-16 9:29 UTC (permalink / raw) To: Lars Schneider; +Cc: git, Jeff King, Duy Nguyen On Wed, May 16 2018, Lars Schneider wrote: > I am looking into different options to cache Git repositories on build > machines. The two most promising ways seem to be git-worktree [1] and > git-alternates [2]. > > I wonder if you see an advantage of one over the other? > > My impression is that git-worktree supersedes git-alternates. Would > that be a fair statement? If yes, would it makes sense to deprecate > alternates for simplification? > > [1] https://git-scm.com/docs/git-worktree > [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates It's not correct that worktrees supersede alternates, or the other way around, they're orthagonal features. git-worktree allows you to create a new working directory connected to the same local object store. Alternates allow you to declare in any given local object store, that your set of objects isn't complete, and you can find the rest at some other location, those object stores may or may not have more than one worktree connected to them. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 9:29 ` Ævar Arnfjörð Bjarmason @ 2018-05-16 9:42 ` Robert P. J. Day 2018-05-16 11:07 ` Ævar Arnfjörð Bjarmason 2018-05-16 9:51 ` Lars Schneider 1 sibling, 1 reply; 34+ messages in thread From: Robert P. J. Day @ 2018-05-16 9:42 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Lars Schneider, git, Jeff King, Duy Nguyen [-- Attachment #1: Type: text/plain, Size: 1396 bytes --] On Wed, 16 May 2018, Ævar Arnfjörð Bjarmason wrote: > > On Wed, May 16 2018, Lars Schneider wrote: > > > I am looking into different options to cache Git repositories on build > > machines. The two most promising ways seem to be git-worktree [1] and > > git-alternates [2]. > > > > I wonder if you see an advantage of one over the other? > > > > My impression is that git-worktree supersedes git-alternates. Would > > that be a fair statement? If yes, would it makes sense to deprecate > > alternates for simplification? > > > > [1] https://git-scm.com/docs/git-worktree > > [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates > > It's not correct that worktrees supersede alternates, or the other > way around, they're orthagonal features. > > git-worktree allows you to create a new working directory connected > to the same local object store. > > Alternates allow you to declare in any given local object store, > that your set of objects isn't complete, and you can find the rest > at some other location, those object stores may or may not have more > than one worktree connected to them. just to be clear here, there should be nothing about how alternates are set up for a repository that should affect the normal behaviour of working trees for that repository, correct? i never thought there was, i just thought i'd make absolutely sure. rday ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 9:42 ` Robert P. J. Day @ 2018-05-16 11:07 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-16 11:07 UTC (permalink / raw) To: Robert P. J. Day; +Cc: Lars Schneider, git, Jeff King, Duy Nguyen On Wed, May 16 2018, Robert P. J. Day wrote: > On Wed, 16 May 2018, Ævar Arnfjörð Bjarmason wrote: > >> >> On Wed, May 16 2018, Lars Schneider wrote: >> >> > I am looking into different options to cache Git repositories on build >> > machines. The two most promising ways seem to be git-worktree [1] and >> > git-alternates [2]. >> > >> > I wonder if you see an advantage of one over the other? >> > >> > My impression is that git-worktree supersedes git-alternates. Would >> > that be a fair statement? If yes, would it makes sense to deprecate >> > alternates for simplification? >> > >> > [1] https://git-scm.com/docs/git-worktree >> > [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates >> >> It's not correct that worktrees supersede alternates, or the other >> way around, they're orthagonal features. >> >> git-worktree allows you to create a new working directory connected >> to the same local object store. >> >> Alternates allow you to declare in any given local object store, >> that your set of objects isn't complete, and you can find the rest >> at some other location, those object stores may or may not have more >> than one worktree connected to them. > > just to be clear here, there should be nothing about how alternates > are set up for a repository that should affect the normal behaviour of > working trees for that repository, correct? i never thought there was, > i just thought i'd make absolutely sure. That's correct. The worktree(s) are logically composed of the index/cache, checked-out files, and the local reference store (and some auxiliary things, like per-worktree refs like HEAD, and config...). Whether you have one worktree or many, eventually git needs to look up objects somewhere. The alternates mechanism is just one more way to specify where to look, along with some special logic in pack-objects and the like where we need to be aware of them for the purposes of maintaining objects in the repository. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 9:29 ` Ævar Arnfjörð Bjarmason 2018-05-16 9:42 ` Robert P. J. Day @ 2018-05-16 9:51 ` Lars Schneider 2018-05-16 10:33 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 34+ messages in thread From: Lars Schneider @ 2018-05-16 9:51 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Duy Nguyen > On 16 May 2018, at 11:29, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Wed, May 16 2018, Lars Schneider wrote: > >> I am looking into different options to cache Git repositories on build >> machines. The two most promising ways seem to be git-worktree [1] and >> git-alternates [2]. >> >> I wonder if you see an advantage of one over the other? >> >> My impression is that git-worktree supersedes git-alternates. Would >> that be a fair statement? If yes, would it makes sense to deprecate >> alternates for simplification? >> >> [1] https://git-scm.com/docs/git-worktree >> [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates > > It's not correct that worktrees supersede alternates, or the other way > around, they're orthagonal features. > > git-worktree allows you to create a new working directory connected to > the same local object store. > > Alternates allow you to declare in any given local object store, that > your set of objects isn't complete, and you can find the rest at some > other location, those object stores may or may not have more than one > worktree connected to them. OK. I just wonder in what situation I would work with an incomplete object store. The only use case I could imagine is that two repos share a common set of objects (most likely blobs). However, in that situation I would keep the two independent lines of development in a single repo with two root commits. Would it be fair to say that "git alternates" are a good mechanism to cache objects across different repos? However, I would consider a cache hit between different repos unlikely. In that line of thinking "git worktree" would be a good (maybe better?) mechanism to cache objects for a single repo? Thanks, Lars ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 9:51 ` Lars Schneider @ 2018-05-16 10:33 ` Ævar Arnfjörð Bjarmason 2018-05-16 13:02 ` Derrick Stolee 0 siblings, 1 reply; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-16 10:33 UTC (permalink / raw) To: Lars Schneider; +Cc: git, Jeff King, Duy Nguyen On Wed, May 16 2018, Lars Schneider wrote: >> On 16 May 2018, at 11:29, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >> >> >> On Wed, May 16 2018, Lars Schneider wrote: >> >>> I am looking into different options to cache Git repositories on build >>> machines. The two most promising ways seem to be git-worktree [1] and >>> git-alternates [2]. >>> >>> I wonder if you see an advantage of one over the other? >>> >>> My impression is that git-worktree supersedes git-alternates. Would >>> that be a fair statement? If yes, would it makes sense to deprecate >>> alternates for simplification? >>> >>> [1] https://git-scm.com/docs/git-worktree >>> [2] https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates >> >> It's not correct that worktrees supersede alternates, or the other way >> around, they're orthagonal features. >> >> git-worktree allows you to create a new working directory connected to >> the same local object store. >> >> Alternates allow you to declare in any given local object store, that >> your set of objects isn't complete, and you can find the rest at some >> other location, those object stores may or may not have more than one >> worktree connected to them. > > OK. I just wonder in what situation I would work with an incomplete > object store. The only use case I could imagine is that two repos share > a common set of objects (most likely blobs). However, in that situation > I would keep the two independent lines of development in a single repo > with two root commits. > > Would it be fair to say that "git alternates" are a good mechanism to > cache objects across different repos? However, I would consider a cache > hit between different repos unlikely. In that line of thinking > "git worktree" would be a good (maybe better?) mechanism to cache objects > for a single repo? The use case is cloning with e.g. --shared or --reference. Consider the following scenario: * You have 100 developers with *nix accounts on a single machine. * These 100 all need access to the same repo, but .git/objects is 1G * This would then naïvely require 100G of space + working tree. If the machine has 92G of RAM you'll be swapping the fscache in & out and performance will be horrible. Instead, you have a single repository maintained on the system designed to have all the alternates point to it, cloned as: git clone --reference /usr/share/git_tree/bigrepo ssh://....bigrepo.git ~/bigrepo Now you're using just a bit over 1GB of space in total, but any new objects the devs create will be written to their local .git dir, since you're spending 1GB for those 100 repos instead of 100GB the data is always in the FS cache. And here's where this isn't at all like "worktree", each of those 100 will have their own "master" branch, and they can all create 100 different branches called "topic" that can be different. With worktree the references are all shared across the same worktrees, so it's designed for one dev working on different topic branches in different checkouts. The --reference feature is also commonly used in CI-like environments. Imagine the above example, but except with 100 devs you have CI jobs on the same machine being spun up all the time, although here you get some overlap, if you're OK with the main branch name being different you can also do this with worktrees instead of alternates. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 10:33 ` Ævar Arnfjörð Bjarmason @ 2018-05-16 13:02 ` Derrick Stolee 2018-05-16 14:58 ` Konstantin Ryabitsev 0 siblings, 1 reply; 34+ messages in thread From: Derrick Stolee @ 2018-05-16 13:02 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Lars Schneider Cc: git, Jeff King, Duy Nguyen On 5/16/2018 6:33 AM, Ævar Arnfjörð Bjarmason wrote: [big snip] > > And here's where this isn't at all like "worktree", each of those 100 > will have their own "master" branch, and they can all create 100 > different branches called "topic" that can be different. This is the biggest difference. You cannot have the same ref checked out in multiple worktrees, as they both may edit that ref. The alternates allow you to share data in a "read only" fashion. If you have one repo that is the "base" repo that manages that objects dir, then that is probably a good way to reduce the duplication. I'm not familiar with what happens when a "child" repo does 'git gc' or 'git repack', will it delete the local objects that is sees exist in the alternate? GVFS uses alternates in this same way: we create a drive-wide "shared object cache" that GVFS manages. We put our prefetch packs filled with commits and trees in there, and any loose objects that are downloaded via the object virtualization are placed as loose objects in the alternate. We also store the multi-pack-index and commit-graph in that alternate. This means that the only objects in each src dir are those created by the developer doing their normal work. Thanks, -Stolee ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 13:02 ` Derrick Stolee @ 2018-05-16 14:58 ` Konstantin Ryabitsev 2018-05-16 15:34 ` Ævar Arnfjörð Bjarmason ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 14:58 UTC (permalink / raw) To: Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider Cc: git, Jeff King, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 2651 bytes --] On 05/16/18 09:02, Derrick Stolee wrote: > This is the biggest difference. You cannot have the same ref checked out > in multiple worktrees, as they both may edit that ref. The alternates > allow you to share data in a "read only" fashion. If you have one repo > that is the "base" repo that manages that objects dir, then that is > probably a good way to reduce the duplication. I'm not familiar with > what happens when a "child" repo does 'git gc' or 'git repack', will it > delete the local objects that is sees exist in the alternate? The parent repo is not keeping track of any other repositories that may be using it for alternates, which is why you basically: 1. never run auto-gc in the parent repo 2. repack it manually using -Ad to keep loose objects that other repos may be borrowing (but we don't know if they are) 3. never prune the parent repo, because this may delete objects other repos are borrowing Very infrequently you may consider this extra set of maintenance steps: 1. Find every repo mentioning the parent repository in their alternates 2. Repack them without the -l switch (which copies all the borrowed objects into those repos) 3. Once all child repos have been repacked this way, prune the parent repo (it's safe now) 4. Repack child repos again, this time with the -l flag, to get your savings back. I would heartily love a way to teach git-repack to recognize when an object it's borrowing from the parent repo is in danger of being pruned. The cheapest way of doing this would probably be to hardlink loose objects into its own objects directory and only consider "safe" objects those that are part of the parent repository's pack. This should make alternates a lot safer, just in case git-prune happens to run by accident. > GVFS uses alternates in this same way: we create a drive-wide "shared > object cache" that GVFS manages. We put our prefetch packs filled with > commits and trees in there, and any loose objects that are downloaded > via the object virtualization are placed as loose objects in the > alternate. We also store the multi-pack-index and commit-graph in that > alternate. This means that the only objects in each src dir are those > created by the developer doing their normal work. I'm very interested in GVFS, because it would certainly make my life easier maintaining source.codeaurora.org, which is many thousands of repos that are mostly forks of the same stuff. However, GVFS appears to only exist for Windows (hint-hint, nudge-nudge). :) Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 14:58 ` Konstantin Ryabitsev @ 2018-05-16 15:34 ` Ævar Arnfjörð Bjarmason 2018-05-16 15:49 ` Konstantin Ryabitsev 2018-05-16 17:14 ` Martin Fick 2018-05-16 19:14 ` Jeff King 2 siblings, 1 reply; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-16 15:34 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wed, May 16 2018, Konstantin Ryabitsev wrote: > On 05/16/18 09:02, Derrick Stolee wrote: >> This is the biggest difference. You cannot have the same ref checked out >> in multiple worktrees, as they both may edit that ref. The alternates >> allow you to share data in a "read only" fashion. If you have one repo >> that is the "base" repo that manages that objects dir, then that is >> probably a good way to reduce the duplication. I'm not familiar with >> what happens when a "child" repo does 'git gc' or 'git repack', will it >> delete the local objects that is sees exist in the alternate? > > The parent repo is not keeping track of any other repositories that may > be using it for alternates, which is why you basically: > > 1. never run auto-gc in the parent repo > 2. repack it manually using -Ad to keep loose objects that other repos > may be borrowing (but we don't know if they are) > 3. never prune the parent repo, because this may delete objects other > repos are borrowing > > Very infrequently you may consider this extra set of maintenance steps: > > 1. Find every repo mentioning the parent repository in their alternates > 2. Repack them without the -l switch (which copies all the borrowed > objects into those repos) > 3. Once all child repos have been repacked this way, prune the parent > repo (it's safe now) > 4. Repack child repos again, this time with the -l flag, to get your > savings back. > > I would heartily love a way to teach git-repack to recognize when an > object it's borrowing from the parent repo is in danger of being pruned. > The cheapest way of doing this would probably be to hardlink loose > objects into its own objects directory and only consider "safe" objects > those that are part of the parent repository's pack. This should make > alternates a lot safer, just in case git-prune happens to run by accident. I may have missed some edge case, but I believe this entire workaround isn't needed if you guarantee that the parent repo doesn't contain any objects that will get un-referenced. You'd do that in the common case by cloning with --single-branch, and depending on your setup --no-tags (if you delete tags). This is assuming that your HEAD branch points to something like a "master" that doesn't get rewound. The problem you're describing happens if say you clone git.git and have the "pu" branch in there in the parent, and as a result you get child repos referencing those objects, but when the parent GCs after "pu" is rewound the child repos break. Thus your elaborate work-around. But that situation isn't possible in the first place if you only ever import the "master" branch, or other references guaranteed not to change. Of course that has the trade-off that every child repo needs to get its own objects for the "next" branch, "pu", etc. But those are comparatively tiny. I wasn't aware of -l (--local), or had forgotten about it. I thought that we didn't have that and the "child" repos would just keep growing over time, i.e. not get rid of the objects we're fetching into the parent (which the parent might get later due to the child, say if it's fetched in a daily cronjob). Good to know that's not the case. With that --local flag the trade-off of not fetching "next" and "pu" etc. should become irrelevant over time, as they migrate to "master" they'll get de-duplicated, or alternatively GC'd by the child repos if they don't make it. >> GVFS uses alternates in this same way: we create a drive-wide "shared >> object cache" that GVFS manages. We put our prefetch packs filled with >> commits and trees in there, and any loose objects that are downloaded >> via the object virtualization are placed as loose objects in the >> alternate. We also store the multi-pack-index and commit-graph in that >> alternate. This means that the only objects in each src dir are those >> created by the developer doing their normal work. > > I'm very interested in GVFS, because it would certainly make my life > easier maintaining source.codeaurora.org, which is many thousands of > repos that are mostly forks of the same stuff. However, GVFS appears to > only exist for Windows (hint-hint, nudge-nudge). :) This should make you happy: https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux/ But I don't know what the current status is or where it can be followed. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 15:34 ` Ævar Arnfjörð Bjarmason @ 2018-05-16 15:49 ` Konstantin Ryabitsev 2018-05-16 17:54 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 15:49 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wed, May 16, 2018 at 05:34:34PM +0200, Ævar Arnfjörð Bjarmason wrote: >I may have missed some edge case, but I believe this entire workaround >isn't needed if you guarantee that the parent repo doesn't contain any >objects that will get un-referenced. You can't guarantee that, because the parent repo can have its history rewritten either via a forced push, or via a rebase. Obviously, this won't happen in something like torvalds/linux.git, which is why it's pretty safe to alternate off of that repo for us, but codeaurora.org repos aren't always strictly-ff (e.g. because they may rebase themselves based on what is in upstream AOSP repos) -- so objects in them may become unreferenced and pruned away, corrupting any repos using them for alternates. >> I'm very interested in GVFS, because it would certainly make my life >> easier maintaining source.codeaurora.org, which is many thousands of >> repos that are mostly forks of the same stuff. However, GVFS appears to >> only exist for Windows (hint-hint, nudge-nudge). :) > >This should make you happy: > >https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux/ > >But I don't know what the current status is or where it can be followed. Very good to know, thanks! -K ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 15:49 ` Konstantin Ryabitsev @ 2018-05-16 17:54 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-16 17:54 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wed, May 16 2018, Konstantin Ryabitsev wrote: > On Wed, May 16, 2018 at 05:34:34PM +0200, Ævar Arnfjörð Bjarmason wrote: >>I may have missed some edge case, but I believe this entire workaround >>isn't needed if you guarantee that the parent repo doesn't contain any >>objects that will get un-referenced. > > You can't guarantee that, because the parent repo can have its history > rewritten either via a forced push, or via a rebase. Obviously, this > won't happen in something like torvalds/linux.git, which is why it's > pretty safe to alternate off of that repo for us, but codeaurora.org > repos aren't always strictly-ff (e.g. because they may rebase themselves > based on what is in upstream AOSP repos) -- so objects in them may > become unreferenced and pruned away, corrupting any repos using them for > alternates. Right, it wouldn't work in the general case. I was thinking of the use-case for doing this (say with known big monorepos) where you know a given branch won't be unwound. Still, there's a tiny variation on this that should work with arbitrary repos whose master may be rewound, you just setup a refspec to fetch their upstream HEAD into master-1 without having "+" in the refspec. Then if they never rewind you keep fetching to master-1 forever. If they do rewind you fetch that to master-2 and so forth, so you can follow an upstream rewinding branch while still guaranteeing that no objects ever disappear from your parent repo. This is still a lot simpler than the juggling approach you noted, since it's just a tiny shellscript around the "fetch". This assumes that: 1. Whenever this happens the history is still similar enough that the parent won't balloon in size like this, or at least it won't be worse than not using alternates at all. 2. You're getting most of the gains of the object sharing by just grabbing the upstream HEAD branch, i.e. you don't have some repo with huge and N unrelated histories. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 14:58 ` Konstantin Ryabitsev 2018-05-16 15:34 ` Ævar Arnfjörð Bjarmason @ 2018-05-16 17:14 ` Martin Fick 2018-05-16 17:41 ` Konstantin Ryabitsev 2018-05-16 19:14 ` Jeff King 2 siblings, 1 reply; 34+ messages in thread From: Martin Fick @ 2018-05-16 17:14 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider, git, Jeff King, Duy Nguyen On Wednesday, May 16, 2018 10:58:19 AM Konstantin Ryabitsev wrote: > > 1. Find every repo mentioning the parent repository in > their alternates 2. Repack them without the -l switch > (which copies all the borrowed objects into those repos) > 3. Once all child repos have been repacked this way, prune > the parent repo (it's safe now) This is probably only true if the repos are in read-only mode? I suspect this is still racy on a busy server with no downtime. > 4. Repack child repos again, this time with the -l flag, > to get your savings back. > I would heartily love a way to teach git-repack to > recognize when an object it's borrowing from the parent > repo is in danger of being pruned. The cheapest way of > doing this would probably be to hardlink loose objects > into its own objects directory and only consider "safe" > objects those that are part of the parent repository's > pack. This should make alternates a lot safer, just in > case git-prune happens to run by accident. I think that hard linking is generally a good approach to solving many of the "pruning" races left in git. I have uploaded a "hard linking" proposal to jgit that could potentially solve a similar situation that is not alternate specific, and only for packfiles, with the intent of eventually also doing something similar for loose objects. You can see this here: https://git.eclipse.org/r/c/122288/2 I think it would be good to fill in more of these pruning gaps! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 17:14 ` Martin Fick @ 2018-05-16 17:41 ` Konstantin Ryabitsev 2018-05-16 18:02 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 17:41 UTC (permalink / raw) To: Martin Fick Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider, git, Jeff King, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 1283 bytes --] On 05/16/18 13:14, Martin Fick wrote: > On Wednesday, May 16, 2018 10:58:19 AM Konstantin Ryabitsev > wrote: >> >> 1. Find every repo mentioning the parent repository in >> their alternates 2. Repack them without the -l switch >> (which copies all the borrowed objects into those repos) >> 3. Once all child repos have been repacked this way, prune >> the parent repo (it's safe now) > > This is probably only true if the repos are in read-only > mode? I suspect this is still racy on a busy server with no > downtime. We don't actually do this anywhere. :) It's a feature I keep hoping to add one day to grokmirror, but keep putting off because of various considerations. As you can imagine, if we have 300 forks of linux.git all using torvalds/linux.git as their alternates, then repacking them all without -l would balloon our disk usage 300-fold. At this time it's just cheaper to keep a bunch of loose objects around forever at the cost of decreased performance. Maybe git-repack can be told to only borrow parent objects if they are in packs. Anything not in packs should be hardlinked into the child repo. That's my wishful think for the day. :) Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 17:41 ` Konstantin Ryabitsev @ 2018-05-16 18:02 ` Ævar Arnfjörð Bjarmason 2018-05-16 18:12 ` Konstantin Ryabitsev 0 siblings, 1 reply; 34+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-05-16 18:02 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Martin Fick, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wed, May 16 2018, Konstantin Ryabitsev wrote: > Maybe git-repack can be told to only borrow parent objects if they are > in packs. Anything not in packs should be hardlinked into the child > repo. That's my wishful think for the day. :) Can you elaborate on how this would help? We're just going to create loose objects on interactive "git commit", presumably you're not adding someone's working copy as the alternate. Otherwise if it's just being pushed to all those pushes are going to be in packs, and the packs may contain e.g. pushes for the "pu" branch or whatever, which are objects that'll go away. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 18:02 ` Ævar Arnfjörð Bjarmason @ 2018-05-16 18:12 ` Konstantin Ryabitsev 2018-05-16 18:26 ` Martin Fick 0 siblings, 1 reply; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 18:12 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Martin Fick, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 1535 bytes --] On 05/16/18 14:02, Ævar Arnfjörð Bjarmason wrote: > > On Wed, May 16 2018, Konstantin Ryabitsev wrote: > >> Maybe git-repack can be told to only borrow parent objects if they are >> in packs. Anything not in packs should be hardlinked into the child >> repo. That's my wishful think for the day. :) > > Can you elaborate on how this would help? > > We're just going to create loose objects on interactive "git commit", > presumably you're not adding someone's working copy as the alternate. The loose objects I'm thinking of are those that are generated when we do "git repack -Ad" -- this takes all unreachable objects and loosens them (see man git-repack for more info). Normally, these would be pruned after a certain period, but we're deliberately keeping them around forever just in case another repo relies on them via alternates. I want those repos to "claim" these loose objects via hardlinks, such that we can run git-prune on the mother repo instead of dragging all the unreachable objects on forever just in case. > Otherwise if it's just being pushed to all those pushes are going to be > in packs, and the packs may contain e.g. pushes for the "pu" branch or > whatever, which are objects that'll go away. There are lots of cases where unreachable objects in one repo would never become unreachable in another -- for example, if the author had stopped updating it. Hope this helps. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 18:12 ` Konstantin Ryabitsev @ 2018-05-16 18:26 ` Martin Fick 2018-05-16 19:01 ` Konstantin Ryabitsev 0 siblings, 1 reply; 34+ messages in thread From: Martin Fick @ 2018-05-16 18:26 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wednesday, May 16, 2018 02:12:24 PM Konstantin Ryabitsev wrote: > The loose objects I'm thinking of are those that are > generated when we do "git repack -Ad" -- this takes all > unreachable objects and loosens them (see man git-repack > for more info). Normally, these would be pruned after a > certain period, but we're deliberately keeping them > around forever just in case another repo relies on them > via alternates. I want those repos to "claim" these loose > objects via hardlinks, such that we can run git-prune on > the mother repo instead of dragging all the unreachable > objects on forever just in case. If you are going to keep the unreferenced objects around forever, it might be better to keep them around in packed form? We currently do that because we don't think there is a safe way to prune objects yet on a running server (which is why I am teaching jgit to be able to recover from a racy pruning error), -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 18:26 ` Martin Fick @ 2018-05-16 19:01 ` Konstantin Ryabitsev 2018-05-16 19:03 ` Martin Fick 2018-05-16 19:23 ` Jeff King 0 siblings, 2 replies; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 19:01 UTC (permalink / raw) To: Martin Fick Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 644 bytes --] On 05/16/18 14:26, Martin Fick wrote: > If you are going to keep the unreferenced objects around > forever, it might be better to keep them around in packed > form? I'm undecided about that. On the one hand this does create lots of small files and inevitably causes (some) performance degradation. On the other hand, I don't want to keep useless objects in the pack, because that would also cause performance degradation for people cloning the "mother repo." If my assumptions on any of that are incorrect, I'm happy to learn more. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:01 ` Konstantin Ryabitsev @ 2018-05-16 19:03 ` Martin Fick 2018-05-16 19:11 ` Konstantin Ryabitsev 2018-05-16 19:23 ` Jeff King 1 sibling, 1 reply; 34+ messages in thread From: Martin Fick @ 2018-05-16 19:03 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wednesday, May 16, 2018 03:01:13 PM Konstantin Ryabitsev wrote: > On 05/16/18 14:26, Martin Fick wrote: > > If you are going to keep the unreferenced objects around > > forever, it might be better to keep them around in > > packed > > form? > > I'm undecided about that. On the one hand this does create > lots of small files and inevitably causes (some) > performance degradation. On the other hand, I don't want > to keep useless objects in the pack, because that would > also cause performance degradation for people cloning the > "mother repo." If my assumptions on any of that are > incorrect, I'm happy to learn more. My suggestion is to use science, not logic or hearsay. :) i.e. test it! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:03 ` Martin Fick @ 2018-05-16 19:11 ` Konstantin Ryabitsev 2018-05-16 19:18 ` Martin Fick 0 siblings, 1 reply; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 19:11 UTC (permalink / raw) To: Martin Fick Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --] On 05/16/18 15:03, Martin Fick wrote: >> I'm undecided about that. On the one hand this does create >> lots of small files and inevitably causes (some) >> performance degradation. On the other hand, I don't want >> to keep useless objects in the pack, because that would >> also cause performance degradation for people cloning the >> "mother repo." If my assumptions on any of that are >> incorrect, I'm happy to learn more. > My suggestion is to use science, not logic or hearsay. :) > i.e. test it! I think the answer will be "it depends." In many of our cases the repos that need those loose objects are rarely accessed -- usually because they are forks with older data (hence why they need objects that are no longer used by the mother repo). Therefore, performance impacts of occasionally touching a handful of loose objects will be fairly negligible. This is especially true on non-spinning media where seek times are low anyway. Having slimmer packs for the mother repo would be more beneficial in this case. On the other hand, if the "child repo" is frequently used, then the impact of needing a bunch of loose objects would be greater. For the sake of simplicity, I think I'll leave things as they are -- it's cheaper to fix this via reducing seek times than by applying complicated logic trying to optimize on a per-repo basis. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:11 ` Konstantin Ryabitsev @ 2018-05-16 19:18 ` Martin Fick 0 siblings, 0 replies; 34+ messages in thread From: Martin Fick @ 2018-05-16 19:18 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Jeff King, Duy Nguyen On Wednesday, May 16, 2018 03:11:47 PM Konstantin Ryabitsev wrote: > On 05/16/18 15:03, Martin Fick wrote: > >> I'm undecided about that. On the one hand this does > >> create lots of small files and inevitably causes > >> (some) performance degradation. On the other hand, I > >> don't want to keep useless objects in the pack, > >> because that would also cause performance degradation > >> for people cloning the "mother repo." If my > >> assumptions on any of that are incorrect, I'm happy to > >> learn more. > > > > My suggestion is to use science, not logic or hearsay. > > :) > > i.e. test it! > > I think the answer will be "it depends." In many of our > cases the repos that need those loose objects are rarely > accessed -- usually because they are forks with older > data (hence why they need objects that are no longer used > by the mother repo). Therefore, performance impacts of > occasionally touching a handful of loose objects will be > fairly negligible. This is especially true on > non-spinning media where seek times are low anyway. > Having slimmer packs for the mother repo would be more > beneficial in this case. > > On the other hand, if the "child repo" is frequently used, > then the impact of needing a bunch of loose objects would > be greater. For the sake of simplicity, I think I'll > leave things as they are -- it's cheaper to fix this via > reducing seek times than by applying complicated logic > trying to optimize on a per-repo basis. I think a major performance issue with loose objects is not just the seek time, but also the fact that they are not delta compressed. This means that sending them over the wire will likely have a significant cost before sending it. Unlike the seek time, this cost is not mitigated across concurrent fetches by the FS (or jgit if you were to use it) caching, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:01 ` Konstantin Ryabitsev 2018-05-16 19:03 ` Martin Fick @ 2018-05-16 19:23 ` Jeff King 2018-05-16 19:29 ` Konstantin Ryabitsev 1 sibling, 1 reply; 34+ messages in thread From: Jeff King @ 2018-05-16 19:23 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Wed, May 16, 2018 at 03:01:13PM -0400, Konstantin Ryabitsev wrote: > On 05/16/18 14:26, Martin Fick wrote: > > If you are going to keep the unreferenced objects around > > forever, it might be better to keep them around in packed > > form? > > I'm undecided about that. On the one hand this does create lots of small > files and inevitably causes (some) performance degradation. On the other > hand, I don't want to keep useless objects in the pack, because that > would also cause performance degradation for people cloning the "mother > repo." If my assumptions on any of that are incorrect, I'm happy to > learn more. I implemented "repack -k", which keeps all objects and just rolls them into the new pack (along with any currently-loose unreachable objects). Aside from corner cases (e.g., where somebody accidentally added a 20GB file to an otherwise 100MB-repo and then rolled it back), it usually doesn't significantly affect the repository size. And it generally should not cause performance problems for people cloning, since Git will create a custom pack for each client with only the reachable objects. There _is_ an interesting corner case where a reachable object might be a delta against an unreachable one, which can cause a clone to have to break that relationship and find a new delta. At GitHub we have some custom code that tries to avoid these kind of delta dependencies (not just to unreachable objects, but to other forks that share object storage). You can see the patch at: https://github.com/peff/git jk/delta-islands -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:23 ` Jeff King @ 2018-05-16 19:29 ` Konstantin Ryabitsev 2018-05-16 19:37 ` Jeff King 0 siblings, 1 reply; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 19:29 UTC (permalink / raw) To: Jeff King Cc: Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 872 bytes --] On 05/16/18 15:23, Jeff King wrote: > I implemented "repack -k", which keeps all objects and just rolls them > into the new pack (along with any currently-loose unreachable objects). > Aside from corner cases (e.g., where somebody accidentally added a 20GB > file to an otherwise 100MB-repo and then rolled it back), it usually > doesn't significantly affect the repository size. Hmm... I should read manpages more often! :) So, do you suggest that this is a better approach: - mother repos: "git repack -adk" - child repos: "git repack -Adl" (followed by prune) Currently, we do "-Adl" regardless, but we already track whether a repo is being used for alternates anywhere (so we don't prune it) and can do different flags if that improves performance. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:29 ` Konstantin Ryabitsev @ 2018-05-16 19:37 ` Jeff King 2018-05-16 19:40 ` Martin Fick 2018-05-16 20:02 ` Konstantin Ryabitsev 0 siblings, 2 replies; 34+ messages in thread From: Jeff King @ 2018-05-16 19:37 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Wed, May 16, 2018 at 03:29:42PM -0400, Konstantin Ryabitsev wrote: > On 05/16/18 15:23, Jeff King wrote: > > I implemented "repack -k", which keeps all objects and just rolls them > > into the new pack (along with any currently-loose unreachable objects). > > Aside from corner cases (e.g., where somebody accidentally added a 20GB > > file to an otherwise 100MB-repo and then rolled it back), it usually > > doesn't significantly affect the repository size. > > Hmm... I should read manpages more often! :) > > So, do you suggest that this is a better approach: > > - mother repos: "git repack -adk" > - child repos: "git repack -Adl" (followed by prune) Yes, that's pretty close to what we do at GitHub. Before doing any repacking in the mother repo, we actually do the equivalent of: git fetch --prune ../$id.git +refs/*:refs/remotes/$id/* git repack -Adl from each child to pick up any new objects to de-duplicate (our "mother" repos are not real repos at all, but just big shared-object stores). I say "equivalent" because those commands can actually be a bit slow. So we do some hacky tricks like directly moving objects in the filesystem. In theory the fetch means that it's safe to actually prune in the mother repo, but in practice there are still races. They don't come up often, but if you have enough repositories, they do eventually. :) -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:37 ` Jeff King @ 2018-05-16 19:40 ` Martin Fick 2018-05-16 20:06 ` Jeff King 2018-05-16 20:02 ` Konstantin Ryabitsev 1 sibling, 1 reply; 34+ messages in thread From: Martin Fick @ 2018-05-16 19:40 UTC (permalink / raw) To: Jeff King Cc: Konstantin Ryabitsev, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Wednesday, May 16, 2018 12:37:45 PM Jeff King wrote: > On Wed, May 16, 2018 at 03:29:42PM -0400, Konstantin Ryabitsev wrote: > Yes, that's pretty close to what we do at GitHub. Before > doing any repacking in the mother repo, we actually do > the equivalent of: > > git fetch --prune ../$id.git +refs/*:refs/remotes/$id/* > git repack -Adl > > from each child to pick up any new objects to de-duplicate > (our "mother" repos are not real repos at all, but just > big shared-object stores). ... > In theory the fetch means that it's safe to actually prune > in the mother repo, but in practice there are still > races. They don't come up often, but if you have enough > repositories, they do eventually. :) Peff, I would be very curious to hear what you think of this approach to mitigating the effect of those races? https://git.eclipse.org/r/c/122288/2 -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:40 ` Martin Fick @ 2018-05-16 20:06 ` Jeff King 2018-05-16 20:43 ` Martin Fick 0 siblings, 1 reply; 34+ messages in thread From: Jeff King @ 2018-05-16 20:06 UTC (permalink / raw) To: Martin Fick Cc: Konstantin Ryabitsev, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Wed, May 16, 2018 at 01:40:56PM -0600, Martin Fick wrote: > > In theory the fetch means that it's safe to actually prune > > in the mother repo, but in practice there are still > > races. They don't come up often, but if you have enough > > repositories, they do eventually. :) > > Peff, > > I would be very curious to hear what you think of this > approach to mitigating the effect of those races? > > https://git.eclipse.org/r/c/122288/2 The crux of the problem is that we have no way to atomically mark an object as "I am using this -- do not delete" with respect to the actual deletion. So if I'm reading your approach correctly, you put objects into a purgatory rather than delete them, and let some operations rescue them from purgatory if we had a race. That's certainly a direction we've considered, but I think there are some open questions, like: 1. When do you rescue from purgatory? Any time the object is referenced? Do you then pull in all of its reachable objects too? 2. How do you decide when to drop an object from purgatory? And specifically, how do you avoid racing with somebody using the object as you're pruning purgatory? 3. How do you know that an operation has been run that will actually rescue the object, as opposed to silently having a corrupted state on disk? E.g., imagine this sequence: a. git-prune computes reachability and finds that commit X is ready to be pruned b. another process sees that commit X exists and builds a commit that references it as a parent c. git-prune drops the object into purgatory Now we have a corrupt state created by the process in (b), since we have a reachable object in purgatory. But what if nobody goes back and tries to read those commits in the meantime? I think this might be solvable by using the purgatory as a kind of "lock", where prune does something like: 1. compute reachability 2. move candidate objects into purgatory; nobody can look into purgatory except us 3. compute reachability _again_, making sure that no purgatory objects are used (if so, rollback the deletion and try again) But even that's not quite there, because you need to have some consistent atomic view of what's "used". Just checking refs isn't enough, because some other process may be planning to reference a purgatory object but not yet have updated the ref. So you need some atomic way of saying "I am interested in using this object". -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 20:06 ` Jeff King @ 2018-05-16 20:43 ` Martin Fick 0 siblings, 0 replies; 34+ messages in thread From: Martin Fick @ 2018-05-16 20:43 UTC (permalink / raw) To: Jeff King Cc: Konstantin Ryabitsev, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Wednesday, May 16, 2018 01:06:59 PM Jeff King wrote: > On Wed, May 16, 2018 at 01:40:56PM -0600, Martin Fick wrote: > > > In theory the fetch means that it's safe to actually > > > prune in the mother repo, but in practice there are > > > still races. They don't come up often, but if you > > > have enough repositories, they do eventually. :) > > > > Peff, > > > > I would be very curious to hear what you think of this > > approach to mitigating the effect of those races? > > > > https://git.eclipse.org/r/c/122288/2 > > The crux of the problem is that we have no way to > atomically mark an object as "I am using this -- do not > delete" with respect to the actual deletion. > > So if I'm reading your approach correctly, you put objects > into a purgatory rather than delete them, and let some > operations rescue them from purgatory if we had a race. Yes. This has the cost of extra disk space for a while, but once I realized that we are incurring that cost already because for our repos, we already put things into purgatory to avoid getting stale NFS File handle errors during unrecoverable paths (while streaming an object). So effectively this has no extra space cost then what is needed to run safely on NFS. > 1. When do you rescue from purgatory? Any time the > object is referenced? Do you then pull in all of its > reachable objects too? For my approach, I decided a) Yes b) No Because: a) Rescue on reference is cheap and allows any other policy to be built upon it, just ensure that policy references it at some point before it is prune from the purgatory. b) The other referenced objects will likely get pulled in on reference anyway or by virtue of being in the same pack. > 2. How do you decide when to drop an object from > purgatory? And specifically, how do you avoid racing with > somebody using the object as you're pruning purgatory? If you clean the purgatory during repacking after creating all the new packs and before deleting the old ones, you will have a significant grace window to handle most longer running operations. In this way, repacking will have re-referenced any missing objects from the purgatory before it gets pruned causing them to be recovered if necessary. Those missing objects, believed to be in the exact packs in the purgatory at that time, should only ever have been referenced by write operations that started before those packs were moved to the purgatory, which was before the previous repacking round ended. This leaves write operations a full repacking cycle to complete in to avoid loosing objects. > 3. How do you know that an operation has been run that > will actually rescue the object, as opposed to silently > having a corrupted state on disk? > > E.g., imagine this sequence: > > a. git-prune computes reachability and finds that > commit X is ready to be pruned > > b. another process sees that commit X exists and > builds a commit that references it as a parent > > c. git-prune drops the object into purgatory > > Now we have a corrupt state created by the process in > (b), since we have a reachable object in purgatory. But > what if nobody goes back and tries to read those commits > in the meantime? See answer to #2, repacking itself should rescue any objects that need to be rescued before pruning the purgatory. > I think this might be solvable by using the purgatory as a > kind of "lock", where prune does something like: > > 1. compute reachability > > 2. move candidate objects into purgatory; nobody can > look into purgatory except us I don't think this is needed. It should be OK to let others see the objects in the purgatory after 1 and before 3 as long as "seeing" them, causes them to be recovered! > 3. compute reachability _again_, making sure that no > purgatory objects are used (if so, rollback the deletion > and try again) Yes, you laid out the formula, but nothing says this recompute can't wait until the next repack (again see my answer to #2)! i.e. there is no rush to cause a recovery as long as it gets recovered before it gets pruned from the purgatory. > But even that's not quite there, because you need to have > some consistent atomic view of what's "used". Just > checking refs isn't enough, because some other process > may be planning to reference a purgatory object but not > yet have updated the ref. So you need some atomic way of > saying "I am interested in using this object". As long as all write paths also read the object first (I assume they do, or we would be in big trouble already), then this should not be an issue. The idea is to force all reads (and thus all writes also) to recover the object, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:37 ` Jeff King 2018-05-16 19:40 ` Martin Fick @ 2018-05-16 20:02 ` Konstantin Ryabitsev 2018-05-16 20:17 ` Jeff King 2018-05-17 0:43 ` Sitaram Chamarty 1 sibling, 2 replies; 34+ messages in thread From: Konstantin Ryabitsev @ 2018-05-16 20:02 UTC (permalink / raw) To: Jeff King Cc: Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen [-- Attachment #1.1: Type: text/plain, Size: 1520 bytes --] On 05/16/18 15:37, Jeff King wrote: > Yes, that's pretty close to what we do at GitHub. Before doing any > repacking in the mother repo, we actually do the equivalent of: > > git fetch --prune ../$id.git +refs/*:refs/remotes/$id/* > git repack -Adl > > from each child to pick up any new objects to de-duplicate (our "mother" > repos are not real repos at all, but just big shared-object stores). Yes, I keep thinking of doing the same, too -- instead of using torvalds/linux.git for alternates, have an internal repo where objects from all forks are stored. This conversation may finally give me the shove I've been needing to poke at this. :) Is your delta-islands patch heading into upstream, or is that something that's going to remain external? > I say "equivalent" because those commands can actually be a bit slow. So > we do some hacky tricks like directly moving objects in the filesystem. > > In theory the fetch means that it's safe to actually prune in the mother > repo, but in practice there are still races. They don't come up often, > but if you have enough repositories, they do eventually. :) I feel like a whitepaper on "how we deal with bajillions of forks at GitHub" would be nice. :) I was previously told that it's unlikely such paper could be written due to so many custom-built things at GH, but I would be very happy if that turned out not to be the case. Best, -- Konstantin Ryabitsev Director, IT Infrastructure Security The Linux Foundation [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 20:02 ` Konstantin Ryabitsev @ 2018-05-16 20:17 ` Jeff King 2018-05-17 0:43 ` Sitaram Chamarty 1 sibling, 0 replies; 34+ messages in thread From: Jeff King @ 2018-05-16 20:17 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Wed, May 16, 2018 at 04:02:53PM -0400, Konstantin Ryabitsev wrote: > On 05/16/18 15:37, Jeff King wrote: > > Yes, that's pretty close to what we do at GitHub. Before doing any > > repacking in the mother repo, we actually do the equivalent of: > > > > git fetch --prune ../$id.git +refs/*:refs/remotes/$id/* > > git repack -Adl > > > > from each child to pick up any new objects to de-duplicate (our "mother" > > repos are not real repos at all, but just big shared-object stores). > > Yes, I keep thinking of doing the same, too -- instead of using > torvalds/linux.git for alternates, have an internal repo where objects > from all forks are stored. This conversation may finally give me the > shove I've been needing to poke at this. :) > > Is your delta-islands patch heading into upstream, or is that something > that's going to remain external? I have vague plans to submit it upstream, but I'm still not convinced it's quite optimal. The resulting packs tend to be a fair bit larger than they could be when packed by themselves, because we miss many delta opportunities (and it's important to "repack -f --window=250" once in a while, since we're throwing away so many delta candidates). There's an alternative way of doing it, too, which I think git.or.cz uses: it "layers" forks in a hierarchy. So if I fork torvalds/linux.git, then I get my own repo that uses torvalds/linux as an alternate. And if somebody forks my repo, then I'm their alternate, and they recursively depend on torvalds/linux. So each fork basically layers a slice of its own pack on top of the parent. This is all from recollections of past discussions (which were sadly not on the list -- I don't know if they've written up their scheme anywhere public), so I may have some details wrong. But I think that their repacking is done hierarchically, too: any objects which the root fork might drop get migrated up to the children instead, and so forth, until the leaf nodes can actually throw away objects. The big problem with this is that Git tends to behave better when objects are in the same pack: 1. We don't bother looking for new deltas within the same pack, whereas a clone of a fork may actually try to find new deltas between the layers. 2. Reachability bitmaps can't cross pack boundaries (due to the way they're implemented, but also the current on-disk format). So you can only bitmap the root repo, not any of the other layers. > I feel like a whitepaper on "how we deal with bajillions of forks at > GitHub" would be nice. :) I was previously told that it's unlikely such > paper could be written due to so many custom-built things at GH, but I > would be very happy if that turned out not to be the case. We have a few engineering blog posts on the subject, like: https://githubengineering.com/counting-objects/ https://githubengineering.com/introducing-dgit/ https://githubengineering.com/building-resilience-in-spokes/ but we haven't done a very good job of keeping that up. I think a summary whitepaper would interesting. Maybe one day...:) -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 20:02 ` Konstantin Ryabitsev 2018-05-16 20:17 ` Jeff King @ 2018-05-17 0:43 ` Sitaram Chamarty 2018-05-17 3:31 ` Jeff King 1 sibling, 1 reply; 34+ messages in thread From: Sitaram Chamarty @ 2018-05-17 0:43 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Jeff King, Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen [-- Attachment #1: Type: text/plain, Size: 1346 bytes --] On Wed, May 16, 2018 at 04:02:53PM -0400, Konstantin Ryabitsev wrote: > On 05/16/18 15:37, Jeff King wrote: > > Yes, that's pretty close to what we do at GitHub. Before doing any > > repacking in the mother repo, we actually do the equivalent of: > > > > git fetch --prune ../$id.git +refs/*:refs/remotes/$id/* > > git repack -Adl > > > > from each child to pick up any new objects to de-duplicate (our "mother" > > repos are not real repos at all, but just big shared-object stores). > > Yes, I keep thinking of doing the same, too -- instead of using > torvalds/linux.git for alternates, have an internal repo where objects > from all forks are stored. This conversation may finally give me the > shove I've been needing to poke at this. :) I may have missed a few of the earlier messages, but in the last 20 or so in this thread, I did not see namespaces mentioned by anyone. (I.e., apologies if it was addressed and discarded earlier!) I was under the impression that, as long as "read" access need not be controlled (Konstantin's situation, at least, and maybe Peff's too, for public repos), namespaces are a good way to create and manage that "mother repo". Is that not true anymore? Mind, I have not actually used them in anger anywhere, so I could be missing some really big point here. sitaram [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-17 0:43 ` Sitaram Chamarty @ 2018-05-17 3:31 ` Jeff King 2018-05-19 5:45 ` Duy Nguyen 0 siblings, 1 reply; 34+ messages in thread From: Jeff King @ 2018-05-17 3:31 UTC (permalink / raw) To: Sitaram Chamarty Cc: Konstantin Ryabitsev, Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git, Duy Nguyen On Thu, May 17, 2018 at 06:13:55AM +0530, Sitaram Chamarty wrote: > I may have missed a few of the earlier messages, but in the last > 20 or so in this thread, I did not see namespaces mentioned by > anyone. (I.e., apologies if it was addressed and discarded > earlier!) > > I was under the impression that, as long as "read" access need > not be controlled (Konstantin's situation, at least, and maybe > Peff's too, for public repos), namespaces are a good way to > create and manage that "mother repo". > > Is that not true anymore? Mind, I have not actually used them > in anger anywhere, so I could be missing some really big point > here. The biggest problem with namespaces as they are currently implemented is that they do not apply universally to all commands. If you only access the repo via push/fetch, they may be fine. But as soon as you start doing other operations (e.g., showing the history of a branch in a web interface), you don't get to use the namespaced names anymore. I think a different implementation of namespaces could do this better. E.g., by controlling the view of the refs at the refs.c layer (or perhaps as a filtering backend). -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-17 3:31 ` Jeff King @ 2018-05-19 5:45 ` Duy Nguyen 0 siblings, 0 replies; 34+ messages in thread From: Duy Nguyen @ 2018-05-19 5:45 UTC (permalink / raw) To: Jeff King Cc: Sitaram Chamarty, Konstantin Ryabitsev, Martin Fick, Ævar Arnfjörð Bjarmason, Derrick Stolee, Lars Schneider, git On Thu, May 17, 2018 at 5:31 AM, Jeff King <peff@peff.net> wrote: > On Thu, May 17, 2018 at 06:13:55AM +0530, Sitaram Chamarty wrote: > >> I may have missed a few of the earlier messages, but in the last >> 20 or so in this thread, I did not see namespaces mentioned by >> anyone. (I.e., apologies if it was addressed and discarded >> earlier!) >> >> I was under the impression that, as long as "read" access need >> not be controlled (Konstantin's situation, at least, and maybe >> Peff's too, for public repos), namespaces are a good way to >> create and manage that "mother repo". >> >> Is that not true anymore? Mind, I have not actually used them >> in anger anywhere, so I could be missing some really big point >> here. > > The biggest problem with namespaces as they are currently implemented is > that they do not apply universally to all commands. If you only access > the repo via push/fetch, they may be fine. But as soon as you start > doing other operations (e.g., showing the history of a branch in a web > interface), you don't get to use the namespaced names anymore. > > I think a different implementation of namespaces could do this better. > E.g., by controlling the view of the refs at the refs.c layer (or > perhaps as a filtering backend). Yeah. Namespaces (that work for all commands) + worktree was my plan for centralizing repos (for one user). But I never got that far to look into making ref namespaces work for everything. -- Duy ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 14:58 ` Konstantin Ryabitsev 2018-05-16 15:34 ` Ævar Arnfjörð Bjarmason 2018-05-16 17:14 ` Martin Fick @ 2018-05-16 19:14 ` Jeff King 2018-05-16 21:18 ` Stefan Beller 2 siblings, 1 reply; 34+ messages in thread From: Jeff King @ 2018-05-16 19:14 UTC (permalink / raw) To: Konstantin Ryabitsev Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider, git, Duy Nguyen On Wed, May 16, 2018 at 10:58:19AM -0400, Konstantin Ryabitsev wrote: > The parent repo is not keeping track of any other repositories that may > be using it for alternates, which is why you basically: > > 1. never run auto-gc in the parent repo > 2. repack it manually using -Ad to keep loose objects that other repos > may be borrowing (but we don't know if they are) > 3. never prune the parent repo, because this may delete objects other > repos are borrowing > > Very infrequently you may consider this extra set of maintenance steps: > > 1. Find every repo mentioning the parent repository in their alternates > 2. Repack them without the -l switch (which copies all the borrowed > objects into those repos) > 3. Once all child repos have been repacked this way, prune the parent > repo (it's safe now) > 4. Repack child repos again, this time with the -l flag, to get your > savings back. You can also do periodic maintenance like: 1. Copy each ref in the forked repositories into the parent repository (e.g., giving each child that borrows from the parent its own hierarchy in refs/remotes/<child>/*). 2. Repack the parent as normal. It will retain any objects referenced by the children (because they are now referenced by it). But note that: 1. It's not atomic with respect to updates in the child repos (but then, neither is the single-repo case!). 2. It doesn't know about reflogs or the index in the child repositories. This is more or less how we use alternates at GitHub. > I would heartily love a way to teach git-repack to recognize when an > object it's borrowing from the parent repo is in danger of being pruned. > The cheapest way of doing this would probably be to hardlink loose > objects into its own objects directory and only consider "safe" objects > those that are part of the parent repository's pack. This should make > alternates a lot safer, just in case git-prune happens to run by accident. If you set: git config core.repositoryformatversion 1 git config extensions.preciousObjects true in the parent, git-prune (repack -d) will refuse to run. That doesn't solve the problem of how to repack, but it can help prevent accidental misuse. -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 19:14 ` Jeff King @ 2018-05-16 21:18 ` Stefan Beller 2018-05-16 23:45 ` Jeff King 0 siblings, 1 reply; 34+ messages in thread From: Stefan Beller @ 2018-05-16 21:18 UTC (permalink / raw) To: Jeff King Cc: Konstantin Ryabitsev, Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider, git, Duy Nguyen > > You can also do periodic maintenance like: > > 1. Copy each ref in the forked repositories into the parent repository > (e.g., giving each child that borrows from the parent its own > hierarchy in refs/remotes/<child>/*). Can you just copy? I assume the mother repo doesn't know about all objects, hence by copying the ref, we have a "spotty" history. And to improve copying could permanent symlinking be used instead? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: worktrees vs. alternates 2018-05-16 21:18 ` Stefan Beller @ 2018-05-16 23:45 ` Jeff King 0 siblings, 0 replies; 34+ messages in thread From: Jeff King @ 2018-05-16 23:45 UTC (permalink / raw) To: Stefan Beller Cc: Konstantin Ryabitsev, Derrick Stolee, Ævar Arnfjörð Bjarmason, Lars Schneider, git, Duy Nguyen On Wed, May 16, 2018 at 02:18:20PM -0700, Stefan Beller wrote: > > > > You can also do periodic maintenance like: > > > > 1. Copy each ref in the forked repositories into the parent repository > > (e.g., giving each child that borrows from the parent its own > > hierarchy in refs/remotes/<child>/*). > > Can you just copy? I assume the mother repo doesn't know about > all objects, hence by copying the ref, we have a "spotty" history. > > And to improve copying could permanent symlinking be used instead? Sorry, by copying, I meant "fetching". I.e., migrating objects and refs. -Peff ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2018-05-19 5:46 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-16 8:13 worktrees vs. alternates Lars Schneider 2018-05-16 9:29 ` Ævar Arnfjörð Bjarmason 2018-05-16 9:42 ` Robert P. J. Day 2018-05-16 11:07 ` Ævar Arnfjörð Bjarmason 2018-05-16 9:51 ` Lars Schneider 2018-05-16 10:33 ` Ævar Arnfjörð Bjarmason 2018-05-16 13:02 ` Derrick Stolee 2018-05-16 14:58 ` Konstantin Ryabitsev 2018-05-16 15:34 ` Ævar Arnfjörð Bjarmason 2018-05-16 15:49 ` Konstantin Ryabitsev 2018-05-16 17:54 ` Ævar Arnfjörð Bjarmason 2018-05-16 17:14 ` Martin Fick 2018-05-16 17:41 ` Konstantin Ryabitsev 2018-05-16 18:02 ` Ævar Arnfjörð Bjarmason 2018-05-16 18:12 ` Konstantin Ryabitsev 2018-05-16 18:26 ` Martin Fick 2018-05-16 19:01 ` Konstantin Ryabitsev 2018-05-16 19:03 ` Martin Fick 2018-05-16 19:11 ` Konstantin Ryabitsev 2018-05-16 19:18 ` Martin Fick 2018-05-16 19:23 ` Jeff King 2018-05-16 19:29 ` Konstantin Ryabitsev 2018-05-16 19:37 ` Jeff King 2018-05-16 19:40 ` Martin Fick 2018-05-16 20:06 ` Jeff King 2018-05-16 20:43 ` Martin Fick 2018-05-16 20:02 ` Konstantin Ryabitsev 2018-05-16 20:17 ` Jeff King 2018-05-17 0:43 ` Sitaram Chamarty 2018-05-17 3:31 ` Jeff King 2018-05-19 5:45 ` Duy Nguyen 2018-05-16 19:14 ` Jeff King 2018-05-16 21:18 ` Stefan Beller 2018-05-16 23:45 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).