* Suggestion: "verify/repair" option for 'git gc' @ 2021-10-13 15:47 Alexandr Miloslavskiy 2021-10-14 1:19 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 5+ messages in thread From: Alexandr Miloslavskiy @ 2021-10-13 15:47 UTC (permalink / raw) To: git Suggestion ---------- 1) It would be nice if 'git gc' had an option to also verify (like 'git fsck') the repo and report corruption. I think that it's a good idea to have it in 'gc' for performance reasons, because 'git gc' already reads things. 2) It would be nice if git could automatically download blobs from remote if local blob is corrupted. Maybe it was already implemented, see story 3 below. Motivation ---------- -- Story 1 -- Just a few days ago I encountered another secretly broken repo which caused some small bugs in the git UI I'm using. The repo worked mostly fine, that's why I had no idea that it's corrupted. My git UI invokes 'git gc' sometimes and if that detected the corruption, I wouldn't have to spend time hunting the bug in UI. Specifically, it reports these errors on `git fsck` error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, not a blob error in tree 1d571d7354f99b726bbcc0cb232b3f47846c71a1: broken links error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, not a blob error in tree 2808b286c2a933e88735d97416e29b9514fc6af2: broken links error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, not a blob error in tree 604f6f6c4fbf8da7a593708e863e68f8c5a27d07: broken links error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, not a blob error in tree 6a2c4a5ef0b0ee7aa85d88c3147b7558a6a7c29f: broken links The repo is not confidential and I could share it if needed. I "solved" the problem by cloning a new copy. -- Story 2 -- A few years ago, I had another repo that wasn't used for a couple years and had corrupted blobs. The repo looked fine until I tried to clone from it. Unfortunately it was the only copy and I had to write some code to "guess" the blob's contents to repair the repo. If 'git gc' detected corruption, I would have known about the problem earlier, when I still had other copies around. -- Story 3 -- Also a few years ago, I had a repo with a single corrupted blob. I don't remember why, but simply re-cloning it was a headache. I managed to fix repo by issuing a command to re-download a blob from remote. Git could totally do that itself, I think. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Suggestion: "verify/repair" option for 'git gc' 2021-10-13 15:47 Suggestion: "verify/repair" option for 'git gc' Alexandr Miloslavskiy @ 2021-10-14 1:19 ` Ævar Arnfjörð Bjarmason 2021-10-14 12:47 ` Alexandr Miloslavskiy 0 siblings, 1 reply; 5+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-14 1:19 UTC (permalink / raw) To: Alexandr Miloslavskiy; +Cc: git On Wed, Oct 13 2021, Alexandr Miloslavskiy wrote: > Suggestion > ---------- > 1) It would be nice if 'git gc' had an option to also verify > (like 'git fsck') the repo and report corruption. I think that it's > a good idea to have it in 'gc' for performance reasons, because > 'git gc' already reads things. > > 2) It would be nice if git could automatically download blobs from > remote if local blob is corrupted. Maybe it was already implemented, > see story 3 below. > > Motivation > ---------- > > -- Story 1 -- > Just a few days ago I encountered another secretly broken repo which > caused some small bugs in the git UI I'm using. The repo worked mostly > fine, that's why I had no idea that it's corrupted. > > My git UI invokes 'git gc' sometimes and if that detected the > corruption, I wouldn't have to spend time hunting the bug in UI. > > Specifically, it reports these errors on `git fsck` > error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, > not a blob > error in tree 1d571d7354f99b726bbcc0cb232b3f47846c71a1: broken links > error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, > not a blob > error in tree 2808b286c2a933e88735d97416e29b9514fc6af2: broken links > error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, > not a blob > error in tree 604f6f6c4fbf8da7a593708e863e68f8c5a27d07: broken links > error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, > not a blob > error in tree 6a2c4a5ef0b0ee7aa85d88c3147b7558a6a7c29f: broken links > > The repo is not confidential and I could share it if needed. > I "solved" the problem by cloning a new copy. I'd be interested in a copy of it, I've been slowly trying to improve these sorts of corruption cases. > -- Story 2 -- > A few years ago, I had another repo that wasn't used for a couple years > and had corrupted blobs. The repo looked fine until I tried to clone > from it. Unfortunately it was the only copy and I had to write some > code to "guess" the blob's contents to repair the repo. > > If 'git gc' detected corruption, I would have known about the problem > earlier, > when I still had other copies around. I wonder if this and other issues you encountered wouldn't need a full "fsck", but merely gc triggering a complete repack. Which is not to say that some regular background "fsck" wouldn't be a good idea... > -- Story 3 -- > Also a few years ago, I had a repo with a single corrupted blob. I don't > remember why, but simply re-cloning it was a headache. I managed to fix repo > by issuing a command to re-download a blob from remote. Git could totally do > that itself, I think. Yes, we still definitely have cases where dealing with this sort of thing can be very painful. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Suggestion: "verify/repair" option for 'git gc' 2021-10-14 1:19 ` Ævar Arnfjörð Bjarmason @ 2021-10-14 12:47 ` Alexandr Miloslavskiy 2021-10-14 15:17 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 5+ messages in thread From: Alexandr Miloslavskiy @ 2021-10-14 12:47 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git On 14.10.2021 4:19, Ævar Arnfjörð Bjarmason wrote: > I'd be interested in a copy of it, I've been slowly trying to improve > these sorts of corruption cases. Sent. > I wonder if this and other issues you encountered wouldn't need a full > "fsck", but merely gc triggering a complete repack. That sounds slow :( For example, it's going to be a lot of disk write bandwidth. Just doing the verification along with regular gc sounds faster. > Yes, we still definitely have cases where dealing with this sort of > thing can be very painful. With the new remote promisor code, I think that auto-fixing corrupted blobs is easy enough (provided they can be found on any remote) ? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Suggestion: "verify/repair" option for 'git gc' 2021-10-14 12:47 ` Alexandr Miloslavskiy @ 2021-10-14 15:17 ` Ævar Arnfjörð Bjarmason 2021-10-14 20:23 ` Alexandr Miloslavskiy 0 siblings, 1 reply; 5+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-14 15:17 UTC (permalink / raw) To: Alexandr Miloslavskiy; +Cc: git On Thu, Oct 14 2021, Alexandr Miloslavskiy wrote: > On 14.10.2021 4:19, Ævar Arnfjörð Bjarmason wrote: >> I'd be interested in a copy of it, I've been slowly trying to improve >> these sorts of corruption cases. > > Sent. Thanks, I can't promise I'll take a look at it in detail time soon, but I was going to loop back to checking out these corruption cases at some point. >> I wonder if this and other issues you encountered wouldn't need a full >> "fsck", but merely gc triggering a complete repack. > > That sounds slow :( For example, it's going to be a lot of disk write > bandwidth. Just doing the verification along with regular gc sounds > faster. Having looked at your repo the immedite issue is that you've got a tree in it that has a (manually crafted?) entry that points to a commit object, without the relevant mode being correct: $ git cat-file -p 1d571d7354f99b726bbcc0cb232b3f47846c71a1 100644 blob 0189425cc210555c36383293c468df5da73acc48 common.mak 040000 tree 6a2c4a5ef0b0ee7aa85d88c3147b7558a6a7c29f include Was this created with git itself, or some tool that's manually crafting trees? I.e. that the object on-disk has the exact expected content but is just bad in this particular way points to corruption in git or another tool writing the data, not e.g. FS corruption or a bit-flip. Anyway, getting back on track the "gc" command actually does do exactly what you're suggesting: $ git gc; echo $? error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, not a blob fatal: entry 'common.mak' in tree 1d571d7354f99b726bbcc0cb232b3f47846c71a1 has blob mode, but is not a blob fatal: failed to run repack 128 The problem is that as a user you won't have seen that because we won't get that far without running into the gc.auto limit, then it would have run into that, and you'd have had the contents of gc.log spewed at you by other commands. So maybe we should be more aggressive there, e.g. as a function of repo size or whatever (this repo is 18MB). You really don't need "git fsck" to verify FS corruption or basic object graph issues like these, and I think it's rather unfortunate that we expose it like that. What it does over and beyond a full repack of the repo is to exhaustively verify object contents, which is most useful e.g. if you're running a git service and want to prevent users from pushing crafted corrupt objects, either intentionally or unintentionally. I've also been meaning to look at that aspect of it for a while, i.e. it should be able to have some --fast, it has --connectivity-only, but that one goes a bit too far, although in this case it would have helped you. >> Yes, we still definitely have cases where dealing with this sort of >> thing can be very painful. > > With the new remote promisor code, I think that auto-fixing corrupted > blobs is easy enough (provided they can be found on any remote) ? Hypothetically, but these blobs aren't corrupted, and no amount of fetching something from other places is going to fix a bad DAG. If that thing didn't really point at the wrong object type the hash would be different. The problem is that it was wrong when it was written. I say "hypothetically" because even in the case of a bitflip or whatever coercing git into some sort of auto-repair mode is pretty far off. I've been able to do it manually in some cases, but e.g. promisors not having a blob *and knowing that* is very different from the more general cases of object(s) XYZ being corrupt. We have well-intentioned features like the collision detection that actively gets in the way of some repairs like that (I had a patch[1] to disable it, which as a side-effect made recovering from some forms of corruption easier). But even without that you'll find that e.g. if a recent object is bad, and we'd like to fetch it from upstream, that we're just going to die pretty early, as none of the code involved in say incremental fetching is prepared to run across a bad/corrupt object. Those aren't inherent problems, and it would be very nice to have more such auto-repair in git, just limitations of the current implementation. 1. https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Suggestion: "verify/repair" option for 'git gc' 2021-10-14 15:17 ` Ævar Arnfjörð Bjarmason @ 2021-10-14 20:23 ` Alexandr Miloslavskiy 0 siblings, 0 replies; 5+ messages in thread From: Alexandr Miloslavskiy @ 2021-10-14 20:23 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git Thanks for such a detailed reply! On 14.10.2021 18:17, Ævar Arnfjörð Bjarmason wrote: > Was this created with git itself, or some tool that's manually > crafting trees? I have spent... a while... to figure what exactly happened with the repo. It was in fact my fault. Git could have done better to filter wrong user input, but it didn't and allowed my mistake to go into objects. So, here's what I did: 1) I had two repos with completely disjoint history, but matching (at some point) sources. With different paths, though. 2) I wanted to copy a set of patches from repo SRC to repo DST 3) I didn't expect that patch can apply on its own, so I decided to edit the generated patch before applying it. Yeah, I did it wrong in the end, but frankly moving patches between such different repos is something I never done before. 4) I fixed file paths in patch (didn't know about 'git apply -p') 5) I identified commit in repo DST (0189425c, this might ring a bell) which had the same sources as the base of my patches in repo SRC. 6) I patched all 'index sha..sha' to 'index 0189425c..sha' What I hoped to achieve is to convince git that patch was created on repo DST commit 0189425c, instead of where it was really created. From 'sha..sha' notation I thought that it's 'commit..commit', but it was in fact 'blob..blob', I didn't expect that. If you're interested, the patch can still be found in file '.git\rebase-apply\patch' of the repo I sent you, and I tried to apply it on top of 4e74e0897. 7) Patch applied partially, because the sources weren't as identical as I thought. I understand that it created trees in the process, which contained the hacked 0189425c as if it was a blob, while it was in fact a commit. Now I already know that I didn't need to change any sha's, git applies the patch just fine if the previous file content matches. I also know that I could just use '-p' to fix paths, so in the end no patch edits were needed. Conclusion: my fault, but git could error on the wrong 'index sha..sha' instead of creating wrong blobs. > the "gc" command actually does do exactly what you're suggesting Right, I didn't notice that. My UI runs 'git gc' sometimes and I assumed that corruption occurred long ago. Back then, I just deleted and re-cloned a repo and continued my work, that's why I didn't really tried to investigate back then. Still, the other two stories, the blob was genuinely corrupt (in one case I made a program to brute force correct contents, in other downloading blob by sha from remote fixed it). I have now tried corrupting a single bit in loose objects and packs, and both `git gc` and `git fsck` correctly notice the problem. They don't offer any solutions such as re-downloading from remote, though. So it seems that the first part of my suggestion (to verify on gc) is already there. Yeah, it's foolish that I posted before testing things carefully, I was convinced by a combination of stories I encountered :( > Hypothetically, but these blobs aren't corrupted, and no amount of > fetching something from other places is going to fix a bad DAG. Right, the corruption in specific repo I sent is a different case. Now that we know what happened, let's forget about this case in light of auto-repair idea. > But even without that you'll find that e.g. if a recent object is bad, > and we'd like to fetch it from upstream, that we're just going to die > pretty early, as none of the code involved in say incremental fetching > is prepared to run across a bad/corrupt object. I understand that fixing a genuinely corrupt object involves two cases: 1) Loose object - just delete it, then promisor will treat it as missing 2) Packed object - I understand that in current implementation, replacing the object will involve repacking anyway. So, I'm thinking, the corrupt pack can be extracted into loose objects, then read (1) about fixing loose object. In both cases, make sure that remote has the non-corrupt object before going further. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-10-15 0:22 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-13 15:47 Suggestion: "verify/repair" option for 'git gc' Alexandr Miloslavskiy 2021-10-14 1:19 ` Ævar Arnfjörð Bjarmason 2021-10-14 12:47 ` Alexandr Miloslavskiy 2021-10-14 15:17 ` Ævar Arnfjörð Bjarmason 2021-10-14 20:23 ` Alexandr Miloslavskiy
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).