git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Suggestion: "verify/repair" option for 'git gc'
@ 2021-10-13 15:47 Alexandr Miloslavskiy
  2021-10-14  1:19 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 5+ messages in thread
From: Alexandr Miloslavskiy @ 2021-10-13 15:47 UTC (permalink / raw)
  To: git

Suggestion
----------
1) It would be nice if 'git gc' had an option to also verify
    (like 'git fsck') the repo and report corruption. I think that it's
    a good idea to have it in 'gc' for performance reasons, because
    'git gc' already reads things.

2) It would be nice if git could automatically download blobs from
    remote if local blob is corrupted. Maybe it was already implemented,
    see story 3 below.

Motivation
----------

-- Story 1 --
Just a few days ago I encountered another secretly broken repo which
caused some small bugs in the git UI I'm using. The repo worked mostly
fine, that's why I had no idea that it's corrupted.

My git UI invokes 'git gc' sometimes and if that detected the
corruption, I wouldn't have to spend time hunting the bug in UI.

Specifically, it reports these errors on `git fsck`
   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, 
not a blob
   error in tree 1d571d7354f99b726bbcc0cb232b3f47846c71a1: broken links
   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, 
not a blob
   error in tree 2808b286c2a933e88735d97416e29b9514fc6af2: broken links
   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, 
not a blob
   error in tree 604f6f6c4fbf8da7a593708e863e68f8c5a27d07: broken links
   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, 
not a blob
   error in tree 6a2c4a5ef0b0ee7aa85d88c3147b7558a6a7c29f: broken links

The repo is not confidential and I could share it if needed.
I "solved" the problem by cloning a new copy.

-- Story 2 --
A few years ago, I had another repo that wasn't used for a couple years
and had corrupted blobs. The repo looked fine until I tried to clone
from it. Unfortunately it was the only copy and I had to write some
code to "guess" the blob's contents to repair the repo.

If 'git gc' detected corruption, I would have known about the problem 
earlier,
when I still had other copies around.

-- Story 3 --
Also a few years ago, I had a repo with a single corrupted blob. I don't
remember why, but simply re-cloning it was a headache. I managed to fix repo
by issuing a command to re-download a blob from remote. Git could totally do
that itself, I think.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestion: "verify/repair" option for 'git gc'
  2021-10-13 15:47 Suggestion: "verify/repair" option for 'git gc' Alexandr Miloslavskiy
@ 2021-10-14  1:19 ` Ævar Arnfjörð Bjarmason
  2021-10-14 12:47   ` Alexandr Miloslavskiy
  0 siblings, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-14  1:19 UTC (permalink / raw)
  To: Alexandr Miloslavskiy; +Cc: git


On Wed, Oct 13 2021, Alexandr Miloslavskiy wrote:

> Suggestion
> ----------
> 1) It would be nice if 'git gc' had an option to also verify
>    (like 'git fsck') the repo and report corruption. I think that it's
>    a good idea to have it in 'gc' for performance reasons, because
>    'git gc' already reads things.
>
> 2) It would be nice if git could automatically download blobs from
>    remote if local blob is corrupted. Maybe it was already implemented,
>    see story 3 below.
>
> Motivation
> ----------
>
> -- Story 1 --
> Just a few days ago I encountered another secretly broken repo which
> caused some small bugs in the git UI I'm using. The repo worked mostly
> fine, that's why I had no idea that it's corrupted.
>
> My git UI invokes 'git gc' sometimes and if that detected the
> corruption, I wouldn't have to spend time hunting the bug in UI.
>
> Specifically, it reports these errors on `git fsck`
>   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit,
>   not a blob
>   error in tree 1d571d7354f99b726bbcc0cb232b3f47846c71a1: broken links
>   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit,
>   not a blob
>   error in tree 2808b286c2a933e88735d97416e29b9514fc6af2: broken links
>   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit,
>   not a blob
>   error in tree 604f6f6c4fbf8da7a593708e863e68f8c5a27d07: broken links
>   error: object 0189425cc210555c36383293c468df5da73acc48 is a commit,
>   not a blob
>   error in tree 6a2c4a5ef0b0ee7aa85d88c3147b7558a6a7c29f: broken links
>
> The repo is not confidential and I could share it if needed.
> I "solved" the problem by cloning a new copy.

I'd be interested in a copy of it, I've been slowly trying to improve
these sorts of corruption cases.

> -- Story 2 --
> A few years ago, I had another repo that wasn't used for a couple years
> and had corrupted blobs. The repo looked fine until I tried to clone
> from it. Unfortunately it was the only copy and I had to write some
> code to "guess" the blob's contents to repair the repo.
>
> If 'git gc' detected corruption, I would have known about the problem
> earlier,
> when I still had other copies around.

I wonder if this and other issues you encountered wouldn't need a full
"fsck", but merely gc triggering a complete repack. Which is not to say
that some regular background "fsck" wouldn't be a good idea...

> -- Story 3 --
> Also a few years ago, I had a repo with a single corrupted blob. I don't
> remember why, but simply re-cloning it was a headache. I managed to fix repo
> by issuing a command to re-download a blob from remote. Git could totally do
> that itself, I think.

Yes, we still definitely have cases where dealing with this sort of
thing can be very painful.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestion: "verify/repair" option for 'git gc'
  2021-10-14  1:19 ` Ævar Arnfjörð Bjarmason
@ 2021-10-14 12:47   ` Alexandr Miloslavskiy
  2021-10-14 15:17     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 5+ messages in thread
From: Alexandr Miloslavskiy @ 2021-10-14 12:47 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On 14.10.2021 4:19, Ævar Arnfjörð Bjarmason wrote:
> I'd be interested in a copy of it, I've been slowly trying to improve
> these sorts of corruption cases.

Sent.

> I wonder if this and other issues you encountered wouldn't need a full
> "fsck", but merely gc triggering a complete repack.

That sounds slow :( For example, it's going to be a lot of disk write 
bandwidth. Just doing the verification along with regular gc sounds faster.

> Yes, we still definitely have cases where dealing with this sort of
> thing can be very painful.

With the new remote promisor code, I think that auto-fixing corrupted 
blobs is easy enough (provided they can be found on any remote) ?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestion: "verify/repair" option for 'git gc'
  2021-10-14 12:47   ` Alexandr Miloslavskiy
@ 2021-10-14 15:17     ` Ævar Arnfjörð Bjarmason
  2021-10-14 20:23       ` Alexandr Miloslavskiy
  0 siblings, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-14 15:17 UTC (permalink / raw)
  To: Alexandr Miloslavskiy; +Cc: git


On Thu, Oct 14 2021, Alexandr Miloslavskiy wrote:

> On 14.10.2021 4:19, Ævar Arnfjörð Bjarmason wrote:
>> I'd be interested in a copy of it, I've been slowly trying to improve
>> these sorts of corruption cases.
>
> Sent.

Thanks, I can't promise I'll take a look at it in detail time soon, but
I was going to loop back to checking out these corruption cases at some
point.

>> I wonder if this and other issues you encountered wouldn't need a full
>> "fsck", but merely gc triggering a complete repack.
>
> That sounds slow :( For example, it's going to be a lot of disk write
> bandwidth. Just doing the verification along with regular gc sounds
> faster.

Having looked at your repo the immedite issue is that you've got a tree
in it that has a (manually crafted?) entry that points to a commit
object, without the relevant mode being correct:

    $ git cat-file -p 1d571d7354f99b726bbcc0cb232b3f47846c71a1
    100644 blob 0189425cc210555c36383293c468df5da73acc48    common.mak
    040000 tree 6a2c4a5ef0b0ee7aa85d88c3147b7558a6a7c29f    include

Was this created with git itself, or some tool that's manually crafting
trees? I.e. that the object on-disk has the exact expected content but
is just bad in this particular way points to corruption in git or
another tool writing the data, not e.g. FS corruption or a bit-flip.

Anyway, getting back on track the "gc" command actually does do exactly
what you're suggesting:

    $ git gc; echo $?
    error: object 0189425cc210555c36383293c468df5da73acc48 is a commit, not a blob
    fatal: entry 'common.mak' in tree 1d571d7354f99b726bbcc0cb232b3f47846c71a1 has blob mode, but is not a blob
    fatal: failed to run repack
    128

The problem is that as a user you won't have seen that because we won't
get that far without running into the gc.auto limit, then it would have
run into that, and you'd have had the contents of gc.log spewed at you
by other commands.

So maybe we should be more aggressive there, e.g. as a function of repo
size or whatever (this repo is 18MB).

You really don't need "git fsck" to verify FS corruption or basic object
graph issues like these, and I think it's rather unfortunate that we
expose it like that.

What it does over and beyond a full repack of the repo is to
exhaustively verify object contents, which is most useful e.g. if you're
running a git service and want to prevent users from pushing crafted
corrupt objects, either intentionally or unintentionally.

I've also been meaning to look at that aspect of it for a while, i.e. it
should be able to have some --fast, it has --connectivity-only, but that
one goes a bit too far, although in this case it would have helped you.

>> Yes, we still definitely have cases where dealing with this sort of
>> thing can be very painful.
>
> With the new remote promisor code, I think that auto-fixing corrupted
> blobs is easy enough (provided they can be found on any remote) ?

Hypothetically, but these blobs aren't corrupted, and no amount of
fetching something from other places is going to fix a bad DAG. If that
thing didn't really point at the wrong object type the hash would be
different. The problem is that it was wrong when it was written.

I say "hypothetically" because even in the case of a bitflip or whatever
coercing git into some sort of auto-repair mode is pretty far off. I've
been able to do it manually in some cases, but e.g. promisors not having
a blob *and knowing that* is very different from the more general cases
of object(s) XYZ being corrupt.

We have well-intentioned features like the collision detection that
actively gets in the way of some repairs like that (I had a patch[1] to
disable it, which as a side-effect made recovering from some forms of
corruption easier).

But even without that you'll find that e.g. if a recent object is bad,
and we'd like to fetch it from upstream, that we're just going to die
pretty early, as none of the code involved in say incremental fetching
is prepared to run across a bad/corrupt object.

Those aren't inherent problems, and it would be very nice to have more
such auto-repair in git, just limitations of the current implementation.

1. https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/ 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suggestion: "verify/repair" option for 'git gc'
  2021-10-14 15:17     ` Ævar Arnfjörð Bjarmason
@ 2021-10-14 20:23       ` Alexandr Miloslavskiy
  0 siblings, 0 replies; 5+ messages in thread
From: Alexandr Miloslavskiy @ 2021-10-14 20:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

Thanks for such a detailed reply!

On 14.10.2021 18:17, Ævar Arnfjörð Bjarmason wrote:
 > Was this created with git itself, or some tool that's manually
 > crafting trees?

I have spent... a while... to figure what exactly happened with the
repo. It was in fact my fault. Git could have done better to filter
wrong user input, but it didn't and allowed my mistake to go into
objects.

So, here's what I did:
1) I had two repos with completely disjoint history, but matching
    (at some point) sources. With different paths, though.
2) I wanted to copy a set of patches from repo SRC to repo DST
3) I didn't expect that patch can apply on its own, so I decided to
    edit the generated patch before applying it. Yeah, I did it wrong
    in the end, but frankly moving patches between such different repos
    is something I never done before.
4) I fixed file paths in patch (didn't know about 'git apply -p')
5) I identified commit in repo DST (0189425c, this might ring a bell)
    which had the same sources as the base of my patches in repo SRC.
6) I patched all 'index sha..sha' to 'index 0189425c..sha'
    What I hoped to achieve is to convince git that patch was created
    on repo DST commit 0189425c, instead of where it was really created.
    From 'sha..sha' notation I thought that it's 'commit..commit', but
    it was in fact 'blob..blob', I didn't expect that.
    If you're interested, the patch can still be found in file
    '.git\rebase-apply\patch' of the repo I sent you, and I tried to
    apply it on top of 4e74e0897.
7) Patch applied partially, because the sources weren't as identical
    as I thought. I understand that it created trees in the process,
    which contained the hacked 0189425c as if it was a blob, while it
    was in fact a commit.

Now I already know that I didn't need to change any sha's, git applies
the patch just fine if the previous file content matches. I also know
that I could just use '-p' to fix paths, so in the end no patch edits
were needed.

Conclusion: my fault, but git could error on the wrong 'index sha..sha'
instead of creating wrong blobs.

 > the "gc" command actually does do exactly what you're suggesting

Right, I didn't notice that. My UI runs 'git gc' sometimes and I
assumed that corruption occurred long ago. Back then, I just deleted
and re-cloned a repo and continued my work, that's why I didn't really
tried to investigate back then.

Still, the other two stories, the blob was genuinely corrupt (in one
case I made a program to brute force correct contents, in other
downloading blob by sha from remote fixed it).

I have now tried corrupting a single bit in loose objects and packs,
and both `git gc` and `git fsck` correctly notice the problem. They
don't offer any solutions such as re-downloading from remote, though.

So it seems that the first part of my suggestion (to verify on gc) is
already there. Yeah, it's foolish that I posted before testing things
carefully, I was convinced by a combination of stories I encountered :(

 > Hypothetically, but these blobs aren't corrupted, and no amount of
 > fetching something from other places is going to fix a bad DAG.

Right, the corruption in specific repo I sent is a different case. Now
that we know what happened, let's forget about this case in light of
auto-repair idea.

 > But even without that you'll find that e.g. if a recent object is bad,
 > and we'd like to fetch it from upstream, that we're just going to die
 > pretty early, as none of the code involved in say incremental fetching
 > is prepared to run across a bad/corrupt object.

I understand that fixing a genuinely corrupt object involves two cases:
1) Loose object - just delete it, then promisor will treat it as missing
2) Packed object - I understand that in current implementation,
    replacing the object will involve repacking anyway. So, I'm thinking,
    the corrupt pack can be extracted into loose objects, then read (1)
    about fixing loose object.

In both cases, make sure that remote has the non-corrupt object before
going further.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-10-15  0:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 15:47 Suggestion: "verify/repair" option for 'git gc' Alexandr Miloslavskiy
2021-10-14  1:19 ` Ævar Arnfjörð Bjarmason
2021-10-14 12:47   ` Alexandr Miloslavskiy
2021-10-14 15:17     ` Ævar Arnfjörð Bjarmason
2021-10-14 20:23       ` Alexandr Miloslavskiy

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).