git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Suggestion: "verify/repair" option for 'git gc'
Date: Thu, 14 Oct 2021 23:23:29 +0300	[thread overview]
Message-ID: <437b170b-d366-ce3b-b34f-e56429a21b5a@syntevo.com> (raw)
In-Reply-To: <87czo7haha.fsf@evledraar.gmail.com>

Thanks for such a detailed reply!

On 14.10.2021 18:17, Ævar Arnfjörð Bjarmason wrote:
 > Was this created with git itself, or some tool that's manually
 > crafting trees?

I have spent... a while... to figure what exactly happened with the
repo. It was in fact my fault. Git could have done better to filter
wrong user input, but it didn't and allowed my mistake to go into
objects.

So, here's what I did:
1) I had two repos with completely disjoint history, but matching
    (at some point) sources. With different paths, though.
2) I wanted to copy a set of patches from repo SRC to repo DST
3) I didn't expect that patch can apply on its own, so I decided to
    edit the generated patch before applying it. Yeah, I did it wrong
    in the end, but frankly moving patches between such different repos
    is something I never done before.
4) I fixed file paths in patch (didn't know about 'git apply -p')
5) I identified commit in repo DST (0189425c, this might ring a bell)
    which had the same sources as the base of my patches in repo SRC.
6) I patched all 'index sha..sha' to 'index 0189425c..sha'
    What I hoped to achieve is to convince git that patch was created
    on repo DST commit 0189425c, instead of where it was really created.
    From 'sha..sha' notation I thought that it's 'commit..commit', but
    it was in fact 'blob..blob', I didn't expect that.
    If you're interested, the patch can still be found in file
    '.git\rebase-apply\patch' of the repo I sent you, and I tried to
    apply it on top of 4e74e0897.
7) Patch applied partially, because the sources weren't as identical
    as I thought. I understand that it created trees in the process,
    which contained the hacked 0189425c as if it was a blob, while it
    was in fact a commit.

Now I already know that I didn't need to change any sha's, git applies
the patch just fine if the previous file content matches. I also know
that I could just use '-p' to fix paths, so in the end no patch edits
were needed.

Conclusion: my fault, but git could error on the wrong 'index sha..sha'
instead of creating wrong blobs.

 > the "gc" command actually does do exactly what you're suggesting

Right, I didn't notice that. My UI runs 'git gc' sometimes and I
assumed that corruption occurred long ago. Back then, I just deleted
and re-cloned a repo and continued my work, that's why I didn't really
tried to investigate back then.

Still, the other two stories, the blob was genuinely corrupt (in one
case I made a program to brute force correct contents, in other
downloading blob by sha from remote fixed it).

I have now tried corrupting a single bit in loose objects and packs,
and both `git gc` and `git fsck` correctly notice the problem. They
don't offer any solutions such as re-downloading from remote, though.

So it seems that the first part of my suggestion (to verify on gc) is
already there. Yeah, it's foolish that I posted before testing things
carefully, I was convinced by a combination of stories I encountered :(

 > Hypothetically, but these blobs aren't corrupted, and no amount of
 > fetching something from other places is going to fix a bad DAG.

Right, the corruption in specific repo I sent is a different case. Now
that we know what happened, let's forget about this case in light of
auto-repair idea.

 > But even without that you'll find that e.g. if a recent object is bad,
 > and we'd like to fetch it from upstream, that we're just going to die
 > pretty early, as none of the code involved in say incremental fetching
 > is prepared to run across a bad/corrupt object.

I understand that fixing a genuinely corrupt object involves two cases:
1) Loose object - just delete it, then promisor will treat it as missing
2) Packed object - I understand that in current implementation,
    replacing the object will involve repacking anyway. So, I'm thinking,
    the corrupt pack can be extracted into loose objects, then read (1)
    about fixing loose object.

In both cases, make sure that remote has the non-corrupt object before
going further.

      reply	other threads:[~2021-10-15  0:22 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-13 15:47 Suggestion: "verify/repair" option for 'git gc' Alexandr Miloslavskiy
2021-10-14  1:19 ` Ævar Arnfjörð Bjarmason
2021-10-14 12:47   ` Alexandr Miloslavskiy
2021-10-14 15:17     ` Ævar Arnfjörð Bjarmason
2021-10-14 20:23       ` Alexandr Miloslavskiy [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=437b170b-d366-ce3b-b34f-e56429a21b5a@syntevo.com \
    --to=alexandr.miloslavskiy@syntevo.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).