git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Martin Fick <mfick@codeaurora.org>
Cc: git@vger.kernel.org
Subject: Re: pack corruption post-mortem
Date: Wed, 16 Oct 2013 20:35:47 -0400	[thread overview]
Message-ID: <20131017003546.GA12439@sigill.intra.peff.net> (raw)
In-Reply-To: <201310160941.16904.mfick@codeaurora.org>

On Wed, Oct 16, 2013 at 09:41:16AM -0600, Martin Fick wrote:

> I have nightmares about this sort of thing every now and 
> then, and we even experience some corruption here and there 
> that needs to be fixed (mainly missing objects when we toy 
> with different git repack arguments).  I cannot help but 
> wonder, how we can improve git further to either help 
> diagnose or even fix some of these problems?  More inline 
> below...

In general, I don't think we know enough about patterns of recovery
corruption to say which commands would definitely be worth implementing.
Part of the reason I wrote this up is to document this one case. But
this is the first time in 7 years of git usage that I've had to do this.
So I'd feel a little bit better about sinking time into it after seeing
a few more cases and realizing where the patterns are.

One of the major hassles is that the assumptions you can and can't make
depend on what data you have that _isn't_ corrupted. Do you have a pack
index, or a bare pack? Do you have zlib data that fails the crc, or zlib
data that cannot be parsed?

In this case there was no other copy of the repository. But if you know
the broken object (which we did here), and you can copy it from
elsewhere, then git already will try to find other sources of the
object (loose, or in another pack).

> >   dd if=$pack of=object bs=1 skip=51653873 count=10863
> 
> Is there a current plumbing command that should be enhanced 
> to be able to do the 2 steps above directly for people 
> debugging (maybe with some new switch)?  If not, should we 
> create one, git show --zlib, or git cat-file --zlib?

Most of the git plumbing commands deal with data at the object layer.
This is really about going a step below and saying "Give me the on-disk
representation of the object". We recently introduced an
"%(objectsize:disk)" formatter for cat-file. The logical extension would
be to ask for "%(contents:disk)" or something. Though what you get would
depend on how the object is stored, so you would need to figure that out
to do anything useful with it.

Note that this implies you actually have a packfile index that says
"object XXX is at offset YYY". In some corruption cases, you might have
only a packfile. That is generally enough to generate the index, but if
there is corruption, you cannot actually parse the pack to find out the
sha1 of the objects.

So in the worst case, what you really want is something like "dump the
object in packfile X at offset Y". But even then, you don't know the
length of the object. The packfile is a stream, and the length we
calculated is from the index, which depends on the zlib data parsing in
some sane way.

> > and then running "git index-pack tmp.pack" in the
> > debugger (stop at unpack_raw_entry). Doing this, I found
> > that there were 3 bytes of header (and the header itself
> > had a sane type and size). So I stripped those off with:
> > 
> >   dd if=object of=zlib bs=1 skip=3
> 
> This too feels like something we should be able to do with a 
> plumbing command eventually?
> 
> git zlib-extract

Perhaps. I think if you had some "extract object at offset X from the
packfile" command, it would be optional to give you the whole thing, or
just the zlib data.

> > So I took a different approach. Working under the guess
> > that the corruption was limited to a single byte, I
> > wrote a program to munge each byte individually, and try
> > inflating the result. Since the object was only 10K
> > compressed, that worked out to about 2.5M attempts,
> > which took a few minutes.
> 
> Awesome!  Would this make a good new plumbing command, git 
> zlib-fix?

I'd like to see it actually work more than once first. This relies on
there being single-byte corruption. Even double-byte corruption starts
to get expensive to brute-force like this. SHA1, by its nature, requires
brute-forcing. But it's possible that the crc, not being
cryptographically secure, could be reverse-engineered to find likely
spots of corruption. I don't know enough about it to say.

> > I fixed the packfile itself with:
> > 
> >   chmod +w $pack
> >   printf '\xc7' | dd of=$pack bs=1 seek=51659518
> > conv=notrunc chmod -w $pack
> > 
> > The '\xc7' comes from the replacement byte our "munge"
> > program found. The offset 51659518 is derived by taking
> > the original object offset (51653873), adding the
> > replacement offset found by "munge" (5642), and then
> > adding back in the 3 bytes of git header we stripped.
> 
> Another plumbing command needed?  git pack-put --zlib?

I think in this case that dd does a nice job of solving the problem.
Some of the stuff I did was very git-specific and required knowledge of
the formats. But this one is really just "replace byte X at offset Y",
and I don't see any need to avoid a general-purpose tool (except that
dd is itself reasonably arcane :) ).

-Peff

  reply	other threads:[~2013-10-17  0:35 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-16  8:34 pack corruption post-mortem Jeff King
2013-10-16  8:59 ` Duy Nguyen
2013-10-16 15:41 ` Martin Fick
2013-10-17  0:35   ` Jeff King [this message]
2013-10-17 15:47     ` Junio C Hamano
2013-10-25  7:55       ` Jeff King
2013-10-17  1:06   ` Duy Nguyen
2013-10-19 10:32 ` Duy Nguyen
2013-10-19 14:41   ` Nicolas Pitre
2013-10-19 19:17     ` Shawn Pearce
2013-10-20 20:56       ` Nicolas Pitre
2013-10-20  4:44     ` Duy Nguyen
2013-10-20 21:08       ` Nicolas Pitre
2015-04-01 21:08 ` [PATCH] howto: document more tools for recovery corruption Jeff King
2015-04-01 22:21   ` Junio C Hamano
2015-04-02  0:49     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131017003546.GA12439@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=mfick@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).