Re: should git download missing objects?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Junio C Hamano <junkio@cox.net>
To: "Anand Kumria" <wildfire@progsoc.org>
Cc: git@vger.kernel.org
Subject: Re: should git download missing objects?
Date: Sun, 12 Nov 2006 11:41:23 -0800	[thread overview]
Message-ID: <7vwt60bggs.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: ej7fgp$8ca$1@sea.gmane.org

"Anand Kumria" <wildfire@progsoc.org> writes:

> I did an initial clone of Linus' linux-2.6.git tree, via the git protocol,
> and then managed to accidently delete one of the .pack and
> corresponding .idx files.
>
> I thought that 'cg-fetch' would do the job of bring down the missing pack
> again, and all would be well. Alas this isn't the case.
>
> <http://pastebin.ca/246678>
>
> Pasky, on IRC, indicated that this might be because git-fetch-pack isn't
> downloading missing objects when the git:// protocol is being used.

There are the invariants between refs and objects:

 - objects that its refs (files under .git/refs/ hierarchy that
   record 40-byte hexadecimal object names) point at are never
   missing, or the repository is corrupt.

 - objects that are reachable via pointers in another object
   that is not missing (a tag points at another object, a commit
   points at its tree and its parent commits, and a tree points
   at its subtrees and blobs) are never missing, or the repository
   is corrupt.

Git tools first fetch missing objects and then update your refs
only when fetch succeeds completely, in order to maintain the
above invariants (a partial fetch does not update your refs).
And these invariants are why:

 - fsck-objects start reachability check from the refs;

 - commit walkers can stop at your existing refs;

 - git native protocols only need to tell the other end what
   refs you have, in order for the other end to exclude what you
   already have from the set of objects it sends you.

What's missing needs to be determined in a reasonably efficient
manner, and the above invariants allow us not have to do the
equivalent of fsck-objects every time.  Being able to trust refs
is fairly fundamental in the fetch operation of git.

I am not opposed to the idea of a new tool to fix a corrupted
repository that has broken the above invariants, perhaps caused
by accidental removal of objects and packs by end users.  What
it needs to do would be:

 - run fsck-objects to notice what are missing, by noting
   "broken link from foo to bar" output messages.  Object 'bar'
   is what you _ought_ to have according to your refs but you
   don't (because you removed the objects that should be there),
   and everything that is reachable from it from the other side
   needs to be retrieved.  Because you do not have 'bar', your
   end cannot determine what other objects you happen to have in
   your object store are reachable from it and would result in
   redundant download.

 - run fetch-pack equivalent to get everything reachable
   starting at the above missing objects, pretending you do not
   have any object, because your refs are not trustworthy.

 - run fsck-objects again to make sure that your refs can now be
   trusted again.

To implement the second step above, you need to implement a
modified fetch-pack that does not trust any of your refs.  It
also needs to ignore what are offered from the other end but
asks the objects you know are missing ('bar' in the above
example).  This program needs to talk to a modified upload-pack
running at the other end (let's call it upload-pack-recover),
because usual upload-pack does not serve starting from a random
object that happen to be in its repository, but only starting
from objects that are pointed by its own set of refs to ensure
integrity.

The upload-pack-recover program would need to start traversal
from object 'bar' in the above example, and when it does so, it
should not just run 'rev-list --objects' starting at 'bar'.  It
first needs to prove that its object store has everything that
is reachable from 'bar' (the recipient would still end up with
an incomplete repository if it didn't).

What this means is that it needs to prove some of its refs can
reach 'bar' (again, on the upstream end, only refs are trusted,
not mere existence of object is not enough) before sending
objects back.  Usual upload-pack do not have to do it because it
refuses to serve starting from anything but what its refs point
at (and by the invariants, the objects pointed at by refs are
guaranteed to be complete [an object is "complete" if no object
that can be reachable is not missing]).

This is needed because the repository might have discarded
branch that used to reach 'bar', and while the object 'bar' was
in a pack but some of its ancestors or component trees and/or
blobs were loose and subsequent git-prune have removed the
latter without removing 'bar'.  Mere existence of the object
'bar' does not mean 'bar' is complete.

So coming up with such a pair of programs is not a rocket
science, but it is fairly delicate.  I would rather have them as
specialized commands, not a part of everyday commands, even if
you were to implement it.

Since this is not everyday anyway, a far easier way would be to
clone-pack from the upstream into a new repository, take the
pack you downloaded from that new repository and mv it into your
corrupt repository.  You can run fsck-objects to see if you got
back everything you lost earlier.

next prev parent reply	other threads:[~2006-11-12 19:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-12 15:44 should git download missing objects? Anand Kumria
2006-11-12 19:41 ` Junio C Hamano [this message]
2006-11-13 19:45   ` Alex Riesen
2006-11-13 19:54     ` Shawn Pearce
2006-11-13 20:03       ` Petr Baudis
2006-11-13 20:10         ` Shawn Pearce
2006-11-13 20:22         ` Junio C Hamano
2006-11-14 20:08           ` Petr Baudis
2006-11-13 20:05     ` Junio C Hamano
2006-11-13 22:52       ` Alex Riesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vwt60bggs.fsf@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=git@vger.kernel.org \
    --cc=wildfire@progsoc.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).