git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [Design RFC] Being more defensive about fetching commits in partial clone
@ 2022-11-24  0:42 Jonathan Tan
  2022-11-24  1:04 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Tan @ 2022-11-24  0:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

At $DAYJOB, we recently ran into a situation in which through a bug (not
necessarily through Git) [1], there was corruption in the object store of
a partial clone. In this particular case, the problem was exposed when "git
gc" tried to expire reflogs, which calls repo_parse_commit(), which triggers
fetches of the missing commits.

We don't want to go to great lengths to improve the user experience in a
relatively rare case caused by a bug in another program at the expense of the
regular user experience, so this constrains the solution space. But I think
there is a solution that works: if we have reason to believe that we are
parsing a commit, we shouldn't lazy-fetch if it is missing. (I'm not proposing
a hard guarantee that commits are never lazy-fetched; this just relatively
increases resilience to object store corruption, and does not guarantee
absolute defense.) I think that we can use a missing commit as a sign of
object store corruption in this case because currently, Git does not support
excluding commits in partial clones.

There are other possible solutions including passing an argument from "git gc"
to "git reflog" to inhibit all lazy fetches, but I think that this fix is at
the wrong level - fixing "git reflog" means that this particular command works
fine, or so we think (it will fail if it somehow needs to read a legitimately
missing blob, say, a .gitmodules file), but fixing repo_parse_commit() will fix
a whole class of bugs.

A question remains of whether we would need to undo all this work if we decide
to support commit filters in partial clones. Firstly, there are good arguments
against (and, of course, for) commit filters in partial clones, so commit
filters may not work out in the end anyway. Secondly, even if we do have commit
filters, we at $DAYJOB think that we still need to differentiate, in some way,
a fetch that we have accounted for in our design and a fetch that we haven't;
commit chains have much greater lengths than tree chains and users wouldn't
want to wait for Git to fetch commit by commit (or segment by segment, if we
end up batch fetching commits as we probably will). So we would be building on
the defensiveness of fetching commits in this case, not tearing it down.

My next step will be to send a patch modifying repo_parse_commit() to not
lazy-fetch, and I think that future work will lie in identifying when we know
that we are reading a commit and inhibiting lazy-fetches in those cases. If
anyone has an opinion on this, feel free to let us know (hence the "RFC" in
the subject).

[1] For the curious, we ran a script that ran "git gc" on a repo having
configured a symlink to that repo as its alternate, which resulted in many
objects being deleted.
 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Design RFC] Being more defensive about fetching commits in partial clone
  2022-11-24  0:42 [Design RFC] Being more defensive about fetching commits in partial clone Jonathan Tan
@ 2022-11-24  1:04 ` Jeff King
  2022-11-28 18:53   ` Jonathan Tan
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2022-11-24  1:04 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Wed, Nov 23, 2022 at 04:42:05PM -0800, Jonathan Tan wrote:

> A question remains of whether we would need to undo all this work if we decide
> to support commit filters in partial clones. Firstly, there are good arguments
> against (and, of course, for) commit filters in partial clones, so commit
> filters may not work out in the end anyway. Secondly, even if we do have commit
> filters, we at $DAYJOB think that we still need to differentiate, in some way,
> a fetch that we have accounted for in our design and a fetch that we haven't;
> commit chains have much greater lengths than tree chains and users wouldn't
> want to wait for Git to fetch commit by commit (or segment by segment, if we
> end up batch fetching commits as we probably will). So we would be building on
> the defensiveness of fetching commits in this case, not tearing it down.
> 
> My next step will be to send a patch modifying repo_parse_commit() to not
> lazy-fetch, and I think that future work will lie in identifying when we know
> that we are reading a commit and inhibiting lazy-fetches in those cases. If
> anyone has an opinion on this, feel free to let us know (hence the "RFC" in
> the subject).

In general, I think partial clones tend to know which filters were used
to create them, because we save that filter in the config. Would it be
reasonable before lazy-fetching to say "I am looking for an object of
type X; would my configured filters have skipped such an object?".

Then not only would you get the behavior you want for commits (which are
never skipped), but a blob:none clone would not try to lazy-fetch trees
(but a tree:depth one would lazy-fetch both trees and blobs).

The gotcha I'd worry about is that the config doesn't necessarily match
how the repository was originally created. There is nothing right now
saying you cannot partial-clone with one filter, then change the config
going forward.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Design RFC] Being more defensive about fetching commits in partial clone
  2022-11-24  1:04 ` Jeff King
@ 2022-11-28 18:53   ` Jonathan Tan
  2022-11-29  1:31     ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Tan @ 2022-11-28 18:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:
> On Wed, Nov 23, 2022 at 04:42:05PM -0800, Jonathan Tan wrote:
> 
> > A question remains of whether we would need to undo all this work if we decide
> > to support commit filters in partial clones. Firstly, there are good arguments
> > against (and, of course, for) commit filters in partial clones, so commit
> > filters may not work out in the end anyway. Secondly, even if we do have commit
> > filters, we at $DAYJOB think that we still need to differentiate, in some way,
> > a fetch that we have accounted for in our design and a fetch that we haven't;
> > commit chains have much greater lengths than tree chains and users wouldn't
> > want to wait for Git to fetch commit by commit (or segment by segment, if we
> > end up batch fetching commits as we probably will). So we would be building on
> > the defensiveness of fetching commits in this case, not tearing it down.
> > 
> > My next step will be to send a patch modifying repo_parse_commit() to not
> > lazy-fetch, and I think that future work will lie in identifying when we know
> > that we are reading a commit and inhibiting lazy-fetches in those cases. If
> > anyone has an opinion on this, feel free to let us know (hence the "RFC" in
> > the subject).
> 
> In general, I think partial clones tend to know which filters were used
> to create them, because we save that filter in the config. Would it be
> reasonable before lazy-fetching to say "I am looking for an object of
> type X; would my configured filters have skipped such an object?".
> 
> Then not only would you get the behavior you want for commits (which are
> never skipped), but a blob:none clone would not try to lazy-fetch trees
> (but a tree:depth one would lazy-fetch both trees and blobs).
> 
> The gotcha I'd worry about is that the config doesn't necessarily match
> how the repository was originally created. There is nothing right now
> saying you cannot partial-clone with one filter, then change the config
> going forward.
> 
> -Peff

Thanks for weighing in. In the general case, we indeed do know what kind of
object we're fetching, so it does make sense to generalize my idea to trees and
blobs as well. On the other hand, though, besides the issue that the user may
subsequently change the config, the benefits of distinguishing between a blob
and tree are not that great, I think - we would fail fast when, say, a tree
is missing due to object store corruption when trying to check out something
in a blob:none clone, but the same object store corruption probably would
mean that commits are missing too, so even if we just did this for commits, we
would already fail equally as fast. Because of this (and the user being able
to change the config), it makes sense to me to handle only the commit case, at
least for now.

Having said that, this is not forwards incompatible with restricting lazy fetch
for trees and blobs, so if we see fit later to do that, we can still do it.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Design RFC] Being more defensive about fetching commits in partial clone
  2022-11-28 18:53   ` Jonathan Tan
@ 2022-11-29  1:31     ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2022-11-29  1:31 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Mon, Nov 28, 2022 at 10:53:19AM -0800, Jonathan Tan wrote:

> > In general, I think partial clones tend to know which filters were used
> > to create them, because we save that filter in the config. Would it be
> > reasonable before lazy-fetching to say "I am looking for an object of
> > type X; would my configured filters have skipped such an object?".
> > 
> > Then not only would you get the behavior you want for commits (which are
> > never skipped), but a blob:none clone would not try to lazy-fetch trees
> > (but a tree:depth one would lazy-fetch both trees and blobs).
> > 
> > The gotcha I'd worry about is that the config doesn't necessarily match
> > how the repository was originally created. There is nothing right now
> > saying you cannot partial-clone with one filter, then change the config
> > going forward.
> 
> Thanks for weighing in. In the general case, we indeed do know what kind of
> object we're fetching, so it does make sense to generalize my idea to trees and
> blobs as well. On the other hand, though, besides the issue that the user may
> subsequently change the config, the benefits of distinguishing between a blob
> and tree are not that great, I think - we would fail fast when, say, a tree
> is missing due to object store corruption when trying to check out something
> in a blob:none clone, but the same object store corruption probably would
> mean that commits are missing too, so even if we just did this for commits, we
> would already fail equally as fast. Because of this (and the user being able
> to change the config), it makes sense to me to handle only the commit case, at
> least for now.
> 
> Having said that, this is not forwards incompatible with restricting lazy fetch
> for trees and blobs, so if we see fit later to do that, we can still do it.

Yeah, I think your reasoning is sound. Certainly it seems reasonable to
start with commit objects, since we know they can never be filtered, and
see how that fares in practice. And then think about handling other
object types later (or never; this is really just a "we might catch
corruption sooner" nice-to-have, and not a correctness problem in normal
use).

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-29  1:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-24  0:42 [Design RFC] Being more defensive about fetching commits in partial clone Jonathan Tan
2022-11-24  1:04 ` Jeff King
2022-11-28 18:53   ` Jonathan Tan
2022-11-29  1:31     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).