From: Jeff King <peff@peff.net>
To: Tao Klerks <tao@klerks.biz>
Cc: git <git@vger.kernel.org>
Subject: Re: Determining whether you have a commit locally, in a partial clone?
Date: Tue, 27 Jun 2023 04:09:54 -0400 [thread overview]
Message-ID: <20230627080954.GF1226768@coredump.intra.peff.net> (raw)
In-Reply-To: <CAPMMpoha6rBA-T-7cn3DQT_nbNfknigLTky55x0TEmt4Ay2GRA@mail.gmail.com>
On Wed, Jun 21, 2023 at 12:10:33PM +0200, Tao Klerks wrote:
> > This is not very efficient, but:
> >
> > git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered |
> > grep $some_sha1
> >
> > will tell you whether we have the object locally.
> >
>
> Thanks so much for your help!
>
> in Windows (msys or git bash) this is still very slow in my repo with
> 6,500,000 local objects - around 60s - but in linux on the same repo
> it's quite a lot faster, at 5s. A large proportion of my users are on
> Windows though, so I don't think this will be "good enough" for my
> purposes, when I often need to check for the existence of dozens or
> even hundreds of commits.
Yeah, it's just a lot of object names to print, most of which you don't
care about. :)
The more efficient thing would be to open the actual pack .idx files and
look for the names via binary search. I don't think you can convince git
to do that, though I suspect you could write a trivial libgit2 program
that does.
> > I don't work with partial clones often, but it feels like being able to
> > say:
> >
> > git --no-partial-fetch cat-file ...
> >
> > would be a useful primitive to have.
>
> It feels that way to me, yes!
>
> On the other hand, I find very little demand for it when I search "the
> internet" - or I don't know how to search for it.
I think partial clones are still new enough that not many people are
using them heavily. And when they do, not managing the partial state at
a very advanced level; I think tools for pruning locally cached objects
(which you could refetch) is only just being worked on now.
> > It does seem like you might be able to bend it to
> > your will here, though. I think without any patches that:
> >
> > git rev-list --objects --exclude-promisor-objects $oid
> >
> > will tell you whether we have the object or not (since it turns off
> > fetch_if_missing, and thus will either succeed, printing nothing, or
> > bail if the object can't be found).
>
> This behaves in a way that I don't understand:
>
> In the repo that I'm working in, this command runs successfully
> *without fetching*, but it takes a *very* long time - 300+ seconds -
> much longer than even the "inefficient" 'cat-file'-based printing of
> all (6.5M) local object ids that you proposed above. I haven't
> attempted to understand what's going on in there (besides running with
> GIT_TRACE2_PERF, which showed nothing interesting), but the idea that
> git would have to work super-hard to find an object by its ID seems
> counter to everything I know about it. Would there be value in my
> trying to understand & reproduce this in a shareable repo, or is there
> already an explanation as to why this command could/should ever do
> non-trivial work, even in the largest partial repos?
I think it's actually doing the gigantic traversal (and just limiting it
when it sees objects that are not available). You probably want
"--no-walk" at least, but really you don't even want to walk the trees
of any commits you specify (so you'd want to omit "--objects" if you are
asking about a commit, and otherwise include it, which is slightly
awkward).
> > It feels like --missing=error should
> > function similarly, but it seems to still lazy-fetch (I guess since it's
> > the default, the point is to just find truly unavailable objects). Using
> > --missing=print disables the lazy-fetch, but it seems to bail
> > immediately if you ask it about a missing object (I didn't dig, but my
> > guess is that --missing is mostly about objects we traverse, not the
> > initial tips).
>
> Woah, "--missing=print" seems to work!!!
>
> The following gives me the commit hash if I have it locally, and an
> error otherwise - consistently across linux and windows, git versions
> 2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy
> CPU-churning:
>
> git rev-list --missing=print -1 $oid
>
> Thank you thank you thank you!
Hmph, I thought I tried that before and it didn't work, but it seems to
work for me now. I guess I was hoping to have it print the missing
object rather than exiting with an error, but if you do one object at a
time then the error is sufficient signal. :)
You might want "--objects" if you're going to ask about non-commits.
Though it might not be necessary. I suspect Git would bail trying to
look up the object in the first place if we don't have it, and if we do
have it then it just becomes a silent noop.
> I feel like I should try to work something into the doc about this,
> but I'm not sure how to express this: "--missing=error is the default,
> but it doesn't actually error out when you're explicitly asking about
> a missing commit, it fetches it instead - but --missing=print actually
> *does* error out if you explicitly ask about a missing commit" seems
> like a strange thing to be saying.
I think we are relying on the side effect that everything except
--missing=error will turn off auto-fetching. I don't know if that's
something we'd want to document. It seems reasonable to me that we might
later change the implementation so that we kick in the --missing
behavior only after parsing the initial list of traversal tips (I mean,
I don't know why we would do that in particular, but it seems like the
kind of thing we'd want to reserve as an implementation detail subject
to change).
I do think in the long run that a big "--do-not-lazy-fetch" flag would
be the right solution to let the user tell us what they want.
> Thanks again for finding me an efficient working strategy here!
I'm glad it worked. I was mostly just thinking out loud. ;)
-Peff
prev parent reply other threads:[~2023-06-27 8:10 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-20 11:26 Determining whether you have a commit locally, in a partial clone? Tao Klerks
2023-06-20 12:04 ` Tao Klerks
2023-06-20 18:31 ` Junio C Hamano
2023-06-20 19:41 ` Tao Klerks
2023-06-21 6:54 ` Jeff King
2023-06-20 19:12 ` Tao Klerks
2023-06-21 6:44 ` Jeff King
2023-06-21 10:10 ` Tao Klerks
2023-06-27 8:09 ` Jeff King [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230627080954.GF1226768@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=tao@klerks.biz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).