git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Tao Klerks <tao@klerks.biz>
To: Jeff King <peff@peff.net>
Cc: git <git@vger.kernel.org>
Subject: Re: Determining whether you have a commit locally, in a partial clone?
Date: Wed, 21 Jun 2023 12:10:33 +0200	[thread overview]
Message-ID: <CAPMMpoha6rBA-T-7cn3DQT_nbNfknigLTky55x0TEmt4Ay2GRA@mail.gmail.com> (raw)
In-Reply-To: <20230621064459.GA607974@coredump.intra.peff.net>

On Wed, Jun 21, 2023 at 8:45 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Jun 20, 2023 at 09:12:24PM +0200, Tao Klerks wrote:
>
> > I'm back to begging for any hints here: Any idea how I can determine
> > whether a given commit object exists locally, *without causing it to
> > be fetched by the act of checking for it?*
>
> This is not very efficient, but:
>
>   git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered |
>   grep $some_sha1
>
> will tell you whether we have the object locally.
>

Thanks so much for your help!

in Windows (msys or git bash) this is still very slow in my repo with
6,500,000 local objects - around 60s - but in linux on the same repo
it's quite a lot faster, at 5s. A large proportion of my users are on
Windows though, so I don't think this will be "good enough" for my
purposes, when I often need to check for the existence of dozens or
even hundreds of commits.

> I don't work with partial clones often, but it feels like being able to
> say:
>
>   git --no-partial-fetch cat-file ...
>
> would be a useful primitive to have.

It feels that way to me, yes!

On the other hand, I find very little demand for it when I search "the
internet" - or I don't know how to search for it.


> The implementation might start
> something like this:
>
> diff --git a/object-file.c b/object-file.c
> index 7c1af5c8db..494cdd7706 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1555,6 +1555,14 @@ void disable_obj_read_lock(void)
>
>  int fetch_if_missing = 1;
>
> +static int allow_lazy_fetch(void)
> +{
> +       static int ret = -1;
> +       if (ret < 0)
> +               ret = git_env_bool("GIT_PARTIAL_FETCH", 1);
> +       return ret;
> +}
> +
>  static int do_oid_object_info_extended(struct repository *r,
>                                        const struct object_id *oid,
>                                        struct object_info *oi, unsigned flags)
> @@ -1622,6 +1630,7 @@ static int do_oid_object_info_extended(struct repository *r,
>
>                 /* Check if it is a missing object */
>                 if (fetch_if_missing && repo_has_promisor_remote(r) &&
> +                   allow_lazy_fetch() &&
>                     !already_retried &&
>                     !(flags & OBJECT_INFO_SKIP_FETCH_OBJECT)) {
>                         promisor_remote_get_direct(r, real, 1);
>
> and then have git.c populate the environment variable, similar to how we
> handle --literal-pathspecs, etc.
>
> That fetch_if_missing kind of does the same thing, but it's mostly
> controlled by programs themselves which try to handle missing remote
> objects specially.

Thanks, I will play with this if I get the chance. That said, I don't
control my users' distributions of Git, so on a purely practical basis
I'm looking for something that will work in git 2.39 to whatever
future version would introduce such a capability. (before 2.39, the
"set remote to False" hack works)

> It does seem like you might be able to bend it to
> your will here, though. I think without any patches that:
>
>   git rev-list --objects --exclude-promisor-objects $oid
>
> will tell you whether we have the object or not (since it turns off
> fetch_if_missing, and thus will either succeed, printing nothing, or
> bail if the object can't be found).

This behaves in a way that I don't understand:

In the repo that I'm working in, this command runs successfully
*without fetching*, but it takes a *very* long time - 300+ seconds -
much longer than even the "inefficient" 'cat-file'-based printing of
all (6.5M) local object ids that you proposed above. I haven't
attempted to understand what's going on in there (besides running with
GIT_TRACE2_PERF, which showed nothing interesting), but the idea that
git would have to work super-hard to find an object by its ID seems
counter to everything I know about it. Would there be value in my
trying to understand & reproduce this in a shareable repo, or is there
already an explanation as to why this command could/should ever do
non-trivial work, even in the largest partial repos?

> It feels like --missing=error should
> function similarly, but it seems to still lazy-fetch (I guess since it's
> the default, the point is to just find truly unavailable objects). Using
> --missing=print disables the lazy-fetch, but it seems to bail
> immediately if you ask it about a missing object (I didn't dig, but my
> guess is that --missing is mostly about objects we traverse, not the
> initial tips).

Woah, "--missing=print" seems to work!!!

The following gives me the commit hash if I have it locally, and an
error otherwise - consistently across linux and windows, git versions
2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy
CPU-churning:

git rev-list --missing=print -1 $oid

Thank you thank you thank you!

I feel like I should try to work something into the doc about this,
but I'm not sure how to express this: "--missing=error is the default,
but it doesn't actually error out when you're explicitly asking about
a missing commit, it fetches it instead - but --missing=print actually
*does* error out if you explicitly ask about a missing commit" seems
like a strange thing to be saying.

Thanks again for finding me an efficient working strategy here!

  reply	other threads:[~2023-06-21 10:11 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20 11:26 Determining whether you have a commit locally, in a partial clone? Tao Klerks
2023-06-20 12:04 ` Tao Klerks
2023-06-20 18:31   ` Junio C Hamano
2023-06-20 19:41     ` Tao Klerks
2023-06-21  6:54     ` Jeff King
2023-06-20 19:12   ` Tao Klerks
2023-06-21  6:44     ` Jeff King
2023-06-21 10:10       ` Tao Klerks [this message]
2023-06-27  8:09         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPMMpoha6rBA-T-7cn3DQT_nbNfknigLTky55x0TEmt4Ay2GRA@mail.gmail.com \
    --to=tao@klerks.biz \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).