git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Patrick Steinhardt <ps@pks.im>
Cc: git@vger.kernel.org, Christian Couder <christian.couder@gmail.com>
Subject: Re: [PATCH 0/7] rev-parse: implement object type filter
Date: Thu, 11 Mar 2021 12:54:38 -0500	[thread overview]
Message-ID: <YEpZXjPxT3npyhhv@coredump.intra.peff.net> (raw)
In-Reply-To: <YEorU+rkX4sgMtKi@ncase>

On Thu, Mar 11, 2021 at 03:38:11PM +0100, Patrick Steinhardt wrote:

> > Those produce very different answers. I guess because in the first one,
> > you still have a bunch of tree objects, too. You'd do much better to get
> > the actual types from cat-file, and filter on that. That also lets you
> > use bitmaps for the traversal portion. E.g.:
> 
> They do provide different answers, and you're right that `--batch-check`
> would have helped to filter by type. Your idea doesn't really work in my
> usecase though to identify LFS pointers, at least not without additional
> tooling on top of what you've provided. There'd at least need to be two
> git-cat-file(1) processes: one to do the `--batch-check` thing to
> actually filter by object type, and one to then read the actual LFS
> pointer candidates from disk in order to see whether they are LFS
> pointers or not.
> 
> Actually, we currently are doing something similar to that at GitLab: we
> list all potential candidates via git-rev-list(1), write the output into
> `git-cat-file --batch-check`, and anything that is a blob then gets
> forwarded into `git-cat-file --batch`.

You'd need that final cat-file with your patch, too, though. So I think
it makes sense to think about "generate the list of blobs" as the
primary action.

You can of course do the type and content dump as a single cat-file, but
in my experience that is much slower (because we waste time dumping
object content that the caller ultimately won't care about).

Thinking in the opposite direction, if we are filtering by type via
cat-file, we could do the size filter there, too. So:

  git rev-list --use-bitmap-index --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectsize) %(objectname)' |
  perl -lne 'print $2 if /^blob (\d+) (.*)/ && $1 < 200'

which produces the same answer as my earlier:

> >   $ time git rev-list --use-bitmap-index --objects --filter=blob:limit=200 --all |
> >          git cat-file --buffer --batch-check='%(objecttype) %(objectname)' |
> > 	 perl -lne 'print $1 if /^blob (.*)/' | wc -l

but takes about twice as long. Which is really just a roundabout way of
saying that yes, shoving things into "rev-list" can provide substantial
speedups. :)

> > which is faster than what you showed above (this is on linux.git, but my
> > result is different; maybe you have more refs than me?). But we should
> > be able to do better purely internally, so I suspect my computer is just
> > faster (or maybe your extra refs just aren't well-covered by bitmaps).
> > Running with your patches I get:
> 
> I've got quite a beefy machine with a Ryzen 3 5800X, and I did do a `git
> repack -Adfb` right before doig benchmarks. I do have the stable kernel
> repository added though, which accounts for quite a lot of additional
> references (3938) and objects (9.3M).

Yeah, I wondered if it was something like that. Mine is just
torvalds/linux.git. Fetching stable/linux.git from kernel.org, running
"git repack -adb" on the result, and then repeating my timings gets me
numbers close to yours.

> > which is indeed faster. It's quite curious that the answer is not the
> > same, though! I think yours has some bugs. If I sort and diff the
> > results, I see some commits mentioned in the output. Perhaps this is
> > --filter-provided not working, as they all seem to be ref tips.
> 
> I noticed it, too, and couldn't yet find an answer why that is.
> Honestly, I found the NOT_USER_GIVEN flag quite confusing and I'm not at
> all sure whether I've got all cases covered correctly. The previous was
> how this was handled (`USER_GIVEN` instead of `NOT_USER_GIVEN`) would've
> been easier to figure out for this specific usecase. But I guess it was
> converted due to specific reasons.
> 
> I'll invest some more time to figure out what's happening here.

Thanks. I also scratched my head at NOT_USER_GIVEN. I haven't looked at
this part of the filter code very much, but it seems like that is a
recipe for accidentally marking a commit as NOT_USER_GIVEN if we
traverse to it (even if it was originally _also_ given by the user).

-Peff

> > That's probably reasonable, especially because it lets us use bitmaps. I
> > do have a dream that we'll eventually be able to support more extensive
> > formatting via log/rev-list, which would allow:
> > 
> >   git rev-list --use-bitmap-index --objects --all \
> >                --format=%(objecttype) %(objectname) |
> >   perl -ne 'print $1 if /^blob (.*)/'
> > 
> > That should be faster than the separate cat-file (which has to re-lookup
> > each object, in addition to the extra pipe overhead), but I expect the
> > --filter solution should always be faster still, as it can very quickly
> > eliminate the majority of the objects at the bitmap level.
> 
> That'd be nice, even though it wouldn't help in my particular usecase: I
> need to read each candidate blob to see whether it's an LFS pointer or
> not anyway.

I think it works out roughly the same as the --filter solution, in the
sense that both generate a list of candidate blobs that you'd read with
"cat-file --batch" (but of course it's still slower).

-Peff

  reply	other threads:[~2021-03-11 17:55 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-01 12:20 [PATCH 0/7] rev-parse: implement object type filter Patrick Steinhardt
2021-03-01 12:20 ` [PATCH 1/7] revision: mark commit parents as NOT_USER_GIVEN Patrick Steinhardt
2021-03-01 12:20 ` [PATCH 2/7] list-objects: move tag processing into its own function Patrick Steinhardt
2021-03-01 12:20 ` [PATCH 3/7] list-objects: support filtering by tag and commit Patrick Steinhardt
2021-03-01 12:20 ` [PATCH 4/7] list-objects: implement object type filter Patrick Steinhardt
2021-03-01 12:20 ` [PATCH 5/7] pack-bitmap: " Patrick Steinhardt
2021-03-01 12:20 ` [PATCH 6/7] pack-bitmap: implement combined filter Patrick Steinhardt
2021-03-01 12:21 ` [PATCH 7/7] rev-list: allow filtering of provided items Patrick Steinhardt
2021-03-10 21:39 ` [PATCH 0/7] rev-parse: implement object type filter Jeff King
2021-03-11 14:38   ` Patrick Steinhardt
2021-03-11 17:54     ` Jeff King [this message]
2021-03-15 11:25   ` Patrick Steinhardt
2021-03-10 21:58 ` Taylor Blau
2021-03-10 22:19   ` Jeff King
2021-03-11 14:43     ` Patrick Steinhardt
2021-03-11 17:56       ` Jeff King
2021-03-15 13:14 ` [PATCH v2 0/8] " Patrick Steinhardt
2021-03-15 13:14   ` [PATCH v2 1/8] uploadpack.txt: document implication of `uploadpackfilter.allow` Patrick Steinhardt
2021-04-06 17:17     ` Jeff King
2021-03-15 13:14   ` [PATCH v2 2/8] revision: mark commit parents as NOT_USER_GIVEN Patrick Steinhardt
2021-04-06 17:30     ` Jeff King
2021-04-09 10:19       ` Patrick Steinhardt
2021-03-15 13:14   ` [PATCH v2 3/8] list-objects: move tag processing into its own function Patrick Steinhardt
2021-04-06 17:39     ` Jeff King
2021-03-15 13:14   ` [PATCH v2 4/8] list-objects: support filtering by tag and commit Patrick Steinhardt
2021-03-15 13:14   ` [PATCH v2 5/8] list-objects: implement object type filter Patrick Steinhardt
2021-04-06 17:42     ` Jeff King
2021-03-15 13:14   ` [PATCH v2 6/8] pack-bitmap: " Patrick Steinhardt
2021-04-06 17:48     ` Jeff King
2021-03-15 13:14   ` [PATCH v2 7/8] pack-bitmap: implement combined filter Patrick Steinhardt
2021-04-06 17:54     ` Jeff King
2021-04-09 10:31       ` Patrick Steinhardt
2021-04-09 15:53         ` Jeff King
2021-04-09 11:17       ` Patrick Steinhardt
2021-04-09 15:55         ` Jeff King
2021-03-15 13:15   ` [PATCH v2 8/8] rev-list: allow filtering of provided items Patrick Steinhardt
2021-04-06 18:04     ` Jeff King
2021-04-09 10:59       ` Patrick Steinhardt
2021-04-09 15:58         ` Jeff King
2021-03-20 21:10   ` [PATCH v2 0/8] rev-parse: implement object type filter Junio C Hamano
2021-04-06 18:08     ` Jeff King
2021-04-09 11:14       ` Patrick Steinhardt
2021-04-09 16:05         ` Jeff King
2021-04-09 11:27   ` [PATCH v3 " Patrick Steinhardt
2021-04-09 11:27     ` [PATCH v3 1/8] uploadpack.txt: document implication of `uploadpackfilter.allow` Patrick Steinhardt
2021-04-09 11:27     ` [PATCH v3 2/8] revision: mark commit parents as NOT_USER_GIVEN Patrick Steinhardt
2021-04-09 11:28     ` [PATCH v3 3/8] list-objects: move tag processing into its own function Patrick Steinhardt
2021-04-09 11:28     ` [PATCH v3 4/8] list-objects: support filtering by tag and commit Patrick Steinhardt
2021-04-11  6:49       ` Junio C Hamano
2021-04-09 11:28     ` [PATCH v3 5/8] list-objects: implement object type filter Patrick Steinhardt
2021-04-09 11:28     ` [PATCH v3 6/8] pack-bitmap: " Patrick Steinhardt
2021-04-09 11:28     ` [PATCH v3 7/8] pack-bitmap: implement combined filter Patrick Steinhardt
2021-04-09 11:28     ` [PATCH v3 8/8] rev-list: allow filtering of provided items Patrick Steinhardt
2021-04-09 11:32       ` [RESEND PATCH " Patrick Steinhardt
2021-04-09 15:00       ` [PATCH " Philip Oakley
2021-04-12 13:15         ` Patrick Steinhardt
2021-04-11  6:02     ` [PATCH v3 0/8] rev-parse: implement object type filter Junio C Hamano
2021-04-12 13:12       ` Patrick Steinhardt
2021-04-12 13:37     ` [PATCH v4 0/8] rev-list: " Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 1/8] uploadpack.txt: document implication of `uploadpackfilter.allow` Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 2/8] revision: mark commit parents as NOT_USER_GIVEN Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 3/8] list-objects: move tag processing into its own function Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 4/8] list-objects: support filtering by tag and commit Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 5/8] list-objects: implement object type filter Patrick Steinhardt
2021-04-13  9:57         ` Ævar Arnfjörð Bjarmason
2021-04-13 10:43           ` Andreas Schwab
2021-04-14 11:32           ` Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 6/8] pack-bitmap: " Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 7/8] pack-bitmap: implement combined filter Patrick Steinhardt
2021-04-12 13:37       ` [PATCH v4 8/8] rev-list: allow filtering of provided items Patrick Steinhardt
2021-04-13  7:45       ` [PATCH v4 0/8] rev-list: implement object type filter Jeff King
2021-04-13  8:06         ` Patrick Steinhardt
2021-04-15  9:42           ` Jeff King
2021-04-16 22:06             ` Junio C Hamano
2021-04-16 23:15               ` Junio C Hamano
2021-04-17  1:17                 ` Ramsay Jones
2021-04-17  9:01                   ` Jeff King
2021-04-17 21:45                     ` Junio C Hamano
2021-04-13 21:03         ` Junio C Hamano
2021-04-14 11:59           ` Patrick Steinhardt
2021-04-14 21:07             ` Junio C Hamano
2021-04-15  9:57               ` Jeff King
2021-04-15 17:53                 ` Junio C Hamano
2021-04-15 17:57                   ` Junio C Hamano
2021-04-17  8:58                     ` Jeff King
2021-04-19 11:46       ` [PATCH v5 " Patrick Steinhardt
2021-04-19 11:46         ` [PATCH v5 1/8] uploadpack.txt: document implication of `uploadpackfilter.allow` Patrick Steinhardt
2021-04-19 11:46         ` [PATCH v5 2/8] revision: mark commit parents as NOT_USER_GIVEN Patrick Steinhardt
2021-04-19 11:46         ` [PATCH v5 3/8] list-objects: move tag processing into its own function Patrick Steinhardt
2021-04-19 11:46         ` [PATCH v5 4/8] list-objects: support filtering by tag and commit Patrick Steinhardt
2021-04-19 11:46         ` [PATCH v5 5/8] list-objects: implement object type filter Patrick Steinhardt
2021-04-19 11:46         ` [PATCH v5 6/8] pack-bitmap: " Patrick Steinhardt
2021-04-19 11:47         ` [PATCH v5 7/8] pack-bitmap: implement combined filter Patrick Steinhardt
2021-04-19 11:47         ` [PATCH v5 8/8] rev-list: allow filtering of provided items Patrick Steinhardt
2021-04-19 23:16         ` [PATCH v5 0/8] rev-list: implement object type filter Junio C Hamano
2021-04-23  9:13           ` Jeff King
2021-04-28  2:18             ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YEpZXjPxT3npyhhv@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).