git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "Baumann, Moritz" <moritz.baumann@sap.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Feature Request: Option to make "git rev-list --objects" output duplicate objects
Date: Tue, 28 Mar 2023 14:32:53 -0400	[thread overview]
Message-ID: <20230328183253.GD18558@coredump.intra.peff.net> (raw)
In-Reply-To: <AS1PR02MB8185DF947EBC583318481E1994889@AS1PR02MB8185.eurprd02.prod.outlook.com>

On Tue, Mar 28, 2023 at 08:08:02AM +0000, Baumann, Moritz wrote:

> > But if you want
> > to know all of the names touched in a set of commits, I have used
> > something like this before:
> >
> >   git rev-list $new --not --all |
> >   git diff-tree --stdin --format= -r -c --name-only
> 
> Thanks, that looks promising and solves at least one of my use cases. The only
> minor problem is that there seems to be no way to pipe the diff-tree output to
> cat-file without massaging it with awk first.
> 
> I have three uses cases in my pre-receive hooks:
> 
> 1. Filters solely based on the file name
>    ? your suggestions works perfectly here
> 2. Filters based only on file contents
>    ? git rev-list --objects + git cat-file provide everything I need
> 3. One filter based on file size and name (forbid large files, with exceptions)
>    ? I'm guessing "git rev-list | git diff-tree --stdin | awk |
>      git cat-file --batch-check" is the best solution to extract the necessary
>      information from git in this case?

Yes, that's how I would do all of those. Having to massage the output
between diff-tree and cat-file is a little annoying, but at least can
still be done in O(1) processes. And you really need some language
capable of parsing cat-file output anyway, so it's not too big of a
lift.

One thing we did at GitHub is teach index-pack to collect a list of
too-big paths (since naturally it knows the size of every blob as it
indexes it), and write them out to a special path. Then our pre-receive
hook could quickly check the list and act on it (warning above a certain
size, rejecting above another), without having to traverse again.

I haven't sent those patches upstream, though, for a few reasons:

  - we only hooked index-pack, not unpack-objects (we only use the
    former, not the latter, but in stock Git you might see either)

  - getting just the object names is kind of awkward. You have to then
    invoke rev-list to find the names of hits (though at least you only
    do so when there is a problem object; the happy case remains fast).

  - it's actually not that helpful to avoid the traversal if you have
    other stuff you want to check anyway (like file contents, or names).
    It was one of those things we optimized a long time ago, and I kind
    of doubt is doing much good these days.

-Peff

      parent reply	other threads:[~2023-03-28 18:33 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-24 15:51 Feature Request: Option to make "git rev-list --objects" output duplicate objects Baumann, Moritz
2023-03-24 16:50 ` Junio C Hamano
2023-03-27  7:02   ` Baumann, Moritz
2023-03-27 16:07     ` Junio C Hamano
2023-03-24 19:28 ` Jeff King
2023-03-28  8:08   ` Baumann, Moritz
2023-03-28 18:26     ` [PATCH] docs: document caveats of rev-list's object-name output Jeff King
2023-03-30 10:32       ` Baumann, Moritz
2023-03-28 18:32     ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230328183253.GD18558@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=moritz.baumann@sap.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).