From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Han Xin <chiyutianyi@gmail.com>,
Junio C Hamano <gitster@pobox.com>,
Git List <git@vger.kernel.org>,
Han Xin <hanxin.hx@alibaba-inc.com>
Subject: Re: [PATCH v2] receive-pack: not receive pack file with large object
Date: Fri, 1 Oct 2021 02:55:37 -0400 [thread overview]
Message-ID: <YVaw6agcPNclhws8@coredump.intra.peff.net> (raw)
In-Reply-To: <87pmsqtb2p.fsf@evledraar.gmail.com>
On Thu, Sep 30, 2021 at 03:42:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > From: Han Xin <hanxin.hx@alibaba-inc.com>
> >
> > In addition to using 'receive.maxInputSize' to limit the overall size
> > of the received packfile, a new config variable
> > 'receive.maxInputObjectSize' is added to limit the push of a single
> > object larger than this threshold.
>
> Maybe an unfair knee-jerk reaction: I think we should really be pushing
> this sort of thing into pre-receive hooks and/or the proc-receive hook,
> i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook,
> 2020-08-27).
Yes, this could be done as a hook, but it's actually kind of tricky.
We have a similar max-object-size limit in our fork at GitHub, and it's
implemented inside index-pack (we don't have a match in unpack-objects,
but we always run index-pack on incoming packs).
Unlike the patches here, ours is limiting the logical size of an object
(so a 100MB blob limit is on the uncompressed size). It happens in
sha1_object(), where we have that size.
The main reason we put it there is for performance. index-pack is
already computing the size, so it's free to check it there. But it does
make some things awkward. Rather than just dying with "woah, some object
is too big", it's nice to tell the user "hey, the object 1234abcd at
foo.bin is too big". To do that, we actually collect the list of too-big
objects and index them anyway (remember that we're in the quarantine
directory, so nothing is permanent until later). We write the list to a
".large-objects" file in the quarantine directory. And then hooks are
free to produce a much nicer message (e.g., by running something like
"log --find-objects" to get the path name at which we see the blob).
It would be nice if the hook could just find the objects itself, but
there are some gotchas:
- finding the list of pushed objects is awkward and/or expensive. You
can use rev-list, but that's very costly, as it has to inflate all
of the new trees. You really just want the list of what's in the
quarantine directory, but there's no single command to get that. You
have to run show-index on the incoming pack, plus poke through the
loose object directories.
It would be much easier if "cat-file --batch-all-objects" could be
asked to only look at local objects. That would also allow some
other optimizations to reduce the cost. For instance, even when
iterating packs, cat-file then feeds each sha1 back to
oid_object_info(), even though we already know the exact pack/offset
where it can be found. Likewise, it could be taught some basic
filtering (like "show only objects with size > X") to avoid dumping
millions of oid/size pairs to a separate script to do the filtering.
But all of that would just be approaching the speed of having
index-pack do the check, since it has the size already there.
- it wouldn't help index-pack at all with memory attacks by malicious
pushers. Here somebody accidentally pushed up a big blob, and it
caused unpack-objects to OOM. But I also remember a problem ages ago
where some of the fsck_tree() code had an integer overflow
vulnerability, which could only be triggered by a gigantic tree.
In the patch we use at GitHub, we allow large blobs to make it to
the hook level, but too-large trees, commits, and tags are
unceremoniously aborted with an immediate die(). Real users
accidentally try to push 2GB objects. Trees of that size are no
accident. :)
Having index-pack enforce size limits helps a bit there. And
streaming large blobs helps protect against accidents. But there are
still OOM problems with maliciously constructed inputs, because the
delta code only works on full buffers (so it really wants to
construct the whole object before we get to the sha1_object()
limits). It would be nice if this could stream the output, but I
suspect it would require pretty major surgery to the delta code.
Instead, we've put in a crude limit by having receive-pack set
GIT_ALLOC_LIMIT in the environment, which then ensures that its
child index-pack won't allocate too much (the limit we use is much
higher than the more user-facing object limit, because the point is
just to avoid DoS attacks, and not enforce any kind of policy).
It might be nice to have a config option for setting this limit (and
perhaps even putting it at a reasonable defensive default).
So I do think there are some interesting paths forward for doing more of
this in hooks, but there's work to be done to get there.
> The latter pre-dates c08db5a2d0d (receive-pack: allow a maximum input
> size to be specified, 2016-08-24), which may or may not be relevant
> here.
Yeah, I introduced that limit. It wasn't really about object size or
memory, but just about the fact that without it, a malicious pusher can
just send you bytes forever which the server will write to disk. It also
helps with the most naive kind of large-object attacks, but it wouldn't
help with a cleverly constructed delta.
It is, unfortunately, a pretty unceremonious die(), and has caused more
than one support ticket (especially because GitHub has used 2GB for the
limit for ages, and the practical limit for repositories has been
growing).
So there. That's probably more than anybody wanted to know about push
limits. ;) I'm happy to share any of the code from GitHub's fork in
these areas. The main reason I haven't is just that some of it is just
kind of ugly and may need polishing (e.g., the whole "write to
.large-objects" is really an awful interface).
-Peff
next prev parent reply other threads:[~2021-10-01 6:55 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-30 12:10 [PATCH] receive-pack: allow a maximum input object size specified Han Xin
2021-09-30 13:20 ` [PATCH v2] receive-pack: not receive pack file with large object Han Xin
2021-09-30 13:42 ` Ævar Arnfjörð Bjarmason
2021-10-01 2:30 ` Jiang Xin
2021-10-01 6:17 ` Jeff King
2021-10-01 6:55 ` Jeff King [this message]
2021-10-01 18:43 ` Junio C Hamano
2021-09-30 16:49 ` Junio C Hamano
2021-10-01 2:52 ` Jiang Xin
2021-10-01 6:24 ` Jeff King
-- strict thread matches above, loose matches on Subject: below --
2021-10-01 9:16 [PATCH v10 17/17] fsck: report invalid object type-path combinations Ævar Arnfjörð Bjarmason
2021-11-11 3:03 ` [PATCH v2] receive-pack: not receive pack file with large object Han Xin
2021-11-11 18:35 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YVaw6agcPNclhws8@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=avarab@gmail.com \
--cc=chiyutianyi@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=hanxin.hx@alibaba-inc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).