git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>, "Jansen\,
	Geert" <gerardu@amazon.com>,
	"git\@vger.kernel.org" <git@vger.kernel.org>,
	Christian Couder <christian.couder@gmail.com>,
	Nicolas Pitre <nico@fluxnic.net>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH] index-pack: improve performance on NFS
Date: Sat, 27 Oct 2018 13:22:16 +0200	[thread overview]
Message-ID: <87lg6jljmf.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20181027093300.GA23974@sigill.intra.peff.net>


On Sat, Oct 27 2018, Jeff King wrote:

> On Sat, Oct 27, 2018 at 04:26:50PM +0900, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>> > But as Junio notes the devil's in the details, another one I thought of
>> > is:
>> >
>> >     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
>> >
>> > It seems to me that ...
>>
>> Actually I take all of that back ;-)
>>
>> For the purpose of this patch, use of existing .cloning field in the
>> transport is fine, as the sole existing user of the field wants the
>> field to mean "Are we starting with an empty object store?", and not
>> "Are we running the command whose name is 'git clone'?".
>
> Taking one step back, the root problem in this thread is that stat() on
> non-existing files is slow (which makes has_sha1_file slow).
>
> One solution there is to cache the results of looking in .git/objects
> (or any alternate object store) for loose files. And indeed, this whole
> scheme is just a specialized form of that: it's a flag to say "hey, we
> do not have any objects yet, so do not bother looking".
>
> Could we implement that in a more direct and central way? And could we
> implement it in a way that catches more cases? E.g., if I have _one_
> object, that defeats this specialized optimization, but it is probably
> still beneficial to cache that knowledge (and the reasonable cutoff is
> probably not 1, but some value of N loose objects).
>
> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).
>
> So putting that all together, could we have something like:
>
> diff --git a/object-store.h b/object-store.h
> index 63b7605a3e..28cde568a0 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -135,6 +135,18 @@ struct raw_object_store {
>  	 */
>  	struct packed_git *all_packs;
>
> +	/*
> +	 * A set of all loose objects we have. This probably ought to be split
> +	 * into a set of 256 caches so that we can fault in one directory at a
> +	 * time.
> +	 */
> +	struct oid_array loose_cache;
> +	enum {
> +		LOOSE_CACHE_UNFILLED = 0,
> +		LOOSE_CACHE_INVALID,
> +		LOOSE_CACHE_VALID
> +	} loose_cache_status;
> +
>  	/*
>  	 * A fast, rough count of the number of objects in the repository.
>  	 * These two fields are not meant for direct access. Use
> diff --git a/packfile.c b/packfile.c
> index 86074a76e9..68ca4fff0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
>  	r->objects->approximate_object_count_valid = 0;
>  	r->objects->packed_git_initialized = 0;
>  	prepare_packed_git(r);
> +	oid_array_clear(&r->objects->loose_cache);
> +	r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
>  }
>
>  struct packed_git *get_packed_git(struct repository *r)
> diff --git a/sha1-file.c b/sha1-file.c
> index dd0b6aa873..edbe037eaa 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  	return parse_sha1_header_extended(hdr, &oi, 0);
>  }
>
> +/* probably should be configurable? */
> +#define LOOSE_OBJECT_CACHE_MAX 65536
> +
> +static int fill_loose_cache(const struct object_id *oid,
> +			    const char *path,
> +			    void *data)
> +{
> +	struct oid_array *cache = data;
> +
> +	if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
> +		return -1;
> +
> +	oid_array_append(data, oid);
> +	return 0;
> +}
> +
> +static int quick_has_loose(struct raw_object_store *r,
> +			   struct object_id *oid)

The assumption with making it exactly 0 objects and not any value of >0
is that we can safely assume that a "clone" or initial "fetch"[1] is
special in ways that a clone isn't. I.e. we're starting out with nothing
and doing the initial population, that's probably not as true in an
existing repo that's getting concurrent fetches, commits, rebases etc.

But in the spirit of taking a step back, maybe we should take two steps
back and consider why we're doing this at all.

Three of our tests fail if we compile git like this, and cloning is much
faster (especially on NFS):

    diff --git a/builtin/index-pack.c b/builtin/index-pack.c
    index 2004e25da2..0c2d008ee0 100644
    --- a/builtin/index-pack.c
    +++ b/builtin/index-pack.c
    @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,

    -       if (startup_info->have_repository) {
    +       if (0) {
                    read_lock();

Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
spending 5% of my time on that.

But why do we have this in the first place? It's because of 8685da4256
("don't ever allow SHA1 collisions to exist by fetching a pack",
2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
collision check", 2017-04-01).

I.e. we are worried about (and those tests check for):

 a) A malicious user trying to give us repository where they have
    created an object with the same SHA-1 that's different, as in the
    SHAttered attack.

    I remember (but have not dug up) an old E-Mail from Linus saying
    that this was an important security aspect of git, i.e. even if
    SHA-1 was broken you couldn't easily propagate bad objects.

 b) Cases where we've ended up with different content for a SHA-1 due to
    e.g. a local FS corruption. Which is the subject of your commit in
    2017.

 c) Are there cases where fetch.fsckObjects is off and we just flip a
    bit on the wire and don't notice? I think not because we always
    check the pack checksum (don't we), but I'm not 100% sure.

I'm inclined to think that we should at the very least make this
configurable. Running into a) is the least of my worries when operating
some git server on NFS.

The b) case is also a concern, but in that case we'd actually be
improving things by writing out the duplicate object, if that was
followed-up with something like the "null" negotiator[2] and gc/repack
being able to look at the two objects, check their SHA-1/content and
throw away the bad one we'd have the ability to heal a corrupt
repository where we now just produce a hard error.

Even if someone wants to make the argument that this is behavior that we
absolutely *MUST* keep and not make configurable, there's still much
smarter ways to do it.

We could e.g. just unconditionally write out the packfile into a
quarantine environment (see 720dae5a19 ("config doc: elaborate on
fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
objects and packs we have and see if any of those exist in the new pack,
if they do, do the current assertion, and if not (and fetch.fsckObjects
passes) move it out of the quarantine.

I'm most inclined to say we should just have a config option to disable
this in lieu of fancier solutions. I think a) is entirely implausible
(and I'm not worrying about state-level actors attacking my git repos),
and b) would be no worse than it is today.

1. Although less so for initial fetch, think a) setup bunch of remotes b)
   parallel 'git fetch {}' ::: $(git remote)

2. https://public-inbox.org/git/87o9ciisg6.fsf@evledraar.gmail.com/

  reply	other threads:[~2018-10-27 11:22 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-25 18:38 [RFC PATCH] index-pack: improve performance on NFS Jansen, Geert
2018-10-26  0:21 ` Junio C Hamano
2018-10-26 20:38   ` Ævar Arnfjörð Bjarmason
2018-10-27  7:26     ` Junio C Hamano
2018-10-27  9:33       ` Jeff King
2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason [this message]
2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
2018-10-30  2:49             ` Geert Jansen
2018-10-30  9:04               ` Junio C Hamano
2018-10-30 18:43             ` [PATCH v2 0/3] index-pack: test updates Ævar Arnfjörð Bjarmason
2018-11-13 20:19               ` [PATCH v3] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
2018-11-14  7:09                 ` Junio C Hamano
2018-11-14 12:40                   ` Ævar Arnfjörð Bjarmason
2018-10-30 18:43             ` [PATCH v2 1/3] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
2018-10-30 18:43             ` [PATCH v2 2/3] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
2018-10-30 18:43             ` [PATCH v2 3/3] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 1/4] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 2/4] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 3/4] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 4/4] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
2018-10-29 15:04           ` [RFC PATCH] index-pack: improve performance on NFS Jeff King
2018-10-29 15:09             ` Jeff King
2018-10-29 19:36             ` Ævar Arnfjörð Bjarmason
2018-10-29 23:27               ` Jeff King
2018-11-07 22:55                 ` Geert Jansen
2018-11-08 12:02                   ` Jeff King
2018-11-08 20:58                     ` Geert Jansen
2018-11-08 21:18                       ` Jeff King
2018-11-08 21:55                         ` Geert Jansen
2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
2018-11-09 10:11                       ` Ævar Arnfjörð Bjarmason
2018-11-12 14:31                       ` Jeff King
2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
2018-11-12 14:46                       ` [PATCH 1/9] fsck: do not reuse child_process structs Jeff King
2018-11-12 15:26                         ` Derrick Stolee
2018-11-12 14:47                       ` [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with() Jeff King
2018-11-12 18:23                         ` Stefan Beller
2018-11-12 14:48                       ` [PATCH 3/9] rename "alternate_object_database" to "object_directory" Jeff King
2018-11-12 15:30                         ` Derrick Stolee
2018-11-12 15:36                           ` Jeff King
2018-11-12 19:41                             ` Ramsay Jones
2018-11-12 14:48                       ` [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending Jeff King
2018-11-12 15:32                         ` Derrick Stolee
2018-11-12 14:49                       ` [PATCH 5/9] handle alternates paths the same as the main object dir Jeff King
2018-11-12 15:38                         ` Derrick Stolee
2018-11-12 15:46                           ` Jeff King
2018-11-12 15:50                             ` Derrick Stolee
2018-11-12 14:50                       ` [PATCH 6/9] sha1-file: use an object_directory for " Jeff King
2018-11-12 15:48                         ` Derrick Stolee
2018-11-12 16:09                           ` Jeff King
2018-11-12 19:04                             ` Stefan Beller
2018-11-22 17:42                               ` Jeff King
2018-11-12 18:48                           ` Stefan Beller
2018-11-12 14:50                       ` [PATCH 7/9] object-store: provide helpers for loose_objects_cache Jeff King
2018-11-12 19:24                         ` René Scharfe
2018-11-12 20:16                           ` Jeff King
2018-11-12 14:54                       ` [PATCH 8/9] sha1-file: use loose object cache for quick existence check Jeff King
2018-11-12 16:00                         ` Derrick Stolee
2018-11-12 16:01                         ` Ævar Arnfjörð Bjarmason
2018-11-12 16:21                           ` Jeff King
2018-11-12 22:18                             ` Ævar Arnfjörð Bjarmason
2018-11-12 22:30                               ` Ævar Arnfjörð Bjarmason
2018-11-13 10:02                                 ` Ævar Arnfjörð Bjarmason
2018-11-14 18:21                                   ` René Scharfe
2018-12-02 10:52                                   ` René Scharfe
2018-12-03 22:04                                     ` Jeff King
2018-12-04 21:45                                       ` René Scharfe
2018-12-05  4:46                                         ` Jeff King
2018-12-05  6:02                                           ` René Scharfe
2018-12-05  6:51                                             ` Jeff King
2018-12-05  8:15                                               ` Jeff King
2018-12-05 18:41                                                 ` René Scharfe
2018-12-05 20:17                                                   ` Jeff King
2018-11-12 22:44                             ` Geert Jansen
2018-11-27 20:48                         ` René Scharfe
2018-12-01 19:49                           ` Jeff King
2018-11-12 14:55                       ` [PATCH 9/9] fetch-pack: drop custom loose object cache Jeff King
2018-11-12 19:25                         ` René Scharfe
2018-11-12 19:32                           ` Ævar Arnfjörð Bjarmason
2018-11-12 20:07                             ` Jeff King
2018-11-12 20:13                             ` René Scharfe
2018-11-12 16:02                       ` [PATCH 0/9] caching loose objects Derrick Stolee
2018-11-12 19:10                         ` Stefan Beller
2018-11-09 13:43                   ` [RFC PATCH] index-pack: improve performance on NFS Ævar Arnfjörð Bjarmason
2018-11-09 16:08                     ` Duy Nguyen
2018-11-10 14:04                       ` Ævar Arnfjörð Bjarmason
2018-11-12 14:34                         ` Jeff King
2018-11-12 22:58                     ` Geert Jansen
2018-10-27 14:04         ` Duy Nguyen
2018-10-29 15:18           ` Jeff King
2018-10-29  0:48         ` Junio C Hamano
2018-10-29 15:20           ` Jeff King
2018-10-29 18:43             ` Ævar Arnfjörð Bjarmason
2018-10-29 21:34           ` Geert Jansen
2018-10-29 21:50             ` Jeff King
2018-10-29 22:21               ` Geert Jansen
2018-10-29 22:27             ` Jeff King
2018-10-29 22:35               ` Stefan Beller
2018-10-29 23:29                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lg6jljmf.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=gerardu@amazon.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=nico@fluxnic.net \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).