git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [RFC PATCH] index-pack: improve performance on NFS
@ 2018-10-25 18:38 Jansen, Geert
  2018-10-26  0:21 ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Jansen, Geert @ 2018-10-25 18:38 UTC (permalink / raw)
  To: git

The index-pack command determines if a sha1 collision test is needed by
checking the existence of a loose object with the given hash. In my tests, I
can improve performance of “git clone” on Amazon EFS by 8x when used with a
non-default mount option (lookupcache=pos) that's required for a Gitlab HA
setup. My assumption is that this check is unnecessary when cloning into a new
repository because the repository will be empty.

By default, the Linux NFS client will cache directory entries as well as the
non-existence of directory entries. The latter means that when client c1 does
stat() on a file that does not exist, the non-existence will be cached and any
subsequent stat() operation on the file will return -ENOENT until the cache
expires or is invalidated, even if the file was created on client c2 in the
mean time. This leads to errors in a Gitlab HA setup when it distributes jobs
over multiple worker nodes assuming each worker node has the same view of the
shared file system.

The recommended workaround by Gitlab is to use the “lookupcache=pos” NFS mount
option which disables the negative lookup cache. This option has a high
performance impact. Cloning the gitlab-ce repository
(https://gitlab.com/gitlab-org/gitlib-ce.git) into an NFS mounted directory
gives the following results:

  lookupcache=all (default): 624 seconds
  lookupcache=pos: 4957 seconds

The reason for the poor performance is that index-pack will issue a stat()
call for every object in the repo when checking if a collision test is needed.
These stat() calls result in the following NFS operations:

  LOOKUP dirfh=".git/objects", name="01" -> NFS4ERR_ENOENT

With lookupcache=all, the non-existence of the .git/objects/XX directories is
cached, so that there will be at most 256 LOOKUP calls. With lookupcache=pos,
there will be one LOOKUP operation for every object in the repository, which
in case of the gitlab-ce repo is about 1.3 million times.

The attached patch removes the collision check when cloning into a new
repository. The performance of git clone with this patch is:

  lookupcache=pos (with patch): 577 seconds

I'd welcome feedback on the attached patch and whether my assumption that the
sha1 collision check can be safely omitted when cloning into a new repository
is correct.

Signed-off-by: Geert Jansen <gerardu@amazon.com>
---
 builtin/index-pack.c | 5 ++++-
 fetch-pack.c         | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 2004e25da..22b3d40fb 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -84,6 +84,7 @@ static int verbose;
 static int show_resolving_progress;
 static int show_stat;
 static int check_self_contained_and_connected;
+static int cloning;
 
 static struct progress *progress;
 
@@ -794,7 +795,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	assert(data || obj_entry);
 
-	if (startup_info->have_repository) {
+	if (startup_info->have_repository && !cloning) {
 		read_lock();
 		collision_test_needed =
 			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
@@ -1705,6 +1706,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				check_self_contained_and_connected = 1;
 			} else if (!strcmp(arg, "--fsck-objects")) {
 				do_fsck_object = 1;
+			} else if (!strcmp(arg, "--cloning")) {
+				cloning = 1;
 			} else if (!strcmp(arg, "--verify")) {
 				verify = 1;
 			} else if (!strcmp(arg, "--verify-stat")) {
diff --git a/fetch-pack.c b/fetch-pack.c
index b3ed7121b..c75bfb8aa 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -843,6 +843,8 @@ static int get_pack(struct fetch_pack_args *args,
 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
 		if (args->from_promisor)
 			argv_array_push(&cmd.args, "--promisor");
+		if (args->cloning)
+			argv_array_pushf(&cmd.args, "--cloning");
 	}
 	else {
 		cmd_name = "unpack-objects";
-- 
2.19.1.328.g5a0cc8aca.dirty



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-25 18:38 [RFC PATCH] index-pack: improve performance on NFS Jansen, Geert
@ 2018-10-26  0:21 ` Junio C Hamano
  2018-10-26 20:38   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-10-26  0:21 UTC (permalink / raw)
  To: Jansen\, Geert; +Cc: git\

"Jansen, Geert" <gerardu@amazon.com> writes:

> The index-pack command determines if a sha1 collision test is needed by
> checking the existence of a loose object with the given hash. In my tests, I
> can improve performance of “git clone” on Amazon EFS by 8x when used with a
> non-default mount option (lookupcache=pos) that's required for a Gitlab HA
> setup. My assumption is that this check is unnecessary when cloning into a new
> repository because the repository will be empty.

My knee-jerk reaction is that your insight that we can skip the "dup
check" when starting from emptiness is probably correct, but your
use of .cloning flag as an approximation of "are we starting from
emptiness?" is probably wrong, at least for two reasons.

 - "git clone --reference=..." does not strictly start from
   emptiness, and would still have to make sure that incoming pack
   does not try to inject an object with different contents but with
   the same name.

 - "git init && git fetch ..." starts from emptiness and would want
   to benefit from the same optimization as you are implementing
   here.

As to the implementation, I think the patch adjusts the right "if()"
condition to skip the collision test here.

> -	if (startup_info->have_repository) {
> +	if (startup_info->have_repository && !cloning) {
>  		read_lock();
>  		collision_test_needed =
>  			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);

I just do not think !cloning is quite correct.

Thanks.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-26  0:21 ` Junio C Hamano
@ 2018-10-26 20:38   ` Ævar Arnfjörð Bjarmason
  2018-10-27  7:26     ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-26 20:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jansen\, Geert, git\, Christian Couder


On Fri, Oct 26 2018, Junio C Hamano wrote:

> "Jansen, Geert" <gerardu@amazon.com> writes:
>
>> The index-pack command determines if a sha1 collision test is needed by
>> checking the existence of a loose object with the given hash. In my tests, I
>> can improve performance of “git clone” on Amazon EFS by 8x when used with a
>> non-default mount option (lookupcache=pos) that's required for a Gitlab HA
>> setup. My assumption is that this check is unnecessary when cloning into a new
>> repository because the repository will be empty.
>
> My knee-jerk reaction is that your insight that we can skip the "dup
> check" when starting from emptiness is probably correct, but your
> use of .cloning flag as an approximation of "are we starting from
> emptiness?" is probably wrong, at least for two reasons.
>
>  - "git clone --reference=..." does not strictly start from
>    emptiness, and would still have to make sure that incoming pack
>    does not try to inject an object with different contents but with
>    the same name.
>
>  - "git init && git fetch ..." starts from emptiness and would want
>    to benefit from the same optimization as you are implementing
>    here.
>
> As to the implementation, I think the patch adjusts the right "if()"
> condition to skip the collision test here.
>
>> -	if (startup_info->have_repository) {
>> +	if (startup_info->have_repository && !cloning) {
>>  		read_lock();
>>  		collision_test_needed =
>>  			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
>
> I just do not think !cloning is quite correct.

Geert: Thanks for working on this. A GitLab instance I'm involved in
managing at work has suffered from this issue, e.g. with "fork" being a
"clone" under the hood on GitLab, and taking ages on the NetApp NFS
filer due to this issue, so I'm very interested in this moving forward.

But as Junio notes the devil's in the details, another one I thought of
is:

    GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...

It seems to me that it's better to stick this into
setup_git_directory_gently() in setup.c and check various edge cases
there. I.e. make this a new "have_object_already" member of the
startup_info struct.

That would be set depending on whether we find objects/packs in the
objects dir, and would know about GIT_OBJECT_DIRECTORY (either just
punting, or looking at those too). It would also need to know about
read_info_alternates(), depending on when that's checked it would handle
git clone --reference too since it just sets it up via
add_to_alternates_file().

Then you'd change sha1_object() to just check
startup_info->have_objects_already instead of
startup_info->have_repository, and since we'd checked "did we have
objects already?" it would work for the init && fetch case Junio
mentioned.

It would also be very nice to have a test for this, even if it's
something OS-specific that only works on Linux after we probe for
strace(1).

Testing (without your patch, because git-am barfs on it , seems to
dislake the base64 encoding):

    rm -rf /tmp/df; strace -f git clone --bare git@github.com:git/git.git /tmp/df 2>&1 | grep '".*ENOENT' 2>&1|perl -pe 's/^.*?"([^"]+)".*/$1/'|sort|uniq -c|sort -nr|less

I see we also check if packed-refs exists ~2800 times, and check for
each ref we find on the remote. Those are obviously less of a
performance issue when cloning in the common case, but perhaps another
place where we can insert a "don't check, we don't have anything
already" condition.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-26 20:38   ` Ævar Arnfjörð Bjarmason
@ 2018-10-27  7:26     ` Junio C Hamano
  2018-10-27  9:33       ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-10-27  7:26 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jansen\, Geert, git\, Christian Couder

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> But as Junio notes the devil's in the details, another one I thought of
> is:
>
>     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
>
> It seems to me that ...

Actually I take all of that back ;-)

For the purpose of this patch, use of existing .cloning field in the
transport is fine, as the sole existing user of the field wants the
field to mean "Are we starting with an empty object store?", and not
"Are we running the command whose name is 'git clone'?".

Now, the logic used to set the field to true may have room to be
improved.  We should do that as a separte and orthogonal effort so
that the two cases I mentioned plus the new one(s) you brought up
would also be taken into account, so that we can set the .cloning
field more accurately to suite the two callers' needs---one caller
is the age-old one added by beea4152 ("clone: support remote shallow
repository", 2013-12-05), and the other one is what Geert is adding
in this thread.

We _may_ want to rename that transport.cloning field to better
reflect what it truly means (it is not "are we cloning?" but "are
there any objects in the repo to worry about?") as a preparatory
step before Geert's patch, but I do not think we should make it a
requirement to take the "let's skip collision check" patch to
improve the logic used to set that .cloning field.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-27  7:26     ` Junio C Hamano
@ 2018-10-27  9:33       ` Jeff King
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
                           ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Jeff King @ 2018-10-27  9:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Jansen, Geert, git,
	Christian Couder

On Sat, Oct 27, 2018 at 04:26:50PM +0900, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > But as Junio notes the devil's in the details, another one I thought of
> > is:
> >
> >     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
> >
> > It seems to me that ...
> 
> Actually I take all of that back ;-)
> 
> For the purpose of this patch, use of existing .cloning field in the
> transport is fine, as the sole existing user of the field wants the
> field to mean "Are we starting with an empty object store?", and not
> "Are we running the command whose name is 'git clone'?".

Taking one step back, the root problem in this thread is that stat() on
non-existing files is slow (which makes has_sha1_file slow).

One solution there is to cache the results of looking in .git/objects
(or any alternate object store) for loose files. And indeed, this whole
scheme is just a specialized form of that: it's a flag to say "hey, we
do not have any objects yet, so do not bother looking".

Could we implement that in a more direct and central way? And could we
implement it in a way that catches more cases? E.g., if I have _one_
object, that defeats this specialized optimization, but it is probably
still beneficial to cache that knowledge (and the reasonable cutoff is
probably not 1, but some value of N loose objects).

Of course any cache raises questions of cache invalidation, but I think
we've already dealt with that for this case. When we use
OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
accuracy/speed tradeoff (which does a similar caching thing with
packfiles).

So putting that all together, could we have something like:

diff --git a/object-store.h b/object-store.h
index 63b7605a3e..28cde568a0 100644
--- a/object-store.h
+++ b/object-store.h
@@ -135,6 +135,18 @@ struct raw_object_store {
 	 */
 	struct packed_git *all_packs;
 
+	/*
+	 * A set of all loose objects we have. This probably ought to be split
+	 * into a set of 256 caches so that we can fault in one directory at a
+	 * time.
+	 */
+	struct oid_array loose_cache;
+	enum {
+		LOOSE_CACHE_UNFILLED = 0,
+		LOOSE_CACHE_INVALID,
+		LOOSE_CACHE_VALID
+	} loose_cache_status;
+
 	/*
 	 * A fast, rough count of the number of objects in the repository.
 	 * These two fields are not meant for direct access. Use
diff --git a/packfile.c b/packfile.c
index 86074a76e9..68ca4fff0e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
 	r->objects->approximate_object_count_valid = 0;
 	r->objects->packed_git_initialized = 0;
 	prepare_packed_git(r);
+	oid_array_clear(&r->objects->loose_cache);
+	r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
 }
 
 struct packed_git *get_packed_git(struct repository *r)
diff --git a/sha1-file.c b/sha1-file.c
index dd0b6aa873..edbe037eaa 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	return parse_sha1_header_extended(hdr, &oi, 0);
 }
 
+/* probably should be configurable? */
+#define LOOSE_OBJECT_CACHE_MAX 65536
+
+static int fill_loose_cache(const struct object_id *oid,
+			    const char *path,
+			    void *data)
+{
+	struct oid_array *cache = data;
+
+	if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
+		return -1;
+
+	oid_array_append(data, oid);
+	return 0;
+}
+
+static int quick_has_loose(struct raw_object_store *r,
+			   struct object_id *oid)
+{
+	struct oid_array *cache = &r->loose_cache;
+
+	if (r->loose_cache_status == LOOSE_CACHE_UNFILLED) {
+		if (for_each_loose_object(fill_loose_cache, cache, 0) < 0)
+			r->loose_cache_status = LOOSE_CACHE_INVALID;
+		else
+			r->loose_cache_status = LOOSE_CACHE_VALID;
+	}
+
+	if (r->loose_cache_status == LOOSE_CACHE_INVALID)
+		return -1;
+
+	return oid_array_lookup(cache, oid) >= 0;
+}
+
 static int sha1_loose_object_info(struct repository *r,
 				  const unsigned char *sha1,
 				  struct object_info *oi, int flags)
@@ -1198,6 +1232,19 @@ static int sha1_loose_object_info(struct repository *r,
 	if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
+		if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK)) {
+			struct object_id oid;
+			hashcpy(oid.hash, sha1);
+			switch (quick_has_loose(r->objects, &oid)) {
+			case 0:
+				return -1; /* missing: error */
+			case 1:
+				return 0; /* have: 0 == success */
+			default:
+				/* unknown; fall back to stat */
+				break;
+			}
+		}
 		if (stat_sha1_file(r, sha1, &st, &path) < 0)
 			return -1;
 		if (oi->disk_sizep)

That's mostly untested, but it might be enough to run some timing tests
with. I think if we want to pursue this, we'd want to address the bits I
mentioned in the comments, and look at unifying this with the loose
cache from cc817ca3ef (which if I had remembered we added, probably
would have saved some time writing the above ;) ).

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-27  9:33       ` Jeff King
@ 2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
                             ` (5 more replies)
  2018-10-27 14:04         ` Duy Nguyen
  2018-10-29  0:48         ` Junio C Hamano
  2 siblings, 6 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-27 11:22 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jansen\, Geert, git\,
	Christian Couder, Nicolas Pitre, Linus Torvalds


On Sat, Oct 27 2018, Jeff King wrote:

> On Sat, Oct 27, 2018 at 04:26:50PM +0900, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>> > But as Junio notes the devil's in the details, another one I thought of
>> > is:
>> >
>> >     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
>> >
>> > It seems to me that ...
>>
>> Actually I take all of that back ;-)
>>
>> For the purpose of this patch, use of existing .cloning field in the
>> transport is fine, as the sole existing user of the field wants the
>> field to mean "Are we starting with an empty object store?", and not
>> "Are we running the command whose name is 'git clone'?".
>
> Taking one step back, the root problem in this thread is that stat() on
> non-existing files is slow (which makes has_sha1_file slow).
>
> One solution there is to cache the results of looking in .git/objects
> (or any alternate object store) for loose files. And indeed, this whole
> scheme is just a specialized form of that: it's a flag to say "hey, we
> do not have any objects yet, so do not bother looking".
>
> Could we implement that in a more direct and central way? And could we
> implement it in a way that catches more cases? E.g., if I have _one_
> object, that defeats this specialized optimization, but it is probably
> still beneficial to cache that knowledge (and the reasonable cutoff is
> probably not 1, but some value of N loose objects).
>
> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).
>
> So putting that all together, could we have something like:
>
> diff --git a/object-store.h b/object-store.h
> index 63b7605a3e..28cde568a0 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -135,6 +135,18 @@ struct raw_object_store {
>  	 */
>  	struct packed_git *all_packs;
>
> +	/*
> +	 * A set of all loose objects we have. This probably ought to be split
> +	 * into a set of 256 caches so that we can fault in one directory at a
> +	 * time.
> +	 */
> +	struct oid_array loose_cache;
> +	enum {
> +		LOOSE_CACHE_UNFILLED = 0,
> +		LOOSE_CACHE_INVALID,
> +		LOOSE_CACHE_VALID
> +	} loose_cache_status;
> +
>  	/*
>  	 * A fast, rough count of the number of objects in the repository.
>  	 * These two fields are not meant for direct access. Use
> diff --git a/packfile.c b/packfile.c
> index 86074a76e9..68ca4fff0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
>  	r->objects->approximate_object_count_valid = 0;
>  	r->objects->packed_git_initialized = 0;
>  	prepare_packed_git(r);
> +	oid_array_clear(&r->objects->loose_cache);
> +	r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
>  }
>
>  struct packed_git *get_packed_git(struct repository *r)
> diff --git a/sha1-file.c b/sha1-file.c
> index dd0b6aa873..edbe037eaa 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  	return parse_sha1_header_extended(hdr, &oi, 0);
>  }
>
> +/* probably should be configurable? */
> +#define LOOSE_OBJECT_CACHE_MAX 65536
> +
> +static int fill_loose_cache(const struct object_id *oid,
> +			    const char *path,
> +			    void *data)
> +{
> +	struct oid_array *cache = data;
> +
> +	if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
> +		return -1;
> +
> +	oid_array_append(data, oid);
> +	return 0;
> +}
> +
> +static int quick_has_loose(struct raw_object_store *r,
> +			   struct object_id *oid)

The assumption with making it exactly 0 objects and not any value of >0
is that we can safely assume that a "clone" or initial "fetch"[1] is
special in ways that a clone isn't. I.e. we're starting out with nothing
and doing the initial population, that's probably not as true in an
existing repo that's getting concurrent fetches, commits, rebases etc.

But in the spirit of taking a step back, maybe we should take two steps
back and consider why we're doing this at all.

Three of our tests fail if we compile git like this, and cloning is much
faster (especially on NFS):

    diff --git a/builtin/index-pack.c b/builtin/index-pack.c
    index 2004e25da2..0c2d008ee0 100644
    --- a/builtin/index-pack.c
    +++ b/builtin/index-pack.c
    @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,

    -       if (startup_info->have_repository) {
    +       if (0) {
                    read_lock();

Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
spending 5% of my time on that.

But why do we have this in the first place? It's because of 8685da4256
("don't ever allow SHA1 collisions to exist by fetching a pack",
2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
collision check", 2017-04-01).

I.e. we are worried about (and those tests check for):

 a) A malicious user trying to give us repository where they have
    created an object with the same SHA-1 that's different, as in the
    SHAttered attack.

    I remember (but have not dug up) an old E-Mail from Linus saying
    that this was an important security aspect of git, i.e. even if
    SHA-1 was broken you couldn't easily propagate bad objects.

 b) Cases where we've ended up with different content for a SHA-1 due to
    e.g. a local FS corruption. Which is the subject of your commit in
    2017.

 c) Are there cases where fetch.fsckObjects is off and we just flip a
    bit on the wire and don't notice? I think not because we always
    check the pack checksum (don't we), but I'm not 100% sure.

I'm inclined to think that we should at the very least make this
configurable. Running into a) is the least of my worries when operating
some git server on NFS.

The b) case is also a concern, but in that case we'd actually be
improving things by writing out the duplicate object, if that was
followed-up with something like the "null" negotiator[2] and gc/repack
being able to look at the two objects, check their SHA-1/content and
throw away the bad one we'd have the ability to heal a corrupt
repository where we now just produce a hard error.

Even if someone wants to make the argument that this is behavior that we
absolutely *MUST* keep and not make configurable, there's still much
smarter ways to do it.

We could e.g. just unconditionally write out the packfile into a
quarantine environment (see 720dae5a19 ("config doc: elaborate on
fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
objects and packs we have and see if any of those exist in the new pack,
if they do, do the current assertion, and if not (and fetch.fsckObjects
passes) move it out of the quarantine.

I'm most inclined to say we should just have a config option to disable
this in lieu of fancier solutions. I think a) is entirely implausible
(and I'm not worrying about state-level actors attacking my git repos),
and b) would be no worse than it is today.

1. Although less so for initial fetch, think a) setup bunch of remotes b)
   parallel 'git fetch {}' ::: $(git remote)

2. https://public-inbox.org/git/87o9ciisg6.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-27  9:33       ` Jeff King
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
@ 2018-10-27 14:04         ` Duy Nguyen
  2018-10-29 15:18           ` Jeff King
  2018-10-29  0:48         ` Junio C Hamano
  2 siblings, 1 reply; 87+ messages in thread
From: Duy Nguyen @ 2018-10-27 14:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, gerardu,
	Git Mailing List, Christian Couder

On Sat, Oct 27, 2018 at 11:34 AM Jeff King <peff@peff.net> wrote:
> Taking one step back, the root problem in this thread is that stat() on
> non-existing files is slow (which makes has_sha1_file slow).
>
> One solution there is to cache the results of looking in .git/objects
> (or any alternate object store) for loose files. And indeed, this whole
> scheme is just a specialized form of that: it's a flag to say "hey, we
> do not have any objects yet, so do not bother looking".
>
> Could we implement that in a more direct and central way? And could we
> implement it in a way that catches more cases? E.g., if I have _one_
> object, that defeats this specialized optimization, but it is probably
> still beneficial to cache that knowledge (and the reasonable cutoff is
> probably not 1, but some value of N loose objects).

And we do hit this on normal git-fetch. The larger the received pack
is (e.g. if you don't fetch often), the more stat() we do, so your
approach is definitely better.

> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).

We don't care about a separate process adding more loose objects while
index-pack is running, do we? I'm guessing we don't but just to double
check...

> So putting that all together, could we have something like:
>
> diff --git a/object-store.h b/object-store.h
> index 63b7605a3e..28cde568a0 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -135,6 +135,18 @@ struct raw_object_store {
>          */
>         struct packed_git *all_packs;
>
> +       /*
> +        * A set of all loose objects we have. This probably ought to be split
> +        * into a set of 256 caches so that we can fault in one directory at a
> +        * time.
> +        */
> +       struct oid_array loose_cache;
> +       enum {
> +               LOOSE_CACHE_UNFILLED = 0,
> +               LOOSE_CACHE_INVALID,
> +               LOOSE_CACHE_VALID
> +       } loose_cache_status;
> +
>         /*
>          * A fast, rough count of the number of objects in the repository.
>          * These two fields are not meant for direct access. Use
> diff --git a/packfile.c b/packfile.c
> index 86074a76e9..68ca4fff0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
>         r->objects->approximate_object_count_valid = 0;
>         r->objects->packed_git_initialized = 0;
>         prepare_packed_git(r);
> +       oid_array_clear(&r->objects->loose_cache);
> +       r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
>  }
>
>  struct packed_git *get_packed_git(struct repository *r)
> diff --git a/sha1-file.c b/sha1-file.c
> index dd0b6aa873..edbe037eaa 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
>         return parse_sha1_header_extended(hdr, &oi, 0);
>  }
>
> +/* probably should be configurable? */

Yes, perhaps with gc.auto config value (multiplied by 256) as the cut
point. If it's too big maybe just go with a bloom filter. For this
particular case we expect like 99% of calls to miss.

> +#define LOOSE_OBJECT_CACHE_MAX 65536
> +
> +static int fill_loose_cache(const struct object_id *oid,
> +                           const char *path,
> +                           void *data)
> +{
> +       struct oid_array *cache = data;
> +
> +       if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
> +               return -1;
> +
> +       oid_array_append(data, oid);
> +       return 0;
> +}
> +
> +static int quick_has_loose(struct raw_object_store *r,
> +                          struct object_id *oid)
> +{
> +       struct oid_array *cache = &r->loose_cache;
> +
> +       if (r->loose_cache_status == LOOSE_CACHE_UNFILLED) {
> +               if (for_each_loose_object(fill_loose_cache, cache, 0) < 0)
> +                       r->loose_cache_status = LOOSE_CACHE_INVALID;
> +               else
> +                       r->loose_cache_status = LOOSE_CACHE_VALID;
> +       }
> +
> +       if (r->loose_cache_status == LOOSE_CACHE_INVALID)
> +               return -1;
> +
> +       return oid_array_lookup(cache, oid) >= 0;
> +}
> +
>  static int sha1_loose_object_info(struct repository *r,
>                                   const unsigned char *sha1,
>                                   struct object_info *oi, int flags)
> @@ -1198,6 +1232,19 @@ static int sha1_loose_object_info(struct repository *r,
>         if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
>                 const char *path;
>                 struct stat st;
> +               if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK)) {
> +                       struct object_id oid;
> +                       hashcpy(oid.hash, sha1);
> +                       switch (quick_has_loose(r->objects, &oid)) {
> +                       case 0:
> +                               return -1; /* missing: error */
> +                       case 1:
> +                               return 0; /* have: 0 == success */
> +                       default:
> +                               /* unknown; fall back to stat */
> +                               break;
> +                       }
> +               }
>                 if (stat_sha1_file(r, sha1, &st, &path) < 0)
>                         return -1;
>                 if (oi->disk_sizep)
>
> That's mostly untested, but it might be enough to run some timing tests
> with. I think if we want to pursue this, we'd want to address the bits I
> mentioned in the comments, and look at unifying this with the loose
> cache from cc817ca3ef (which if I had remembered we added, probably
> would have saved some time writing the above ;) ).
>
> -Peff
-- 
Duy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
@ 2018-10-28 22:50           ` Ævar Arnfjörð Bjarmason
  2018-10-30  2:49             ` Geert Jansen
                               ` (4 more replies)
  2018-10-28 22:50           ` [PATCH 1/4] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
                             ` (4 subsequent siblings)
  5 siblings, 5 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-28 22:50 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

This patch series implements what I suggested in
https://public-inbox.org/git/87lg6jljmf.fsf@evledraar.gmail.com/

It's not a replacement for what Geert Jansen's RFC in
https://public-inbox.org/git/ED25E182-C296-4D08-8170-340567D8964A@amazon.com/
does, which was to turn this off entirely on "clone".

I left the door open for that in the new config option 4/4 implements,
but I suspect for Geert's purposes this is something he'd prefer to
turn off in git on clone entirely, i.e. because it may be running on
some random Amazon's customer's EFS instance, and they won't know
about this new core.checkCollisions option.

But maybe I'm wrong about that and Geert is happy to just turn on
core.checkCollisions=false and use this series instead.

Ævar Arnfjörð Bjarmason (4):
  pack-objects test: modernize style
  pack-objects tests: don't leave test .git corrupt at end
  index-pack tests: don't leave test repo dirty at end
  index-pack: add ability to disable SHA-1 collision check

 Documentation/config.txt     | 68 ++++++++++++++++++++++++++++++++++++
 builtin/index-pack.c         |  7 ++--
 cache.h                      |  1 +
 config.c                     | 20 +++++++++++
 config.h                     |  1 +
 environment.c                |  1 +
 t/README                     |  3 ++
 t/t1060-object-corruption.sh | 37 +++++++++++++++++++-
 t/t5300-pack-object.sh       | 51 +++++++++++++++------------
 9 files changed, 163 insertions(+), 26 deletions(-)

-- 
2.19.1.759.g500967bb5e


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 1/4] pack-objects test: modernize style
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
@ 2018-10-28 22:50           ` Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 2/4] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-28 22:50 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Modernize the quoting and indentation style of two tests added in
8685da4256 ("don't ever allow SHA1 collisions to exist by fetching a
pack", 2007-03-20), and of a subsequent one added in
4614043c8f ("index-pack: use streaming interface for collision test on
large blobs", 2012-05-24) which had copied the style of the first two.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5300-pack-object.sh | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 6c620cd540..a0309e4bab 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -475,22 +475,22 @@ test_expect_success 'pack-objects in too-many-packs mode' '
 # two tests at the end of this file.
 #
 
-test_expect_success \
-    'fake a SHA1 hash collision' \
-    'long_a=$(git hash-object a | sed -e "s!^..!&/!") &&
-     long_b=$(git hash-object b | sed -e "s!^..!&/!") &&
-     test -f	.git/objects/$long_b &&
-     cp -f	.git/objects/$long_a \
-		.git/objects/$long_b'
+test_expect_success 'fake a SHA1 hash collision' '
+	long_a=$(git hash-object a | sed -e "s!^..!&/!") &&
+	long_b=$(git hash-object b | sed -e "s!^..!&/!") &&
+	test -f	.git/objects/$long_b &&
+	cp -f	.git/objects/$long_a \
+		.git/objects/$long_b
+'
 
-test_expect_success \
-    'make sure index-pack detects the SHA1 collision' \
-    'test_must_fail git index-pack -o bad.idx test-3.pack 2>msg &&
-     test_i18ngrep "SHA1 COLLISION FOUND" msg'
+test_expect_success 'make sure index-pack detects the SHA1 collision' '
+	test_must_fail git index-pack -o bad.idx test-3.pack 2>msg &&
+	test_i18ngrep "SHA1 COLLISION FOUND" msg
+'
 
-test_expect_success \
-    'make sure index-pack detects the SHA1 collision (large blobs)' \
-    'test_must_fail git -c core.bigfilethreshold=1 index-pack -o bad.idx test-3.pack 2>msg &&
-     test_i18ngrep "SHA1 COLLISION FOUND" msg'
+test_expect_success 'make sure index-pack detects the SHA1 collision (large blobs)' '
+	test_must_fail git -c core.bigfilethreshold=1 index-pack -o bad.idx test-3.pack 2>msg &&
+	test_i18ngrep "SHA1 COLLISION FOUND" msg
+'
 
 test_done
-- 
2.19.1.759.g500967bb5e


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 2/4] pack-objects tests: don't leave test .git corrupt at end
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 1/4] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
@ 2018-10-28 22:50           ` Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 3/4] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-28 22:50 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Change the pack-objects tests to not leave their .git directory
corrupt and the end.

In 2fca19fbb5 ("fix multiple issues with t5300", 2010-02-03) a comment
was added warning against adding any subsequent tests, but since
4614043c8f ("index-pack: use streaming interface for collision test on
large blobs", 2012-05-24) the comment has drifted away from the code,
mentioning two test, when we actually have three.

Instead of having this warning let's just create a new .git directory
specifically for these tests.

As an aside, it would be interesting to instrument the test suite to
run a "git fsck" at the very end (in "test_done"). That would have
errored before this change, and may find other issues #leftoverbits.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5300-pack-object.sh | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index a0309e4bab..410a09b0dd 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -468,29 +468,32 @@ test_expect_success 'pack-objects in too-many-packs mode' '
 	git fsck
 '
 
-#
-# WARNING!
-#
-# The following test is destructive.  Please keep the next
-# two tests at the end of this file.
-#
-
-test_expect_success 'fake a SHA1 hash collision' '
-	long_a=$(git hash-object a | sed -e "s!^..!&/!") &&
-	long_b=$(git hash-object b | sed -e "s!^..!&/!") &&
-	test -f	.git/objects/$long_b &&
-	cp -f	.git/objects/$long_a \
-		.git/objects/$long_b
+test_expect_success 'setup: fake a SHA1 hash collision' '
+	git init corrupt &&
+	(
+		cd corrupt &&
+		long_a=$(git hash-object -w ../a | sed -e "s!^..!&/!") &&
+		long_b=$(git hash-object -w ../b | sed -e "s!^..!&/!") &&
+		test -f	.git/objects/$long_b &&
+		cp -f	.git/objects/$long_a \
+			.git/objects/$long_b
+	)
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision' '
-	test_must_fail git index-pack -o bad.idx test-3.pack 2>msg &&
-	test_i18ngrep "SHA1 COLLISION FOUND" msg
+	(
+		cd corrupt &&
+		test_must_fail git index-pack -o ../bad.idx ../test-3.pack 2>msg &&
+		test_i18ngrep "SHA1 COLLISION FOUND" msg
+	)
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision (large blobs)' '
-	test_must_fail git -c core.bigfilethreshold=1 index-pack -o bad.idx test-3.pack 2>msg &&
-	test_i18ngrep "SHA1 COLLISION FOUND" msg
+	(
+		cd corrupt &&
+		test_must_fail git -c core.bigfilethreshold=1 index-pack -o ../bad.idx ../test-3.pack 2>msg &&
+		test_i18ngrep "SHA1 COLLISION FOUND" msg
+	)
 '
 
 test_done
-- 
2.19.1.759.g500967bb5e


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 3/4] index-pack tests: don't leave test repo dirty at end
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
                             ` (2 preceding siblings ...)
  2018-10-28 22:50           ` [PATCH 2/4] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
@ 2018-10-28 22:50           ` " Ævar Arnfjörð Bjarmason
  2018-10-28 22:50           ` [PATCH 4/4] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
  2018-10-29 15:04           ` [RFC PATCH] index-pack: improve performance on NFS Jeff King
  5 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-28 22:50 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Change a test added in 51054177b3 ("index-pack: detect local
corruption in collision check", 2017-04-01) so that the repository
isn't left dirty at the end.

Due to the caveats explained in 720dae5a19 ("config doc: elaborate on
fetch.fsckObjects security", 2018-07-27) even a "fetch" that fails
will write to the local object store, so let's copy the bit-error test
directory before running this test.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t1060-object-corruption.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
index ac1f189fd2..4feb65157d 100755
--- a/t/t1060-object-corruption.sh
+++ b/t/t1060-object-corruption.sh
@@ -117,8 +117,10 @@ test_expect_failure 'clone --local detects misnamed objects' '
 '
 
 test_expect_success 'fetch into corrupted repo with index-pack' '
+	cp -R bit-error bit-error-cp &&
+	test_when_finished "rm -rf bit-error-cp" &&
 	(
-		cd bit-error &&
+		cd bit-error-cp &&
 		test_must_fail git -c transfer.unpackLimit=1 \
 			fetch ../no-bit-error 2>stderr &&
 		test_i18ngrep ! -i collision stderr
-- 
2.19.1.759.g500967bb5e


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 4/4] index-pack: add ability to disable SHA-1 collision check
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
                             ` (3 preceding siblings ...)
  2018-10-28 22:50           ` [PATCH 3/4] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
@ 2018-10-28 22:50           ` Ævar Arnfjörð Bjarmason
  2018-10-29 15:04           ` [RFC PATCH] index-pack: improve performance on NFS Jeff King
  5 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-28 22:50 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Add a new core.checkCollisions setting. On by default, it can be set
to 'false' to disable the check for existing objects in sha1_object().

As noted in the documentation being added here this is done out of
paranoia about future SHA-1 collisions and as a canary (redundant to
"git fsck") for local object corruption.

For the history of SHA-1 collision checking see:

 - 5c2a7fbc36 ("[PATCH] SHA1 naive collision checking", 2005-04-13)

 - f864ba7448 ("Fix read-cache.c collission check logic.", 2005-04-13)

 - aac1794132 ("Improve sha1 object file writing.", 2005-05-03)

 - 8685da4256 ("don't ever allow SHA1 collisions to exist by fetching
   a pack", 2007-03-20)

 - 1421c5f274 ("write_loose_object: don't bother trying to read an old
   object", 2008-06-16)

 - 51054177b3 ("index-pack: detect local corruption in collision
   check", 2017-04-01)

As seen when going through that history there used to be a way to turn
this off at compile-time by using -DCOLLISION_CHECK=0 option (see
f864ba7448), but this check later went away in favor of general "don't
write if exists" logic for loose objects, and was then brought back
for remotely fetched packs in 8685da4256.

I plan to turn this off by default in my own settings since I'll
appreciate the performance improvement, and because I think worrying
about SHA-1 collisions is insane paranoia. But others might disagree,
so the check is still on by default.

Also add a "GIT_TEST_CHECK_COLLISIONS" setting so the entire test
suite can be exercised with the collision check turned off.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config.txt     | 68 ++++++++++++++++++++++++++++++++++++
 builtin/index-pack.c         |  7 ++--
 cache.h                      |  1 +
 config.c                     | 20 +++++++++++
 config.h                     |  1 +
 environment.c                |  1 +
 t/README                     |  3 ++
 t/t1060-object-corruption.sh | 33 +++++++++++++++++
 t/t5300-pack-object.sh       | 10 ++++--
 9 files changed, 138 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 552827935a..0192fc84a9 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -461,6 +461,74 @@ core.untrackedCache::
 	properly on your system.
 	See linkgit:git-update-index[1]. `keep` by default.
 
+core.checkCollisions::
+	When missing or set to `default` Git will assert when writing
+	a given object that it doesn't exist already anywhere else in
+	the object store (also accounting for
+	`GIT_ALTERNATE_OBJECT_DIRECTORIES` et al, see
+	linkgit:git[1]).
++
+The reasons for why this is on by default are:
++
+--
+. If there's ever a new SHA-1 collision attack similar to the
+  SHAttered attack (see https://shattered.io) Git can't be fooled into
+  replacing an existing known-good object with a new one with the same
+  SHA-1.
++
+Note that Git by default is built with a hardened version of SHA-1
+function with collision detection for attacks like the SHAttered
+attack (see link:technical/hash-function-transition.html[the hash
+function transition documentation]), but new future attacks might not
+be detected by the hardened SHA-1 code.
+
+. It serves as a canary for detecting some instances of repository
+  corruption. The type and size of the existing and new objects are
+  compared, if they differ Git will panic and abort. This can happen
+  e.g. if a loose object's content has been truncated or otherwise
+  mangled by filesystem corruption.
+--
++
+The reasons to disable this are, respectively:
++
+--
+. Doing the "does this object exist already?" check can be expensive,
+  and it's always cheaper to do nothing.
++
+Even on a very fast local disk (e.g. SSD) cloning a repository like
+git.git spends around 5% of its time just in `lstat()`. This
+percentage can get much higher (up to even hundreds of percents!) on
+network filesystems like NFS where metadata operations can be much
+slower.
++
+This is because with the collision check every object in an incoming
+packfile must be checked against any existing packfiles, as well as
+the loose object store (most of the `lstat()` time is spent on the
+latter). Git doesn't guarantee that some concurrent process isn't
+writing to the same repository during a `clone`. The same sort of
+slowdowns can be seen when doing a big fetch (lots of objects to write
+out).
+
+. If you have a corrupt local repository this check can prevent
+  repairing it by fetching a known-good version of the same object
+  from a remote repository. See the "repair a corrupted repo with
+  index-pack" test in the `t1060-object-corruption.sh` test in the git
+  source code.
+--
++
+Consider turning this off if you're more concerned about performance
+than you are about hypothetical future SHA-1 collisions or object
+corruption (linkgit:git-fsck[1] will also catch object
+corruption). This setting can also be disabled during specific
+phases/commands that can be bottlenecks, e.g. with `git -c
+core.checkCollisions=false clone [...]` for an initial clone on NFS.
++
+Setting this to `false` will disable object collision
+checking. I.e. the value can either be "default" or a boolean. Other
+values might be added in the future (e.g. for selectively disabling
+this just for "clone"), but now any non-boolean non-"default" values
+error out.
+
 core.checkStat::
 	When missing or is set to `default`, many fields in the stat
 	structure are checked to detect if a file has been modified
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 2004e25da2..4a3508aa9f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -791,23 +791,24 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 {
 	void *new_data = NULL;
 	int collision_test_needed = 0;
+	int do_coll_check = git_config_get_collision_check();
 
 	assert(data || obj_entry);
 
-	if (startup_info->have_repository) {
+	if (do_coll_check && startup_info->have_repository) {
 		read_lock();
 		collision_test_needed =
 			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
 		read_unlock();
 	}
 
-	if (collision_test_needed && !data) {
+	if (do_coll_check && collision_test_needed && !data) {
 		read_lock();
 		if (!check_collison(obj_entry))
 			collision_test_needed = 0;
 		read_unlock();
 	}
-	if (collision_test_needed) {
+	if (do_coll_check && collision_test_needed) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
diff --git a/cache.h b/cache.h
index f7fabdde8f..cf0b69133e 100644
--- a/cache.h
+++ b/cache.h
@@ -858,6 +858,7 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern int check_collisions;
 
 /*
  * Accessors for the core.sharedrepository config which lazy-load the value
diff --git a/config.c b/config.c
index 4051e38823..a93e74f399 100644
--- a/config.c
+++ b/config.c
@@ -1362,6 +1362,14 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkcollisions")) {
+		if (!strcasecmp(value, "default"))
+			check_collisions = 1;
+		else
+			check_collisions = git_config_bool(var, value);
+		return 0;
+	}
+
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
@@ -2307,6 +2315,18 @@ int git_config_get_index_threads(void)
 	return 0; /* auto */
 }
 
+int git_config_get_collision_check(void)
+{
+	static int checked_env = 0;
+	if (!checked_env) {
+		checked_env = 1;
+		int v = git_env_bool("GIT_TEST_CHECK_COLLISIONS", -1);
+		if (v != -1)
+			check_collisions = v;
+	}
+	return check_collisions;
+}
+
 NORETURN
 void git_die_config_linenr(const char *key, const char *filename, int linenr)
 {
diff --git a/config.h b/config.h
index a06027e69b..4c6f6d9ae4 100644
--- a/config.h
+++ b/config.h
@@ -251,6 +251,7 @@ extern int git_config_get_split_index(void);
 extern int git_config_get_max_percent_split_change(void);
 extern int git_config_get_fsmonitor(void);
 extern int git_config_get_index_threads(void);
+extern int git_config_get_collision_check(void);
 
 /* This dies if the configured or default date is in the future */
 extern int git_config_get_expiry(const char *key, const char **output);
diff --git a/environment.c b/environment.c
index 3f3c8746c2..0a1512bee6 100644
--- a/environment.c
+++ b/environment.c
@@ -21,6 +21,7 @@
 int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
+int check_collisions = 1;
 int has_symlinks = 1;
 int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
diff --git a/t/README b/t/README
index 8847489640..050abe85ad 100644
--- a/t/README
+++ b/t/README
@@ -343,6 +343,9 @@ of the index for the whole test suite by bypassing the default number of
 cache entries and thread minimums. Setting this to 1 will make the
 index loading single threaded.
 
+GIT_TEST_CHECK_COLLISIONS=<boolean> excercises the
+core.checkCollisions=false codepath.
+
 Naming Tests
 ------------
 
diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
index 4feb65157d..87e395d2ba 100755
--- a/t/t1060-object-corruption.sh
+++ b/t/t1060-object-corruption.sh
@@ -117,6 +117,7 @@ test_expect_failure 'clone --local detects misnamed objects' '
 '
 
 test_expect_success 'fetch into corrupted repo with index-pack' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
 	cp -R bit-error bit-error-cp &&
 	test_when_finished "rm -rf bit-error-cp" &&
 	(
@@ -127,4 +128,36 @@ test_expect_success 'fetch into corrupted repo with index-pack' '
 	)
 '
 
+test_expect_success 'repair a corrupted repo with index-pack' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
+	cp -R bit-error bit-error-cp &&
+	test_when_finished "rm -rf bit-error-cp" &&
+	(
+		cd bit-error-cp &&
+
+		# Have the corrupt object still and fsck complains
+		test_must_fail git cat-file blob HEAD:content.t &&
+		test_must_fail git fsck 2>stderr &&
+		test_i18ngrep "corrupt or missing" stderr &&
+
+		# Fetch the new object (as a pack). The transfer.unpackLimit=1
+		# setting here is important, we must end up with a pack, not a
+		# loose object. The latter would fail due to "exists? Do not
+		# bother" semantics unrelated to the collision check.
+		git -c transfer.unpackLimit=1 \
+			-c core.checkCollisions=false \
+			fetch ../no-bit-error 2>stderr &&
+
+		# fsck still complains, but we have the non-corrupt object
+		# (we lookup in packs first)
+		test_must_fail git fsck 2>stderr &&
+		test_i18ngrep "corrupt or missing" stderr &&
+		git cat-file blob HEAD:content.t &&
+
+		# A "gc" will remove the now-redundant and corrupt object
+		git gc &&
+		git fsck
+	)
+'
+
 test_done
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 410a09b0dd..ca109fff84 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -481,18 +481,22 @@ test_expect_success 'setup: fake a SHA1 hash collision' '
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
 	(
 		cd corrupt &&
-		test_must_fail git index-pack -o ../bad.idx ../test-3.pack 2>msg &&
-		test_i18ngrep "SHA1 COLLISION FOUND" msg
+		test_must_fail git index-pack -o good.idx ../test-3.pack 2>msg &&
+		test_i18ngrep "SHA1 COLLISION FOUND" msg &&
+		git -c core.checkCollisions=false index-pack -o good.idx ../test-3.pack
 	)
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision (large blobs)' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
 	(
 		cd corrupt &&
 		test_must_fail git -c core.bigfilethreshold=1 index-pack -o ../bad.idx ../test-3.pack 2>msg &&
-		test_i18ngrep "SHA1 COLLISION FOUND" msg
+		test_i18ngrep "SHA1 COLLISION FOUND" msg &&
+		git -c core.checkCollisions=false -c core.bigfilethreshold=1 index-pack -o good.idx ../test-3.pack
 	)
 '
 
-- 
2.19.1.759.g500967bb5e


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-27  9:33       ` Jeff King
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
  2018-10-27 14:04         ` Duy Nguyen
@ 2018-10-29  0:48         ` Junio C Hamano
  2018-10-29 15:20           ` Jeff King
  2018-10-29 21:34           ` Geert Jansen
  2 siblings, 2 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-10-29  0:48 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Jansen\, Geert, git\,
	Christian Couder

Jeff King <peff@peff.net> writes:

> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).
>
> So putting that all together, could we have something like:

I think this conceptually is a vast improvement relative to
".cloning" optimization.  Obviously this does not have the huge
downside of the other approach that turns the collision detection
completely off.

A real question is how much performance gain, relative to ".cloning"
thing, this approach gives us.  If it gives us 80% or more of the
gain compared to doing no checking, I'd say we have a clear winner.

> That's mostly untested, but it might be enough to run some timing tests
> with. I think if we want to pursue this, we'd want to address the bits I
> mentioned in the comments, and look at unifying this with the loose
> cache from cc817ca3ef (which if I had remembered we added, probably
> would have saved some time writing the above ;) ).

Yup.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
                             ` (4 preceding siblings ...)
  2018-10-28 22:50           ` [PATCH 4/4] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
@ 2018-10-29 15:04           ` Jeff King
  2018-10-29 15:09             ` Jeff King
  2018-10-29 19:36             ` Ævar Arnfjörð Bjarmason
  5 siblings, 2 replies; 87+ messages in thread
From: Jeff King @ 2018-10-29 15:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Jansen, Geert, git, Christian Couder,
	Nicolas Pitre, Linus Torvalds

On Sat, Oct 27, 2018 at 01:22:16PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Taking one step back, the root problem in this thread is that stat() on
> > non-existing files is slow (which makes has_sha1_file slow).
> >
> > One solution there is to cache the results of looking in .git/objects
> > (or any alternate object store) for loose files. And indeed, this whole
> > scheme is just a specialized form of that: it's a flag to say "hey, we
> > do not have any objects yet, so do not bother looking".
> >
> > Could we implement that in a more direct and central way? And could we
> > implement it in a way that catches more cases? E.g., if I have _one_
> > object, that defeats this specialized optimization, but it is probably
> > still beneficial to cache that knowledge (and the reasonable cutoff is
> > probably not 1, but some value of N loose objects).
> [...]
> 
> The assumption with making it exactly 0 objects and not any value of >0
> is that we can safely assume that a "clone" or initial "fetch"[1] is
> special in ways that a clone isn't. I.e. we're starting out with nothing
> and doing the initial population, that's probably not as true in an
> existing repo that's getting concurrent fetches, commits, rebases etc.

I assume you mean s/that a clone isn't/that a fetch isn't/.

I agree there are cases where you might be able to go further if you
assume a full "0". But my point is that "clone" is an ambiguous concept,
and it doesn't map completely to what's actually slow here. So if you
only look at "are we cloning", then:

  - you have a bunch of cases which are "clones", but aren't actually
    starting from scratch

  - you get zero benefit in the non-clone cases, when we could be
    scaling the benefit smoothly

> But in the spirit of taking a step back, maybe we should take two steps
> back and consider why we're doing this at all.

OK, I think it's worth discussing, and I'll do that below. But first I
want to say...

> Three of our tests fail if we compile git like this, and cloning is much
> faster (especially on NFS):
> 
>     diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>     index 2004e25da2..0c2d008ee0 100644
>     --- a/builtin/index-pack.c
>     +++ b/builtin/index-pack.c
>     @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
> 
>     -       if (startup_info->have_repository) {
>     +       if (0) {
>                     read_lock();
> 
> Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
> spending 5% of my time on that.

With the caching patch I posted earlier, I see roughly the same speedup
on an index-pack of git.git as I do with disabling the collision check
entirely (I did see about a 1% difference in favor of what you wrote
above, which was within the noise, but may well be valid due to slightly
reduced lock contention).

TBH I'm not sure if any of this is actually worth caring about on a
normal Linux system, though. There stat() is fast. It might be much more
interesting on macOS or Windows, or on a Linux system on NFS.

> But why do we have this in the first place? It's because of 8685da4256
> ("don't ever allow SHA1 collisions to exist by fetching a pack",
> 2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
> collision check", 2017-04-01).
> 
> I.e. we are worried about (and those tests check for):
> 
>  a) A malicious user trying to give us repository where they have
>     created an object with the same SHA-1 that's different, as in the
>     SHAttered attack.
> 
>     I remember (but have not dug up) an old E-Mail from Linus saying
>     that this was an important security aspect of git, i.e. even if
>     SHA-1 was broken you couldn't easily propagate bad objects.

Yeah, especially given recent advances in SHA-1 attacks, I'm not super
comfortable with the idea of disabling the duplicate-object check at
this point.

>  b) Cases where we've ended up with different content for a SHA-1 due to
>     e.g. a local FS corruption. Which is the subject of your commit in
>     2017.

Sort of. We actually detected it before my patch, but we just gave a
really crappy error message. ;)

>  c) Are there cases where fetch.fsckObjects is off and we just flip a
>     bit on the wire and don't notice? I think not because we always
>     check the pack checksum (don't we), but I'm not 100% sure.

We'd detect bit-blips on the wire due to the pack checksum. But what's
more interesting are bit-flips on the disk of the sender, which would
then put the bogus data into the pack checksum they generate on the fly.

However, we do detect such a bit-flip, even without fsckObjects, because
the sender does not tell us the expected sha-1 of each object. It gives
us a stream of objects, and the receiver computes the sha-1's
themselves. So a bit flip manifests in the connectivity-check when we
say "hey, the other side should have sent us object X but did not" (we
do not say "gee, what is this object Y they sent?" because after not
seeing X, we do not know which objects would have been reachable, so we
have a whole bunch of such Y's).

fetch.fsckObjects is purely about doing semantic object-quality checks.
They're not even that expensive to do. The main reason they're disabled
is that there are many historical objects floating around that fail
them (I think it would be a useful exercise to sort the existing checks
by priority, downgrading many of them to warnings, and then setting the
default for fetch.fsckObjects to "reject anything above warning").

> Even if someone wants to make the argument that this is behavior that we
> absolutely *MUST* keep and not make configurable, there's still much
> smarter ways to do it.

I don't have any real object to a configuration like this, if people
want to experiment with it. But in contrast, the patch I showed earlier:

  - is safe enough to just turn on all the time, without the user having
    to configure anything nor make a safety tradeoff

  - speeds up all the other spots that use OBJECT_INFO_QUICK (like
    fetch's tag-following, or what appears to be the exact same
    optimization done manually inside mark_complete_and_common-ref()).

> We could e.g. just unconditionally write out the packfile into a
> quarantine environment (see 720dae5a19 ("config doc: elaborate on
> fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
> objects and packs we have and see if any of those exist in the new pack,
> if they do, do the current assertion, and if not (and fetch.fsckObjects
> passes) move it out of the quarantine.

Yes, I agree that would work, though it's a bigger architecture change.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 15:04           ` [RFC PATCH] index-pack: improve performance on NFS Jeff King
@ 2018-10-29 15:09             ` Jeff King
  2018-10-29 19:36             ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-29 15:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Jansen, Geert, git, Christian Couder,
	Nicolas Pitre, Linus Torvalds

On Mon, Oct 29, 2018 at 11:04:53AM -0400, Jeff King wrote:

> > Even if someone wants to make the argument that this is behavior that we
> > absolutely *MUST* keep and not make configurable, there's still much
> > smarter ways to do it.
> 
> I don't have any real object to a configuration like this, if people
> want to experiment with it. But in contrast, the patch I showed earlier:
> 
>   - is safe enough to just turn on all the time, without the user having
>     to configure anything nor make a safety tradeoff
> 
>   - speeds up all the other spots that use OBJECT_INFO_QUICK (like
>     fetch's tag-following, or what appears to be the exact same
>     optimization done manually inside mark_complete_and_common-ref()).

One thing I forgot to add. We're focusing here on the case where the
objects _aren't_ present, and we're primarily trying to get rid of the
stat call.

But when we actually do see a duplicate, we open up the object and
actually compare the bytes. Eliminating the collision check entirely
would save that work, which is obviously something that can't be
improved by just caching the existence of loose objects.

I'm not sure how often that case happens in a normal repository. We see
it a fair on GitHub servers because of the way we use alternates (i.e.,
we often already have the object you pushed up because it's present in
another fork and available via objects/info/alternates).

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-27 14:04         ` Duy Nguyen
@ 2018-10-29 15:18           ` Jeff King
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-29 15:18 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, gerardu,
	Git Mailing List, Christian Couder

On Sat, Oct 27, 2018 at 04:04:32PM +0200, Duy Nguyen wrote:

> > Of course any cache raises questions of cache invalidation, but I think
> > we've already dealt with that for this case. When we use
> > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> > accuracy/speed tradeoff (which does a similar caching thing with
> > packfiles).
> 
> We don't care about a separate process adding more loose objects while
> index-pack is running, do we? I'm guessing we don't but just to double
> check...

Right. That's basically what QUICK means: don't bother re-examining the
repository to handle simultaneous writes, even if it means saying an
object is not there when it has recently appeared.

So far it has only applied to packs, but this is really just the same
concept (just as we would not notice a new pack arriving, we will not
notice a new loose object arriving).

> > +/* probably should be configurable? */
> > +#define LOOSE_OBJECT_CACHE_MAX 65536
> 
> Yes, perhaps with gc.auto config value (multiplied by 256) as the cut
> point. If it's too big maybe just go with a bloom filter. For this
> particular case we expect like 99% of calls to miss.

I wonder, though, if we should have a maximum at all. The existing
examples I've found of this technique are:

  - mark_complete_and_common_ref(), which is trying to cover this exact
    case. It looks like it avoids adding more objects than there are
    refs, so I guess it actually has a pretty small cutoff.

  - find_short_object_filename(), which does the same thing with no
    limits. And there if we _do_ have a lot of objects, we'd still
    prefer to keep the cache.

And really, this list is pretty much equivalent to looking at a pack
.idx. The only difference is that one is mmap'd, but here we'd use the
heap. So it's not shared between processes, but otherwise the working
set size is similar.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29  0:48         ` Junio C Hamano
@ 2018-10-29 15:20           ` Jeff King
  2018-10-29 18:43             ` Ævar Arnfjörð Bjarmason
  2018-10-29 21:34           ` Geert Jansen
  1 sibling, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-29 15:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Jansen, Geert, git,
	Christian Couder

On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:

> > Of course any cache raises questions of cache invalidation, but I think
> > we've already dealt with that for this case. When we use
> > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> > accuracy/speed tradeoff (which does a similar caching thing with
> > packfiles).
> >
> > So putting that all together, could we have something like:
> 
> I think this conceptually is a vast improvement relative to
> ".cloning" optimization.  Obviously this does not have the huge
> downside of the other approach that turns the collision detection
> completely off.
> 
> A real question is how much performance gain, relative to ".cloning"
> thing, this approach gives us.  If it gives us 80% or more of the
> gain compared to doing no checking, I'd say we have a clear winner.

My test runs showed it improving index-pack by about 3%, versus 4% for
no collision checking at all. But there was easily 1% of noise. And much
more importantly, that was on a Linux system on ext4, where stat is
fast. I'd be much more curious to hear timing results from people on
macOS or Windows, or from Geert's original NFS case.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 15:20           ` Jeff King
@ 2018-10-29 18:43             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-29 18:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Jansen\, Geert, git\, Christian Couder


On Mon, Oct 29 2018, Jeff King wrote:

> On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:
>
>> > Of course any cache raises questions of cache invalidation, but I think
>> > we've already dealt with that for this case. When we use
>> > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
>> > accuracy/speed tradeoff (which does a similar caching thing with
>> > packfiles).
>> >
>> > So putting that all together, could we have something like:
>>
>> I think this conceptually is a vast improvement relative to
>> ".cloning" optimization.  Obviously this does not have the huge
>> downside of the other approach that turns the collision detection
>> completely off.
>>
>> A real question is how much performance gain, relative to ".cloning"
>> thing, this approach gives us.  If it gives us 80% or more of the
>> gain compared to doing no checking, I'd say we have a clear winner.
>
> My test runs showed it improving index-pack by about 3%, versus 4% for
> no collision checking at all. But there was easily 1% of noise. And much
> more importantly, that was on a Linux system on ext4, where stat is
> fast. I'd be much more curious to hear timing results from people on
> macOS or Windows, or from Geert's original NFS case.

At work we make copious use of NetApp over NFS for filers. I'd say this
is probably typical for enterprise environments. Raw I/O performance
over the wire (writing a large file) is really good, but metadata
(e.g. stat) performance tends to be atrocious.

We both host the in-house Git server (GitLab) on such a filer (for HA
etc.), as well as many types of clients.

As noted by Geert upthread you need to mount the git directories with
lookupcache=positive (see e.g. [1]).

Cloning git.git as --bare onto such a partition with my patch:

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     60.98    1.802091          19     93896     19813 futex
     14.64    0.432782           7     61415        16 read
      9.40    0.277804           1    199576           pread64
      4.88    0.144172           3     49355        11 write
      3.10    0.091498          31      2919      2880 stat
      2.53    0.074812          31      2431       737 lstat
      1.96    0.057934           3     17257      1276 recvfrom
      0.91    0.026815           3      8543           select
      0.62    0.018425           2      8543           poll
    [...]
    real    0m32.053s
    user    0m21.451s
    sys     0m7.806s

Without:

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     71.01   31.653787          50    628265     21608 futex
     24.14   10.761950          41    260658    258964 lstat
      2.22    0.988001           5    199576           pread64
      1.32    0.587844          10     59662         3 read
      0.79    0.350625           7     50376        11 write
      0.22    0.096019          33      2919      2880 stat
      0.13    0.057950           4     15821        12 recvfrom
      0.05    0.022385           3      7949           select
      0.04    0.015988           2      7949           poll
      0.03    0.013622        3406         4           wait4
    [...]
    real    4m38.670s
    user    0m29.015s
    sys     0m33.894s

So a reduction in clone time by ~90%.

Performance would be basically the same with your patch. But let's
discuss that elsewhere in this thread. Just wanted to post the
performance numbers here.

1. https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/109#note_12528896

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 15:04           ` [RFC PATCH] index-pack: improve performance on NFS Jeff King
  2018-10-29 15:09             ` Jeff King
@ 2018-10-29 19:36             ` Ævar Arnfjörð Bjarmason
  2018-10-29 23:27               ` Jeff King
  1 sibling, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-29 19:36 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Jansen\, Geert, git\,
	Christian Couder, Nicolas Pitre, Linus Torvalds


On Mon, Oct 29 2018, Jeff King wrote:

> On Sat, Oct 27, 2018 at 01:22:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Taking one step back, the root problem in this thread is that stat() on
>> > non-existing files is slow (which makes has_sha1_file slow).
>> >
>> > One solution there is to cache the results of looking in .git/objects
>> > (or any alternate object store) for loose files. And indeed, this whole
>> > scheme is just a specialized form of that: it's a flag to say "hey, we
>> > do not have any objects yet, so do not bother looking".
>> >
>> > Could we implement that in a more direct and central way? And could we
>> > implement it in a way that catches more cases? E.g., if I have _one_
>> > object, that defeats this specialized optimization, but it is probably
>> > still beneficial to cache that knowledge (and the reasonable cutoff is
>> > probably not 1, but some value of N loose objects).
>> [...]
>>
>> The assumption with making it exactly 0 objects and not any value of >0
>> is that we can safely assume that a "clone" or initial "fetch"[1] is
>> special in ways that a clone isn't. I.e. we're starting out with nothing
>> and doing the initial population, that's probably not as true in an
>> existing repo that's getting concurrent fetches, commits, rebases etc.
>
> I assume you mean s/that a clone isn't/that a fetch isn't/.

Yes, sorry.

> I agree there are cases where you might be able to go further if you
> assume a full "0". But my point is that "clone" is an ambiguous concept,
> and it doesn't map completely to what's actually slow here. So if you
> only look at "are we cloning", then:
>
>   - you have a bunch of cases which are "clones", but aren't actually
>     starting from scratch
>
>   - you get zero benefit in the non-clone cases, when we could be
>     scaling the benefit smoothly

Indeed. It's not special in principle, but I think in practice the
biggest wins are in the clone case, and unlike "fetch" we can safely
assume we're free of race conditions. More on that below.

>> But in the spirit of taking a step back, maybe we should take two steps
>> back and consider why we're doing this at all.
>
> OK, I think it's worth discussing, and I'll do that below. But first I
> want to say...
>
>> Three of our tests fail if we compile git like this, and cloning is much
>> faster (especially on NFS):
>>
>>     diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>>     index 2004e25da2..0c2d008ee0 100644
>>     --- a/builtin/index-pack.c
>>     +++ b/builtin/index-pack.c
>>     @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>>
>>     -       if (startup_info->have_repository) {
>>     +       if (0) {
>>                     read_lock();
>>
>> Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
>> spending 5% of my time on that.
>
> With the caching patch I posted earlier, I see roughly the same speedup
> on an index-pack of git.git as I do with disabling the collision check
> entirely (I did see about a 1% difference in favor of what you wrote
> above, which was within the noise, but may well be valid due to slightly
> reduced lock contention).
>
> TBH I'm not sure if any of this is actually worth caring about on a
> normal Linux system, though. There stat() is fast. It might be much more
> interesting on macOS or Windows, or on a Linux system on NFS.

It matters a *lot* on NFS as my performance numbers in
https://public-inbox.org/git/87d0rslhkl.fsf@evledraar.gmail.com/ show.

>> But why do we have this in the first place? It's because of 8685da4256
>> ("don't ever allow SHA1 collisions to exist by fetching a pack",
>> 2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
>> collision check", 2017-04-01).
>>
>> I.e. we are worried about (and those tests check for):
>>
>>  a) A malicious user trying to give us repository where they have
>>     created an object with the same SHA-1 that's different, as in the
>>     SHAttered attack.
>>
>>     I remember (but have not dug up) an old E-Mail from Linus saying
>>     that this was an important security aspect of git, i.e. even if
>>     SHA-1 was broken you couldn't easily propagate bad objects.
>
> Yeah, especially given recent advances in SHA-1 attacks, I'm not super
> comfortable with the idea of disabling the duplicate-object check at
> this point.

I'd be comfortable with it in my setup since it's been limited to
collision attacks that are computationally prohibitive, and there being
no sign of preimage attacks, which is the case we really need to worry
about.

>>  b) Cases where we've ended up with different content for a SHA-1 due to
>>     e.g. a local FS corruption. Which is the subject of your commit in
>>     2017.
>
> Sort of. We actually detected it before my patch, but we just gave a
> really crappy error message. ;)
>
>>  c) Are there cases where fetch.fsckObjects is off and we just flip a
>>     bit on the wire and don't notice? I think not because we always
>>     check the pack checksum (don't we), but I'm not 100% sure.
>
> We'd detect bit-blips on the wire due to the pack checksum. But what's
> more interesting are bit-flips on the disk of the sender, which would
> then put the bogus data into the pack checksum they generate on the fly.
>
> However, we do detect such a bit-flip, even without fsckObjects, because
> the sender does not tell us the expected sha-1 of each object. It gives
> us a stream of objects, and the receiver computes the sha-1's
> themselves. So a bit flip manifests in the connectivity-check when we
> say "hey, the other side should have sent us object X but did not" (we
> do not say "gee, what is this object Y they sent?" because after not
> seeing X, we do not know which objects would have been reachable, so we
> have a whole bunch of such Y's).
>
> fetch.fsckObjects is purely about doing semantic object-quality checks.
> They're not even that expensive to do. The main reason they're disabled
> is that there are many historical objects floating around that fail
> them (I think it would be a useful exercise to sort the existing checks
> by priority, downgrading many of them to warnings, and then setting the
> default for fetch.fsckObjects to "reject anything above warning").

Thanks, that's really informative & useful.

>> Even if someone wants to make the argument that this is behavior that we
>> absolutely *MUST* keep and not make configurable, there's still much
>> smarter ways to do it.
>
> I don't have any real object to a configuration like this, if people
> want to experiment with it. But in contrast, the patch I showed earlier:
>
>   - is safe enough to just turn on all the time, without the user having
>     to configure anything nor make a safety tradeoff

I think it's a useful patch to carry forward, and agree that it should
be turned on by default.

It does introduce a race condition where you can introduce a colliding
object to the repository by doing two concurrent pushes, but as you note
in
https://public-inbox.org/git/20181029151842.GJ17668@sigill.intra.peff.net/
this already applies to packs, so you can trigger that with the right
sized push (depending on transfer.unpackLimit), and we also have this in
existing forms for other stuff.

I do think it's amazingly paranoid to be worried about SHA-1 collisions
in the first place, and a bit odd to leave the door open on these race
conditions. I.e. it's hard to imagine a state-level[1] actor with
sufficient motivation to exploit this who wouldn't find some way to make
the race condition work as an escape hatch.

I admit just leaving that race condition does close a lot of doors
entirely. I.e. you could sometimes trigger a collision but wouldn't have
the right conditions to exploit the race condition.

>   - speeds up all the other spots that use OBJECT_INFO_QUICK (like
>     fetch's tag-following, or what appears to be the exact same
>     optimization done manually inside mark_complete_and_common-ref()).

We also pay a constant cost of doing an opendir() / readdir() on however
many loose objects we have for every push on the server-side. While it's
not as bad as stat() in a loop that's also quite slow on NFS.

In a busy repo that gets a lot of branches / branch deletions (so not
quite as extreme as [2], but close) and the default expiry policy you
can easily have 20-100K loose objects (something near the lower bound of
that is the current live state of one server I'm looking at).

A recursive opendir()/readdir() on that on local disk is really fast if
it's in cache, but can easily be 1-5 seconds on NFS. So for a push we'd
now pay up to 5s just populating a cache we'll bearly use to accept some
tiny push with just a few objects.

I also found when writing "index-pack: add ability to disable SHA-1
collision check" that it's really handy to recover from some forms of
repo corruption, so I've documented that. So aside from the performance
case it's a useful knob to have.

So what I'll do is:

 * Re-roll my 4 patch series to include the patch you have in
   <20181027093300.GA23974@sigill.intra.peff.net>

 * Turn that behavior on by default, but have some knob to toggle it
   off, because as noted above on some performance sensitive NFS cases
   I'd really like to not have the cache *AND* not have the collision
   check, performance will suffer with the cache.

>> We could e.g. just unconditionally write out the packfile into a
>> quarantine environment (see 720dae5a19 ("config doc: elaborate on
>> fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
>> objects and packs we have and see if any of those exist in the new pack,
>> if they do, do the current assertion, and if not (and fetch.fsckObjects
>> passes) move it out of the quarantine.
>
> Yes, I agree that would work, though it's a bigger architecture change.

1. "state-level" because even though Google's collision cost ~$100k
   we're talking about a *much* harder problem in practice of doing
   something useful. E.g. replacing git.c with an actual exploit, not
   just two cute specially crafted PDFs.

2. https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29  0:48         ` Junio C Hamano
  2018-10-29 15:20           ` Jeff King
@ 2018-10-29 21:34           ` Geert Jansen
  2018-10-29 21:50             ` Jeff King
  2018-10-29 22:27             ` Jeff King
  1 sibling, 2 replies; 87+ messages in thread
From: Geert Jansen @ 2018-10-29 21:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, git, Christian Couder

On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:

> A real question is how much performance gain, relative to ".cloning"
> thing, this approach gives us.  If it gives us 80% or more of the
> gain compared to doing no checking, I'd say we have a clear winner.

I've tested Jeff's loose-object-cache patch and the performance is within error
bounds of my .cloning patch. A git clone of the same repo as in my initial
tests:

  .cloning -> 10m04
  loose-object-cache -> 9m59

Jeff's patch does a little more work (256 readdir() calls, which in case of an
empty repo translate into 256 LOOKUP calls that return NFS4ERR_NOENT) but that
appears to be insignificant.

I agree that the loose-object-cache approach is preferred as it applies to more
git commands and also benefits performance when there are loose objects
already in the repository.

As pointed out in the thread, maybe the default cache size should be some
integer times gc.auto.

I believe the loose-object-cache approach would have a performance regression
when you're receiving a small pack file and there's many loose objects in the
repo. Basically you're trading off

    MIN(256, num_objects_in_pack / dentries_per_readdir) * readdir_latency
    
against

    num_loose_objects * stat_latency

On Amazon EFS (and I expect on other NFS server implementations too) it is more
efficient to do readdir() on a large directory than to stat() each of the
individual files in the same directory. I don't have exact numbers but based on
a very rough calculation the difference is roughly 10x for large directories
under normal circumstances.

As an example, this means that when you're recieving a pack file with 1K
objects in a repository with 10K loose objects that the loose-object-cache
patch has roughly the same performance as the current git. I'm not sure if this
is something to worry about as I'm not sure people run repos with this many
loose files. If it is a concern, there could be a flag to turn the loose object
cache on/off.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 21:34           ` Geert Jansen
@ 2018-10-29 21:50             ` Jeff King
  2018-10-29 22:21               ` Geert Jansen
  2018-10-29 22:27             ` Jeff King
  1 sibling, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-29 21:50 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, git,
	Christian Couder

On Mon, Oct 29, 2018 at 09:34:53PM +0000, Geert Jansen wrote:

> On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:
> 
> > A real question is how much performance gain, relative to ".cloning"
> > thing, this approach gives us.  If it gives us 80% or more of the
> > gain compared to doing no checking, I'd say we have a clear winner.
> 
> I've tested Jeff's loose-object-cache patch and the performance is within error
> bounds of my .cloning patch. A git clone of the same repo as in my initial
> tests:
> 
>   .cloning -> 10m04
>   loose-object-cache -> 9m59
> 
> Jeff's patch does a little more work (256 readdir() calls, which in case of an
> empty repo translate into 256 LOOKUP calls that return NFS4ERR_NOENT) but that
> appears to be insignificant.

Yep, that makes sense. Thanks for timing it.

> I believe the loose-object-cache approach would have a performance regression
> when you're receiving a small pack file and there's many loose objects in the
> repo. Basically you're trading off
> 
>     MIN(256, num_objects_in_pack / dentries_per_readdir) * readdir_latency
>     
> against
> 
>     num_loose_objects * stat_latency

Should num_loose_objects and num_objects_in_pack be swapped here? Just
making sure I understand what you're saying.

The patch I showed just blindly reads each of the 256 object
subdirectories. I think if we pursue this (and it seems like everybody
is on board), we should cache each of those individually. So a single
object would incur at most one opendir/readdir (and subsequent objects
may, too, or they may hit that cache if they share the first byte).

So the 256 in your MIN() is potentially much smaller. We still have to
deal with the fact that if you have a large number of loose objects,
they may be split cross multiple readdir (or getdents) calls. The "cache
maximum" we discussed does bound that, but in some ways that's worse:
you waste time doing the bounded amount of readdir and then don't even
get the benefit of the cache. ;)

> On Amazon EFS (and I expect on other NFS server implementations too) it is more
> efficient to do readdir() on a large directory than to stat() each of the
> individual files in the same directory. I don't have exact numbers but based on
> a very rough calculation the difference is roughly 10x for large directories
> under normal circumstances.

I'd expect readdir() to be much faster than stat() in general (e.g., "ls
-f" versus "ls -l" is faster even on a warm cache; there's more
formatting going on in the latter, but I think a lot of it is the effort
to stat).

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 21:50             ` Jeff King
@ 2018-10-29 22:21               ` Geert Jansen
  0 siblings, 0 replies; 87+ messages in thread
From: Geert Jansen @ 2018-10-29 22:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, git,
	Christian Couder

On Mon, Oct 29, 2018 at 05:50:50PM -0400, Jeff King wrote:

> > I believe the loose-object-cache approach would have a performance regression
> > when you're receiving a small pack file and there's many loose objects in the
> > repo. Basically you're trading off
> > 
> >     MIN(256, num_objects_in_pack / dentries_per_readdir) * readdir_latency
> >     
> > against
> > 
> >     num_loose_objects * stat_latency
> 
> Should num_loose_objects and num_objects_in_pack be swapped here? Just
> making sure I understand what you're saying.

Whoops, yes, thanks for spotting that!

> So the 256 in your MIN() is potentially much smaller. We still have to
> deal with the fact that if you have a large number of loose objects,
> they may be split cross multiple readdir (or getdents) calls. The "cache
> maximum" we discussed does bound that, but in some ways that's worse:
> you waste time doing the bounded amount of readdir and then don't even
> get the benefit of the cache. ;)

Yup. To get the performance benefit you'd like the cache to hold all loose
objects except in clearly degenerate cases with far too many loose objects.

> > On Amazon EFS (and I expect on other NFS server implementations too) it is more
> > efficient to do readdir() on a large directory than to stat() each of the
> > individual files in the same directory. I don't have exact numbers but based on
> > a very rough calculation the difference is roughly 10x for large directories
> > under normal circumstances.
> 
> I'd expect readdir() to be much faster than stat() in general (e.g., "ls
> -f" versus "ls -l" is faster even on a warm cache; there's more
> formatting going on in the latter, but I think a lot of it is the effort
> to stat).

In the case of NFS, the client usually requests that the READDIR response also
contains some of the stat flags (like st_mode). But even in this case it's
still more efficient to return multiple entries in one batch through READDIR
rather than as individual responses to GETATTR (which is what stat() maps to).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 21:34           ` Geert Jansen
  2018-10-29 21:50             ` Jeff King
@ 2018-10-29 22:27             ` Jeff King
  2018-10-29 22:35               ` Stefan Beller
  1 sibling, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-29 22:27 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, git,
	Christian Couder

On Mon, Oct 29, 2018 at 09:34:53PM +0000, Geert Jansen wrote:

> As an example, this means that when you're recieving a pack file with 1K
> objects in a repository with 10K loose objects that the loose-object-cache
> patch has roughly the same performance as the current git. I'm not sure if this
> is something to worry about as I'm not sure people run repos with this many
> loose files. If it is a concern, there could be a flag to turn the loose object
> cache on/off.

So yeah, that's the other thing I'm thinking about regarding having a
maximum loose cache size.

10k objects is only 200KB in memory. That's basically nothing. At some
point you run into pathological cases, like having a million objects
(but that's still only 20MB, much less than we devote to other caches,
though of course they do add up).

If you have a million loose objects, I strongly suspect you're going to
run into other problems (like space, since you're not getting any
deltas).

The one thing that gives me pause is that if you have a bunch of unused
and unreachable loose objects on disk, most operations won't actually
look at them at all. The majority of operations are only looking for
objects we expect to be present (e.g., resolving a ref, walking a tree)
and are fulfilled by checking the pack indices first.

So it's possible that Git is _tolerable_ for most operations with a
million loose objects, and we could make it slightly worse by loading
the cache. But I find it hard to get too worked up about spending an
extra 20MB (and the time to readdir() it in) in that case. It seems like
about 400ms on my machine, and the correct next step is almost always
going to be "pack" or "prune" anyway.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 22:27             ` Jeff King
@ 2018-10-29 22:35               ` Stefan Beller
  2018-10-29 23:29                 ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: Stefan Beller @ 2018-10-29 22:35 UTC (permalink / raw)
  To: Jeff King
  Cc: gerardu, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	git, Christian Couder

On Mon, Oct 29, 2018 at 3:27 PM Jeff King <peff@peff.net> wrote:

> So yeah, that's the other thing I'm thinking about regarding having a
> maximum loose cache size.

tangent:
This preloading/caching could be used for a more precise approach
to decide when to gc instead of using some statistical sampling
of objects/17, eventually.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 19:36             ` Ævar Arnfjörð Bjarmason
@ 2018-10-29 23:27               ` Jeff King
  2018-11-07 22:55                 ` Geert Jansen
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-29 23:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Jansen, Geert, git, Christian Couder,
	Nicolas Pitre, Linus Torvalds

On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > Yeah, especially given recent advances in SHA-1 attacks, I'm not super
> > comfortable with the idea of disabling the duplicate-object check at
> > this point.
> 
> I'd be comfortable with it in my setup since it's been limited to
> collision attacks that are computationally prohibitive, and there being
> no sign of preimage attacks, which is the case we really need to worry
> about.

I agree, and I'm not actually that worried about the current state. But
what makes me more nervous is the life-cycle around Git. In 5 years,
people are still going to be running what we ship today, and will
grumble about upgrading to deal with SHA-1.

I suppose it's not the end of the world as long as they can un-flip a
config switch to get back the more-paranoid behavior (which is all that
you're really proposing).

> It does introduce a race condition where you can introduce a colliding
> object to the repository by doing two concurrent pushes, but as you note
> in
> https://public-inbox.org/git/20181029151842.GJ17668@sigill.intra.peff.net/
> this already applies to packs, so you can trigger that with the right
> sized push (depending on transfer.unpackLimit), and we also have this in
> existing forms for other stuff.

Right. It can also trigger currently if somebody runs "git repack"
simultaneously (the loose becomes packed, but we don't re-scan the pack
directory).

> I do think it's amazingly paranoid to be worried about SHA-1 collisions
> in the first place, and a bit odd to leave the door open on these race
> conditions. I.e. it's hard to imagine a state-level[1] actor with
> sufficient motivation to exploit this who wouldn't find some way to make
> the race condition work as an escape hatch.

Yeah, I agree there's an element of that. I think the "push twice
quickly to race" thing is actually not all that interesting, though. In
that case, you're providing both the objects already, so why not just
push the one you want?

What's more interesting is racing with the victim of your collision (I
feed Junio the good half of the collision, and then try to race his
push and get my evil half in at the same time). Or racing a repack. But
timing the race there seems a lot trickier.

I suspect you could open up the window substantially by feeding your
pack really slowly. So I start to push at 1pm, but trickle in a byte at
a time of my 1GB pack, taking several hours. Meanwhile Junio pushes, and
then as soon as I see that, I send the rest of my pack. My index-pack
doesn't see Junio's push because it started before.

And ditto with repack, if the servers runs it predictably in response to
load.  So maybe not so tricky after all.

I think the other thing that helps here is that _everybody_ runs the
collision check. So yeah, you can race pushing your evil stuff to my
server. But it only takes one person fetching into their quiescent
laptop repository to notice the collision and sound the alarm.

I'll admit that there's a whole lot of hand-waving there, for a security
claim. I'll be glad to simply move off of SHA-1.

> In a busy repo that gets a lot of branches / branch deletions (so not
> quite as extreme as [2], but close) and the default expiry policy you
> can easily have 20-100K loose objects (something near the lower bound of
> that is the current live state of one server I'm looking at).
> 
> A recursive opendir()/readdir() on that on local disk is really fast if
> it's in cache, but can easily be 1-5 seconds on NFS. So for a push we'd
> now pay up to 5s just populating a cache we'll bearly use to accept some
> tiny push with just a few objects.

That 1-5 seconds is a little scary. Locally for a million objects I was
looking at 400ms. But obviously NFS is going to be much worse.

I do agree with your sentiment below that even if this should be on by
default, it should have a config knob. After all, "please flip this
switch and see if things improve" is a good escape hatch to have.

>  * Re-roll my 4 patch series to include the patch you have in
>    <20181027093300.GA23974@sigill.intra.peff.net>

I don't think it's quite ready for inclusion as-is. I hope to brush it
up a bit, but I have quite a backlog of stuff to review, as well.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 22:35               ` Stefan Beller
@ 2018-10-29 23:29                 ` Jeff King
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-29 23:29 UTC (permalink / raw)
  To: Stefan Beller
  Cc: gerardu, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	git, Christian Couder

On Mon, Oct 29, 2018 at 03:35:58PM -0700, Stefan Beller wrote:

> On Mon, Oct 29, 2018 at 3:27 PM Jeff King <peff@peff.net> wrote:
> 
> > So yeah, that's the other thing I'm thinking about regarding having a
> > maximum loose cache size.
> 
> tangent:
> This preloading/caching could be used for a more precise approach
> to decide when to gc instead of using some statistical sampling
> of objects/17, eventually.

Isn't it exactly the same thing? Ideally we'd break down the cache to
the directory level, so we could fill the cache list for "17" and ask
"how full are you". Or we could just readdir objects/17 ourselves. But
either way, it's the same amount of work.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
@ 2018-10-30  2:49             ` Geert Jansen
  2018-10-30  9:04               ` Junio C Hamano
  2018-10-30 18:43             ` [PATCH v2 0/3] index-pack: test updates Ævar Arnfjörð Bjarmason
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 87+ messages in thread
From: Geert Jansen @ 2018-10-30  2:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Christian Couder, Nicolas Pitre,
	Linus Torvalds, Petr Baudis

On Sun, Oct 28, 2018 at 10:50:19PM +0000, Ævar Arnfjörð Bjarmason wrote:

> I left the door open for that in the new config option 4/4 implements,
> but I suspect for Geert's purposes this is something he'd prefer to
> turn off in git on clone entirely, i.e. because it may be running on
> some random Amazon's customer's EFS instance, and they won't know
> about this new core.checkCollisions option.
> 
> But maybe I'm wrong about that and Geert is happy to just turn on
> core.checkCollisions=false and use this series instead.

I think that the best user experience would probably be if git were fast by
default without having to give up on (future) security by removing the sha1
collision check.  Maybe core.checkCollisons could default to "on" only when
there's no loose objects in the repository? That would give a fast experience
for many common cases (git clone, git init && git fetch) while still doing the
collision check when relevant.

My patch used the --cloning flag as an approximation of "no loose objects".
Maybe a better option would be to check for the non-existence of the [00-ff]
directories under .git/objects.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking
  2018-10-30  2:49             ` Geert Jansen
@ 2018-10-30  9:04               ` Junio C Hamano
  0 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-10-30  9:04 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, git, Jeff King,
	Christian Couder, Nicolas Pitre, Linus Torvalds, Petr Baudis

Geert Jansen <gerardu@amazon.com> writes:

> Maybe a better option would be to check for the non-existence of the [00-ff]
> directories under .git/objects.

Please do not do this; I expect many people do this before they
leave work, just like I do:

	$ git repack -a -d -f --window=$largs --depth=$small
	$ git prune

which would typically leave only info/ and pack/ subdirectories
under .git/objects/ directory.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 0/3] index-pack: test updates
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
  2018-10-30  2:49             ` Geert Jansen
@ 2018-10-30 18:43             ` Ævar Arnfjörð Bjarmason
  2018-11-13 20:19               ` [PATCH v3] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
  2018-10-30 18:43             ` [PATCH v2 1/3] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-30 18:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

I'd still probalby like to have core.checkCollisions as a config knob
to be able to turn it off, but let's see what Jeff comes up with once
he finishes his WIP cache patch.

In the meantime 1-3/4 of my series is obviously correct test fixes
which I'd like queued up first.

Ævar Arnfjörð Bjarmason (3):
  pack-objects test: modernize style
  pack-objects tests: don't leave test .git corrupt at end
  index-pack tests: don't leave test repo dirty at end

 t/t1060-object-corruption.sh |  4 ++-
 t/t5300-pack-object.sh       | 47 +++++++++++++++++++-----------------
 2 files changed, 28 insertions(+), 23 deletions(-)

-- 
2.19.1.899.g0250525e69


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 1/3] pack-objects test: modernize style
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
  2018-10-30  2:49             ` Geert Jansen
  2018-10-30 18:43             ` [PATCH v2 0/3] index-pack: test updates Ævar Arnfjörð Bjarmason
@ 2018-10-30 18:43             ` Ævar Arnfjörð Bjarmason
  2018-10-30 18:43             ` [PATCH v2 2/3] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
  2018-10-30 18:43             ` [PATCH v2 3/3] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
  4 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-30 18:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Modernize the quoting and indentation style of two tests added in
8685da4256 ("don't ever allow SHA1 collisions to exist by fetching a
pack", 2007-03-20), and of a subsequent one added in
4614043c8f ("index-pack: use streaming interface for collision test on
large blobs", 2012-05-24) which had copied the style of the first two.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5300-pack-object.sh | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 6c620cd540..a0309e4bab 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -475,22 +475,22 @@ test_expect_success 'pack-objects in too-many-packs mode' '
 # two tests at the end of this file.
 #
 
-test_expect_success \
-    'fake a SHA1 hash collision' \
-    'long_a=$(git hash-object a | sed -e "s!^..!&/!") &&
-     long_b=$(git hash-object b | sed -e "s!^..!&/!") &&
-     test -f	.git/objects/$long_b &&
-     cp -f	.git/objects/$long_a \
-		.git/objects/$long_b'
+test_expect_success 'fake a SHA1 hash collision' '
+	long_a=$(git hash-object a | sed -e "s!^..!&/!") &&
+	long_b=$(git hash-object b | sed -e "s!^..!&/!") &&
+	test -f	.git/objects/$long_b &&
+	cp -f	.git/objects/$long_a \
+		.git/objects/$long_b
+'
 
-test_expect_success \
-    'make sure index-pack detects the SHA1 collision' \
-    'test_must_fail git index-pack -o bad.idx test-3.pack 2>msg &&
-     test_i18ngrep "SHA1 COLLISION FOUND" msg'
+test_expect_success 'make sure index-pack detects the SHA1 collision' '
+	test_must_fail git index-pack -o bad.idx test-3.pack 2>msg &&
+	test_i18ngrep "SHA1 COLLISION FOUND" msg
+'
 
-test_expect_success \
-    'make sure index-pack detects the SHA1 collision (large blobs)' \
-    'test_must_fail git -c core.bigfilethreshold=1 index-pack -o bad.idx test-3.pack 2>msg &&
-     test_i18ngrep "SHA1 COLLISION FOUND" msg'
+test_expect_success 'make sure index-pack detects the SHA1 collision (large blobs)' '
+	test_must_fail git -c core.bigfilethreshold=1 index-pack -o bad.idx test-3.pack 2>msg &&
+	test_i18ngrep "SHA1 COLLISION FOUND" msg
+'
 
 test_done
-- 
2.19.1.899.g0250525e69


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 2/3] pack-objects tests: don't leave test .git corrupt at end
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
                               ` (2 preceding siblings ...)
  2018-10-30 18:43             ` [PATCH v2 1/3] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
@ 2018-10-30 18:43             ` Ævar Arnfjörð Bjarmason
  2018-10-30 18:43             ` [PATCH v2 3/3] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
  4 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-30 18:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Change the pack-objects tests to not leave their .git directory
corrupt and the end.

In 2fca19fbb5 ("fix multiple issues with t5300", 2010-02-03) a comment
was added warning against adding any subsequent tests, but since
4614043c8f ("index-pack: use streaming interface for collision test on
large blobs", 2012-05-24) the comment has drifted away from the code,
mentioning two test, when we actually have three.

Instead of having this warning let's just create a new .git directory
specifically for these tests.

As an aside, it would be interesting to instrument the test suite to
run a "git fsck" at the very end (in "test_done"). That would have
errored before this change, and may find other issues #leftoverbits.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5300-pack-object.sh | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index a0309e4bab..410a09b0dd 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -468,29 +468,32 @@ test_expect_success 'pack-objects in too-many-packs mode' '
 	git fsck
 '
 
-#
-# WARNING!
-#
-# The following test is destructive.  Please keep the next
-# two tests at the end of this file.
-#
-
-test_expect_success 'fake a SHA1 hash collision' '
-	long_a=$(git hash-object a | sed -e "s!^..!&/!") &&
-	long_b=$(git hash-object b | sed -e "s!^..!&/!") &&
-	test -f	.git/objects/$long_b &&
-	cp -f	.git/objects/$long_a \
-		.git/objects/$long_b
+test_expect_success 'setup: fake a SHA1 hash collision' '
+	git init corrupt &&
+	(
+		cd corrupt &&
+		long_a=$(git hash-object -w ../a | sed -e "s!^..!&/!") &&
+		long_b=$(git hash-object -w ../b | sed -e "s!^..!&/!") &&
+		test -f	.git/objects/$long_b &&
+		cp -f	.git/objects/$long_a \
+			.git/objects/$long_b
+	)
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision' '
-	test_must_fail git index-pack -o bad.idx test-3.pack 2>msg &&
-	test_i18ngrep "SHA1 COLLISION FOUND" msg
+	(
+		cd corrupt &&
+		test_must_fail git index-pack -o ../bad.idx ../test-3.pack 2>msg &&
+		test_i18ngrep "SHA1 COLLISION FOUND" msg
+	)
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision (large blobs)' '
-	test_must_fail git -c core.bigfilethreshold=1 index-pack -o bad.idx test-3.pack 2>msg &&
-	test_i18ngrep "SHA1 COLLISION FOUND" msg
+	(
+		cd corrupt &&
+		test_must_fail git -c core.bigfilethreshold=1 index-pack -o ../bad.idx ../test-3.pack 2>msg &&
+		test_i18ngrep "SHA1 COLLISION FOUND" msg
+	)
 '
 
 test_done
-- 
2.19.1.899.g0250525e69


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 3/3] index-pack tests: don't leave test repo dirty at end
  2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
                               ` (3 preceding siblings ...)
  2018-10-30 18:43             ` [PATCH v2 2/3] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
@ 2018-10-30 18:43             ` " Ævar Arnfjörð Bjarmason
  4 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-30 18:43 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Change a test added in 51054177b3 ("index-pack: detect local
corruption in collision check", 2017-04-01) so that the repository
isn't left dirty at the end.

Due to the caveats explained in 720dae5a19 ("config doc: elaborate on
fetch.fsckObjects security", 2018-07-27) even a "fetch" that fails
will write to the local object store, so let's copy the bit-error test
directory before running this test.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t1060-object-corruption.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
index ac1f189fd2..4feb65157d 100755
--- a/t/t1060-object-corruption.sh
+++ b/t/t1060-object-corruption.sh
@@ -117,8 +117,10 @@ test_expect_failure 'clone --local detects misnamed objects' '
 '
 
 test_expect_success 'fetch into corrupted repo with index-pack' '
+	cp -R bit-error bit-error-cp &&
+	test_when_finished "rm -rf bit-error-cp" &&
 	(
-		cd bit-error &&
+		cd bit-error-cp &&
 		test_must_fail git -c transfer.unpackLimit=1 \
 			fetch ../no-bit-error 2>stderr &&
 		test_i18ngrep ! -i collision stderr
-- 
2.19.1.899.g0250525e69


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-10-29 23:27               ` Jeff King
@ 2018-11-07 22:55                 ` Geert Jansen
  2018-11-08 12:02                   ` Jeff King
  2018-11-09 13:43                   ` [RFC PATCH] index-pack: improve performance on NFS Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 87+ messages in thread
From: Geert Jansen @ 2018-11-07 22:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git

On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:

> On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >  * Re-roll my 4 patch series to include the patch you have in
> >    <20181027093300.GA23974@sigill.intra.peff.net>
> 
> I don't think it's quite ready for inclusion as-is. I hope to brush it
> up a bit, but I have quite a backlog of stuff to review, as well.

We're still quite keen to get this patch included. Is there anything I can do
to help?

Also I just re-read your comments on maximum cache size. I think you were
arguing both sides of the equation and I wasn't sure where you'd ended up. :)
A larger cache size potentially takes more time to fill up especially on NFS
while a smaller cache size obviously would less effective. That said a small
cache is still effective for the "clone" case where the repo is empty.

It also occurred to me that as a performance optimization your patch could read
the the loose object directories in parallel using a thread pool. At least on
Amazon EFS this should result in al almost linear performance increase. I'm not
sure how much this would help for local file systems. In any case this may be
best done as a follow-up patch (that I'd be happy to volunteer for).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-07 22:55                 ` Geert Jansen
@ 2018-11-08 12:02                   ` Jeff King
  2018-11-08 20:58                     ` Geert Jansen
                                       ` (2 more replies)
  2018-11-09 13:43                   ` [RFC PATCH] index-pack: improve performance on NFS Ævar Arnfjörð Bjarmason
  1 sibling, 3 replies; 87+ messages in thread
From: Jeff King @ 2018-11-08 12:02 UTC (permalink / raw)
  To: Geert Jansen; +Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git

On Wed, Nov 07, 2018 at 10:55:24PM +0000, Geert Jansen wrote:

> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
> 
> > On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
> > >  * Re-roll my 4 patch series to include the patch you have in
> > >    <20181027093300.GA23974@sigill.intra.peff.net>
> > 
> > I don't think it's quite ready for inclusion as-is. I hope to brush it
> > up a bit, but I have quite a backlog of stuff to review, as well.
> 
> We're still quite keen to get this patch included. Is there anything I can do
> to help?

Yes, testing and review. :)

I won't send the series out just yet, as I suspect it could use another
read-through on my part. But if you want to peek at it or try some
timings, it's available at:

  https://github.com/peff/git jk/loose-cache

It's quite a bit bigger than the original patch, as some refactoring was
necessary to reuse the existing cache in alternate_object_directories.
I'm rather pleased with how it turned out; it unifies the handling of
alternates and the main object directory, which is a cleanup I've been
wanting to do for some time.

> Also I just re-read your comments on maximum cache size. I think you were
> arguing both sides of the equation and I wasn't sure where you'd ended up. :)
> A larger cache size potentially takes more time to fill up especially on NFS
> while a smaller cache size obviously would less effective. That said a small
> cache is still effective for the "clone" case where the repo is empty.

I ended up thinking that a large cache is going to be fine. So I didn't
even bother implementing a limit in my series, which makes things a bit
simpler (it's one less state to deal with).

Since it reuses the existing cache code, it's better in a few ways than
my earlier patch:

  1. If a program uses OBJECT_INFO_QUICK and prints abbreviated sha1s,
     we only have to load the cache once (I think fetch does this, but I
     didn't test it).

  2. The cache is filled one directory at a time, which avoids
     unnecessary work when there are only a few lookups.

  3. The cache is per-object-directory. So if a request can be filled
     without looking at an alternate, we avoid looking at the alternate.
     I doubt this matters much in practice (the case we care about is
     when we _don't_ have the object, and there you have to look
     everywhere).

The one thing I didn't implement is a config option to disable this.
That would be pretty easy to add. I don't think it's necessary, but it
would make testing before/after behavior easier if somebody thinks it's
slowing down their particular case.

> It also occurred to me that as a performance optimization your patch could read
> the the loose object directories in parallel using a thread pool. At least on
> Amazon EFS this should result in al almost linear performance increase. I'm not
> sure how much this would help for local file systems. In any case this may be
> best done as a follow-up patch (that I'd be happy to volunteer for).

Yeah, I suspect it could make things faster in some cases. But it also
implies filling all of the cache directories at once up front. The code
I have now tries to avoid unnecessary cache fills. But it would be
pretty easy to kick off a full fill.

I agree it would make more sense as a follow-up patch (and probably
controlled by a config option, since it likely only makes sense when you
have a really high-latency readdir).

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-08 12:02                   ` Jeff King
@ 2018-11-08 20:58                     ` Geert Jansen
  2018-11-08 21:18                       ` Jeff King
  2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
  2 siblings, 1 reply; 87+ messages in thread
From: Geert Jansen @ 2018-11-08 20:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git

On Thu, Nov 08, 2018 at 07:02:57AM -0500, Jeff King wrote:

> Yes, testing and review. :)
> 
> I won't send the series out just yet, as I suspect it could use another
> read-through on my part. But if you want to peek at it or try some
> timings, it's available at:
> 
>   https://github.com/peff/git jk/loose-cache

I gave this branch a go. There's a performance regression as I'm getting a
clone speed of about 100 KiB/s while with the previous patch I got around 20
MiB/s. The culprint appears to be a very large number of stat() calls on
".git/objects/info/alternates". The call stack is:

 -> quick_has_loose()
 -> prepare_alt_odb()
 -> read_info_alternates()

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-08 20:58                     ` Geert Jansen
@ 2018-11-08 21:18                       ` Jeff King
  2018-11-08 21:55                         ` Geert Jansen
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-08 21:18 UTC (permalink / raw)
  To: Geert Jansen; +Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git

On Thu, Nov 08, 2018 at 08:58:19PM +0000, Geert Jansen wrote:

> On Thu, Nov 08, 2018 at 07:02:57AM -0500, Jeff King wrote:
> 
> > Yes, testing and review. :)
> > 
> > I won't send the series out just yet, as I suspect it could use another
> > read-through on my part. But if you want to peek at it or try some
> > timings, it's available at:
> > 
> >   https://github.com/peff/git jk/loose-cache
> 
> I gave this branch a go. There's a performance regression as I'm getting a
> clone speed of about 100 KiB/s while with the previous patch I got around 20
> MiB/s. The culprint appears to be a very large number of stat() calls on
> ".git/objects/info/alternates". The call stack is:
> 
>  -> quick_has_loose()
>  -> prepare_alt_odb()
>  -> read_info_alternates()

Heh, indeed. Try this on top:

diff --git a/sha1-file.c b/sha1-file.c
index bc35b28e17..9ff27f92ed 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -692,6 +692,7 @@ void prepare_alt_odb(struct repository *r)
 	link_alt_odb_entries(r, r->objects->alternate_db, PATH_SEP, NULL, 0);
 
 	read_info_alternates(r, r->objects->odb->path, 0);
+	r->objects->loaded_alternates = 1;
 }
 
 /* Returns 1 if we have successfully freshened the file, 0 otherwise. */

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-08 21:18                       ` Jeff King
@ 2018-11-08 21:55                         ` Geert Jansen
  0 siblings, 0 replies; 87+ messages in thread
From: Geert Jansen @ 2018-11-08 21:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git

On Thu, Nov 08, 2018 at 04:18:24PM -0500, Jeff King wrote:

> Heh, indeed. Try this on top:
> 
> diff --git a/sha1-file.c b/sha1-file.c
> index bc35b28e17..9ff27f92ed 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -692,6 +692,7 @@ void prepare_alt_odb(struct repository *r)
>  	link_alt_odb_entries(r, r->objects->alternate_db, PATH_SEP, NULL, 0);
>  
>  	read_info_alternates(r, r->objects->odb->path, 0);
> +	r->objects->loaded_alternates = 1;
>  }
>  
>  /* Returns 1 if we have successfully freshened the file, 0 otherwise. */

Thanks, this did it. Performance is now back at the level of the previous patch.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-08 12:02                   ` Jeff King
  2018-11-08 20:58                     ` Geert Jansen
@ 2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
  2018-11-09 10:11                       ` Ævar Arnfjörð Bjarmason
  2018-11-12 14:31                       ` Jeff King
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
  2 siblings, 2 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-08 22:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Geert Jansen, Junio C Hamano, git\


On Thu, Nov 08 2018, Jeff King wrote:

> On Wed, Nov 07, 2018 at 10:55:24PM +0000, Geert Jansen wrote:
>
>> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
>>
>> > On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> > >  * Re-roll my 4 patch series to include the patch you have in
>> > >    <20181027093300.GA23974@sigill.intra.peff.net>
>> >
>> > I don't think it's quite ready for inclusion as-is. I hope to brush it
>> > up a bit, but I have quite a backlog of stuff to review, as well.
>>
>> We're still quite keen to get this patch included. Is there anything I can do
>> to help?
>
> Yes, testing and review. :)
>
> I won't send the series out just yet, as I suspect it could use another
> read-through on my part. But if you want to peek at it or try some
> timings, it's available at:
>
>   https://github.com/peff/git jk/loose-cache

Just a comment on this from the series:

    Note that it is possible for this to actually be _slower_. We'll do a
    full readdir() to fill the cache, so if you have a very large number of
    loose objects and a very small number of lookups, that readdir() may end
    up more expensive.

    In practice, though, having a large number of loose objects is already a
    performance problem, which should be fixed by repacking or pruning via
    git-gc. So on balance, this should be a good tradeoff.

Our biggest repo has a very large number of loose objects at any given
time, but the vast majority of these are because gc *is* happening very
frequently and the default expiry policy of 2wks is in effect.

Having a large number of loose objects is not per-se a performance
problem.

It's a problem if you end up "faulting" to from packs to the loose
object directory a lot because those objects are still reachable, but if
they're not reachable that number can grow very large if your ref churn
is large (so lots of expired loose object production).

Anyway, the series per-se looks good to me. It's particularly nice to
have some of the ODB cleanup + cleanup in fetch-pack.c

Just wanted to note that in our default (reasonable) config we do
produce scenarios where this change can still be somewhat pathological,
so I'm still interested in disabling it entirely given the
implausibility of what it's guarding against.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
@ 2018-11-09 10:11                       ` Ævar Arnfjörð Bjarmason
  2018-11-12 14:31                       ` Jeff King
  1 sibling, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-09 10:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Geert Jansen, Junio C Hamano, git\


On Thu, Nov 08 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Nov 08 2018, Jeff King wrote:
>
>> On Wed, Nov 07, 2018 at 10:55:24PM +0000, Geert Jansen wrote:
>>
>>> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
>>>
>>> > On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>> > >  * Re-roll my 4 patch series to include the patch you have in
>>> > >    <20181027093300.GA23974@sigill.intra.peff.net>
>>> >
>>> > I don't think it's quite ready for inclusion as-is. I hope to brush it
>>> > up a bit, but I have quite a backlog of stuff to review, as well.
>>>
>>> We're still quite keen to get this patch included. Is there anything I can do
>>> to help?
>>
>> Yes, testing and review. :)
>>
>> I won't send the series out just yet, as I suspect it could use another
>> read-through on my part. But if you want to peek at it or try some
>> timings, it's available at:
>>
>>   https://github.com/peff/git jk/loose-cache
>
> Just a comment on this from the series:
>
>     Note that it is possible for this to actually be _slower_. We'll do a
>     full readdir() to fill the cache, so if you have a very large number of
>     loose objects and a very small number of lookups, that readdir() may end
>     up more expensive.
>
>     In practice, though, having a large number of loose objects is already a
>     performance problem, which should be fixed by repacking or pruning via
>     git-gc. So on balance, this should be a good tradeoff.
>
> Our biggest repo has a very large number of loose objects at any given
> time, but the vast majority of these are because gc *is* happening very
> frequently and the default expiry policy of 2wks is in effect.
>
> Having a large number of loose objects is not per-se a performance
> problem.
>
> It's a problem if you end up "faulting" to from packs to the loose
> object directory a lot because those objects are still reachable, but if
> they're not reachable that number can grow very large if your ref churn
> is large (so lots of expired loose object production).
>
> Anyway, the series per-se looks good to me. It's particularly nice to
> have some of the ODB cleanup + cleanup in fetch-pack.c
>
> Just wanted to note that in our default (reasonable) config we do
> produce scenarios where this change can still be somewhat pathological,
> so I'm still interested in disabling it entirely given the
> implausibility of what it's guarding against.

Some actual numbers for this for a fairly small repo on NFS, "cold"
cache (hadn't run it in a bit):

    $ time (find objects/?? -type f|wc -l)
    862
    real    0m1.927s

Warm cache:

    $ time (find objects/?? -type f|wc -l)
    872
    real    0m0.151s

Cold cache on a bigger monorepo:

    $ time (find objects/?? -type f|wc -l)
    real    0m4.336s

Warm cache on a bigger monorepo (more ref churn):

    $ time (find objects/?? -type f|wc -l)
    49746
    real    0m1.082s

This on a server where bulk sustained writes of large files are really
fast (up to 1GB/s). It's just these metadata ops that are slow.

I also get cold cache times of up to 6 seconds on:

    time (find $(ls -d objects/??|sort -R) -type f | wc -l)

As opposed max of ~4s without -R, so I suspect there may be some
client/server optimization where things are iterated over in recursive
glob order (pre-fetched?), whereas the cache will try to fill buckets is
it encounters loose objects, so iterate over objects/{00..ff} randomly.

I'm not really leading up to any point here I haven't made already. I
was just curious to try to find some upper bound of overhead if say a
pack with 512 objects is pushed. In that case it's very likely that we
need to fill at least 200/256 buckets.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-07 22:55                 ` Geert Jansen
  2018-11-08 12:02                   ` Jeff King
@ 2018-11-09 13:43                   ` Ævar Arnfjörð Bjarmason
  2018-11-09 16:08                     ` Duy Nguyen
  2018-11-12 22:58                     ` Geert Jansen
  1 sibling, 2 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-09 13:43 UTC (permalink / raw)
  To: Geert Jansen; +Cc: Jeff King, Junio C Hamano, git\


On Wed, Nov 07 2018, Geert Jansen wrote:

> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
>
>> On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> >  * Re-roll my 4 patch series to include the patch you have in
>> >    <20181027093300.GA23974@sigill.intra.peff.net>
>>
>> I don't think it's quite ready for inclusion as-is. I hope to brush it
>> up a bit, but I have quite a backlog of stuff to review, as well.
>
> We're still quite keen to get this patch included. Is there anything I can do
> to help?
>
> Also I just re-read your comments on maximum cache size. I think you were
> arguing both sides of the equation and I wasn't sure where you'd ended up. :)
> A larger cache size potentially takes more time to fill up especially on NFS
> while a smaller cache size obviously would less effective. That said a small
> cache is still effective for the "clone" case where the repo is empty.
>
> It also occurred to me that as a performance optimization your patch could read
> the the loose object directories in parallel using a thread pool. At least on
> Amazon EFS this should result in al almost linear performance increase. I'm not
> sure how much this would help for local file systems. In any case this may be
> best done as a follow-up patch (that I'd be happy to volunteer for).

I'm planning to re-submit mine with some minor changes after the great
Documentation/config* move lands.

As noted in
https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
it's regardless of Jeff's optimization is. O(nothing) is always faster
than O(something), particularly (as explained in that E-Mail) on NFS.

You didn't answer my question in
https://public-inbox.org/git/20181030024925.GC8325@amazon.com/ about
whether for your purposes you're interested in this for something where
it needs to work out of the box on some random Amazon's customer's
"git", or if it's something in-house and you just don't want to turn off
collision checking. That would be useful to know.

I've turned on core.checkCollisions=false in production at our
site. Cloning for some large repositories went from ~200 minutes to ~5m,
and some pushes from ~5 minutes to ~10 seconds. Those numbers will be
very similar (but slightly higher, maybe 1-5 seconds higher in the
latter case) with Jeff's (depending on the push).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-09 13:43                   ` [RFC PATCH] index-pack: improve performance on NFS Ævar Arnfjörð Bjarmason
@ 2018-11-09 16:08                     ` Duy Nguyen
  2018-11-10 14:04                       ` Ævar Arnfjörð Bjarmason
  2018-11-12 22:58                     ` Geert Jansen
  1 sibling, 1 reply; 87+ messages in thread
From: Duy Nguyen @ 2018-11-09 16:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: gerardu, Jeff King, Junio C Hamano, Git Mailing List

On Fri, Nov 9, 2018 at 2:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> I'm planning to re-submit mine with some minor changes after the great
> Documentation/config* move lands.
>
> As noted in
> https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
> it's regardless of Jeff's optimization is. O(nothing) is always faster
> than O(something), particularly (as explained in that E-Mail) on NFS.

Is it really worth adding more code to maintain just to shave a couple
seconds (or a few percent clone time)?
-- 
Duy

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-09 16:08                     ` Duy Nguyen
@ 2018-11-10 14:04                       ` Ævar Arnfjörð Bjarmason
  2018-11-12 14:34                         ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-10 14:04 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: gerardu, Jeff King, Junio C Hamano, Git Mailing List


On Fri, Nov 09 2018, Duy Nguyen wrote:

> On Fri, Nov 9, 2018 at 2:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> I'm planning to re-submit mine with some minor changes after the great
>> Documentation/config* move lands.
>>
>> As noted in
>> https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
>> https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
>> it's regardless of Jeff's optimization is. O(nothing) is always faster
>> than O(something), particularly (as explained in that E-Mail) on NFS.
>
> Is it really worth adding more code to maintain just to shave a couple
> seconds (or a few percent clone time)?

Yeah I think so, because (in rough priority order):

a) The maintenance burden of carrying core.checkCollisions is trivial,
   and it's hard to imagine a scenario where it'll be difficult to
   selectively turn off some does_this_collide() function.

b) I think I need to worry more about a meteorite colliding with the
   datacenter than the threat this check is trying to guard against.

c) I think we should just turn it off by default on SHA-1, but don't
   expect that argument will carry the day. But I expect even those who
   think we still need it will have a hard time making that argument in
   the case of SHA-256. So having the codepath to disable it is helpful.

d) As shown in the linked E-Mails of mine you sometimes pay a 2-3 second
   *fixed* cost even for a very small (think ~100-200 objects) push/fetch
   that would otherwise take milliseconds with Jeff's version of this
   optimization (and not with mine). This can be a hundred/thousands of
   percent slowdown.

   Is that a big deal in itself in terms of absolute time spent? No. But
   I'm also thinking about this from the perspective of getting noise
   out of performance metrics. Some of this slowdown is also "user
   waiting for the terminal to be usable again" not just some machine
   somewhere wasting its own time.

e) As shown in the patch I have this direction as a very beneficial
   side-effect makes it much easier to repair corrupt
   repositories. Something I'm hoping to pursue even further. I've had
   cases where core.checkCollisions=false + stuff on top would have made
   repairing a broken repo much easier.

Anyway, I'm in no rush to send my patch. I'm happily using it in
production, but will wait for Jeff's be ready and to land before picking
it up again. Just wanted to do a braindump of the benefits.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
  2018-11-09 10:11                       ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 14:31                       ` Jeff King
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Geert Jansen, Junio C Hamano, git

On Thu, Nov 08, 2018 at 11:20:47PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Just a comment on this from the series:
> 
>     Note that it is possible for this to actually be _slower_. We'll do a
>     full readdir() to fill the cache, so if you have a very large number of
>     loose objects and a very small number of lookups, that readdir() may end
>     up more expensive.
> 
>     In practice, though, having a large number of loose objects is already a
>     performance problem, which should be fixed by repacking or pruning via
>     git-gc. So on balance, this should be a good tradeoff.
> 
> Our biggest repo has a very large number of loose objects at any given
> time, but the vast majority of these are because gc *is* happening very
> frequently and the default expiry policy of 2wks is in effect.
> 
> Having a large number of loose objects is not per-se a performance
> problem.

Yes, you're right. I was trying not to get into the rabbit hole of
discussing theoretical tradeoffs, but it is worth addressing. I've
updated that commit message in the patches I'll send out momentarily.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-10 14:04                       ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 14:34                         ` Jeff King
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Duy Nguyen, gerardu, Junio C Hamano, Git Mailing List

On Sat, Nov 10, 2018 at 03:04:35PM +0100, Ævar Arnfjörð Bjarmason wrote:

> d) As shown in the linked E-Mails of mine you sometimes pay a 2-3 second
>    *fixed* cost even for a very small (think ~100-200 objects) push/fetch
>    that would otherwise take milliseconds with Jeff's version of this
>    optimization (and not with mine). This can be a hundred/thousands of
>    percent slowdown.
> 
>    Is that a big deal in itself in terms of absolute time spent? No. But
>    I'm also thinking about this from the perspective of getting noise
>    out of performance metrics. Some of this slowdown is also "user
>    waiting for the terminal to be usable again" not just some machine
>    somewhere wasting its own time.

IMHO the ultimate end-game in this direction is still "don't have a
bunch of loose objects".

Right now this can legitimately happen due to unreachable-but-recent
objects being exploded out (or never packed in the first place). But I
hope in the long run that we'll actually put these into packs. That will
make this case faster _and_ avoid extra work during gc _and_ fix the
"whoops, we just ran gc but you still have a lot of objects" problem.

Which doesn't invalidate your other four points, of course. ;)

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 0/9] caching loose objects
  2018-11-08 12:02                   ` Jeff King
  2018-11-08 20:58                     ` Geert Jansen
  2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 14:46                     ` Jeff King
  2018-11-12 14:46                       ` [PATCH 1/9] fsck: do not reuse child_process structs Jeff King
                                         ` (9 more replies)
  2 siblings, 10 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:46 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

Here's the series I mentioned earlier in the thread to cache loose
objects when answering has_object_file(..., OBJECT_INFO_QUICK). For
those just joining us, this makes operations that look up a lot of
missing objects (like "index-pack" looking for collisions) faster. This
is mostly targeted at systems where stat() is slow, like over NFS, but
it seems to give a 2% speedup indexing a full git.git packfile into an
empty repository (i.e., what you'd see on a clone).

I'm adding René Scharfe and Takuto Ikuta to the cc for their previous
work in loose-object caching.

The interesting bit is patch 8. The rest of it is cleanup to let us
treat alternates and the main object directory similarly.

  [1/9]: fsck: do not reuse child_process structs
  [2/9]: submodule--helper: prefer strip_suffix() to ends_with()
  [3/9]: rename "alternate_object_database" to "object_directory"
  [4/9]: sha1_file_name(): overwrite buffer instead of appending
  [5/9]: handle alternates paths the same as the main object dir
  [6/9]: sha1-file: use an object_directory for the main object dir
  [7/9]: object-store: provide helpers for loose_objects_cache
  [8/9]: sha1-file: use loose object cache for quick existence check
  [9/9]: fetch-pack: drop custom loose object cache

 builtin/count-objects.c     |   4 +-
 builtin/fsck.c              |  35 +++---
 builtin/grep.c              |   2 +-
 builtin/submodule--helper.c |   9 +-
 commit-graph.c              |  13 +--
 environment.c               |   4 +-
 fetch-pack.c                |  39 +------
 http-walker.c               |   2 +-
 http.c                      |   4 +-
 object-store.h              |  60 +++++------
 object.c                    |  26 ++---
 packfile.c                  |  20 ++--
 path.c                      |   2 +-
 repository.c                |   8 +-
 sha1-file.c                 | 210 ++++++++++++++++++------------------
 sha1-name.c                 |  42 ++------
 transport.c                 |   2 +-
 17 files changed, 209 insertions(+), 273 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 1/9] fsck: do not reuse child_process structs
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
@ 2018-11-12 14:46                       ` Jeff King
  2018-11-12 15:26                         ` Derrick Stolee
  2018-11-12 14:47                       ` [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with() Jeff King
                                         ` (8 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:46 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

The run-command API makes no promises about what is left in a struct
child_process after a command finishes, and it's not safe to simply
reuse it again for a similar command. In particular:

 - if you use child->args or child->env_array, they are cleared after
   finish_command()

 - likewise, start_command() may point child->argv at child->args->argv;
   reusing that would lead to accessing freed memory

 - the in/out/err may hold pipe descriptors from the previous run

These two calls are _probably_ OK because they do not use any of those
features. But it's only by chance, and may break in the future; let's
reinitialize our struct for each program we run.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fsck.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 06eb421720..b10f2b154c 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -841,6 +841,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 
 		prepare_alt_odb(the_repository);
 		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
+			child_process_init(&commit_graph_verify);
+			commit_graph_verify.argv = verify_argv;
+			commit_graph_verify.git_cmd = 1;
 			verify_argv[2] = "--object-dir";
 			verify_argv[3] = alt->path;
 			if (run_command(&commit_graph_verify))
@@ -859,6 +862,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 
 		prepare_alt_odb(the_repository);
 		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
+			child_process_init(&midx_verify);
+			midx_verify.argv = midx_argv;
+			midx_verify.git_cmd = 1;
 			midx_argv[2] = "--object-dir";
 			midx_argv[3] = alt->path;
 			if (run_command(&midx_verify))
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with()
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
  2018-11-12 14:46                       ` [PATCH 1/9] fsck: do not reuse child_process structs Jeff King
@ 2018-11-12 14:47                       ` Jeff King
  2018-11-12 18:23                         ` Stefan Beller
  2018-11-12 14:48                       ` [PATCH 3/9] rename "alternate_object_database" to "object_directory" Jeff King
                                         ` (7 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:47 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

Using strip_suffix() lets us avoid repeating ourselves. It also makes
the handling of "/" a bit less subtle (we strip one less character than
we matched in order to leave it in place, but we can just as easily
include the "/" when we add more path components).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/submodule--helper.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 676175b9be..28b9449e82 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -1268,16 +1268,17 @@ static int add_possible_reference_from_superproject(
 		struct alternate_object_database *alt, void *sas_cb)
 {
 	struct submodule_alternate_setup *sas = sas_cb;
+	size_t len;
 
 	/*
 	 * If the alternate object store is another repository, try the
 	 * standard layout with .git/(modules/<name>)+/objects
 	 */
-	if (ends_with(alt->path, "/objects")) {
+	if (strip_suffix(alt->path, "/objects", &len)) {
 		char *sm_alternate;
 		struct strbuf sb = STRBUF_INIT;
 		struct strbuf err = STRBUF_INIT;
-		strbuf_add(&sb, alt->path, strlen(alt->path) - strlen("objects"));
+		strbuf_add(&sb, alt->path, len);
 
 		/*
 		 * We need to end the new path with '/' to mark it as a dir,
@@ -1285,7 +1286,7 @@ static int add_possible_reference_from_superproject(
 		 * as the last part of a missing submodule reference would
 		 * be taken as a file name.
 		 */
-		strbuf_addf(&sb, "modules/%s/", sas->submodule_name);
+		strbuf_addf(&sb, "/modules/%s/", sas->submodule_name);
 
 		sm_alternate = compute_alternate_path(sb.buf, &err);
 		if (sm_alternate) {
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 3/9] rename "alternate_object_database" to "object_directory"
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
  2018-11-12 14:46                       ` [PATCH 1/9] fsck: do not reuse child_process structs Jeff King
  2018-11-12 14:47                       ` [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with() Jeff King
@ 2018-11-12 14:48                       ` Jeff King
  2018-11-12 15:30                         ` Derrick Stolee
  2018-11-12 14:48                       ` [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending Jeff King
                                         ` (6 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:48 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

In preparation for unifying the handling of alt odb's and the normal
repo object directory, let's use a more neutral name. This patch is
purely mechanical, swapping the type name, and converting any variables
named "alt" to "odb". There should be no functional change, but it will
reduce the noise in subsequent diffs.

Signed-off-by: Jeff King <peff@peff.net>
---
I waffled on calling this object_database instead of object_directory.
But really, it is very specifically about the directory (packed
storage, including packs from alternates, is handled elsewhere).

 builtin/count-objects.c     |  4 ++--
 builtin/fsck.c              | 16 ++++++-------
 builtin/submodule--helper.c |  6 ++---
 commit-graph.c              | 10 ++++----
 object-store.h              | 14 +++++------
 object.c                    | 10 ++++----
 packfile.c                  |  8 +++----
 sha1-file.c                 | 48 ++++++++++++++++++-------------------
 sha1-name.c                 | 20 ++++++++--------
 transport.c                 |  2 +-
 10 files changed, 69 insertions(+), 69 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index a7cad052c6..3fae474f6f 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -78,10 +78,10 @@ static int count_cruft(const char *basename, const char *path, void *data)
 	return 0;
 }
 
-static int print_alternate(struct alternate_object_database *alt, void *data)
+static int print_alternate(struct object_directory *odb, void *data)
 {
 	printf("alternate: ");
-	quote_c_style(alt->path, NULL, stdout, 0);
+	quote_c_style(odb->path, NULL, stdout, 0);
 	putchar('\n');
 	return 0;
 }
diff --git a/builtin/fsck.c b/builtin/fsck.c
index b10f2b154c..55153cf92a 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -688,7 +688,7 @@ static struct option fsck_opts[] = {
 int cmd_fsck(int argc, const char **argv, const char *prefix)
 {
 	int i;
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 
 	/* fsck knows how to handle missing promisor objects */
 	fetch_if_missing = 0;
@@ -725,14 +725,14 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
 		for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
 	} else {
-		struct alternate_object_database *alt_odb_list;
+		struct object_directory *alt_odb_list;
 
 		fsck_object_dir(get_object_directory());
 
 		prepare_alt_odb(the_repository);
 		alt_odb_list = the_repository->objects->alt_odb_list;
-		for (alt = alt_odb_list; alt; alt = alt->next)
-			fsck_object_dir(alt->path);
+		for (odb = alt_odb_list; odb; odb = odb->next)
+			fsck_object_dir(odb->path);
 
 		if (check_full) {
 			struct packed_git *p;
@@ -840,12 +840,12 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			errors_found |= ERROR_COMMIT_GRAPH;
 
 		prepare_alt_odb(the_repository);
-		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
+		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
 			child_process_init(&commit_graph_verify);
 			commit_graph_verify.argv = verify_argv;
 			commit_graph_verify.git_cmd = 1;
 			verify_argv[2] = "--object-dir";
-			verify_argv[3] = alt->path;
+			verify_argv[3] = odb->path;
 			if (run_command(&commit_graph_verify))
 				errors_found |= ERROR_COMMIT_GRAPH;
 		}
@@ -861,12 +861,12 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			errors_found |= ERROR_COMMIT_GRAPH;
 
 		prepare_alt_odb(the_repository);
-		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
+		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
 			child_process_init(&midx_verify);
 			midx_verify.argv = midx_argv;
 			midx_verify.git_cmd = 1;
 			midx_argv[2] = "--object-dir";
-			midx_argv[3] = alt->path;
+			midx_argv[3] = odb->path;
 			if (run_command(&midx_verify))
 				errors_found |= ERROR_COMMIT_GRAPH;
 		}
diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 28b9449e82..3ae451bc46 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -1265,7 +1265,7 @@ struct submodule_alternate_setup {
 	SUBMODULE_ALTERNATE_ERROR_IGNORE, NULL }
 
 static int add_possible_reference_from_superproject(
-		struct alternate_object_database *alt, void *sas_cb)
+		struct object_directory *odb, void *sas_cb)
 {
 	struct submodule_alternate_setup *sas = sas_cb;
 	size_t len;
@@ -1274,11 +1274,11 @@ static int add_possible_reference_from_superproject(
 	 * If the alternate object store is another repository, try the
 	 * standard layout with .git/(modules/<name>)+/objects
 	 */
-	if (strip_suffix(alt->path, "/objects", &len)) {
+	if (strip_suffix(odb->path, "/objects", &len)) {
 		char *sm_alternate;
 		struct strbuf sb = STRBUF_INIT;
 		struct strbuf err = STRBUF_INIT;
-		strbuf_add(&sb, alt->path, len);
+		strbuf_add(&sb, odb->path, len);
 
 		/*
 		 * We need to end the new path with '/' to mark it as a dir,
diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..5dd3f5b15c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -230,7 +230,7 @@ static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
  */
 static int prepare_commit_graph(struct repository *r)
 {
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 	char *obj_dir;
 	int config_value;
 
@@ -255,10 +255,10 @@ static int prepare_commit_graph(struct repository *r)
 	obj_dir = r->objects->objectdir;
 	prepare_commit_graph_one(r, obj_dir);
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list;
-	     !r->objects->commit_graph && alt;
-	     alt = alt->next)
-		prepare_commit_graph_one(r, alt->path);
+	for (odb = r->objects->alt_odb_list;
+	     !r->objects->commit_graph && odb;
+	     odb = odb->next)
+		prepare_commit_graph_one(r, odb->path);
 	return !!r->objects->commit_graph;
 }
 
diff --git a/object-store.h b/object-store.h
index 63b7605a3e..122d5f75e2 100644
--- a/object-store.h
+++ b/object-store.h
@@ -7,8 +7,8 @@
 #include "sha1-array.h"
 #include "strbuf.h"
 
-struct alternate_object_database {
-	struct alternate_object_database *next;
+struct object_directory {
+	struct object_directory *next;
 
 	/* see alt_scratch_buf() */
 	struct strbuf scratch;
@@ -32,14 +32,14 @@ struct alternate_object_database {
 };
 void prepare_alt_odb(struct repository *r);
 char *compute_alternate_path(const char *path, struct strbuf *err);
-typedef int alt_odb_fn(struct alternate_object_database *, void *);
+typedef int alt_odb_fn(struct object_directory *, void *);
 int foreach_alt_odb(alt_odb_fn, void*);
 
 /*
  * Allocate a "struct alternate_object_database" but do _not_ actually
  * add it to the list of alternates.
  */
-struct alternate_object_database *alloc_alt_odb(const char *dir);
+struct object_directory *alloc_alt_odb(const char *dir);
 
 /*
  * Add the directory to the on-disk alternates file; the new entry will also
@@ -60,7 +60,7 @@ void add_to_alternates_memory(const char *dir);
  * alternate. Always use this over direct access to alt->scratch, as it
  * cleans up any previous use of the scratch buffer.
  */
-struct strbuf *alt_scratch_buf(struct alternate_object_database *alt);
+struct strbuf *alt_scratch_buf(struct object_directory *odb);
 
 struct packed_git {
 	struct packed_git *next;
@@ -100,8 +100,8 @@ struct raw_object_store {
 	/* Path to extra alternate object database if not NULL */
 	char *alternate_db;
 
-	struct alternate_object_database *alt_odb_list;
-	struct alternate_object_database **alt_odb_tail;
+	struct object_directory *alt_odb_list;
+	struct object_directory **alt_odb_tail;
 
 	/*
 	 * Objects that should be substituted by other objects
diff --git a/object.c b/object.c
index e54160550c..6af8e908bb 100644
--- a/object.c
+++ b/object.c
@@ -482,17 +482,17 @@ struct raw_object_store *raw_object_store_new(void)
 	return o;
 }
 
-static void free_alt_odb(struct alternate_object_database *alt)
+static void free_alt_odb(struct object_directory *odb)
 {
-	strbuf_release(&alt->scratch);
-	oid_array_clear(&alt->loose_objects_cache);
-	free(alt);
+	strbuf_release(&odb->scratch);
+	oid_array_clear(&odb->loose_objects_cache);
+	free(odb);
 }
 
 static void free_alt_odbs(struct raw_object_store *o)
 {
 	while (o->alt_odb_list) {
-		struct alternate_object_database *next;
+		struct object_directory *next;
 
 		next = o->alt_odb_list->next;
 		free_alt_odb(o->alt_odb_list);
diff --git a/packfile.c b/packfile.c
index f2850a00b5..d6d511cfd2 100644
--- a/packfile.c
+++ b/packfile.c
@@ -966,16 +966,16 @@ static void prepare_packed_git_mru(struct repository *r)
 
 static void prepare_packed_git(struct repository *r)
 {
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 
 	if (r->objects->packed_git_initialized)
 		return;
 	prepare_multi_pack_index_one(r, r->objects->objectdir, 1);
 	prepare_packed_git_one(r, r->objects->objectdir, 1);
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
-		prepare_multi_pack_index_one(r, alt->path, 0);
-		prepare_packed_git_one(r, alt->path, 0);
+	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
+		prepare_multi_pack_index_one(r, odb->path, 0);
+		prepare_packed_git_one(r, odb->path, 0);
 	}
 	rearrange_packed_git(r);
 
diff --git a/sha1-file.c b/sha1-file.c
index dd0b6aa873..a3cc650a0a 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -353,16 +353,16 @@ void sha1_file_name(struct repository *r, struct strbuf *buf, const unsigned cha
 	fill_sha1_path(buf, sha1);
 }
 
-struct strbuf *alt_scratch_buf(struct alternate_object_database *alt)
+struct strbuf *alt_scratch_buf(struct object_directory *odb)
 {
-	strbuf_setlen(&alt->scratch, alt->base_len);
-	return &alt->scratch;
+	strbuf_setlen(&odb->scratch, odb->base_len);
+	return &odb->scratch;
 }
 
-static const char *alt_sha1_path(struct alternate_object_database *alt,
+static const char *alt_sha1_path(struct object_directory *odb,
 				 const unsigned char *sha1)
 {
-	struct strbuf *buf = alt_scratch_buf(alt);
+	struct strbuf *buf = alt_scratch_buf(odb);
 	fill_sha1_path(buf, sha1);
 	return buf->buf;
 }
@@ -374,7 +374,7 @@ static int alt_odb_usable(struct raw_object_store *o,
 			  struct strbuf *path,
 			  const char *normalized_objdir)
 {
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 
 	/* Detect cases where alternate disappeared */
 	if (!is_directory(path->buf)) {
@@ -388,8 +388,8 @@ static int alt_odb_usable(struct raw_object_store *o,
 	 * Prevent the common mistake of listing the same
 	 * thing twice, or object directory itself.
 	 */
-	for (alt = o->alt_odb_list; alt; alt = alt->next) {
-		if (!fspathcmp(path->buf, alt->path))
+	for (odb = o->alt_odb_list; odb; odb = odb->next) {
+		if (!fspathcmp(path->buf, odb->path))
 			return 0;
 	}
 	if (!fspathcmp(path->buf, normalized_objdir))
@@ -402,7 +402,7 @@ static int alt_odb_usable(struct raw_object_store *o,
  * Prepare alternate object database registry.
  *
  * The variable alt_odb_list points at the list of struct
- * alternate_object_database.  The elements on this list come from
+ * object_directory.  The elements on this list come from
  * non-empty elements from colon separated ALTERNATE_DB_ENVIRONMENT
  * environment variable, and $GIT_OBJECT_DIRECTORY/info/alternates,
  * whose contents is similar to that environment variable but can be
@@ -419,7 +419,7 @@ static void read_info_alternates(struct repository *r,
 static int link_alt_odb_entry(struct repository *r, const char *entry,
 	const char *relative_base, int depth, const char *normalized_objdir)
 {
-	struct alternate_object_database *ent;
+	struct object_directory *ent;
 	struct strbuf pathbuf = STRBUF_INIT;
 
 	if (!is_absolute_path(entry) && relative_base) {
@@ -540,9 +540,9 @@ static void read_info_alternates(struct repository *r,
 	free(path);
 }
 
-struct alternate_object_database *alloc_alt_odb(const char *dir)
+struct object_directory *alloc_alt_odb(const char *dir)
 {
-	struct alternate_object_database *ent;
+	struct object_directory *ent;
 
 	FLEX_ALLOC_STR(ent, path, dir);
 	strbuf_init(&ent->scratch, 0);
@@ -684,7 +684,7 @@ char *compute_alternate_path(const char *path, struct strbuf *err)
 
 int foreach_alt_odb(alt_odb_fn fn, void *cb)
 {
-	struct alternate_object_database *ent;
+	struct object_directory *ent;
 	int r = 0;
 
 	prepare_alt_odb(the_repository);
@@ -743,10 +743,10 @@ static int check_and_freshen_local(const struct object_id *oid, int freshen)
 
 static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
 {
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 	prepare_alt_odb(the_repository);
-	for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) {
-		const char *path = alt_sha1_path(alt, oid->hash);
+	for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
+		const char *path = alt_sha1_path(odb, oid->hash);
 		if (check_and_freshen_file(path, freshen))
 			return 1;
 	}
@@ -893,7 +893,7 @@ int git_open_cloexec(const char *name, int flags)
 static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 			  struct stat *st, const char **path)
 {
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 	static struct strbuf buf = STRBUF_INIT;
 
 	strbuf_reset(&buf);
@@ -905,8 +905,8 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 
 	prepare_alt_odb(r);
 	errno = ENOENT;
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
-		*path = alt_sha1_path(alt, sha1);
+	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
+		*path = alt_sha1_path(odb, sha1);
 		if (!lstat(*path, st))
 			return 0;
 	}
@@ -922,7 +922,7 @@ static int open_sha1_file(struct repository *r,
 			  const unsigned char *sha1, const char **path)
 {
 	int fd;
-	struct alternate_object_database *alt;
+	struct object_directory *odb;
 	int most_interesting_errno;
 	static struct strbuf buf = STRBUF_INIT;
 
@@ -936,8 +936,8 @@ static int open_sha1_file(struct repository *r,
 	most_interesting_errno = errno;
 
 	prepare_alt_odb(r);
-	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
-		*path = alt_sha1_path(alt, sha1);
+	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
+		*path = alt_sha1_path(odb, sha1);
 		fd = git_open(*path);
 		if (fd >= 0)
 			return fd;
@@ -2139,14 +2139,14 @@ struct loose_alt_odb_data {
 	void *data;
 };
 
-static int loose_from_alt_odb(struct alternate_object_database *alt,
+static int loose_from_alt_odb(struct object_directory *odb,
 			      void *vdata)
 {
 	struct loose_alt_odb_data *data = vdata;
 	struct strbuf buf = STRBUF_INIT;
 	int r;
 
-	strbuf_addstr(&buf, alt->path);
+	strbuf_addstr(&buf, odb->path);
 	r = for_each_loose_file_in_objdir_buf(&buf,
 					      data->cb, NULL, NULL,
 					      data->data);
diff --git a/sha1-name.c b/sha1-name.c
index faa60f69e3..2594aa79f8 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -95,8 +95,8 @@ static int match_sha(unsigned, const unsigned char *, const unsigned char *);
 static void find_short_object_filename(struct disambiguate_state *ds)
 {
 	int subdir_nr = ds->bin_pfx.hash[0];
-	struct alternate_object_database *alt;
-	static struct alternate_object_database *fakeent;
+	struct object_directory *odb;
+	static struct object_directory *fakeent;
 
 	if (!fakeent) {
 		/*
@@ -110,24 +110,24 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 	}
 	fakeent->next = the_repository->objects->alt_odb_list;
 
-	for (alt = fakeent; alt && !ds->ambiguous; alt = alt->next) {
+	for (odb = fakeent; odb && !ds->ambiguous; odb = odb->next) {
 		int pos;
 
-		if (!alt->loose_objects_subdir_seen[subdir_nr]) {
-			struct strbuf *buf = alt_scratch_buf(alt);
+		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
+			struct strbuf *buf = alt_scratch_buf(odb);
 			for_each_file_in_obj_subdir(subdir_nr, buf,
 						    append_loose_object,
 						    NULL, NULL,
-						    &alt->loose_objects_cache);
-			alt->loose_objects_subdir_seen[subdir_nr] = 1;
+						    &odb->loose_objects_cache);
+			odb->loose_objects_subdir_seen[subdir_nr] = 1;
 		}
 
-		pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx);
+		pos = oid_array_lookup(&odb->loose_objects_cache, &ds->bin_pfx);
 		if (pos < 0)
 			pos = -1 - pos;
-		while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) {
+		while (!ds->ambiguous && pos < odb->loose_objects_cache.nr) {
 			const struct object_id *oid;
-			oid = alt->loose_objects_cache.oid + pos;
+			oid = odb->loose_objects_cache.oid + pos;
 			if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash))
 				break;
 			update_candidates(ds, oid);
diff --git a/transport.c b/transport.c
index 5a74b609ff..040e92c134 100644
--- a/transport.c
+++ b/transport.c
@@ -1433,7 +1433,7 @@ struct alternate_refs_data {
 	void *data;
 };
 
-static int refs_from_alternate_cb(struct alternate_object_database *e,
+static int refs_from_alternate_cb(struct object_directory *e,
 				  void *data)
 {
 	struct strbuf path = STRBUF_INIT;
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (2 preceding siblings ...)
  2018-11-12 14:48                       ` [PATCH 3/9] rename "alternate_object_database" to "object_directory" Jeff King
@ 2018-11-12 14:48                       ` Jeff King
  2018-11-12 15:32                         ` Derrick Stolee
  2018-11-12 14:49                       ` [PATCH 5/9] handle alternates paths the same as the main object dir Jeff King
                                         ` (5 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:48 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

The sha1_file_name() function is used to generate the path to a loose
object in the object directory. It doesn't make much sense for it to
append, since the the path we write may be absolute (i.e., you cannot
reliably build up a path with it). Because many callers use it with a
static buffer, they have to strbuf_reset() manually before each call
(and the other callers always use an empty buffer, so they don't care
either way). Let's handle this automatically.

Since we're changing the semantics, let's take the opportunity to give
it a more hash-neutral name (which will also catch any callers from
topics in flight).

Signed-off-by: Jeff King <peff@peff.net>
---
 http-walker.c  |  2 +-
 http.c         |  4 ++--
 object-store.h |  2 +-
 sha1-file.c    | 18 ++++++++----------
 4 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/http-walker.c b/http-walker.c
index b3334bf657..0a392c85b6 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -547,7 +547,7 @@ static int fetch_object(struct walker *walker, unsigned char *sha1)
 		ret = error("File %s has bad hash", hex);
 	} else if (req->rename < 0) {
 		struct strbuf buf = STRBUF_INIT;
-		sha1_file_name(the_repository, &buf, req->sha1);
+		loose_object_path(the_repository, &buf, req->sha1);
 		ret = error("unable to write sha1 filename %s", buf.buf);
 		strbuf_release(&buf);
 	}
diff --git a/http.c b/http.c
index 3dc8c560d6..46c2e7a275 100644
--- a/http.c
+++ b/http.c
@@ -2314,7 +2314,7 @@ struct http_object_request *new_http_object_request(const char *base_url,
 	hashcpy(freq->sha1, sha1);
 	freq->localfile = -1;
 
-	sha1_file_name(the_repository, &filename, sha1);
+	loose_object_path(the_repository, &filename, sha1);
 	strbuf_addf(&freq->tmpfile, "%s.temp", filename.buf);
 
 	strbuf_addf(&prevfile, "%s.prev", filename.buf);
@@ -2465,7 +2465,7 @@ int finish_http_object_request(struct http_object_request *freq)
 		unlink_or_warn(freq->tmpfile.buf);
 		return -1;
 	}
-	sha1_file_name(the_repository, &filename, freq->sha1);
+	loose_object_path(the_repository, &filename, freq->sha1);
 	freq->rename = finalize_object_file(freq->tmpfile.buf, filename.buf);
 	strbuf_release(&filename);
 
diff --git a/object-store.h b/object-store.h
index 122d5f75e2..fefa17e380 100644
--- a/object-store.h
+++ b/object-store.h
@@ -157,7 +157,7 @@ void raw_object_store_clear(struct raw_object_store *o);
  * Put in `buf` the name of the file in the local object database that
  * would be used to store a loose object with the specified sha1.
  */
-void sha1_file_name(struct repository *r, struct strbuf *buf, const unsigned char *sha1);
+void loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1);
 
 void *map_sha1_file(struct repository *r, const unsigned char *sha1, unsigned long *size);
 
diff --git a/sha1-file.c b/sha1-file.c
index a3cc650a0a..478eac326b 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -346,8 +346,10 @@ static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1)
 	}
 }
 
-void sha1_file_name(struct repository *r, struct strbuf *buf, const unsigned char *sha1)
+void loose_object_path(struct repository *r, struct strbuf *buf,
+		       const unsigned char *sha1)
 {
+	strbuf_reset(buf);
 	strbuf_addstr(buf, r->objects->objectdir);
 	strbuf_addch(buf, '/');
 	fill_sha1_path(buf, sha1);
@@ -735,8 +737,7 @@ static int check_and_freshen_local(const struct object_id *oid, int freshen)
 {
 	static struct strbuf buf = STRBUF_INIT;
 
-	strbuf_reset(&buf);
-	sha1_file_name(the_repository, &buf, oid->hash);
+	loose_object_path(the_repository, &buf, oid->hash);
 
 	return check_and_freshen_file(buf.buf, freshen);
 }
@@ -888,7 +889,7 @@ int git_open_cloexec(const char *name, int flags)
  *
  * The "path" out-parameter will give the path of the object we found (if any).
  * Note that it may point to static storage and is only valid until another
- * call to sha1_file_name(), etc.
+ * call to loose_object_path(), etc.
  */
 static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 			  struct stat *st, const char **path)
@@ -896,8 +897,7 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 	struct object_directory *odb;
 	static struct strbuf buf = STRBUF_INIT;
 
-	strbuf_reset(&buf);
-	sha1_file_name(r, &buf, sha1);
+	loose_object_path(r, &buf, sha1);
 	*path = buf.buf;
 
 	if (!lstat(*path, st))
@@ -926,8 +926,7 @@ static int open_sha1_file(struct repository *r,
 	int most_interesting_errno;
 	static struct strbuf buf = STRBUF_INIT;
 
-	strbuf_reset(&buf);
-	sha1_file_name(r, &buf, sha1);
+	loose_object_path(r, &buf, sha1);
 	*path = buf.buf;
 
 	fd = git_open(*path);
@@ -1626,8 +1625,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
 	static struct strbuf tmp_file = STRBUF_INIT;
 	static struct strbuf filename = STRBUF_INIT;
 
-	strbuf_reset(&filename);
-	sha1_file_name(the_repository, &filename, oid->hash);
+	loose_object_path(the_repository, &filename, oid->hash);
 
 	fd = create_tmpfile(&tmp_file, filename.buf);
 	if (fd < 0) {
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 5/9] handle alternates paths the same as the main object dir
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (3 preceding siblings ...)
  2018-11-12 14:48                       ` [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending Jeff King
@ 2018-11-12 14:49                       ` Jeff King
  2018-11-12 15:38                         ` Derrick Stolee
  2018-11-12 14:50                       ` [PATCH 6/9] sha1-file: use an object_directory for " Jeff King
                                         ` (4 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:49 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

When we generate loose file paths for the main object directory, the
caller provides a buffer to loose_object_path (formerly sha1_file_name).
The callers generally keep their own static buffer to avoid excessive
reallocations.

But for alternate directories, each struct carries its own scratch
buffer. This is needlessly different; let's unify them.

We could go either direction here, but this patch moves the alternates
struct over to the main directory style (rather than vice-versa).
Technically the alternates style is more efficient, as it avoids
rewriting the object directory name on each call. But this is unlikely
to matter in practice, as we avoid reallocations either way (and nobody
has ever noticed or complained that the main object directory is copying
a few extra bytes before making a much more expensive system call).

And this has the advantage that the reusable buffers are tied to
particular calls, which makes the invalidation rules simpler (for
example, the return value from stat_sha1_file() used to be invalidated
by basically any other object call, but now it is affected only by other
calls to stat_sha1_file()).

We do steal the trick from alt_sha1_path() of returning a pointer to the
filled buffer, which makes a few conversions more convenient.

Signed-off-by: Jeff King <peff@peff.net>
---
 object-store.h | 14 +-------------
 object.c       |  1 -
 sha1-file.c    | 44 ++++++++++++++++----------------------------
 sha1-name.c    |  8 ++++++--
 4 files changed, 23 insertions(+), 44 deletions(-)

diff --git a/object-store.h b/object-store.h
index fefa17e380..b2fa0d0df0 100644
--- a/object-store.h
+++ b/object-store.h
@@ -10,10 +10,6 @@
 struct object_directory {
 	struct object_directory *next;
 
-	/* see alt_scratch_buf() */
-	struct strbuf scratch;
-	size_t base_len;
-
 	/*
 	 * Used to store the results of readdir(3) calls when searching
 	 * for unique abbreviated hashes.  This cache is never
@@ -54,14 +50,6 @@ void add_to_alternates_file(const char *dir);
  */
 void add_to_alternates_memory(const char *dir);
 
-/*
- * Returns a scratch strbuf pre-filled with the alternate object directory,
- * including a trailing slash, which can be used to access paths in the
- * alternate. Always use this over direct access to alt->scratch, as it
- * cleans up any previous use of the scratch buffer.
- */
-struct strbuf *alt_scratch_buf(struct object_directory *odb);
-
 struct packed_git {
 	struct packed_git *next;
 	struct list_head mru;
@@ -157,7 +145,7 @@ void raw_object_store_clear(struct raw_object_store *o);
  * Put in `buf` the name of the file in the local object database that
  * would be used to store a loose object with the specified sha1.
  */
-void loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1);
+const char *loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1);
 
 void *map_sha1_file(struct repository *r, const unsigned char *sha1, unsigned long *size);
 
diff --git a/object.c b/object.c
index 6af8e908bb..dd485ac629 100644
--- a/object.c
+++ b/object.c
@@ -484,7 +484,6 @@ struct raw_object_store *raw_object_store_new(void)
 
 static void free_alt_odb(struct object_directory *odb)
 {
-	strbuf_release(&odb->scratch);
 	oid_array_clear(&odb->loose_objects_cache);
 	free(odb);
 }
diff --git a/sha1-file.c b/sha1-file.c
index 478eac326b..15db6b61a9 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -346,27 +346,20 @@ static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1)
 	}
 }
 
-void loose_object_path(struct repository *r, struct strbuf *buf,
-		       const unsigned char *sha1)
+static const char *odb_loose_path(const char *path, struct strbuf *buf,
+				  const unsigned char *sha1)
 {
 	strbuf_reset(buf);
-	strbuf_addstr(buf, r->objects->objectdir);
+	strbuf_addstr(buf, path);
 	strbuf_addch(buf, '/');
 	fill_sha1_path(buf, sha1);
+	return buf->buf;
 }
 
-struct strbuf *alt_scratch_buf(struct object_directory *odb)
+const char *loose_object_path(struct repository *r, struct strbuf *buf,
+			      const unsigned char *sha1)
 {
-	strbuf_setlen(&odb->scratch, odb->base_len);
-	return &odb->scratch;
-}
-
-static const char *alt_sha1_path(struct object_directory *odb,
-				 const unsigned char *sha1)
-{
-	struct strbuf *buf = alt_scratch_buf(odb);
-	fill_sha1_path(buf, sha1);
-	return buf->buf;
+	return odb_loose_path(r->objects->objectdir, buf, sha1);
 }
 
 /*
@@ -547,9 +540,6 @@ struct object_directory *alloc_alt_odb(const char *dir)
 	struct object_directory *ent;
 
 	FLEX_ALLOC_STR(ent, path, dir);
-	strbuf_init(&ent->scratch, 0);
-	strbuf_addf(&ent->scratch, "%s/", dir);
-	ent->base_len = ent->scratch.len;
 
 	return ent;
 }
@@ -745,10 +735,12 @@ static int check_and_freshen_local(const struct object_id *oid, int freshen)
 static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
 {
 	struct object_directory *odb;
+	static struct strbuf path = STRBUF_INIT;
+
 	prepare_alt_odb(the_repository);
 	for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
-		const char *path = alt_sha1_path(odb, oid->hash);
-		if (check_and_freshen_file(path, freshen))
+		odb_loose_path(odb->path, &path, oid->hash);
+		if (check_and_freshen_file(path.buf, freshen))
 			return 1;
 	}
 	return 0;
@@ -889,7 +881,7 @@ int git_open_cloexec(const char *name, int flags)
  *
  * The "path" out-parameter will give the path of the object we found (if any).
  * Note that it may point to static storage and is only valid until another
- * call to loose_object_path(), etc.
+ * call to stat_sha1_file().
  */
 static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 			  struct stat *st, const char **path)
@@ -897,16 +889,14 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 	struct object_directory *odb;
 	static struct strbuf buf = STRBUF_INIT;
 
-	loose_object_path(r, &buf, sha1);
-	*path = buf.buf;
-
+	*path = loose_object_path(r, &buf, sha1);
 	if (!lstat(*path, st))
 		return 0;
 
 	prepare_alt_odb(r);
 	errno = ENOENT;
 	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
-		*path = alt_sha1_path(odb, sha1);
+		*path = odb_loose_path(odb->path, &buf, sha1);
 		if (!lstat(*path, st))
 			return 0;
 	}
@@ -926,9 +916,7 @@ static int open_sha1_file(struct repository *r,
 	int most_interesting_errno;
 	static struct strbuf buf = STRBUF_INIT;
 
-	loose_object_path(r, &buf, sha1);
-	*path = buf.buf;
-
+	*path = loose_object_path(r, &buf, sha1);
 	fd = git_open(*path);
 	if (fd >= 0)
 		return fd;
@@ -936,7 +924,7 @@ static int open_sha1_file(struct repository *r,
 
 	prepare_alt_odb(r);
 	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
-		*path = alt_sha1_path(odb, sha1);
+		*path = odb_loose_path(odb->path, &buf, sha1);
 		fd = git_open(*path);
 		if (fd >= 0)
 			return fd;
diff --git a/sha1-name.c b/sha1-name.c
index 2594aa79f8..96a8e71482 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -97,6 +97,7 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 	int subdir_nr = ds->bin_pfx.hash[0];
 	struct object_directory *odb;
 	static struct object_directory *fakeent;
+	struct strbuf buf = STRBUF_INIT;
 
 	if (!fakeent) {
 		/*
@@ -114,8 +115,9 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 		int pos;
 
 		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
-			struct strbuf *buf = alt_scratch_buf(odb);
-			for_each_file_in_obj_subdir(subdir_nr, buf,
+			strbuf_reset(&buf);
+			strbuf_addstr(&buf, odb->path);
+			for_each_file_in_obj_subdir(subdir_nr, &buf,
 						    append_loose_object,
 						    NULL, NULL,
 						    &odb->loose_objects_cache);
@@ -134,6 +136,8 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 			pos++;
 		}
 	}
+
+	strbuf_release(&buf);
 }
 
 static int match_sha(unsigned len, const unsigned char *a, const unsigned char *b)
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 6/9] sha1-file: use an object_directory for the main object dir
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (4 preceding siblings ...)
  2018-11-12 14:49                       ` [PATCH 5/9] handle alternates paths the same as the main object dir Jeff King
@ 2018-11-12 14:50                       ` " Jeff King
  2018-11-12 15:48                         ` Derrick Stolee
  2018-11-12 14:50                       ` [PATCH 7/9] object-store: provide helpers for loose_objects_cache Jeff King
                                         ` (3 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:50 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:

  do_something(r->objects->objdir);

  for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
        do_something(odb->path);

That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).

Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).

A few observations:

  - we don't need r->objects->objectdir anymore, and can just
    mechanically convert that to r->objects->odb->path

  - object_directory's path field needs to become a real pointer rather
    than a FLEX_ARRAY, in order to fill it with expand_base_dir()

  - we'll call prepare_alt_odb() earlier in many functions (i.e.,
    outside of the loop). This may result in us calling it even when our
    function would be satisfied looking only at the main odb.

    But this doesn't matter in practice. It's not a very expensive
    operation in the first place, and in the majority of cases it will
    be a noop. We call it already (and cache its results) in
    prepare_packed_git(), and we'll generally check packs before loose
    objects. So essentially every program is going to call it
    immediately once per program.

    Arguably we should just prepare_alt_odb() immediately upon setting
    up the repository's object directory, which would save us sprinkling
    calls throughout the code base (and forgetting to do so has been a
    source of subtle bugs in the past). But I've stopped short of that
    here, since there are already a lot of other moving parts in this
    patch.

  - Most call sites just get shorter. The check_and_freshen() functions
    are an exception, because they have entry points to handle local and
    nonlocal directories separately.

Signed-off-by: Jeff King <peff@peff.net>
---
If the "the first one is the main store, the rest are alternates" bit is
too subtle, we could mark each "struct object_directory" with a bit for
"is_local".

 builtin/fsck.c |  21 ++-------
 builtin/grep.c |   2 +-
 commit-graph.c |   5 +-
 environment.c  |   4 +-
 object-store.h |  27 ++++++-----
 object.c       |  19 ++++----
 packfile.c     |  10 ++--
 path.c         |   2 +-
 repository.c   |   8 +++-
 sha1-file.c    | 122 ++++++++++++++++++-------------------------------
 sha1-name.c    |  17 ++-----
 11 files changed, 90 insertions(+), 147 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 55153cf92a..15338bd178 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -725,13 +725,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
 		for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
 	} else {
-		struct object_directory *alt_odb_list;
-
-		fsck_object_dir(get_object_directory());
-
 		prepare_alt_odb(the_repository);
-		alt_odb_list = the_repository->objects->alt_odb_list;
-		for (odb = alt_odb_list; odb; odb = odb->next)
+		for (odb = the_repository->objects->odb; odb; odb = odb->next)
 			fsck_object_dir(odb->path);
 
 		if (check_full) {
@@ -834,13 +829,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
 		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL };
 
-		commit_graph_verify.argv = verify_argv;
-		commit_graph_verify.git_cmd = 1;
-		if (run_command(&commit_graph_verify))
-			errors_found |= ERROR_COMMIT_GRAPH;
-
 		prepare_alt_odb(the_repository);
-		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
+		for (odb = the_repository->objects->odb; odb; odb = odb->next) {
 			child_process_init(&commit_graph_verify);
 			commit_graph_verify.argv = verify_argv;
 			commit_graph_verify.git_cmd = 1;
@@ -855,13 +845,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 		struct child_process midx_verify = CHILD_PROCESS_INIT;
 		const char *midx_argv[] = { "multi-pack-index", "verify", NULL, NULL, NULL };
 
-		midx_verify.argv = midx_argv;
-		midx_verify.git_cmd = 1;
-		if (run_command(&midx_verify))
-			errors_found |= ERROR_COMMIT_GRAPH;
-
 		prepare_alt_odb(the_repository);
-		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
+		for (odb = the_repository->objects->odb; odb; odb = odb->next) {
 			child_process_init(&midx_verify);
 			midx_verify.argv = midx_argv;
 			midx_verify.git_cmd = 1;
diff --git a/builtin/grep.c b/builtin/grep.c
index d8508ddf79..714c8d91ba 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -441,7 +441,7 @@ static int grep_submodule(struct grep_opt *opt, struct repository *superproject,
 	 * object.
 	 */
 	grep_read_lock();
-	add_to_alternates_memory(submodule.objects->objectdir);
+	add_to_alternates_memory(submodule.objects->odb->path);
 	grep_read_unlock();
 
 	if (oid) {
diff --git a/commit-graph.c b/commit-graph.c
index 5dd3f5b15c..99163c244b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -231,7 +231,6 @@ static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
 static int prepare_commit_graph(struct repository *r)
 {
 	struct object_directory *odb;
-	char *obj_dir;
 	int config_value;
 
 	if (r->objects->commit_graph_attempted)
@@ -252,10 +251,8 @@ static int prepare_commit_graph(struct repository *r)
 	if (!commit_graph_compatible(r))
 		return 0;
 
-	obj_dir = r->objects->objectdir;
-	prepare_commit_graph_one(r, obj_dir);
 	prepare_alt_odb(r);
-	for (odb = r->objects->alt_odb_list;
+	for (odb = r->objects->odb;
 	     !r->objects->commit_graph && odb;
 	     odb = odb->next)
 		prepare_commit_graph_one(r, odb->path);
diff --git a/environment.c b/environment.c
index 3f3c8746c2..441ce56690 100644
--- a/environment.c
+++ b/environment.c
@@ -274,9 +274,9 @@ const char *get_git_work_tree(void)
 
 char *get_object_directory(void)
 {
-	if (!the_repository->objects->objectdir)
+	if (!the_repository->objects->odb)
 		BUG("git environment hasn't been setup");
-	return the_repository->objects->objectdir;
+	return the_repository->objects->odb->path;
 }
 
 int odb_mkstemp(struct strbuf *temp_filename, const char *pattern)
diff --git a/object-store.h b/object-store.h
index b2fa0d0df0..30faf7b391 100644
--- a/object-store.h
+++ b/object-store.h
@@ -24,19 +24,14 @@ struct object_directory {
 	 * Path to the alternative object store. If this is a relative path,
 	 * it is relative to the current working directory.
 	 */
-	char path[FLEX_ARRAY];
+	char *path;
 };
+
 void prepare_alt_odb(struct repository *r);
 char *compute_alternate_path(const char *path, struct strbuf *err);
 typedef int alt_odb_fn(struct object_directory *, void *);
 int foreach_alt_odb(alt_odb_fn, void*);
 
-/*
- * Allocate a "struct alternate_object_database" but do _not_ actually
- * add it to the list of alternates.
- */
-struct object_directory *alloc_alt_odb(const char *dir);
-
 /*
  * Add the directory to the on-disk alternates file; the new entry will also
  * take effect in the current process.
@@ -80,17 +75,21 @@ struct multi_pack_index;
 
 struct raw_object_store {
 	/*
-	 * Path to the repository's object store.
-	 * Cannot be NULL after initialization.
+	 * Set of all object directories; the main directory is first (and
+	 * cannot be NULL after initialization). Subsequent directories are
+	 * alternates.
 	 */
-	char *objectdir;
+	struct object_directory *odb;
+	struct object_directory **odb_tail;
+	int loaded_alternates;
 
-	/* Path to extra alternate object database if not NULL */
+	/*
+	 * A list of alternate object directories loaded from the environment;
+	 * this should not generally need to be accessed directly, but will
+	 * populate the "odb" list when prepare_alt_odb() is run.
+	 */
 	char *alternate_db;
 
-	struct object_directory *alt_odb_list;
-	struct object_directory **alt_odb_tail;
-
 	/*
 	 * Objects that should be substituted by other objects
 	 * (see git-replace(1)).
diff --git a/object.c b/object.c
index dd485ac629..79d636091c 100644
--- a/object.c
+++ b/object.c
@@ -482,26 +482,26 @@ struct raw_object_store *raw_object_store_new(void)
 	return o;
 }
 
-static void free_alt_odb(struct object_directory *odb)
+static void free_object_directory(struct object_directory *odb)
 {
+	free(odb->path);
 	oid_array_clear(&odb->loose_objects_cache);
 	free(odb);
 }
 
-static void free_alt_odbs(struct raw_object_store *o)
+static void free_object_directories(struct raw_object_store *o)
 {
-	while (o->alt_odb_list) {
+	while (o->odb) {
 		struct object_directory *next;
 
-		next = o->alt_odb_list->next;
-		free_alt_odb(o->alt_odb_list);
-		o->alt_odb_list = next;
+		next = o->odb->next;
+		free_object_directory(o->odb);
+		o->odb = next;
 	}
 }
 
 void raw_object_store_clear(struct raw_object_store *o)
 {
-	FREE_AND_NULL(o->objectdir);
 	FREE_AND_NULL(o->alternate_db);
 
 	oidmap_free(o->replace_map, 1);
@@ -511,8 +511,9 @@ void raw_object_store_clear(struct raw_object_store *o)
 	o->commit_graph = NULL;
 	o->commit_graph_attempted = 0;
 
-	free_alt_odbs(o);
-	o->alt_odb_tail = NULL;
+	free_object_directories(o);
+	o->odb_tail = NULL;
+	o->loaded_alternates = 0;
 
 	INIT_LIST_HEAD(&o->packed_git_mru);
 	close_all_packs(o);
diff --git a/packfile.c b/packfile.c
index d6d511cfd2..1eda33247f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -970,12 +970,12 @@ static void prepare_packed_git(struct repository *r)
 
 	if (r->objects->packed_git_initialized)
 		return;
-	prepare_multi_pack_index_one(r, r->objects->objectdir, 1);
-	prepare_packed_git_one(r, r->objects->objectdir, 1);
+
 	prepare_alt_odb(r);
-	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
-		prepare_multi_pack_index_one(r, odb->path, 0);
-		prepare_packed_git_one(r, odb->path, 0);
+	for (odb = r->objects->odb; odb; odb = odb->next) {
+		int local = (odb == r->objects->odb);
+		prepare_multi_pack_index_one(r, odb->path, local);
+		prepare_packed_git_one(r, odb->path, local);
 	}
 	rearrange_packed_git(r);
 
diff --git a/path.c b/path.c
index ba06ec5b2d..e8609cf56d 100644
--- a/path.c
+++ b/path.c
@@ -383,7 +383,7 @@ static void adjust_git_path(const struct repository *repo,
 		strbuf_splice(buf, 0, buf->len,
 			      repo->index_file, strlen(repo->index_file));
 	else if (dir_prefix(base, "objects"))
-		replace_dir(buf, git_dir_len + 7, repo->objects->objectdir);
+		replace_dir(buf, git_dir_len + 7, repo->objects->odb->path);
 	else if (git_hooks_path && dir_prefix(base, "hooks"))
 		replace_dir(buf, git_dir_len + 5, git_hooks_path);
 	else if (repo->different_commondir)
diff --git a/repository.c b/repository.c
index 5dd1486718..7b02e1dffa 100644
--- a/repository.c
+++ b/repository.c
@@ -63,8 +63,14 @@ void repo_set_gitdir(struct repository *repo,
 	free(old_gitdir);
 
 	repo_set_commondir(repo, o->commondir);
-	expand_base_dir(&repo->objects->objectdir, o->object_dir,
+
+	if (!repo->objects->odb) {
+		repo->objects->odb = xcalloc(1, sizeof(*repo->objects->odb));
+		repo->objects->odb_tail = &repo->objects->odb->next;
+	}
+	expand_base_dir(&repo->objects->odb->path, o->object_dir,
 			repo->commondir, "objects");
+
 	free(repo->objects->alternate_db);
 	repo->objects->alternate_db = xstrdup_or_null(o->alternate_db);
 	expand_base_dir(&repo->graft_file, o->graft_file,
diff --git a/sha1-file.c b/sha1-file.c
index 15db6b61a9..503262edd2 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -346,11 +346,12 @@ static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1)
 	}
 }
 
-static const char *odb_loose_path(const char *path, struct strbuf *buf,
+static const char *odb_loose_path(struct object_directory *odb,
+				  struct strbuf *buf,
 				  const unsigned char *sha1)
 {
 	strbuf_reset(buf);
-	strbuf_addstr(buf, path);
+	strbuf_addstr(buf, odb->path);
 	strbuf_addch(buf, '/');
 	fill_sha1_path(buf, sha1);
 	return buf->buf;
@@ -359,7 +360,7 @@ static const char *odb_loose_path(const char *path, struct strbuf *buf,
 const char *loose_object_path(struct repository *r, struct strbuf *buf,
 			      const unsigned char *sha1)
 {
-	return odb_loose_path(r->objects->objectdir, buf, sha1);
+	return odb_loose_path(r->objects->odb, buf, sha1);
 }
 
 /*
@@ -383,7 +384,7 @@ static int alt_odb_usable(struct raw_object_store *o,
 	 * Prevent the common mistake of listing the same
 	 * thing twice, or object directory itself.
 	 */
-	for (odb = o->alt_odb_list; odb; odb = odb->next) {
+	for (odb = o->odb; odb; odb = odb->next) {
 		if (!fspathcmp(path->buf, odb->path))
 			return 0;
 	}
@@ -442,11 +443,12 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
 		return -1;
 	}
 
-	ent = alloc_alt_odb(pathbuf.buf);
+	ent = xcalloc(1, sizeof(*ent));
+	ent->path = xstrdup(pathbuf.buf);
 
 	/* add the alternate entry */
-	*r->objects->alt_odb_tail = ent;
-	r->objects->alt_odb_tail = &(ent->next);
+	*r->objects->odb_tail = ent;
+	r->objects->odb_tail = &(ent->next);
 	ent->next = NULL;
 
 	/* recursively add alternates */
@@ -500,7 +502,7 @@ static void link_alt_odb_entries(struct repository *r, const char *alt,
 		return;
 	}
 
-	strbuf_add_absolute_path(&objdirbuf, r->objects->objectdir);
+	strbuf_add_absolute_path(&objdirbuf, r->objects->odb->path);
 	if (strbuf_normalize_path(&objdirbuf) < 0)
 		die(_("unable to normalize object directory: %s"),
 		    objdirbuf.buf);
@@ -535,15 +537,6 @@ static void read_info_alternates(struct repository *r,
 	free(path);
 }
 
-struct object_directory *alloc_alt_odb(const char *dir)
-{
-	struct object_directory *ent;
-
-	FLEX_ALLOC_STR(ent, path, dir);
-
-	return ent;
-}
-
 void add_to_alternates_file(const char *reference)
 {
 	struct lock_file lock = LOCK_INIT;
@@ -580,7 +573,7 @@ void add_to_alternates_file(const char *reference)
 		fprintf_or_die(out, "%s\n", reference);
 		if (commit_lock_file(&lock))
 			die_errno(_("unable to move new alternates file into place"));
-		if (the_repository->objects->alt_odb_tail)
+		if (the_repository->objects->loaded_alternates)
 			link_alt_odb_entries(the_repository, reference,
 					     '\n', NULL, 0);
 	}
@@ -680,7 +673,7 @@ int foreach_alt_odb(alt_odb_fn fn, void *cb)
 	int r = 0;
 
 	prepare_alt_odb(the_repository);
-	for (ent = the_repository->objects->alt_odb_list; ent; ent = ent->next) {
+	for (ent = the_repository->objects->odb->next; ent; ent = ent->next) {
 		r = fn(ent, cb);
 		if (r)
 			break;
@@ -690,13 +683,13 @@ int foreach_alt_odb(alt_odb_fn fn, void *cb)
 
 void prepare_alt_odb(struct repository *r)
 {
-	if (r->objects->alt_odb_tail)
+	if (r->objects->loaded_alternates)
 		return;
 
-	r->objects->alt_odb_tail = &r->objects->alt_odb_list;
 	link_alt_odb_entries(r, r->objects->alternate_db, PATH_SEP, NULL, 0);
 
-	read_info_alternates(r, r->objects->objectdir, 0);
+	read_info_alternates(r, r->objects->odb->path, 0);
+	r->objects->loaded_alternates = 1;
 }
 
 /* Returns 1 if we have successfully freshened the file, 0 otherwise. */
@@ -723,24 +716,27 @@ int check_and_freshen_file(const char *fn, int freshen)
 	return 1;
 }
 
-static int check_and_freshen_local(const struct object_id *oid, int freshen)
+static int check_and_freshen_odb(struct object_directory *odb,
+				 const struct object_id *oid,
+				 int freshen)
 {
-	static struct strbuf buf = STRBUF_INIT;
-
-	loose_object_path(the_repository, &buf, oid->hash);
+	static struct strbuf path = STRBUF_INIT;
+	odb_loose_path(odb, &path, oid->hash);
+	return check_and_freshen_file(path.buf, freshen);
+}
 
-	return check_and_freshen_file(buf.buf, freshen);
+static int check_and_freshen_local(const struct object_id *oid, int freshen)
+{
+	return check_and_freshen_odb(the_repository->objects->odb, oid, freshen);
 }
 
 static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
 {
 	struct object_directory *odb;
-	static struct strbuf path = STRBUF_INIT;
 
 	prepare_alt_odb(the_repository);
-	for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
-		odb_loose_path(odb->path, &path, oid->hash);
-		if (check_and_freshen_file(path.buf, freshen))
+	for (odb = the_repository->objects->odb->next; odb; odb = odb->next) {
+		if (check_and_freshen_odb(odb, oid, freshen))
 			return 1;
 	}
 	return 0;
@@ -889,14 +885,9 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 	struct object_directory *odb;
 	static struct strbuf buf = STRBUF_INIT;
 
-	*path = loose_object_path(r, &buf, sha1);
-	if (!lstat(*path, st))
-		return 0;
-
 	prepare_alt_odb(r);
-	errno = ENOENT;
-	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
-		*path = odb_loose_path(odb->path, &buf, sha1);
+	for (odb = r->objects->odb; odb; odb = odb->next) {
+		*path = odb_loose_path(odb, &buf, sha1);
 		if (!lstat(*path, st))
 			return 0;
 	}
@@ -913,21 +904,16 @@ static int open_sha1_file(struct repository *r,
 {
 	int fd;
 	struct object_directory *odb;
-	int most_interesting_errno;
+	int most_interesting_errno = ENOENT;
 	static struct strbuf buf = STRBUF_INIT;
 
-	*path = loose_object_path(r, &buf, sha1);
-	fd = git_open(*path);
-	if (fd >= 0)
-		return fd;
-	most_interesting_errno = errno;
-
 	prepare_alt_odb(r);
-	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
-		*path = odb_loose_path(odb->path, &buf, sha1);
+	for (odb = r->objects->odb; odb; odb = odb->next) {
+		*path = odb_loose_path(odb, &buf, sha1);
 		fd = git_open(*path);
 		if (fd >= 0)
 			return fd;
+
 		if (most_interesting_errno == ENOENT)
 			most_interesting_errno = errno;
 	}
@@ -2120,43 +2106,23 @@ int for_each_loose_file_in_objdir(const char *path,
 	return r;
 }
 
-struct loose_alt_odb_data {
-	each_loose_object_fn *cb;
-	void *data;
-};
-
-static int loose_from_alt_odb(struct object_directory *odb,
-			      void *vdata)
-{
-	struct loose_alt_odb_data *data = vdata;
-	struct strbuf buf = STRBUF_INIT;
-	int r;
-
-	strbuf_addstr(&buf, odb->path);
-	r = for_each_loose_file_in_objdir_buf(&buf,
-					      data->cb, NULL, NULL,
-					      data->data);
-	strbuf_release(&buf);
-	return r;
-}
-
 int for_each_loose_object(each_loose_object_fn cb, void *data,
 			  enum for_each_object_flags flags)
 {
-	struct loose_alt_odb_data alt;
-	int r;
+	struct object_directory *odb;
 
-	r = for_each_loose_file_in_objdir(get_object_directory(),
-					  cb, NULL, NULL, data);
-	if (r)
-		return r;
+	prepare_alt_odb(the_repository);
+	for (odb = the_repository->objects->odb; odb; odb = odb->next) {
+		int r = for_each_loose_file_in_objdir(odb->path, cb, NULL,
+						      NULL, data);
+		if (r)
+			return r;
 
-	if (flags & FOR_EACH_OBJECT_LOCAL_ONLY)
-		return 0;
+		if (flags & FOR_EACH_OBJECT_LOCAL_ONLY)
+			break;
+	}
 
-	alt.cb = cb;
-	alt.data = data;
-	return foreach_alt_odb(loose_from_alt_odb, &alt);
+	return 0;
 }
 
 static int check_stream_sha1(git_zstream *stream,
diff --git a/sha1-name.c b/sha1-name.c
index 96a8e71482..358ca5e288 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -96,22 +96,11 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 {
 	int subdir_nr = ds->bin_pfx.hash[0];
 	struct object_directory *odb;
-	static struct object_directory *fakeent;
 	struct strbuf buf = STRBUF_INIT;
 
-	if (!fakeent) {
-		/*
-		 * Create a "fake" alternate object database that
-		 * points to our own object database, to make it
-		 * easier to get a temporary working space in
-		 * alt->name/alt->base while iterating over the
-		 * object databases including our own.
-		 */
-		fakeent = alloc_alt_odb(get_object_directory());
-	}
-	fakeent->next = the_repository->objects->alt_odb_list;
-
-	for (odb = fakeent; odb && !ds->ambiguous; odb = odb->next) {
+	for (odb = the_repository->objects->odb;
+	     odb && !ds->ambiguous;
+	     odb = odb->next) {
 		int pos;
 
 		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 7/9] object-store: provide helpers for loose_objects_cache
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (5 preceding siblings ...)
  2018-11-12 14:50                       ` [PATCH 6/9] sha1-file: use an object_directory for " Jeff King
@ 2018-11-12 14:50                       ` Jeff King
  2018-11-12 19:24                         ` René Scharfe
  2018-11-12 14:54                       ` [PATCH 8/9] sha1-file: use loose object cache for quick existence check Jeff King
                                         ` (2 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:50 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

Our object_directory struct has a loose objects cache that all users of
the struct can see. But the only one that knows how to load the cache is
find_short_object_filename(). Let's extract that logic in to a reusable
function.

While we're at it, let's also reset the cache when we re-read the object
directories. This shouldn't have an impact on performance, as re-reads
are meant to be rare (and are already expensive, so we avoid them with
things like OBJECT_INFO_QUICK).

Since the cache is already meant to be an approximation, it's tempting
to skip even this bit of safety. But it's necessary to allow more code
to use it. For instance, fetch-pack explicitly re-reads the object
directory after performing its fetch, and would be confused if we didn't
clear the cache.

Signed-off-by: Jeff King <peff@peff.net>
---
 object-store.h | 18 +++++++++++++-----
 packfile.c     |  8 ++++++++
 sha1-file.c    | 26 ++++++++++++++++++++++++++
 sha1-name.c    | 21 +--------------------
 4 files changed, 48 insertions(+), 25 deletions(-)

diff --git a/object-store.h b/object-store.h
index 30faf7b391..bf1e0cb761 100644
--- a/object-store.h
+++ b/object-store.h
@@ -11,11 +11,12 @@ struct object_directory {
 	struct object_directory *next;
 
 	/*
-	 * Used to store the results of readdir(3) calls when searching
-	 * for unique abbreviated hashes.  This cache is never
-	 * invalidated, thus it's racy and not necessarily accurate.
-	 * That's fine for its purpose; don't use it for tasks requiring
-	 * greater accuracy!
+	 * Used to store the results of readdir(3) calls when we are OK
+	 * sacrificing accuracy due to races for speed. That includes
+	 * our search for unique abbreviated hashes. Don't use it for tasks
+	 * requiring greater accuracy!
+	 *
+	 * Be sure to call odb_load_loose_cache() before using.
 	 */
 	char loose_objects_subdir_seen[256];
 	struct oid_array loose_objects_cache;
@@ -45,6 +46,13 @@ void add_to_alternates_file(const char *dir);
  */
 void add_to_alternates_memory(const char *dir);
 
+/*
+ * Populate an odb's loose object cache for one particular subdirectory (i.e.,
+ * the one that corresponds to the first byte of objects you're interested in,
+ * from 0 to 255 inclusive).
+ */
+void odb_load_loose_cache(struct object_directory *odb, int subdir_nr);
+
 struct packed_git {
 	struct packed_git *next;
 	struct list_head mru;
diff --git a/packfile.c b/packfile.c
index 1eda33247f..91fd40efb0 100644
--- a/packfile.c
+++ b/packfile.c
@@ -987,6 +987,14 @@ static void prepare_packed_git(struct repository *r)
 
 void reprepare_packed_git(struct repository *r)
 {
+	struct object_directory *odb;
+
+	for (odb = r->objects->odb; odb; odb = odb->next) {
+		oid_array_clear(&odb->loose_objects_cache);
+		memset(&odb->loose_objects_subdir_seen, 0,
+		       sizeof(odb->loose_objects_subdir_seen));
+	}
+
 	r->objects->approximate_object_count_valid = 0;
 	r->objects->packed_git_initialized = 0;
 	prepare_packed_git(r);
diff --git a/sha1-file.c b/sha1-file.c
index 503262edd2..4aae716a37 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -2125,6 +2125,32 @@ int for_each_loose_object(each_loose_object_fn cb, void *data,
 	return 0;
 }
 
+static int append_loose_object(const struct object_id *oid, const char *path,
+			       void *data)
+{
+	oid_array_append(data, oid);
+	return 0;
+}
+
+void odb_load_loose_cache(struct object_directory *odb, int subdir_nr)
+{
+	struct strbuf buf = STRBUF_INIT;
+
+	if (subdir_nr < 0 ||
+	    subdir_nr >= ARRAY_SIZE(odb->loose_objects_subdir_seen))
+		BUG("subdir_nr out of range");
+
+	if (odb->loose_objects_subdir_seen[subdir_nr])
+		return;
+
+	strbuf_addstr(&buf, odb->path);
+	for_each_file_in_obj_subdir(subdir_nr, &buf,
+				    append_loose_object,
+				    NULL, NULL,
+				    &odb->loose_objects_cache);
+	odb->loose_objects_subdir_seen[subdir_nr] = 1;
+}
+
 static int check_stream_sha1(git_zstream *stream,
 			     const char *hdr,
 			     unsigned long size,
diff --git a/sha1-name.c b/sha1-name.c
index 358ca5e288..b24502811b 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -83,36 +83,19 @@ static void update_candidates(struct disambiguate_state *ds, const struct object
 	/* otherwise, current can be discarded and candidate is still good */
 }
 
-static int append_loose_object(const struct object_id *oid, const char *path,
-			       void *data)
-{
-	oid_array_append(data, oid);
-	return 0;
-}
-
 static int match_sha(unsigned, const unsigned char *, const unsigned char *);
 
 static void find_short_object_filename(struct disambiguate_state *ds)
 {
 	int subdir_nr = ds->bin_pfx.hash[0];
 	struct object_directory *odb;
-	struct strbuf buf = STRBUF_INIT;
 
 	for (odb = the_repository->objects->odb;
 	     odb && !ds->ambiguous;
 	     odb = odb->next) {
 		int pos;
 
-		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
-			strbuf_reset(&buf);
-			strbuf_addstr(&buf, odb->path);
-			for_each_file_in_obj_subdir(subdir_nr, &buf,
-						    append_loose_object,
-						    NULL, NULL,
-						    &odb->loose_objects_cache);
-			odb->loose_objects_subdir_seen[subdir_nr] = 1;
-		}
-
+		odb_load_loose_cache(odb, subdir_nr);
 		pos = oid_array_lookup(&odb->loose_objects_cache, &ds->bin_pfx);
 		if (pos < 0)
 			pos = -1 - pos;
@@ -125,8 +108,6 @@ static void find_short_object_filename(struct disambiguate_state *ds)
 			pos++;
 		}
 	}
-
-	strbuf_release(&buf);
 }
 
 static int match_sha(unsigned len, const unsigned char *a, const unsigned char *b)
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (6 preceding siblings ...)
  2018-11-12 14:50                       ` [PATCH 7/9] object-store: provide helpers for loose_objects_cache Jeff King
@ 2018-11-12 14:54                       ` Jeff King
  2018-11-12 16:00                         ` Derrick Stolee
  2018-11-12 16:01                         ` Ævar Arnfjörð Bjarmason
  2018-11-12 14:55                       ` [PATCH 9/9] fetch-pack: drop custom loose object cache Jeff King
  2018-11-12 16:02                       ` [PATCH 0/9] caching loose objects Derrick Stolee
  9 siblings, 2 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:54 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).

However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.

Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.

Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.

This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.

So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).

If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?

On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.

Signed-off-by: Jeff King <peff@peff.net>
---
There's some obvious hand-waving in the paragraphs above. I would love
it if somebody with an NFS system could do some before/after timings
with various numbers of loose objects, to get a sense of where the
breakeven point is.

My gut is that we do not need the complexity of a cache-size limit, nor
of a config option to disable this. But it would be nice to have a real
number where "reasonable" ends and "pathological" begins. :)

 object-store.h |  1 +
 sha1-file.c    | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/object-store.h b/object-store.h
index bf1e0cb761..60758efad8 100644
--- a/object-store.h
+++ b/object-store.h
@@ -13,6 +13,7 @@ struct object_directory {
 	/*
 	 * Used to store the results of readdir(3) calls when we are OK
 	 * sacrificing accuracy due to races for speed. That includes
+	 * object existence with OBJECT_INFO_QUICK, as well as
 	 * our search for unique abbreviated hashes. Don't use it for tasks
 	 * requiring greater accuracy!
 	 *
diff --git a/sha1-file.c b/sha1-file.c
index 4aae716a37..e53da0b701 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -921,6 +921,24 @@ static int open_sha1_file(struct repository *r,
 	return -1;
 }
 
+static int quick_has_loose(struct repository *r,
+			   const unsigned char *sha1)
+{
+	int subdir_nr = sha1[0];
+	struct object_id oid;
+	struct object_directory *odb;
+
+	hashcpy(oid.hash, sha1);
+
+	prepare_alt_odb(r);
+	for (odb = r->objects->odb; odb; odb = odb->next) {
+		odb_load_loose_cache(odb, subdir_nr);
+		if (oid_array_lookup(&odb->loose_objects_cache, &oid) >= 0)
+			return 1;
+	}
+	return 0;
+}
+
 /*
  * Map the loose object at "path" if it is not NULL, or the path found by
  * searching for a loose object named "sha1".
@@ -1171,6 +1189,8 @@ static int sha1_loose_object_info(struct repository *r,
 	if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
+		if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK))
+			return quick_has_loose(r, sha1) ? 0 : -1;
 		if (stat_sha1_file(r, sha1, &st, &path) < 0)
 			return -1;
 		if (oi->disk_sizep)
-- 
2.19.1.1577.g2c5b293d4f


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 9/9] fetch-pack: drop custom loose object cache
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (7 preceding siblings ...)
  2018-11-12 14:54                       ` [PATCH 8/9] sha1-file: use loose object cache for quick existence check Jeff King
@ 2018-11-12 14:55                       ` Jeff King
  2018-11-12 19:25                         ` René Scharfe
  2018-11-12 16:02                       ` [PATCH 0/9] caching loose objects Derrick Stolee
  9 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 14:55 UTC (permalink / raw)
  To: Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

Commit 024aa4696c (fetch-pack.c: use oidset to check existence of loose
object, 2018-03-14) added a cache to avoid calling stat() for a bunch of
loose objects we don't have.

Now that OBJECT_INFO_QUICK handles this caching itself, we can drop the
custom solution.

Note that this might perform slightly differently, as the original code
stopped calling readdir() when we saw more loose objects than there were
refs. So:

  1. The old code might have spent work on readdir() to fill the cache,
     but then decided there were too many loose objects, wasting that
     effort.

  2. The new code might spend a lot of time on readdir() if you have a
     lot of loose objects, even though there are very few objects to
     ask about.

In practice it probably won't matter either way; see the previous commit
for some discussion of the tradeoff.

Signed-off-by: Jeff King <peff@peff.net>
---
 fetch-pack.c | 39 ++-------------------------------------
 1 file changed, 2 insertions(+), 37 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index b3ed7121bc..25a88f4eb2 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -636,23 +636,6 @@ struct loose_object_iter {
 	struct ref *refs;
 };
 
-/*
- *  If the number of refs is not larger than the number of loose objects,
- *  this function stops inserting.
- */
-static int add_loose_objects_to_set(const struct object_id *oid,
-				    const char *path,
-				    void *data)
-{
-	struct loose_object_iter *iter = data;
-	oidset_insert(iter->loose_object_set, oid);
-	if (iter->refs == NULL)
-		return 1;
-
-	iter->refs = iter->refs->next;
-	return 0;
-}
-
 /*
  * Mark recent commits available locally and reachable from a local ref as
  * COMPLETE. If args->no_dependents is false, also mark COMPLETE remote refs as
@@ -670,30 +653,14 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
 	struct ref *ref;
 	int old_save_commit_buffer = save_commit_buffer;
 	timestamp_t cutoff = 0;
-	struct oidset loose_oid_set = OIDSET_INIT;
-	int use_oidset = 0;
-	struct loose_object_iter iter = {&loose_oid_set, *refs};
-
-	/* Enumerate all loose objects or know refs are not so many. */
-	use_oidset = !for_each_loose_object(add_loose_objects_to_set,
-					    &iter, 0);
 
 	save_commit_buffer = 0;
 
 	for (ref = *refs; ref; ref = ref->next) {
 		struct object *o;
-		unsigned int flags = OBJECT_INFO_QUICK;
 
-		if (use_oidset &&
-		    !oidset_contains(&loose_oid_set, &ref->old_oid)) {
-			/*
-			 * I know this does not exist in the loose form,
-			 * so check if it exists in a non-loose form.
-			 */
-			flags |= OBJECT_INFO_IGNORE_LOOSE;
-		}
-
-		if (!has_object_file_with_flags(&ref->old_oid, flags))
+		if (!has_object_file_with_flags(&ref->old_oid,
+						OBJECT_INFO_QUICK))
 			continue;
 		o = parse_object(the_repository, &ref->old_oid);
 		if (!o)
@@ -710,8 +677,6 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
 		}
 	}
 
-	oidset_clear(&loose_oid_set);
-
 	if (!args->deepen) {
 		for_each_ref(mark_complete_oid, NULL);
 		for_each_cached_alternate(NULL, mark_alternate_complete);
-- 
2.19.1.1577.g2c5b293d4f

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 1/9] fsck: do not reuse child_process structs
  2018-11-12 14:46                       ` [PATCH 1/9] fsck: do not reuse child_process structs Jeff King
@ 2018-11-12 15:26                         ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 15:26 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:46 AM, Jeff King wrote:
> The run-command API makes no promises about what is left in a struct
> child_process after a command finishes, and it's not safe to simply
> reuse it again for a similar command. In particular:
>
>   - if you use child->args or child->env_array, they are cleared after
>     finish_command()
>
>   - likewise, start_command() may point child->argv at child->args->argv;
>     reusing that would lead to accessing freed memory
>
>   - the in/out/err may hold pipe descriptors from the previous run

Thanks! This is helpful information.

> These two calls are _probably_ OK because they do not use any of those
> features. But it's only by chance, and may break in the future; let's
> reinitialize our struct for each program we run.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   builtin/fsck.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 06eb421720..b10f2b154c 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -841,6 +841,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   
>   		prepare_alt_odb(the_repository);
>   		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
> +			child_process_init(&commit_graph_verify);
> +			commit_graph_verify.argv = verify_argv;
> +			commit_graph_verify.git_cmd = 1;
>   			verify_argv[2] = "--object-dir";
>   			verify_argv[3] = alt->path;
>   			if (run_command(&commit_graph_verify))
> @@ -859,6 +862,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   
>   		prepare_alt_odb(the_repository);
>   		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
> +			child_process_init(&midx_verify);
> +			midx_verify.argv = midx_argv;
> +			midx_verify.git_cmd = 1;
>   			midx_argv[2] = "--object-dir";
>   			midx_argv[3] = alt->path;
>   			if (run_command(&midx_verify))

Looks good to me.

-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 3/9] rename "alternate_object_database" to "object_directory"
  2018-11-12 14:48                       ` [PATCH 3/9] rename "alternate_object_database" to "object_directory" Jeff King
@ 2018-11-12 15:30                         ` Derrick Stolee
  2018-11-12 15:36                           ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 15:30 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:48 AM, Jeff King wrote:
> In preparation for unifying the handling of alt odb's and the normal
> repo object directory, let's use a more neutral name. This patch is
> purely mechanical, swapping the type name, and converting any variables
> named "alt" to "odb". There should be no functional change, but it will
> reduce the noise in subsequent diffs.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I waffled on calling this object_database instead of object_directory.
> But really, it is very specifically about the directory (packed
> storage, including packs from alternates, is handled elsewhere).

That makes sense. Each alternate makes its own object directory, but is 
part of a larger object database. It also helps clarify a difference 
from the object_store.

My only complaint is that you have a lot of variable names with "odb" 
which are now object_directory pointers. Perhaps "odb" -> "objdir"? Or 
is that just too much change?

>
>   builtin/count-objects.c     |  4 ++--
>   builtin/fsck.c              | 16 ++++++-------
>   builtin/submodule--helper.c |  6 ++---
>   commit-graph.c              | 10 ++++----
>   object-store.h              | 14 +++++------
>   object.c                    | 10 ++++----
>   packfile.c                  |  8 +++----
>   sha1-file.c                 | 48 ++++++++++++++++++-------------------
>   sha1-name.c                 | 20 ++++++++--------
>   transport.c                 |  2 +-
>   10 files changed, 69 insertions(+), 69 deletions(-)
>
> diff --git a/builtin/count-objects.c b/builtin/count-objects.c
> index a7cad052c6..3fae474f6f 100644
> --- a/builtin/count-objects.c
> +++ b/builtin/count-objects.c
> @@ -78,10 +78,10 @@ static int count_cruft(const char *basename, const char *path, void *data)
>   	return 0;
>   }
>   
> -static int print_alternate(struct alternate_object_database *alt, void *data)
> +static int print_alternate(struct object_directory *odb, void *data)
>   {
>   	printf("alternate: ");
> -	quote_c_style(alt->path, NULL, stdout, 0);
> +	quote_c_style(odb->path, NULL, stdout, 0);
>   	putchar('\n');
>   	return 0;
>   }
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index b10f2b154c..55153cf92a 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -688,7 +688,7 @@ static struct option fsck_opts[] = {
>   int cmd_fsck(int argc, const char **argv, const char *prefix)
>   {
>   	int i;
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   
>   	/* fsck knows how to handle missing promisor objects */
>   	fetch_if_missing = 0;
> @@ -725,14 +725,14 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
>   		for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
>   	} else {
> -		struct alternate_object_database *alt_odb_list;
> +		struct object_directory *alt_odb_list;
>   
>   		fsck_object_dir(get_object_directory());
>   
>   		prepare_alt_odb(the_repository);
>   		alt_odb_list = the_repository->objects->alt_odb_list;
> -		for (alt = alt_odb_list; alt; alt = alt->next)
> -			fsck_object_dir(alt->path);
> +		for (odb = alt_odb_list; odb; odb = odb->next)
> +			fsck_object_dir(odb->path);
>   
>   		if (check_full) {
>   			struct packed_git *p;
> @@ -840,12 +840,12 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   			errors_found |= ERROR_COMMIT_GRAPH;
>   
>   		prepare_alt_odb(the_repository);
> -		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
> +		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
>   			child_process_init(&commit_graph_verify);
>   			commit_graph_verify.argv = verify_argv;
>   			commit_graph_verify.git_cmd = 1;
>   			verify_argv[2] = "--object-dir";
> -			verify_argv[3] = alt->path;
> +			verify_argv[3] = odb->path;
>   			if (run_command(&commit_graph_verify))
>   				errors_found |= ERROR_COMMIT_GRAPH;
>   		}
> @@ -861,12 +861,12 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   			errors_found |= ERROR_COMMIT_GRAPH;
>   
>   		prepare_alt_odb(the_repository);
> -		for (alt =  the_repository->objects->alt_odb_list; alt; alt = alt->next) {
> +		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
>   			child_process_init(&midx_verify);
>   			midx_verify.argv = midx_argv;
>   			midx_verify.git_cmd = 1;
>   			midx_argv[2] = "--object-dir";
> -			midx_argv[3] = alt->path;
> +			midx_argv[3] = odb->path;
>   			if (run_command(&midx_verify))
>   				errors_found |= ERROR_COMMIT_GRAPH;
>   		}
> diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
> index 28b9449e82..3ae451bc46 100644
> --- a/builtin/submodule--helper.c
> +++ b/builtin/submodule--helper.c
> @@ -1265,7 +1265,7 @@ struct submodule_alternate_setup {
>   	SUBMODULE_ALTERNATE_ERROR_IGNORE, NULL }
>   
>   static int add_possible_reference_from_superproject(
> -		struct alternate_object_database *alt, void *sas_cb)
> +		struct object_directory *odb, void *sas_cb)
>   {
>   	struct submodule_alternate_setup *sas = sas_cb;
>   	size_t len;
> @@ -1274,11 +1274,11 @@ static int add_possible_reference_from_superproject(
>   	 * If the alternate object store is another repository, try the
>   	 * standard layout with .git/(modules/<name>)+/objects
>   	 */
> -	if (strip_suffix(alt->path, "/objects", &len)) {
> +	if (strip_suffix(odb->path, "/objects", &len)) {
>   		char *sm_alternate;
>   		struct strbuf sb = STRBUF_INIT;
>   		struct strbuf err = STRBUF_INIT;
> -		strbuf_add(&sb, alt->path, len);
> +		strbuf_add(&sb, odb->path, len);
>   
>   		/*
>   		 * We need to end the new path with '/' to mark it as a dir,
> diff --git a/commit-graph.c b/commit-graph.c
> index 40c855f185..5dd3f5b15c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -230,7 +230,7 @@ static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
>    */
>   static int prepare_commit_graph(struct repository *r)
>   {
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   	char *obj_dir;
>   	int config_value;
>   
> @@ -255,10 +255,10 @@ static int prepare_commit_graph(struct repository *r)
>   	obj_dir = r->objects->objectdir;
>   	prepare_commit_graph_one(r, obj_dir);
>   	prepare_alt_odb(r);
> -	for (alt = r->objects->alt_odb_list;
> -	     !r->objects->commit_graph && alt;
> -	     alt = alt->next)
> -		prepare_commit_graph_one(r, alt->path);
> +	for (odb = r->objects->alt_odb_list;
> +	     !r->objects->commit_graph && odb;
> +	     odb = odb->next)
> +		prepare_commit_graph_one(r, odb->path);
>   	return !!r->objects->commit_graph;
>   }
>   
> diff --git a/object-store.h b/object-store.h
> index 63b7605a3e..122d5f75e2 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -7,8 +7,8 @@
>   #include "sha1-array.h"
>   #include "strbuf.h"
>   
> -struct alternate_object_database {
> -	struct alternate_object_database *next;
> +struct object_directory {
> +	struct object_directory *next;
>   
>   	/* see alt_scratch_buf() */
>   	struct strbuf scratch;
> @@ -32,14 +32,14 @@ struct alternate_object_database {
>   };
>   void prepare_alt_odb(struct repository *r);
>   char *compute_alternate_path(const char *path, struct strbuf *err);
> -typedef int alt_odb_fn(struct alternate_object_database *, void *);
> +typedef int alt_odb_fn(struct object_directory *, void *);
>   int foreach_alt_odb(alt_odb_fn, void*);
>   
>   /*
>    * Allocate a "struct alternate_object_database" but do _not_ actually
>    * add it to the list of alternates.
>    */
> -struct alternate_object_database *alloc_alt_odb(const char *dir);
> +struct object_directory *alloc_alt_odb(const char *dir);
>   
>   /*
>    * Add the directory to the on-disk alternates file; the new entry will also
> @@ -60,7 +60,7 @@ void add_to_alternates_memory(const char *dir);
>    * alternate. Always use this over direct access to alt->scratch, as it
>    * cleans up any previous use of the scratch buffer.
>    */
> -struct strbuf *alt_scratch_buf(struct alternate_object_database *alt);
> +struct strbuf *alt_scratch_buf(struct object_directory *odb);
>   
>   struct packed_git {
>   	struct packed_git *next;
> @@ -100,8 +100,8 @@ struct raw_object_store {
>   	/* Path to extra alternate object database if not NULL */
>   	char *alternate_db;
>   
> -	struct alternate_object_database *alt_odb_list;
> -	struct alternate_object_database **alt_odb_tail;
> +	struct object_directory *alt_odb_list;
> +	struct object_directory **alt_odb_tail;
>   
>   	/*
>   	 * Objects that should be substituted by other objects
> diff --git a/object.c b/object.c
> index e54160550c..6af8e908bb 100644
> --- a/object.c
> +++ b/object.c
> @@ -482,17 +482,17 @@ struct raw_object_store *raw_object_store_new(void)
>   	return o;
>   }
>   
> -static void free_alt_odb(struct alternate_object_database *alt)
> +static void free_alt_odb(struct object_directory *odb)
>   {
> -	strbuf_release(&alt->scratch);
> -	oid_array_clear(&alt->loose_objects_cache);
> -	free(alt);
> +	strbuf_release(&odb->scratch);
> +	oid_array_clear(&odb->loose_objects_cache);
> +	free(odb);
>   }
>   
>   static void free_alt_odbs(struct raw_object_store *o)
>   {
>   	while (o->alt_odb_list) {
> -		struct alternate_object_database *next;
> +		struct object_directory *next;
>   
>   		next = o->alt_odb_list->next;
>   		free_alt_odb(o->alt_odb_list);
> diff --git a/packfile.c b/packfile.c
> index f2850a00b5..d6d511cfd2 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -966,16 +966,16 @@ static void prepare_packed_git_mru(struct repository *r)
>   
>   static void prepare_packed_git(struct repository *r)
>   {
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   
>   	if (r->objects->packed_git_initialized)
>   		return;
>   	prepare_multi_pack_index_one(r, r->objects->objectdir, 1);
>   	prepare_packed_git_one(r, r->objects->objectdir, 1);
>   	prepare_alt_odb(r);
> -	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
> -		prepare_multi_pack_index_one(r, alt->path, 0);
> -		prepare_packed_git_one(r, alt->path, 0);
> +	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> +		prepare_multi_pack_index_one(r, odb->path, 0);
> +		prepare_packed_git_one(r, odb->path, 0);
>   	}
>   	rearrange_packed_git(r);
>   
> diff --git a/sha1-file.c b/sha1-file.c
> index dd0b6aa873..a3cc650a0a 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -353,16 +353,16 @@ void sha1_file_name(struct repository *r, struct strbuf *buf, const unsigned cha
>   	fill_sha1_path(buf, sha1);
>   }
>   
> -struct strbuf *alt_scratch_buf(struct alternate_object_database *alt)
> +struct strbuf *alt_scratch_buf(struct object_directory *odb)
>   {
> -	strbuf_setlen(&alt->scratch, alt->base_len);
> -	return &alt->scratch;
> +	strbuf_setlen(&odb->scratch, odb->base_len);
> +	return &odb->scratch;
>   }
>   
> -static const char *alt_sha1_path(struct alternate_object_database *alt,
> +static const char *alt_sha1_path(struct object_directory *odb,
>   				 const unsigned char *sha1)
>   {
> -	struct strbuf *buf = alt_scratch_buf(alt);
> +	struct strbuf *buf = alt_scratch_buf(odb);
>   	fill_sha1_path(buf, sha1);
>   	return buf->buf;
>   }
> @@ -374,7 +374,7 @@ static int alt_odb_usable(struct raw_object_store *o,
>   			  struct strbuf *path,
>   			  const char *normalized_objdir)
>   {
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   
>   	/* Detect cases where alternate disappeared */
>   	if (!is_directory(path->buf)) {
> @@ -388,8 +388,8 @@ static int alt_odb_usable(struct raw_object_store *o,
>   	 * Prevent the common mistake of listing the same
>   	 * thing twice, or object directory itself.
>   	 */
> -	for (alt = o->alt_odb_list; alt; alt = alt->next) {
> -		if (!fspathcmp(path->buf, alt->path))
> +	for (odb = o->alt_odb_list; odb; odb = odb->next) {
> +		if (!fspathcmp(path->buf, odb->path))
>   			return 0;
>   	}
>   	if (!fspathcmp(path->buf, normalized_objdir))
> @@ -402,7 +402,7 @@ static int alt_odb_usable(struct raw_object_store *o,
>    * Prepare alternate object database registry.
>    *
>    * The variable alt_odb_list points at the list of struct
> - * alternate_object_database.  The elements on this list come from
> + * object_directory.  The elements on this list come from
>    * non-empty elements from colon separated ALTERNATE_DB_ENVIRONMENT
>    * environment variable, and $GIT_OBJECT_DIRECTORY/info/alternates,
>    * whose contents is similar to that environment variable but can be
> @@ -419,7 +419,7 @@ static void read_info_alternates(struct repository *r,
>   static int link_alt_odb_entry(struct repository *r, const char *entry,
>   	const char *relative_base, int depth, const char *normalized_objdir)
>   {
> -	struct alternate_object_database *ent;
> +	struct object_directory *ent;
>   	struct strbuf pathbuf = STRBUF_INIT;
>   
>   	if (!is_absolute_path(entry) && relative_base) {
> @@ -540,9 +540,9 @@ static void read_info_alternates(struct repository *r,
>   	free(path);
>   }
>   
> -struct alternate_object_database *alloc_alt_odb(const char *dir)
> +struct object_directory *alloc_alt_odb(const char *dir)
>   {
> -	struct alternate_object_database *ent;
> +	struct object_directory *ent;
>   
>   	FLEX_ALLOC_STR(ent, path, dir);
>   	strbuf_init(&ent->scratch, 0);
> @@ -684,7 +684,7 @@ char *compute_alternate_path(const char *path, struct strbuf *err)
>   
>   int foreach_alt_odb(alt_odb_fn fn, void *cb)
>   {
> -	struct alternate_object_database *ent;
> +	struct object_directory *ent;
>   	int r = 0;
>   
>   	prepare_alt_odb(the_repository);
> @@ -743,10 +743,10 @@ static int check_and_freshen_local(const struct object_id *oid, int freshen)
>   
>   static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
>   {
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   	prepare_alt_odb(the_repository);
> -	for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) {
> -		const char *path = alt_sha1_path(alt, oid->hash);
> +	for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
> +		const char *path = alt_sha1_path(odb, oid->hash);
>   		if (check_and_freshen_file(path, freshen))
>   			return 1;
>   	}
> @@ -893,7 +893,7 @@ int git_open_cloexec(const char *name, int flags)
>   static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
>   			  struct stat *st, const char **path)
>   {
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   	static struct strbuf buf = STRBUF_INIT;
>   
>   	strbuf_reset(&buf);
> @@ -905,8 +905,8 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
>   
>   	prepare_alt_odb(r);
>   	errno = ENOENT;
> -	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
> -		*path = alt_sha1_path(alt, sha1);
> +	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> +		*path = alt_sha1_path(odb, sha1);
>   		if (!lstat(*path, st))
>   			return 0;
>   	}
> @@ -922,7 +922,7 @@ static int open_sha1_file(struct repository *r,
>   			  const unsigned char *sha1, const char **path)
>   {
>   	int fd;
> -	struct alternate_object_database *alt;
> +	struct object_directory *odb;
>   	int most_interesting_errno;
>   	static struct strbuf buf = STRBUF_INIT;
>   
> @@ -936,8 +936,8 @@ static int open_sha1_file(struct repository *r,
>   	most_interesting_errno = errno;
>   
>   	prepare_alt_odb(r);
> -	for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
> -		*path = alt_sha1_path(alt, sha1);
> +	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> +		*path = alt_sha1_path(odb, sha1);
>   		fd = git_open(*path);
>   		if (fd >= 0)
>   			return fd;
> @@ -2139,14 +2139,14 @@ struct loose_alt_odb_data {
>   	void *data;
>   };
>   
> -static int loose_from_alt_odb(struct alternate_object_database *alt,
> +static int loose_from_alt_odb(struct object_directory *odb,
>   			      void *vdata)
>   {
>   	struct loose_alt_odb_data *data = vdata;
>   	struct strbuf buf = STRBUF_INIT;
>   	int r;
>   
> -	strbuf_addstr(&buf, alt->path);
> +	strbuf_addstr(&buf, odb->path);
>   	r = for_each_loose_file_in_objdir_buf(&buf,
>   					      data->cb, NULL, NULL,
>   					      data->data);
> diff --git a/sha1-name.c b/sha1-name.c
> index faa60f69e3..2594aa79f8 100644
> --- a/sha1-name.c
> +++ b/sha1-name.c
> @@ -95,8 +95,8 @@ static int match_sha(unsigned, const unsigned char *, const unsigned char *);
>   static void find_short_object_filename(struct disambiguate_state *ds)
>   {
>   	int subdir_nr = ds->bin_pfx.hash[0];
> -	struct alternate_object_database *alt;
> -	static struct alternate_object_database *fakeent;
> +	struct object_directory *odb;
> +	static struct object_directory *fakeent;
>   
>   	if (!fakeent) {
>   		/*
> @@ -110,24 +110,24 @@ static void find_short_object_filename(struct disambiguate_state *ds)
>   	}
>   	fakeent->next = the_repository->objects->alt_odb_list;
>   
> -	for (alt = fakeent; alt && !ds->ambiguous; alt = alt->next) {
> +	for (odb = fakeent; odb && !ds->ambiguous; odb = odb->next) {
>   		int pos;
>   
> -		if (!alt->loose_objects_subdir_seen[subdir_nr]) {
> -			struct strbuf *buf = alt_scratch_buf(alt);
> +		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
> +			struct strbuf *buf = alt_scratch_buf(odb);
>   			for_each_file_in_obj_subdir(subdir_nr, buf,
>   						    append_loose_object,
>   						    NULL, NULL,
> -						    &alt->loose_objects_cache);
> -			alt->loose_objects_subdir_seen[subdir_nr] = 1;
> +						    &odb->loose_objects_cache);
> +			odb->loose_objects_subdir_seen[subdir_nr] = 1;
>   		}
>   
> -		pos = oid_array_lookup(&alt->loose_objects_cache, &ds->bin_pfx);
> +		pos = oid_array_lookup(&odb->loose_objects_cache, &ds->bin_pfx);
>   		if (pos < 0)
>   			pos = -1 - pos;
> -		while (!ds->ambiguous && pos < alt->loose_objects_cache.nr) {
> +		while (!ds->ambiguous && pos < odb->loose_objects_cache.nr) {
>   			const struct object_id *oid;
> -			oid = alt->loose_objects_cache.oid + pos;
> +			oid = odb->loose_objects_cache.oid + pos;
>   			if (!match_sha(ds->len, ds->bin_pfx.hash, oid->hash))
>   				break;
>   			update_candidates(ds, oid);
> diff --git a/transport.c b/transport.c
> index 5a74b609ff..040e92c134 100644
> --- a/transport.c
> +++ b/transport.c
> @@ -1433,7 +1433,7 @@ struct alternate_refs_data {
>   	void *data;
>   };
>   
> -static int refs_from_alternate_cb(struct alternate_object_database *e,
> +static int refs_from_alternate_cb(struct object_directory *e,
>   				  void *data)
>   {
>   	struct strbuf path = STRBUF_INIT;


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending
  2018-11-12 14:48                       ` [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending Jeff King
@ 2018-11-12 15:32                         ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 15:32 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:48 AM, Jeff King wrote:
> Since we're changing the semantics, let's take the opportunity to give
> it a more hash-neutral name (which will also catch any callers from
> topics in flight).

THANK YOU! This method name confused me so much when I was first looking 
at the code, but the new name is so much better.

-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 3/9] rename "alternate_object_database" to "object_directory"
  2018-11-12 15:30                         ` Derrick Stolee
@ 2018-11-12 15:36                           ` Jeff King
  2018-11-12 19:41                             ` Ramsay Jones
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 15:36 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Geert Jansen, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, Takuto Ikuta

On Mon, Nov 12, 2018 at 10:30:55AM -0500, Derrick Stolee wrote:

> On 11/12/2018 9:48 AM, Jeff King wrote:
> > In preparation for unifying the handling of alt odb's and the normal
> > repo object directory, let's use a more neutral name. This patch is
> > purely mechanical, swapping the type name, and converting any variables
> > named "alt" to "odb". There should be no functional change, but it will
> > reduce the noise in subsequent diffs.
> > 
> > Signed-off-by: Jeff King <peff@peff.net>
> > ---
> > I waffled on calling this object_database instead of object_directory.
> > But really, it is very specifically about the directory (packed
> > storage, including packs from alternates, is handled elsewhere).
> 
> That makes sense. Each alternate makes its own object directory, but is part
> of a larger object database. It also helps clarify a difference from the
> object_store.
> 
> My only complaint is that you have a lot of variable names with "odb" which
> are now object_directory pointers. Perhaps "odb" -> "objdir"? Or is that
> just too much change?

Yeah, that was part of my waffling. ;)

From my conversions, usually "objdir" is a string holding the pathname,
though that's not set in stone. I also like that "odb" is the same short
length as "alt", which helps with conversion.

But I dunno.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/9] handle alternates paths the same as the main object dir
  2018-11-12 14:49                       ` [PATCH 5/9] handle alternates paths the same as the main object dir Jeff King
@ 2018-11-12 15:38                         ` Derrick Stolee
  2018-11-12 15:46                           ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 15:38 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:49 AM, Jeff King wrote:
> When we generate loose file paths for the main object directory, the
> caller provides a buffer to loose_object_path (formerly sha1_file_name).
> The callers generally keep their own static buffer to avoid excessive
> reallocations.
>
> But for alternate directories, each struct carries its own scratch
> buffer. This is needlessly different; let's unify them.
>
> We could go either direction here, but this patch moves the alternates
> struct over to the main directory style (rather than vice-versa).
> Technically the alternates style is more efficient, as it avoids
> rewriting the object directory name on each call. But this is unlikely
> to matter in practice, as we avoid reallocations either way (and nobody
> has ever noticed or complained that the main object directory is copying
> a few extra bytes before making a much more expensive system call).

Hm. I've complained in the past [1] about a simple method like 
strbuf_addf() over loose objects, but that was during abbreviation 
checks so we were adding the string for every loose object but not 
actually reading the objects.

[1] 
https://public-inbox.org/git/20171201174956.143245-1-dstolee@microsoft.com/

The other concern I have is for alternates that may have long-ish paths 
to their object directories.

So, this is worth keeping an eye on, but is likely to be fine.

> And this has the advantage that the reusable buffers are tied to
> particular calls, which makes the invalidation rules simpler (for
> example, the return value from stat_sha1_file() used to be invalidated
> by basically any other object call, but now it is affected only by other
> calls to stat_sha1_file()).
>
> We do steal the trick from alt_sha1_path() of returning a pointer to the
> filled buffer, which makes a few conversions more convenient.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   object-store.h | 14 +-------------
>   object.c       |  1 -
>   sha1-file.c    | 44 ++++++++++++++++----------------------------
>   sha1-name.c    |  8 ++++++--
>   4 files changed, 23 insertions(+), 44 deletions(-)
>
> diff --git a/object-store.h b/object-store.h
> index fefa17e380..b2fa0d0df0 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -10,10 +10,6 @@
>   struct object_directory {
>   	struct object_directory *next;
>   
> -	/* see alt_scratch_buf() */
> -	struct strbuf scratch;
> -	size_t base_len;
> -
>   	/*
>   	 * Used to store the results of readdir(3) calls when searching
>   	 * for unique abbreviated hashes.  This cache is never
> @@ -54,14 +50,6 @@ void add_to_alternates_file(const char *dir);
>    */
>   void add_to_alternates_memory(const char *dir);
>   
> -/*
> - * Returns a scratch strbuf pre-filled with the alternate object directory,
> - * including a trailing slash, which can be used to access paths in the
> - * alternate. Always use this over direct access to alt->scratch, as it
> - * cleans up any previous use of the scratch buffer.
> - */
> -struct strbuf *alt_scratch_buf(struct object_directory *odb);
> -
>   struct packed_git {
>   	struct packed_git *next;
>   	struct list_head mru;
> @@ -157,7 +145,7 @@ void raw_object_store_clear(struct raw_object_store *o);
>    * Put in `buf` the name of the file in the local object database that
>    * would be used to store a loose object with the specified sha1.
>    */
> -void loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1);
> +const char *loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1);
>   
>   void *map_sha1_file(struct repository *r, const unsigned char *sha1, unsigned long *size);
>   
> diff --git a/object.c b/object.c
> index 6af8e908bb..dd485ac629 100644
> --- a/object.c
> +++ b/object.c
> @@ -484,7 +484,6 @@ struct raw_object_store *raw_object_store_new(void)
>   
>   static void free_alt_odb(struct object_directory *odb)
>   {
> -	strbuf_release(&odb->scratch);
>   	oid_array_clear(&odb->loose_objects_cache);
>   	free(odb);
>   }
> diff --git a/sha1-file.c b/sha1-file.c
> index 478eac326b..15db6b61a9 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -346,27 +346,20 @@ static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1)
>   	}
>   }
>   
> -void loose_object_path(struct repository *r, struct strbuf *buf,
> -		       const unsigned char *sha1)
> +static const char *odb_loose_path(const char *path, struct strbuf *buf,
> +				  const unsigned char *sha1)
>   {
>   	strbuf_reset(buf);
> -	strbuf_addstr(buf, r->objects->objectdir);
> +	strbuf_addstr(buf, path);
>   	strbuf_addch(buf, '/');
>   	fill_sha1_path(buf, sha1);
> +	return buf->buf;
>   }
>   
> -struct strbuf *alt_scratch_buf(struct object_directory *odb)
> +const char *loose_object_path(struct repository *r, struct strbuf *buf,
> +			      const unsigned char *sha1)
>   {
> -	strbuf_setlen(&odb->scratch, odb->base_len);
> -	return &odb->scratch;
> -}
> -
> -static const char *alt_sha1_path(struct object_directory *odb,
> -				 const unsigned char *sha1)
> -{
> -	struct strbuf *buf = alt_scratch_buf(odb);
> -	fill_sha1_path(buf, sha1);
> -	return buf->buf;
> +	return odb_loose_path(r->objects->objectdir, buf, sha1);
>   }
>   
>   /*
> @@ -547,9 +540,6 @@ struct object_directory *alloc_alt_odb(const char *dir)
>   	struct object_directory *ent;
>   
>   	FLEX_ALLOC_STR(ent, path, dir);
> -	strbuf_init(&ent->scratch, 0);
> -	strbuf_addf(&ent->scratch, "%s/", dir);
> -	ent->base_len = ent->scratch.len;
>   
>   	return ent;
>   }
> @@ -745,10 +735,12 @@ static int check_and_freshen_local(const struct object_id *oid, int freshen)
>   static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
>   {
>   	struct object_directory *odb;
> +	static struct strbuf path = STRBUF_INIT;
> +
>   	prepare_alt_odb(the_repository);
>   	for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
> -		const char *path = alt_sha1_path(odb, oid->hash);
> -		if (check_and_freshen_file(path, freshen))
> +		odb_loose_path(odb->path, &path, oid->hash);
> +		if (check_and_freshen_file(path.buf, freshen))
>   			return 1;
>   	}
>   	return 0;
> @@ -889,7 +881,7 @@ int git_open_cloexec(const char *name, int flags)
>    *
>    * The "path" out-parameter will give the path of the object we found (if any).
>    * Note that it may point to static storage and is only valid until another
> - * call to loose_object_path(), etc.
> + * call to stat_sha1_file().
>    */
>   static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
>   			  struct stat *st, const char **path)
> @@ -897,16 +889,14 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
>   	struct object_directory *odb;
>   	static struct strbuf buf = STRBUF_INIT;
>   
> -	loose_object_path(r, &buf, sha1);
> -	*path = buf.buf;
> -
> +	*path = loose_object_path(r, &buf, sha1);
>   	if (!lstat(*path, st))
>   		return 0;
>   
>   	prepare_alt_odb(r);
>   	errno = ENOENT;
>   	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> -		*path = alt_sha1_path(odb, sha1);
> +		*path = odb_loose_path(odb->path, &buf, sha1);
>   		if (!lstat(*path, st))
>   			return 0;
>   	}
> @@ -926,9 +916,7 @@ static int open_sha1_file(struct repository *r,
>   	int most_interesting_errno;
>   	static struct strbuf buf = STRBUF_INIT;
>   
> -	loose_object_path(r, &buf, sha1);
> -	*path = buf.buf;
> -
> +	*path = loose_object_path(r, &buf, sha1);
>   	fd = git_open(*path);
>   	if (fd >= 0)
>   		return fd;
> @@ -936,7 +924,7 @@ static int open_sha1_file(struct repository *r,
>   
>   	prepare_alt_odb(r);
>   	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> -		*path = alt_sha1_path(odb, sha1);
> +		*path = odb_loose_path(odb->path, &buf, sha1);
>   		fd = git_open(*path);
>   		if (fd >= 0)
>   			return fd;
> diff --git a/sha1-name.c b/sha1-name.c
> index 2594aa79f8..96a8e71482 100644
> --- a/sha1-name.c
> +++ b/sha1-name.c
> @@ -97,6 +97,7 @@ static void find_short_object_filename(struct disambiguate_state *ds)
>   	int subdir_nr = ds->bin_pfx.hash[0];
>   	struct object_directory *odb;
>   	static struct object_directory *fakeent;
> +	struct strbuf buf = STRBUF_INIT;
>   
>   	if (!fakeent) {
>   		/*
> @@ -114,8 +115,9 @@ static void find_short_object_filename(struct disambiguate_state *ds)
>   		int pos;
>   
>   		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
> -			struct strbuf *buf = alt_scratch_buf(odb);
> -			for_each_file_in_obj_subdir(subdir_nr, buf,
> +			strbuf_reset(&buf);
> +			strbuf_addstr(&buf, odb->path);
> +			for_each_file_in_obj_subdir(subdir_nr, &buf,
>   						    append_loose_object,
>   						    NULL, NULL,
>   						    &odb->loose_objects_cache);
> @@ -134,6 +136,8 @@ static void find_short_object_filename(struct disambiguate_state *ds)
>   			pos++;
>   		}
>   	}
> +
> +	strbuf_release(&buf);
>   }
>   
>   static int match_sha(unsigned len, const unsigned char *a, const unsigned char *b)


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/9] handle alternates paths the same as the main object dir
  2018-11-12 15:38                         ` Derrick Stolee
@ 2018-11-12 15:46                           ` Jeff King
  2018-11-12 15:50                             ` Derrick Stolee
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 15:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Geert Jansen, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, Takuto Ikuta

On Mon, Nov 12, 2018 at 10:38:28AM -0500, Derrick Stolee wrote:

> > We could go either direction here, but this patch moves the alternates
> > struct over to the main directory style (rather than vice-versa).
> > Technically the alternates style is more efficient, as it avoids
> > rewriting the object directory name on each call. But this is unlikely
> > to matter in practice, as we avoid reallocations either way (and nobody
> > has ever noticed or complained that the main object directory is copying
> > a few extra bytes before making a much more expensive system call).
> 
> Hm. I've complained in the past [1] about a simple method like strbuf_addf()
> over loose objects, but that was during abbreviation checks so we were
> adding the string for every loose object but not actually reading the
> objects.
> 
> [1]
> https://public-inbox.org/git/20171201174956.143245-1-dstolee@microsoft.com/

I suspect that had more to do with the cost of snprintf() than the extra
bytes being copied. And here we'd still be using addstr and addch
exclusively. I'm open to numeric arguments to the contrary, though. :)

There's actually a lot of low-hanging fruit there for pre-sizing, too.
E.g., fill_sha1_path() calls strbuf_addch() in a loop, but it could
quite easily grow the 41 bytes it needs ahead of time. I wouldn't want
to change that without finding a measurable improvement, though. It
might not be a big deal due to fec501dae8 (strbuf_addch: avoid calling
strbuf_grow, 2015-04-16).

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 6/9] sha1-file: use an object_directory for the main object dir
  2018-11-12 14:50                       ` [PATCH 6/9] sha1-file: use an object_directory for " Jeff King
@ 2018-11-12 15:48                         ` Derrick Stolee
  2018-11-12 16:09                           ` Jeff King
  2018-11-12 18:48                           ` Stefan Beller
  0 siblings, 2 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 15:48 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:50 AM, Jeff King wrote:
> Our handling of alternate object directories is needlessly different
> from the main object directory. As a result, many places in the code
> basically look like this:
>
>    do_something(r->objects->objdir);
>
>    for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
>          do_something(odb->path);
>
> That gets annoying when do_something() is non-trivial, and we've
> resorted to gross hacks like creating fake alternates (see
> find_short_object_filename()).
>
> Instead, let's give each raw_object_store a unified list of
> object_directory structs. The first will be the main store, and
> everything after is an alternate. Very few callers even care about the
> distinction, and can just loop over the whole list (and those who care
> can just treat the first element differently).
>
> A few observations:
>
>    - we don't need r->objects->objectdir anymore, and can just
>      mechanically convert that to r->objects->odb->path
>
>    - object_directory's path field needs to become a real pointer rather
>      than a FLEX_ARRAY, in order to fill it with expand_base_dir()
>
>    - we'll call prepare_alt_odb() earlier in many functions (i.e.,
>      outside of the loop). This may result in us calling it even when our
>      function would be satisfied looking only at the main odb.
>
>      But this doesn't matter in practice. It's not a very expensive
>      operation in the first place, and in the majority of cases it will
>      be a noop. We call it already (and cache its results) in
>      prepare_packed_git(), and we'll generally check packs before loose
>      objects. So essentially every program is going to call it
>      immediately once per program.
>
>      Arguably we should just prepare_alt_odb() immediately upon setting
>      up the repository's object directory, which would save us sprinkling
>      calls throughout the code base (and forgetting to do so has been a
>      source of subtle bugs in the past). But I've stopped short of that
>      here, since there are already a lot of other moving parts in this
>      patch.
>
>    - Most call sites just get shorter. The check_and_freshen() functions
>      are an exception, because they have entry points to handle local and
>      nonlocal directories separately.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> If the "the first one is the main store, the rest are alternates" bit is
> too subtle, we could mark each "struct object_directory" with a bit for
> "is_local".

This is probably a good thing to do proactively. We have the equivalent 
in the packed_git struct, but that's also because they get out of order. 
At the moment, I can't think of a read-only action that needs to treat 
the local object directory more carefully. The closest I know about is 
'git pack-objects --local', but that also writes a pack-file.

I assume that when we write a pack-file to the "default location" we use 
get_object_directory() instead of referring to the default object_directory?

>
>   builtin/fsck.c |  21 ++-------
>   builtin/grep.c |   2 +-
>   commit-graph.c |   5 +-
>   environment.c  |   4 +-
>   object-store.h |  27 ++++++-----
>   object.c       |  19 ++++----
>   packfile.c     |  10 ++--
>   path.c         |   2 +-
>   repository.c   |   8 +++-
>   sha1-file.c    | 122 ++++++++++++++++++-------------------------------
>   sha1-name.c    |  17 ++-----
>   11 files changed, 90 insertions(+), 147 deletions(-)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 55153cf92a..15338bd178 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -725,13 +725,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
>   		for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
>   	} else {
> -		struct object_directory *alt_odb_list;
> -
> -		fsck_object_dir(get_object_directory());
> -
>   		prepare_alt_odb(the_repository);
> -		alt_odb_list = the_repository->objects->alt_odb_list;
> -		for (odb = alt_odb_list; odb; odb = odb->next)
> +		for (odb = the_repository->objects->odb; odb; odb = odb->next)
>   			fsck_object_dir(odb->path);
>   
>   		if (check_full) {
> @@ -834,13 +829,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
>   		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL };
>   
> -		commit_graph_verify.argv = verify_argv;
> -		commit_graph_verify.git_cmd = 1;
> -		if (run_command(&commit_graph_verify))
> -			errors_found |= ERROR_COMMIT_GRAPH;
> -
>   		prepare_alt_odb(the_repository);
> -		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
> +		for (odb = the_repository->objects->odb; odb; odb = odb->next) {
>   			child_process_init(&commit_graph_verify);
>   			commit_graph_verify.argv = verify_argv;
>   			commit_graph_verify.git_cmd = 1;
> @@ -855,13 +845,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   		struct child_process midx_verify = CHILD_PROCESS_INIT;
>   		const char *midx_argv[] = { "multi-pack-index", "verify", NULL, NULL, NULL };
>   
> -		midx_verify.argv = midx_argv;
> -		midx_verify.git_cmd = 1;
> -		if (run_command(&midx_verify))
> -			errors_found |= ERROR_COMMIT_GRAPH;
> -
>   		prepare_alt_odb(the_repository);
> -		for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) {
> +		for (odb = the_repository->objects->odb; odb; odb = odb->next) {
>   			child_process_init(&midx_verify);
>   			midx_verify.argv = midx_argv;
>   			midx_verify.git_cmd = 1;
> diff --git a/builtin/grep.c b/builtin/grep.c
> index d8508ddf79..714c8d91ba 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -441,7 +441,7 @@ static int grep_submodule(struct grep_opt *opt, struct repository *superproject,
>   	 * object.
>   	 */
>   	grep_read_lock();
> -	add_to_alternates_memory(submodule.objects->objectdir);
> +	add_to_alternates_memory(submodule.objects->odb->path);
>   	grep_read_unlock();
>   
>   	if (oid) {
> diff --git a/commit-graph.c b/commit-graph.c
> index 5dd3f5b15c..99163c244b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -231,7 +231,6 @@ static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
>   static int prepare_commit_graph(struct repository *r)
>   {
>   	struct object_directory *odb;
> -	char *obj_dir;
>   	int config_value;
>   
>   	if (r->objects->commit_graph_attempted)
> @@ -252,10 +251,8 @@ static int prepare_commit_graph(struct repository *r)
>   	if (!commit_graph_compatible(r))
>   		return 0;
>   
> -	obj_dir = r->objects->objectdir;
> -	prepare_commit_graph_one(r, obj_dir);
>   	prepare_alt_odb(r);
> -	for (odb = r->objects->alt_odb_list;
> +	for (odb = r->objects->odb;
>   	     !r->objects->commit_graph && odb;
>   	     odb = odb->next)
>   		prepare_commit_graph_one(r, odb->path);
> diff --git a/environment.c b/environment.c
> index 3f3c8746c2..441ce56690 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -274,9 +274,9 @@ const char *get_git_work_tree(void)
>   
>   char *get_object_directory(void)
>   {
> -	if (!the_repository->objects->objectdir)
> +	if (!the_repository->objects->odb)
>   		BUG("git environment hasn't been setup");
> -	return the_repository->objects->objectdir;
> +	return the_repository->objects->odb->path;
>   }
>   
>   int odb_mkstemp(struct strbuf *temp_filename, const char *pattern)
> diff --git a/object-store.h b/object-store.h
> index b2fa0d0df0..30faf7b391 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -24,19 +24,14 @@ struct object_directory {
>   	 * Path to the alternative object store. If this is a relative path,
>   	 * it is relative to the current working directory.
>   	 */
> -	char path[FLEX_ARRAY];
> +	char *path;
>   };
> +
>   void prepare_alt_odb(struct repository *r);
>   char *compute_alternate_path(const char *path, struct strbuf *err);
>   typedef int alt_odb_fn(struct object_directory *, void *);
>   int foreach_alt_odb(alt_odb_fn, void*);
>   
> -/*
> - * Allocate a "struct alternate_object_database" but do _not_ actually
> - * add it to the list of alternates.
> - */
> -struct object_directory *alloc_alt_odb(const char *dir);
> -
>   /*
>    * Add the directory to the on-disk alternates file; the new entry will also
>    * take effect in the current process.
> @@ -80,17 +75,21 @@ struct multi_pack_index;
>   
>   struct raw_object_store {
>   	/*
> -	 * Path to the repository's object store.
> -	 * Cannot be NULL after initialization.
> +	 * Set of all object directories; the main directory is first (and
> +	 * cannot be NULL after initialization). Subsequent directories are
> +	 * alternates.
>   	 */
> -	char *objectdir;
> +	struct object_directory *odb;
> +	struct object_directory **odb_tail;
> +	int loaded_alternates;
>   
> -	/* Path to extra alternate object database if not NULL */
> +	/*
> +	 * A list of alternate object directories loaded from the environment;
> +	 * this should not generally need to be accessed directly, but will
> +	 * populate the "odb" list when prepare_alt_odb() is run.
> +	 */
>   	char *alternate_db;
>   
> -	struct object_directory *alt_odb_list;
> -	struct object_directory **alt_odb_tail;
> -
>   	/*
>   	 * Objects that should be substituted by other objects
>   	 * (see git-replace(1)).
> diff --git a/object.c b/object.c
> index dd485ac629..79d636091c 100644
> --- a/object.c
> +++ b/object.c
> @@ -482,26 +482,26 @@ struct raw_object_store *raw_object_store_new(void)
>   	return o;
>   }
>   
> -static void free_alt_odb(struct object_directory *odb)
> +static void free_object_directory(struct object_directory *odb)
>   {
> +	free(odb->path);
>   	oid_array_clear(&odb->loose_objects_cache);
>   	free(odb);
>   }
>   
> -static void free_alt_odbs(struct raw_object_store *o)
> +static void free_object_directories(struct raw_object_store *o)
>   {
> -	while (o->alt_odb_list) {
> +	while (o->odb) {
>   		struct object_directory *next;
>   
> -		next = o->alt_odb_list->next;
> -		free_alt_odb(o->alt_odb_list);
> -		o->alt_odb_list = next;
> +		next = o->odb->next;
> +		free_object_directory(o->odb);
> +		o->odb = next;
>   	}
>   }
>   
>   void raw_object_store_clear(struct raw_object_store *o)
>   {
> -	FREE_AND_NULL(o->objectdir);
>   	FREE_AND_NULL(o->alternate_db);
>   
>   	oidmap_free(o->replace_map, 1);
> @@ -511,8 +511,9 @@ void raw_object_store_clear(struct raw_object_store *o)
>   	o->commit_graph = NULL;
>   	o->commit_graph_attempted = 0;
>   
> -	free_alt_odbs(o);
> -	o->alt_odb_tail = NULL;
> +	free_object_directories(o);
> +	o->odb_tail = NULL;
> +	o->loaded_alternates = 0;
>   
>   	INIT_LIST_HEAD(&o->packed_git_mru);
>   	close_all_packs(o);
> diff --git a/packfile.c b/packfile.c
> index d6d511cfd2..1eda33247f 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -970,12 +970,12 @@ static void prepare_packed_git(struct repository *r)
>   
>   	if (r->objects->packed_git_initialized)
>   		return;
> -	prepare_multi_pack_index_one(r, r->objects->objectdir, 1);
> -	prepare_packed_git_one(r, r->objects->objectdir, 1);
> +
>   	prepare_alt_odb(r);
> -	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> -		prepare_multi_pack_index_one(r, odb->path, 0);
> -		prepare_packed_git_one(r, odb->path, 0);
> +	for (odb = r->objects->odb; odb; odb = odb->next) {
> +		int local = (odb == r->objects->odb);

Here seems to be a place where `odb->is_local` would help.

> +		prepare_multi_pack_index_one(r, odb->path, local);
> +		prepare_packed_git_one(r, odb->path, local);
>   	}
>   	rearrange_packed_git(r);
>   

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 5/9] handle alternates paths the same as the main object dir
  2018-11-12 15:46                           ` Jeff King
@ 2018-11-12 15:50                             ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 15:50 UTC (permalink / raw)
  To: Jeff King
  Cc: Geert Jansen, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, Takuto Ikuta



On 11/12/2018 10:46 AM, Jeff King wrote:
> On Mon, Nov 12, 2018 at 10:38:28AM -0500, Derrick Stolee wrote:
>
>>> We could go either direction here, but this patch moves the alternates
>>> struct over to the main directory style (rather than vice-versa).
>>> Technically the alternates style is more efficient, as it avoids
>>> rewriting the object directory name on each call. But this is unlikely
>>> to matter in practice, as we avoid reallocations either way (and nobody
>>> has ever noticed or complained that the main object directory is copying
>>> a few extra bytes before making a much more expensive system call).
>> Hm. I've complained in the past [1] about a simple method like strbuf_addf()
>> over loose objects, but that was during abbreviation checks so we were
>> adding the string for every loose object but not actually reading the
>> objects.
>>
>> [1]
>> https://public-inbox.org/git/20171201174956.143245-1-dstolee@microsoft.com/
> I suspect that had more to do with the cost of snprintf() than the extra
> bytes being copied. And here we'd still be using addstr and addch
> exclusively. I'm open to numeric arguments to the contrary, though. :)

I agree. I don't think it is worth investigating now, as the performance 
difference should be moot. I am making a mental note to take a look here 
if I notice a performance regression later. ;)

> There's actually a lot of low-hanging fruit there for pre-sizing, too.
> E.g., fill_sha1_path() calls strbuf_addch() in a loop, but it could
> quite easily grow the 41 bytes it needs ahead of time. I wouldn't want
> to change that without finding a measurable improvement, though. It
> might not be a big deal due to fec501dae8 (strbuf_addch: avoid calling
> strbuf_grow, 2015-04-16).
>
> -Peff


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 14:54                       ` [PATCH 8/9] sha1-file: use loose object cache for quick existence check Jeff King
@ 2018-11-12 16:00                         ` Derrick Stolee
  2018-11-12 16:01                         ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 16:00 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:54 AM, Jeff King wrote:
> In cases where we expect to ask has_sha1_file() about a lot of objects
> that we are not likely to have (e.g., during fetch negotiation), we
> already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
> a simultaneous write or repack) for speed (we avoid re-scanning the pack
> directory).
>
> However, even checking for loose objects can be expensive, as we will
> stat() each one. On many systems this cost isn't too noticeable, but
> stat() can be particularly slow on some operating systems, or due to
> network filesystems.
>
> Since the QUICK flag already tells us that we're OK with a slightly
> stale answer, we can use that as a cue to look in our in-memory cache of
> each object directory. That basically trades an in-memory binary search
> for a stat() call.
>
> Note that it is possible for this to actually be _slower_. We'll do a
> full readdir() to fill the cache, so if you have a very large number of
> loose objects and a very small number of lookups, that readdir() may end
> up more expensive.
>
> This shouldn't be a big deal in practice. If you have a large number of
> reachable loose objects, you'll already run into performance problems
> (which you should remedy by repacking). You may have unreachable objects
> which wouldn't otherwise impact performance. Usually these would go away
> with the prune step of "git gc", but they may be held for up to 2 weeks
> in the default configuration.
>
> So it comes down to how many such objects you might reasonably expect to
> have, how much slower is readdir() on N entries versus M stat() calls
> (and here we really care about the syscall backing readdir(), like
> getdents() on Linux, but I'll just call this readdir() below).
>
> If N is much smaller than M (a typical packed repo), we know this is a
> big win (few readdirs() followed by many uses of the resulting cache).
> When N and M are similar in size, it's also a win. We care about the
> latency of making a syscall, and readdir() should be giving us many
> values in a single call. How many?
>
> On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
> entries per call (which is 64 bytes per entry; the name itself is 38
> bytes, plus there are some other fields). So we can imagine that this is
> always a win as long as the number of loose objects in the repository is
> a factor of 500 less than the number of lookups you make. It's hard to
> auto-tune this because we don't generally know up front how many lookups
> we're going to do. But it's unlikely for this to perform significantly
> worse.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> There's some obvious hand-waving in the paragraphs above. I would love
> it if somebody with an NFS system could do some before/after timings
> with various numbers of loose objects, to get a sense of where the
> breakeven point is.
>
> My gut is that we do not need the complexity of a cache-size limit, nor
> of a config option to disable this. But it would be nice to have a real
> number where "reasonable" ends and "pathological" begins. :)

I'm interested in such numbers, but do not have the appropriate setup to 
test.

I think the tradeoffs you mention above are reasonable. There's also 
some chance that this isn't "extra" work but is just "earlier" work, as 
the abbreviation code would load these loose object directories.

>
>   object-store.h |  1 +
>   sha1-file.c    | 20 ++++++++++++++++++++
>   2 files changed, 21 insertions(+)
>
> diff --git a/object-store.h b/object-store.h
> index bf1e0cb761..60758efad8 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -13,6 +13,7 @@ struct object_directory {
>   	/*
>   	 * Used to store the results of readdir(3) calls when we are OK
>   	 * sacrificing accuracy due to races for speed. That includes
> +	 * object existence with OBJECT_INFO_QUICK, as well as
>   	 * our search for unique abbreviated hashes. Don't use it for tasks
>   	 * requiring greater accuracy!
>   	 *
> diff --git a/sha1-file.c b/sha1-file.c
> index 4aae716a37..e53da0b701 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -921,6 +921,24 @@ static int open_sha1_file(struct repository *r,
>   	return -1;
>   }
>   
> +static int quick_has_loose(struct repository *r,
> +			   const unsigned char *sha1)
> +{
> +	int subdir_nr = sha1[0];
> +	struct object_id oid;
> +	struct object_directory *odb;
> +
> +	hashcpy(oid.hash, sha1);
> +
> +	prepare_alt_odb(r);
> +	for (odb = r->objects->odb; odb; odb = odb->next) {
> +		odb_load_loose_cache(odb, subdir_nr);
> +		if (oid_array_lookup(&odb->loose_objects_cache, &oid) >= 0)
> +			return 1;
> +	}
> +	return 0;
> +}
> +
>   /*
>    * Map the loose object at "path" if it is not NULL, or the path found by
>    * searching for a loose object named "sha1".
> @@ -1171,6 +1189,8 @@ static int sha1_loose_object_info(struct repository *r,
>   	if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
>   		const char *path;
>   		struct stat st;
> +		if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK))
> +			return quick_has_loose(r, sha1) ? 0 : -1;
>   		if (stat_sha1_file(r, sha1, &st, &path) < 0)
>   			return -1;
>   		if (oi->disk_sizep)

LGTM.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 14:54                       ` [PATCH 8/9] sha1-file: use loose object cache for quick existence check Jeff King
  2018-11-12 16:00                         ` Derrick Stolee
@ 2018-11-12 16:01                         ` Ævar Arnfjörð Bjarmason
  2018-11-12 16:21                           ` Jeff King
  1 sibling, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-12 16:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Geert Jansen, Junio C Hamano, git\, René Scharfe, Takuto Ikuta


On Mon, Nov 12 2018, Jeff King wrote:

> In cases where we expect to ask has_sha1_file() about a lot of objects
> that we are not likely to have (e.g., during fetch negotiation), we
> already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
> a simultaneous write or repack) for speed (we avoid re-scanning the pack
> directory).
>
> However, even checking for loose objects can be expensive, as we will
> stat() each one. On many systems this cost isn't too noticeable, but
> stat() can be particularly slow on some operating systems, or due to
> network filesystems.
>
> Since the QUICK flag already tells us that we're OK with a slightly
> stale answer, we can use that as a cue to look in our in-memory cache of
> each object directory. That basically trades an in-memory binary search
> for a stat() call.
>
> Note that it is possible for this to actually be _slower_. We'll do a
> full readdir() to fill the cache, so if you have a very large number of
> loose objects and a very small number of lookups, that readdir() may end
> up more expensive.
>
> This shouldn't be a big deal in practice. If you have a large number of
> reachable loose objects, you'll already run into performance problems
> (which you should remedy by repacking). You may have unreachable objects
> which wouldn't otherwise impact performance. Usually these would go away
> with the prune step of "git gc", but they may be held for up to 2 weeks
> in the default configuration.
>
> So it comes down to how many such objects you might reasonably expect to
> have, how much slower is readdir() on N entries versus M stat() calls
> (and here we really care about the syscall backing readdir(), like
> getdents() on Linux, but I'll just call this readdir() below).
>
> If N is much smaller than M (a typical packed repo), we know this is a
> big win (few readdirs() followed by many uses of the resulting cache).
> When N and M are similar in size, it's also a win. We care about the
> latency of making a syscall, and readdir() should be giving us many
> values in a single call. How many?
>
> On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
> entries per call (which is 64 bytes per entry; the name itself is 38
> bytes, plus there are some other fields). So we can imagine that this is
> always a win as long as the number of loose objects in the repository is
> a factor of 500 less than the number of lookups you make. It's hard to
> auto-tune this because we don't generally know up front how many lookups
> we're going to do. But it's unlikely for this to perform significantly
> worse.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> There's some obvious hand-waving in the paragraphs above. I would love
> it if somebody with an NFS system could do some before/after timings
> with various numbers of loose objects, to get a sense of where the
> breakeven point is.
>
> My gut is that we do not need the complexity of a cache-size limit, nor
> of a config option to disable this. But it would be nice to have a real
> number where "reasonable" ends and "pathological" begins. :)

I'm happy to test this on some of the NFS we have locally, and started
out with a plan to write some for-loop using the low-level API (so it
would look up all 256), fake populate .git/objects/?? with N number of
objects etc, but ran out of time.

Do you have something ready that you think would be representative and I
could just run? If not I'll try to pick this up again...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 0/9] caching loose objects
  2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
                                         ` (8 preceding siblings ...)
  2018-11-12 14:55                       ` [PATCH 9/9] fetch-pack: drop custom loose object cache Jeff King
@ 2018-11-12 16:02                       ` Derrick Stolee
  2018-11-12 19:10                         ` Stefan Beller
  9 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-11-12 16:02 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On 11/12/2018 9:46 AM, Jeff King wrote:
> Here's the series I mentioned earlier in the thread to cache loose
> objects when answering has_object_file(..., OBJECT_INFO_QUICK). For
> those just joining us, this makes operations that look up a lot of
> missing objects (like "index-pack" looking for collisions) faster. This
> is mostly targeted at systems where stat() is slow, like over NFS, but
> it seems to give a 2% speedup indexing a full git.git packfile into an
> empty repository (i.e., what you'd see on a clone).
>
> I'm adding René Scharfe and Takuto Ikuta to the cc for their previous
> work in loose-object caching.
>
> The interesting bit is patch 8. The rest of it is cleanup to let us
> treat alternates and the main object directory similarly.

This cleanup is actually really valuable, and affects much more than 
this application.

I really think it is a good idea, and hope it doesn't cause too much 
trouble as the topic is cooking.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 6/9] sha1-file: use an object_directory for the main object dir
  2018-11-12 15:48                         ` Derrick Stolee
@ 2018-11-12 16:09                           ` Jeff King
  2018-11-12 19:04                             ` Stefan Beller
  2018-11-12 18:48                           ` Stefan Beller
  1 sibling, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-11-12 16:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Geert Jansen, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, Takuto Ikuta

On Mon, Nov 12, 2018 at 10:48:36AM -0500, Derrick Stolee wrote:

> > If the "the first one is the main store, the rest are alternates" bit is
> > too subtle, we could mark each "struct object_directory" with a bit for
> > "is_local".
> 
> This is probably a good thing to do proactively. We have the equivalent in
> the packed_git struct, but that's also because they get out of order. At the
> moment, I can't think of a read-only action that needs to treat the local
> object directory more carefully. The closest I know about is 'git
> pack-objects --local', but that also writes a pack-file.
> 
> I assume that when we write a pack-file to the "default location" we use
> get_object_directory() instead of referring to the default object_directory?

Generally, yes, though that should eventually be going away in favor of
accessing it via a "struct repository". And after my series,
get_object_directory() is just returning the_repository->objects->odb->path
(i.e., using the "first one is main" rule).

One thing that makes me nervous about a "local" flag in each struct is
that it implies that it's the source of truth for where to write to. So
what does git_object_directory() look like after that? Do we leave it
with the "first one is main" rule? Or does it become:

  for (odb = the_repository->objects->odb; odb; odb = odb->next) {
	if (odb->local)
		return odb->path;
  }
  return NULL; /* yikes? */

? That feels like it's making things more complicated, not less.

> > diff --git a/packfile.c b/packfile.c
> > index d6d511cfd2..1eda33247f 100644
> > --- a/packfile.c
> > +++ b/packfile.c
> > @@ -970,12 +970,12 @@ static void prepare_packed_git(struct repository *r)
> >   	if (r->objects->packed_git_initialized)
> >   		return;
> > -	prepare_multi_pack_index_one(r, r->objects->objectdir, 1);
> > -	prepare_packed_git_one(r, r->objects->objectdir, 1);
> > +
> >   	prepare_alt_odb(r);
> > -	for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> > -		prepare_multi_pack_index_one(r, odb->path, 0);
> > -		prepare_packed_git_one(r, odb->path, 0);
> > +	for (odb = r->objects->odb; odb; odb = odb->next) {
> > +		int local = (odb == r->objects->odb);
> 
> Here seems to be a place where `odb->is_local` would help.

Yes, though I don't mind this spot in particular, as the check is pretty
straight-forward.

I think an example that would benefit more is the check_and_freshen()
stuff. There we have two almost-the-same wrappers, one of which operates
on just the first element of the list, and the other of which operates
on all of the elements after the first.

It could become:

  static int check_and_freshen_odb(struct object_directory *odb_list,
				   const struct object_id *oid,
				   int freshen,
				   int local)
  {
	struct object_directory *odb;

	for (odb = odb_list; odb; odb = odb->next) {
		static struct strbuf path = STRBUF_INIT;

		if (odb->local != local)
			continue;

		odb_loose_path(odb, &path, oid->hash);
		return check_and_freshen_file(path.buf, freshen);
	}
  }

  int check_and_freshen_local(const struct object_id *oid, int freshen)
  {
	return check_and_freshen_odb(the_repository->objects->odb, oid,
				     freshen, 1);
  }

  int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
  {
	return check_and_freshen_odb(the_repository->objects->odb, oid,
				     freshen, 0);
  }

I'm not sure that is a big improvement over the patch we're replying to,
though.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 16:01                         ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 16:21                           ` Jeff King
  2018-11-12 22:18                             ` Ævar Arnfjörð Bjarmason
  2018-11-12 22:44                             ` Geert Jansen
  0 siblings, 2 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 16:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Geert Jansen, Junio C Hamano, git, René Scharfe, Takuto Ikuta

On Mon, Nov 12, 2018 at 05:01:02PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > There's some obvious hand-waving in the paragraphs above. I would love
> > it if somebody with an NFS system could do some before/after timings
> > with various numbers of loose objects, to get a sense of where the
> > breakeven point is.
> >
> > My gut is that we do not need the complexity of a cache-size limit, nor
> > of a config option to disable this. But it would be nice to have a real
> > number where "reasonable" ends and "pathological" begins. :)
> 
> I'm happy to test this on some of the NFS we have locally, and started
> out with a plan to write some for-loop using the low-level API (so it
> would look up all 256), fake populate .git/objects/?? with N number of
> objects etc, but ran out of time.
> 
> Do you have something ready that you think would be representative and I
> could just run? If not I'll try to pick this up again...

No, but they don't even really need to be actual objects. So I suspect
something like:

  git init
  for i in $(seq 256); do
    i=$(printf %02x $i)
    mkdir -p .git/objects/$i
    for j in $(seq --format=%038g 1000); do
      echo foo >.git/objects/$i/$j
    done
  done
  git index-pack -v --stdin </path/to/git.git/objects/pack/XYZ.pack

might work (for various values of 1000). The shell loop would probably
be faster as perl, too. :)

Make sure you clear the object directory between runs, though (otherwise
the subsequent index-pack's really do find collisions and spend time
accessing the objects).

If you want real objects, you could probably just dump a bunch of
sequential blobs to fast-import, and then pipe the result to
unpack-objects.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with()
  2018-11-12 14:47                       ` [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with() Jeff King
@ 2018-11-12 18:23                         ` Stefan Beller
  0 siblings, 0 replies; 87+ messages in thread
From: Stefan Beller @ 2018-11-12 18:23 UTC (permalink / raw)
  To: Jeff King
  Cc: gerardu, Ævar Arnfjörð Bjarmason, Junio C Hamano,
	git, René Scharfe, tikuta

On Mon, Nov 12, 2018 at 6:47 AM Jeff King <peff@peff.net> wrote:
>
> Using strip_suffix() lets us avoid repeating ourselves. It also makes
> the handling of "/" a bit less subtle (we strip one less character than
> we matched in order to leave it in place, but we can just as easily
> include the "/" when we add more path components).
>
> Signed-off-by: Jeff King <peff@peff.net>

This makes sense. Thanks!

(This patch caught my attention as it's a submodule thing,
but now looking at the rest of the series)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 6/9] sha1-file: use an object_directory for the main object dir
  2018-11-12 15:48                         ` Derrick Stolee
  2018-11-12 16:09                           ` Jeff King
@ 2018-11-12 18:48                           ` Stefan Beller
  1 sibling, 0 replies; 87+ messages in thread
From: Stefan Beller @ 2018-11-12 18:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, gerardu, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, tikuta

On Mon, Nov 12, 2018 at 7:48 AM Derrick Stolee <stolee@gmail.com> wrote:
>
[... lots of quoted text...]

Some email readers are very good at recognizing unchanged quoted
text and collapse it, not so at
https://public-inbox.org/git/421d3b43-3425-72c9-218e-facd86e28267@gmail.com/
which I use to read through this series. It would help if you'd cut most
of the (con)text that is not nearby to your reply, as I read the context
email just before your reply.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 6/9] sha1-file: use an object_directory for the main object dir
  2018-11-12 16:09                           ` Jeff King
@ 2018-11-12 19:04                             ` Stefan Beller
  0 siblings, 0 replies; 87+ messages in thread
From: Stefan Beller @ 2018-11-12 19:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, gerardu, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, tikuta

On Mon, Nov 12, 2018 at 8:09 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Nov 12, 2018 at 10:48:36AM -0500, Derrick Stolee wrote:
>
> > > If the "the first one is the main store, the rest are alternates" bit is
> > > too subtle, we could mark each "struct object_directory" with a bit for
> > > "is_local".
> >
> > This is probably a good thing to do proactively. We have the equivalent in
> > the packed_git struct, but that's also because they get out of order. At the
> > moment, I can't think of a read-only action that needs to treat the local
> > object directory more carefully. The closest I know about is 'git
> > pack-objects --local', but that also writes a pack-file.
> >
> > I assume that when we write a pack-file to the "default location" we use
> > get_object_directory() instead of referring to the default object_directory?
>
> Generally, yes, though that should eventually be going away in favor of
> accessing it via a "struct repository". And after my series,
> get_object_directory() is just returning the_repository->objects->odb->path
> (i.e., using the "first one is main" rule).
>
> One thing that makes me nervous about a "local" flag in each struct is
> that it implies that it's the source of truth for where to write to. So
> what does git_object_directory() look like after that? Do we leave it
> with the "first one is main" rule? Or does it become:

s/git/get/ ;-)  get_object_directory is very old and was introduced in
e1b10391ea (Use git config file for committer name and email info,
2005-10-11) by Linus.

I would argue that we might want to get rid of that function now,
actually as it doesn't seem to add value to the code (assuming the
BUG never triggers), and using a_repo->objects->objectdir
or after this series a_repo->objects->odb->path; is just as short.

    $ git grep get_object_directory |wc -l
    30
    $ git grep -- "->objects->objectdir"  |wc -l
    10

Ah well, we're not there yet.

>   for (odb = the_repository->objects->odb; odb; odb = odb->next) {
>         if (odb->local)
>                 return odb->path;
>   }
>   return NULL; /* yikes? */
>
> ? That feels like it's making things more complicated, not less.

It depends if the caller cares about the local flag.

I'd think we can have more than one local, eventually?
Just think of the partial clone stuff that may have a local
set of promised stuff and another set of actual objects,
which may be stored in different local odbs.

If the caller cares about the distinction, they would need
to write out this loop as above themselves.
If they don't care, we could migrate them to not
use this function, so we can get rid of it?

> > > -   for (odb = r->objects->alt_odb_list; odb; odb = odb->next) {
> > > -           prepare_multi_pack_index_one(r, odb->path, 0);
> > > -           prepare_packed_git_one(r, odb->path, 0);
> > > +   for (odb = r->objects->odb; odb; odb = odb->next) {
> > > +           int local = (odb == r->objects->odb);
> >
> > Here seems to be a place where `odb->is_local` would help.
>
> Yes, though I don't mind this spot in particular, as the check is pretty
> straight-forward.
>
> I think an example that would benefit more is the check_and_freshen()
> stuff. There we have two almost-the-same wrappers, one of which operates
> on just the first element of the list, and the other of which operates
> on all of the elements after the first.
>
> It could become:
>
>   static int check_and_freshen_odb(struct object_directory *odb_list,
>                                    const struct object_id *oid,
>                                    int freshen,
>                                    int local)
>   {
>         struct object_directory *odb;
>
>         for (odb = odb_list; odb; odb = odb->next) {
>                 static struct strbuf path = STRBUF_INIT;
>
>                 if (odb->local != local)
>                         continue;
>
>                 odb_loose_path(odb, &path, oid->hash);
>                 return check_and_freshen_file(path.buf, freshen);
>         }
>   }
>
>   int check_and_freshen_local(const struct object_id *oid, int freshen)
>   {
>         return check_and_freshen_odb(the_repository->objects->odb, oid,
>                                      freshen, 1);
>   }
>
>   int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
>   {
>         return check_and_freshen_odb(the_repository->objects->odb, oid,
>                                      freshen, 0);
>   }
>

I am fine with (a maybe better documented) "first is local" rule, but
the code above looks intriguing, except a little wasteful
(we need two full loops in check_and_freshen, but ideally we
can do by just one loop).

What does the local flag mean anyway in a world where we
have many odbs in a repository, that are not distinguishable
except by their order? AFAICT it is actually to be used for differentiating
how much we care in fsck/cat-file/packing, as it may be borrowed
from an alternate, so maybe the flag is rather to be named
after ownership and not so much about it locality?
(I think "borrowed" or "owned" or even just "important"
or "external" or "alternate" may work)

Stefan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 0/9] caching loose objects
  2018-11-12 16:02                       ` [PATCH 0/9] caching loose objects Derrick Stolee
@ 2018-11-12 19:10                         ` Stefan Beller
  0 siblings, 0 replies; 87+ messages in thread
From: Stefan Beller @ 2018-11-12 19:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, gerardu, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, tikuta

On Mon, Nov 12, 2018 at 8:02 AM Derrick Stolee <stolee@gmail.com> wrote:

> This cleanup is actually really valuable, and affects much more than
> this application.

I second this. I'd value this series more for the cleanup than its
application. ;-)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 7/9] object-store: provide helpers for loose_objects_cache
  2018-11-12 14:50                       ` [PATCH 7/9] object-store: provide helpers for loose_objects_cache Jeff King
@ 2018-11-12 19:24                         ` René Scharfe
  2018-11-12 20:16                           ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: René Scharfe @ 2018-11-12 19:24 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	Takuto Ikuta

Am 12.11.2018 um 15:50 schrieb Jeff King:
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -2125,6 +2125,32 @@ int for_each_loose_object(each_loose_object_fn cb, void *data,
>  	return 0;
>  }
>  
> +static int append_loose_object(const struct object_id *oid, const char *path,
> +			       void *data)
> +{
> +	oid_array_append(data, oid);
> +	return 0;
> +}
> +
> +void odb_load_loose_cache(struct object_directory *odb, int subdir_nr)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +
> +	if (subdir_nr < 0 ||

Why not make subdir_nr unsigned (like in for_each_file_in_obj_subdir()), and
get rid of this first check?

> +	    subdir_nr >= ARRAY_SIZE(odb->loose_objects_subdir_seen))

Using unsigned char for subdir_nr would allow removing the second check as
well, but might hide invalid values in implicit conversions, I guess.

> +		BUG("subdir_nr out of range");

Showing the invalid value (like in for_each_file_in_obj_subdir()) would make
debugging easier in case the impossible actually happens.

> +
> +	if (odb->loose_objects_subdir_seen[subdir_nr])
> +		return;
> +
> +	strbuf_addstr(&buf, odb->path);
> +	for_each_file_in_obj_subdir(subdir_nr, &buf,
> +				    append_loose_object,
> +				    NULL, NULL,
> +				    &odb->loose_objects_cache);
> +	odb->loose_objects_subdir_seen[subdir_nr] = 1;

About here would be the ideal new home for ...

> +}
> +
>  static int check_stream_sha1(git_zstream *stream,
>  			     const char *hdr,
>  			     unsigned long size,
> diff --git a/sha1-name.c b/sha1-name.c
> index 358ca5e288..b24502811b 100644
> --- a/sha1-name.c
> +++ b/sha1-name.c
> @@ -83,36 +83,19 @@ static void update_candidates(struct disambiguate_state *ds, const struct object
>  	/* otherwise, current can be discarded and candidate is still good */
>  }
>  
> -static int append_loose_object(const struct object_id *oid, const char *path,
> -			       void *data)
> -{
> -	oid_array_append(data, oid);
> -	return 0;
> -}
> -
>  static int match_sha(unsigned, const unsigned char *, const unsigned char *);
>  
>  static void find_short_object_filename(struct disambiguate_state *ds)
>  {
>  	int subdir_nr = ds->bin_pfx.hash[0];
>  	struct object_directory *odb;
> -	struct strbuf buf = STRBUF_INIT;
>  
>  	for (odb = the_repository->objects->odb;
>  	     odb && !ds->ambiguous;
>  	     odb = odb->next) {
>  		int pos;
>  
> -		if (!odb->loose_objects_subdir_seen[subdir_nr]) {
> -			strbuf_reset(&buf);
> -			strbuf_addstr(&buf, odb->path);
> -			for_each_file_in_obj_subdir(subdir_nr, &buf,
> -						    append_loose_object,
> -						    NULL, NULL,
> -						    &odb->loose_objects_cache);
> -			odb->loose_objects_subdir_seen[subdir_nr] = 1;
> -		}
> -
> +		odb_load_loose_cache(odb, subdir_nr);
>  		pos = oid_array_lookup(&odb->loose_objects_cache, &ds->bin_pfx);
>  		if (pos < 0)
>  			pos = -1 - pos;
> @@ -125,8 +108,6 @@ static void find_short_object_filename(struct disambiguate_state *ds)
>  			pos++;
>  		}
>  	}
> -
> -	strbuf_release(&buf);

... this line.

>  }
>  
>  static int match_sha(unsigned len, const unsigned char *a, const unsigned char *b)
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 9/9] fetch-pack: drop custom loose object cache
  2018-11-12 14:55                       ` [PATCH 9/9] fetch-pack: drop custom loose object cache Jeff King
@ 2018-11-12 19:25                         ` René Scharfe
  2018-11-12 19:32                           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 87+ messages in thread
From: René Scharfe @ 2018-11-12 19:25 UTC (permalink / raw)
  To: Jeff King, Geert Jansen
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	Takuto Ikuta

Am 12.11.2018 um 15:55 schrieb Jeff King:
> Commit 024aa4696c (fetch-pack.c: use oidset to check existence of loose
> object, 2018-03-14) added a cache to avoid calling stat() for a bunch of
> loose objects we don't have.
> 
> Now that OBJECT_INFO_QUICK handles this caching itself, we can drop the
> custom solution.
> 
> Note that this might perform slightly differently, as the original code
> stopped calling readdir() when we saw more loose objects than there were
> refs. So:
> 
>   1. The old code might have spent work on readdir() to fill the cache,
>      but then decided there were too many loose objects, wasting that
>      effort.
> 
>   2. The new code might spend a lot of time on readdir() if you have a
>      lot of loose objects, even though there are very few objects to
>      ask about.

Plus the old code used an oidset while the new one uses an oid_array.

> In practice it probably won't matter either way; see the previous commit
> for some discussion of the tradeoff.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  fetch-pack.c | 39 ++-------------------------------------
>  1 file changed, 2 insertions(+), 37 deletions(-)
> 
> diff --git a/fetch-pack.c b/fetch-pack.c
> index b3ed7121bc..25a88f4eb2 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -636,23 +636,6 @@ struct loose_object_iter {
>  	struct ref *refs;
>  };
>  
> -/*
> - *  If the number of refs is not larger than the number of loose objects,
> - *  this function stops inserting.
> - */
> -static int add_loose_objects_to_set(const struct object_id *oid,
> -				    const char *path,
> -				    void *data)
> -{
> -	struct loose_object_iter *iter = data;
> -	oidset_insert(iter->loose_object_set, oid);
> -	if (iter->refs == NULL)
> -		return 1;
> -
> -	iter->refs = iter->refs->next;
> -	return 0;
> -}
> -
>  /*
>   * Mark recent commits available locally and reachable from a local ref as
>   * COMPLETE. If args->no_dependents is false, also mark COMPLETE remote refs as
> @@ -670,30 +653,14 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>  	struct ref *ref;
>  	int old_save_commit_buffer = save_commit_buffer;
>  	timestamp_t cutoff = 0;
> -	struct oidset loose_oid_set = OIDSET_INIT;
> -	int use_oidset = 0;
> -	struct loose_object_iter iter = {&loose_oid_set, *refs};
> -
> -	/* Enumerate all loose objects or know refs are not so many. */
> -	use_oidset = !for_each_loose_object(add_loose_objects_to_set,
> -					    &iter, 0);
>  
>  	save_commit_buffer = 0;
>  
>  	for (ref = *refs; ref; ref = ref->next) {
>  		struct object *o;
> -		unsigned int flags = OBJECT_INFO_QUICK;
>  
> -		if (use_oidset &&
> -		    !oidset_contains(&loose_oid_set, &ref->old_oid)) {
> -			/*
> -			 * I know this does not exist in the loose form,
> -			 * so check if it exists in a non-loose form.
> -			 */
> -			flags |= OBJECT_INFO_IGNORE_LOOSE;

This removes the only user of OBJECT_INFO_IGNORE_LOOSE.  #leftoverbits

> -		}
> -
> -		if (!has_object_file_with_flags(&ref->old_oid, flags))
> +		if (!has_object_file_with_flags(&ref->old_oid,
> +						OBJECT_INFO_QUICK))
>  			continue;
>  		o = parse_object(the_repository, &ref->old_oid);
>  		if (!o)
> @@ -710,8 +677,6 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>  		}
>  	}
>  
> -	oidset_clear(&loose_oid_set);
> -
>  	if (!args->deepen) {
>  		for_each_ref(mark_complete_oid, NULL);
>  		for_each_cached_alternate(NULL, mark_alternate_complete);
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 9/9] fetch-pack: drop custom loose object cache
  2018-11-12 19:25                         ` René Scharfe
@ 2018-11-12 19:32                           ` Ævar Arnfjörð Bjarmason
  2018-11-12 20:07                             ` Jeff King
  2018-11-12 20:13                             ` René Scharfe
  0 siblings, 2 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-12 19:32 UTC (permalink / raw)
  To: René Scharfe
  Cc: Jeff King, Geert Jansen, Junio C Hamano, git\, Takuto Ikuta


On Mon, Nov 12 2018, René Scharfe wrote:

> Am 12.11.2018 um 15:55 schrieb Jeff King:
>> Commit 024aa4696c (fetch-pack.c: use oidset to check existence of loose
>> object, 2018-03-14) added a cache to avoid calling stat() for a bunch of
>> loose objects we don't have.
>>
>> Now that OBJECT_INFO_QUICK handles this caching itself, we can drop the
>> custom solution.
>>
>> Note that this might perform slightly differently, as the original code
>> stopped calling readdir() when we saw more loose objects than there were
>> refs. So:
>>
>>   1. The old code might have spent work on readdir() to fill the cache,
>>      but then decided there were too many loose objects, wasting that
>>      effort.
>>
>>   2. The new code might spend a lot of time on readdir() if you have a
>>      lot of loose objects, even though there are very few objects to
>>      ask about.
>
> Plus the old code used an oidset while the new one uses an oid_array.
>
>> In practice it probably won't matter either way; see the previous commit
>> for some discussion of the tradeoff.
>>
>> Signed-off-by: Jeff King <peff@peff.net>
>> ---
>>  fetch-pack.c | 39 ++-------------------------------------
>>  1 file changed, 2 insertions(+), 37 deletions(-)
>>
>> diff --git a/fetch-pack.c b/fetch-pack.c
>> index b3ed7121bc..25a88f4eb2 100644
>> --- a/fetch-pack.c
>> +++ b/fetch-pack.c
>> @@ -636,23 +636,6 @@ struct loose_object_iter {
>>  	struct ref *refs;
>>  };
>>
>> -/*
>> - *  If the number of refs is not larger than the number of loose objects,
>> - *  this function stops inserting.
>> - */
>> -static int add_loose_objects_to_set(const struct object_id *oid,
>> -				    const char *path,
>> -				    void *data)
>> -{
>> -	struct loose_object_iter *iter = data;
>> -	oidset_insert(iter->loose_object_set, oid);
>> -	if (iter->refs == NULL)
>> -		return 1;
>> -
>> -	iter->refs = iter->refs->next;
>> -	return 0;
>> -}
>> -
>>  /*
>>   * Mark recent commits available locally and reachable from a local ref as
>>   * COMPLETE. If args->no_dependents is false, also mark COMPLETE remote refs as
>> @@ -670,30 +653,14 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>>  	struct ref *ref;
>>  	int old_save_commit_buffer = save_commit_buffer;
>>  	timestamp_t cutoff = 0;
>> -	struct oidset loose_oid_set = OIDSET_INIT;
>> -	int use_oidset = 0;
>> -	struct loose_object_iter iter = {&loose_oid_set, *refs};
>> -
>> -	/* Enumerate all loose objects or know refs are not so many. */
>> -	use_oidset = !for_each_loose_object(add_loose_objects_to_set,
>> -					    &iter, 0);
>>
>>  	save_commit_buffer = 0;
>>
>>  	for (ref = *refs; ref; ref = ref->next) {
>>  		struct object *o;
>> -		unsigned int flags = OBJECT_INFO_QUICK;
>>
>> -		if (use_oidset &&
>> -		    !oidset_contains(&loose_oid_set, &ref->old_oid)) {
>> -			/*
>> -			 * I know this does not exist in the loose form,
>> -			 * so check if it exists in a non-loose form.
>> -			 */
>> -			flags |= OBJECT_INFO_IGNORE_LOOSE;
>
> This removes the only user of OBJECT_INFO_IGNORE_LOOSE.  #leftoverbits

With this series applied there's still a use of it left in
oid_object_info_extended()

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 3/9] rename "alternate_object_database" to "object_directory"
  2018-11-12 15:36                           ` Jeff King
@ 2018-11-12 19:41                             ` Ramsay Jones
  0 siblings, 0 replies; 87+ messages in thread
From: Ramsay Jones @ 2018-11-12 19:41 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee
  Cc: Geert Jansen, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, René Scharfe, Takuto Ikuta



On 12/11/2018 15:36, Jeff King wrote:
> On Mon, Nov 12, 2018 at 10:30:55AM -0500, Derrick Stolee wrote:
> 
>> On 11/12/2018 9:48 AM, Jeff King wrote:
>>> In preparation for unifying the handling of alt odb's and the normal
>>> repo object directory, let's use a more neutral name. This patch is
>>> purely mechanical, swapping the type name, and converting any variables
>>> named "alt" to "odb". There should be no functional change, but it will
>>> reduce the noise in subsequent diffs.
>>>
>>> Signed-off-by: Jeff King <peff@peff.net>
>>> ---
>>> I waffled on calling this object_database instead of object_directory.
>>> But really, it is very specifically about the directory (packed
>>> storage, including packs from alternates, is handled elsewhere).
>>
>> That makes sense. Each alternate makes its own object directory, but is part
>> of a larger object database. It also helps clarify a difference from the
>> object_store.
>>
>> My only complaint is that you have a lot of variable names with "odb" which
>> are now object_directory pointers. Perhaps "odb" -> "objdir"? Or is that
>> just too much change?
> 
> Yeah, that was part of my waffling. ;)
> 
>>From my conversions, usually "objdir" is a string holding the pathname,
> though that's not set in stone. I also like that "odb" is the same short
> length as "alt", which helps with conversion.

While reading the patch, I keep thinking it should be 'obd' for
OBject Directory. ;-)

[Given my track record in naming things, please take with a _huge_
pinch of salt!]

ATB,
Ramsay Jones


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 9/9] fetch-pack: drop custom loose object cache
  2018-11-12 19:32                           ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 20:07                             ` Jeff King
  2018-11-12 20:13                             ` René Scharfe
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 20:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: René Scharfe, Geert Jansen, Junio C Hamano, git, Takuto Ikuta

On Mon, Nov 12, 2018 at 08:32:43PM +0100, Ævar Arnfjörð Bjarmason wrote:

> >>  	for (ref = *refs; ref; ref = ref->next) {
> >>  		struct object *o;
> >> -		unsigned int flags = OBJECT_INFO_QUICK;
> >>
> >> -		if (use_oidset &&
> >> -		    !oidset_contains(&loose_oid_set, &ref->old_oid)) {
> >> -			/*
> >> -			 * I know this does not exist in the loose form,
> >> -			 * so check if it exists in a non-loose form.
> >> -			 */
> >> -			flags |= OBJECT_INFO_IGNORE_LOOSE;
> >
> > This removes the only user of OBJECT_INFO_IGNORE_LOOSE.  #leftoverbits
> 
> With this series applied there's still a use of it left in
> oid_object_info_extended()

That's just the code that does something with the flag. No callers pass
it in anymore, so we could drop the flag _and_ that code.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 9/9] fetch-pack: drop custom loose object cache
  2018-11-12 19:32                           ` Ævar Arnfjörð Bjarmason
  2018-11-12 20:07                             ` Jeff King
@ 2018-11-12 20:13                             ` René Scharfe
  1 sibling, 0 replies; 87+ messages in thread
From: René Scharfe @ 2018-11-12 20:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Geert Jansen, Junio C Hamano, git, Takuto Ikuta

Am 12.11.2018 um 20:32 schrieb Ævar Arnfjörð Bjarmason:
> 
> On Mon, Nov 12 2018, René Scharfe wrote:
>> This removes the only user of OBJECT_INFO_IGNORE_LOOSE.  #leftoverbits
> 
> With this series applied there's still a use of it left in
> oid_object_info_extended()

OK, rephrasing: With that patch, OBJECT_INFO_IGNORE_LOOSE is never set
anymore, and its check in oid_object_info_extended() as well as its
definition can be removed.

René

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 7/9] object-store: provide helpers for loose_objects_cache
  2018-11-12 19:24                         ` René Scharfe
@ 2018-11-12 20:16                           ` Jeff King
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-11-12 20:16 UTC (permalink / raw)
  To: René Scharfe
  Cc: Geert Jansen, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, git, Takuto Ikuta

On Mon, Nov 12, 2018 at 08:24:59PM +0100, René Scharfe wrote:

> > +void odb_load_loose_cache(struct object_directory *odb, int subdir_nr)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +
> > +	if (subdir_nr < 0 ||
> 
> Why not make subdir_nr unsigned (like in for_each_file_in_obj_subdir()), and
> get rid of this first check?

I stole the use of "int" from your code. ;)

More seriously, though, I wondered if callers might have sign issues
assigning from a "signed char". Usually we hold object ids in an
"unsigned char", but what happens if I do:

  signed char foo[] = { 1, 2, 3, 4 };
  odb_load_loose_cache(foo[0]);

when the parameter is "unsigned"?

I'll admit I get lost in all of the integer promotion rules there, but
are we sure there's no way we can end up with a funky truncation?

If the answer is no, then I agree that your suggestion is a strict
improvement.

> > +	    subdir_nr >= ARRAY_SIZE(odb->loose_objects_subdir_seen))
> 
> Using unsigned char for subdir_nr would allow removing the second check as
> well, but might hide invalid values in implicit conversions, I guess.

Yeah, I know that one could be a dangerous truncation.

I also considered just taking an object_id, which would make the
function "load the cache such that this oid would be valid". And it's
not necessarily the caller's business how much we load.

But that's OK for OBJECT_INFO_QUICK, but it's pretty darn subtle for the
abbrev code. That code doesn't care about just one object, but wants all
objects that share its prefix. That works now because we know that the
prefix is always at least 2 hex chars, so it's OK to load just that
subset.

> > +		BUG("subdir_nr out of range");
> 
> Showing the invalid value (like in for_each_file_in_obj_subdir()) would make
> debugging easier in case the impossible actually happens.

Good suggestion.

> > +	strbuf_addstr(&buf, odb->path);
> > +	for_each_file_in_obj_subdir(subdir_nr, &buf,
> > +				    append_loose_object,
> > +				    NULL, NULL,
> > +				    &odb->loose_objects_cache);
> > +	odb->loose_objects_subdir_seen[subdir_nr] = 1;
> 
> About here would be the ideal new home for ...
> [...]
> > -
> > -	strbuf_release(&buf);
> 
> ... this line.

Oops, thanks. I toyed with making the strbuf here static, which is why I
dropped the release. But since we only use it on a cache miss, I decided
it was better to avoid the hidden global (and then of course forgot to
re-add the release).

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 16:21                           ` Jeff King
@ 2018-11-12 22:18                             ` Ævar Arnfjörð Bjarmason
  2018-11-12 22:30                               ` Ævar Arnfjörð Bjarmason
  2018-11-12 22:44                             ` Geert Jansen
  1 sibling, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-12 22:18 UTC (permalink / raw)
  To: Jeff King
  Cc: Geert Jansen, Junio C Hamano, git\, René Scharfe, Takuto Ikuta


On Mon, Nov 12 2018, Jeff King wrote:

> On Mon, Nov 12, 2018 at 05:01:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> > There's some obvious hand-waving in the paragraphs above. I would love
>> > it if somebody with an NFS system could do some before/after timings
>> > with various numbers of loose objects, to get a sense of where the
>> > breakeven point is.
>> >
>> > My gut is that we do not need the complexity of a cache-size limit, nor
>> > of a config option to disable this. But it would be nice to have a real
>> > number where "reasonable" ends and "pathological" begins. :)
>>
>> I'm happy to test this on some of the NFS we have locally, and started
>> out with a plan to write some for-loop using the low-level API (so it
>> would look up all 256), fake populate .git/objects/?? with N number of
>> objects etc, but ran out of time.
>>
>> Do you have something ready that you think would be representative and I
>> could just run? If not I'll try to pick this up again...
>
> No, but they don't even really need to be actual objects. So I suspect
> something like:
>
>   git init
>   for i in $(seq 256); do
>     i=$(printf %02x $i)
>     mkdir -p .git/objects/$i
>     for j in $(seq --format=%038g 1000); do
>       echo foo >.git/objects/$i/$j
>     done
>   done
>   git index-pack -v --stdin </path/to/git.git/objects/pack/XYZ.pack
>
> might work (for various values of 1000). The shell loop would probably
> be faster as perl, too. :)
>
> Make sure you clear the object directory between runs, though (otherwise
> the subsequent index-pack's really do find collisions and spend time
> accessing the objects).
>
> If you want real objects, you could probably just dump a bunch of
> sequential blobs to fast-import, and then pipe the result to
> unpack-objects.
>
> -Peff

I did a very ad-hoc test against a NetApp filer using the test script
quoted at the end of this E-Mail. The test compared origin/master, this
branch of yours, and my core.checkCollisions=false branch.

When run with DBD-mysql.git (just some random ~1k commit repo I had):

    $ GIT_PERF_REPEAT_COUNT=3 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run origin/master peff/jk/loose-cache avar/check-collisions-config p0008-index-pack.sh

I get:

    Test                                             origin/master     peff/jk/loose-cache      avar/check-collisions-config
    ------------------------------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      4.31(0.55+0.18)   0.41(0.40+0.02) -90.5%   0.23(0.36+0.01) -94.7%
    0008.3: index-pack with 256*10 loose objects     4.37(0.45+0.21)   0.45(0.40+0.02) -89.7%   0.25(0.38+0.01) -94.3%
    0008.4: index-pack with 256*100 loose objects    4.47(0.53+0.23)   0.67(0.63+0.02) -85.0%   0.24(0.38+0.01) -94.6%
    0008.5: index-pack with 256*250 loose objects    5.01(0.67+0.30)   1.04(0.98+0.06) -79.2%   0.24(0.37+0.01) -95.2%
    0008.6: index-pack with 256*500 loose objects    5.11(0.57+0.21)   1.81(1.70+0.09) -64.6%   0.25(0.38+0.01) -95.1%
    0008.7: index-pack with 256*750 loose objects    5.12(0.60+0.22)   2.54(2.38+0.14) -50.4%   0.24(0.38+0.01) -95.3%
    0008.8: index-pack with 256*1000 loose objects   4.52(0.52+0.21)   3.36(3.17+0.17) -25.7%   0.23(0.36+0.01) -94.9%

I then hacked it to test against git.git, but skipped origin/master for
that one because it takes *ages*. So just mine v.s. yours:

    $ GIT_PERF_REPEAT_COUNT=3 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run peff/jk/loose-cache avar/check-collisions-config p0008-index-pack.sh
    [...]
    Test                                             peff/jk/loose-cache   avar/check-collisions-config
    ---------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      12.57(28.72+0.61)     12.68(29.36+0.62) +0.9%
    0008.3: index-pack with 256*10 loose objects     12.77(28.75+0.61)     12.50(28.88+0.56) -2.1%
    0008.4: index-pack with 256*100 loose objects    13.20(29.49+0.66)     12.38(28.58+0.60) -6.2%
    0008.5: index-pack with 256*250 loose objects    14.10(30.59+0.64)     12.54(28.22+0.57) -11.1%
    0008.6: index-pack with 256*500 loose objects    14.48(31.06+0.74)     12.43(28.59+0.60) -14.2%
    0008.7: index-pack with 256*750 loose objects    15.31(31.91+0.74)     12.67(29.23+0.64) -17.2%
    0008.8: index-pack with 256*1000 loose objects   16.34(32.84+0.76)     13.11(30.19+0.68) -19.8%

So not much of a practical difference perhaps. But then again this isn't
a very realistic test case of anything. Rarely are you going to push a
history of something the size of git.git into a repo with this many
loose objects.

Using sha1collisiondetection.git is I think the most realistic scenario,
i.e. you'll often end up fetching/pushing something roughly the size of
its entire history on a big repo, and with it:

    Test                                             peff/jk/loose-cache   avar/check-collisions-config
    ---------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      0.16(0.04+0.01)       0.05(0.03+0.00) -68.8%
    0008.3: index-pack with 256*10 loose objects     0.19(0.04+0.02)       0.05(0.02+0.00) -73.7%
    0008.4: index-pack with 256*100 loose objects    0.32(0.17+0.02)       0.04(0.02+0.00) -87.5%
    0008.5: index-pack with 256*250 loose objects    0.57(0.41+0.03)       0.04(0.02+0.00) -93.0%
    0008.6: index-pack with 256*500 loose objects    1.02(0.83+0.06)       0.04(0.03+0.00) -96.1%
    0008.7: index-pack with 256*750 loose objects    1.47(1.24+0.10)       0.04(0.02+0.00) -97.3%
    0008.8: index-pack with 256*1000 loose objects   1.94(1.70+0.10)       0.04(0.02+0.00) -97.9%

As noted in previous threads I have an in-house monorepo where (due to
expiry policies) loose objects hover around the 256*250 mark.

The script, which is hacky as hell and takes shortcuts not to re-create
the huge fake loose object collection every time (takes ages). Perhaps
you're interested in incorporating some version of this into a v2. To be
useful it should take some target path as an env variable.

$ cat t/perf/p0008-index-pack.sh
#!/bin/sh

test_description="Tests performance of index-pack with loose objects"

. ./perf-lib.sh

test_perf_fresh_repo

test_expect_success 'setup tests' '
	for count in 1 10 100 250 500 750 1000
	do
		if test -d /mnt/ontap_githackers/repo-$count.git
		then
			rm -rf /mnt/ontap_githackers/repo-$count.git/objects/pack
		else
			git init --bare /mnt/ontap_githackers/repo-$count.git &&
			(
				cd /mnt/ontap_githackers/repo-$count.git &&
				for i in $(seq 0 255)
				do
					i=$(printf %02x $i) &&
					mkdir objects/$i &&
					for j in $(seq --format=%038g $count)
					do
						>objects/$i/$j
					done
				done
			)
		fi
	done
'

for count in 1 10 100 250 500 750 1000
do
	echo 3 | sudo tee /proc/sys/vm/drop_caches
	test_perf "index-pack with 256*$count loose objects" "
		(
			cd /mnt/ontap_githackers/repo-$count.git &&
			rm -fv objects/pack/*;
			git -c core.checkCollisions=false index-pack -v --stdin </home/aearnfjord/g/DBD-mysql/.git/objects/pack/pack-*.pack
		)
	"
done

test_done

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 22:18                             ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 22:30                               ` Ævar Arnfjörð Bjarmason
  2018-11-13 10:02                                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-12 22:30 UTC (permalink / raw)
  To: Jeff King
  Cc: Geert Jansen, Junio C Hamano, git\, René Scharfe, Takuto Ikuta


On Mon, Nov 12 2018, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Nov 12 2018, Jeff King wrote:
>
>> On Mon, Nov 12, 2018 at 05:01:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>>> > There's some obvious hand-waving in the paragraphs above. I would love
>>> > it if somebody with an NFS system could do some before/after timings
>>> > with various numbers of loose objects, to get a sense of where the
>>> > breakeven point is.
>>> >
>>> > My gut is that we do not need the complexity of a cache-size limit, nor
>>> > of a config option to disable this. But it would be nice to have a real
>>> > number where "reasonable" ends and "pathological" begins. :)
>>>
>>> I'm happy to test this on some of the NFS we have locally, and started
>>> out with a plan to write some for-loop using the low-level API (so it
>>> would look up all 256), fake populate .git/objects/?? with N number of
>>> objects etc, but ran out of time.
>>>
>>> Do you have something ready that you think would be representative and I
>>> could just run? If not I'll try to pick this up again...
>>
>> No, but they don't even really need to be actual objects. So I suspect
>> something like:
>>
>>   git init
>>   for i in $(seq 256); do
>>     i=$(printf %02x $i)
>>     mkdir -p .git/objects/$i
>>     for j in $(seq --format=%038g 1000); do
>>       echo foo >.git/objects/$i/$j
>>     done
>>   done
>>   git index-pack -v --stdin </path/to/git.git/objects/pack/XYZ.pack
>>
>> might work (for various values of 1000). The shell loop would probably
>> be faster as perl, too. :)
>>
>> Make sure you clear the object directory between runs, though (otherwise
>> the subsequent index-pack's really do find collisions and spend time
>> accessing the objects).
>>
>> If you want real objects, you could probably just dump a bunch of
>> sequential blobs to fast-import, and then pipe the result to
>> unpack-objects.
>>
>> -Peff
>
> I did a very ad-hoc test against a NetApp filer using the test script
> quoted at the end of this E-Mail. The test compared origin/master, this
> branch of yours, and my core.checkCollisions=false branch.
>
> When run with DBD-mysql.git (just some random ~1k commit repo I had):
>
>     $ GIT_PERF_REPEAT_COUNT=3 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run origin/master peff/jk/loose-cache avar/check-collisions-config p0008-index-pack.sh
>
> I get:
>
>     Test                                             origin/master     peff/jk/loose-cache      avar/check-collisions-config
>     ------------------------------------------------------------------------------------------------------------------------
>     0008.2: index-pack with 256*1 loose objects      4.31(0.55+0.18)   0.41(0.40+0.02) -90.5%   0.23(0.36+0.01) -94.7%
>     0008.3: index-pack with 256*10 loose objects     4.37(0.45+0.21)   0.45(0.40+0.02) -89.7%   0.25(0.38+0.01) -94.3%
>     0008.4: index-pack with 256*100 loose objects    4.47(0.53+0.23)   0.67(0.63+0.02) -85.0%   0.24(0.38+0.01) -94.6%
>     0008.5: index-pack with 256*250 loose objects    5.01(0.67+0.30)   1.04(0.98+0.06) -79.2%   0.24(0.37+0.01) -95.2%
>     0008.6: index-pack with 256*500 loose objects    5.11(0.57+0.21)   1.81(1.70+0.09) -64.6%   0.25(0.38+0.01) -95.1%
>     0008.7: index-pack with 256*750 loose objects    5.12(0.60+0.22)   2.54(2.38+0.14) -50.4%   0.24(0.38+0.01) -95.3%
>     0008.8: index-pack with 256*1000 loose objects   4.52(0.52+0.21)   3.36(3.17+0.17) -25.7%   0.23(0.36+0.01) -94.9%
>
> I then hacked it to test against git.git, but skipped origin/master for
> that one because it takes *ages*. So just mine v.s. yours:
>
>     $ GIT_PERF_REPEAT_COUNT=3 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run peff/jk/loose-cache avar/check-collisions-config p0008-index-pack.sh
>     [...]
>     Test                                             peff/jk/loose-cache   avar/check-collisions-config
>     ---------------------------------------------------------------------------------------------------
>     0008.2: index-pack with 256*1 loose objects      12.57(28.72+0.61)     12.68(29.36+0.62) +0.9%
>     0008.3: index-pack with 256*10 loose objects     12.77(28.75+0.61)     12.50(28.88+0.56) -2.1%
>     0008.4: index-pack with 256*100 loose objects    13.20(29.49+0.66)     12.38(28.58+0.60) -6.2%
>     0008.5: index-pack with 256*250 loose objects    14.10(30.59+0.64)     12.54(28.22+0.57) -11.1%
>     0008.6: index-pack with 256*500 loose objects    14.48(31.06+0.74)     12.43(28.59+0.60) -14.2%
>     0008.7: index-pack with 256*750 loose objects    15.31(31.91+0.74)     12.67(29.23+0.64) -17.2%
>     0008.8: index-pack with 256*1000 loose objects   16.34(32.84+0.76)     13.11(30.19+0.68) -19.8%
>
> So not much of a practical difference perhaps. But then again this isn't
> a very realistic test case of anything. Rarely are you going to push a
> history of something the size of git.git into a repo with this many
> loose objects.
>
> Using sha1collisiondetection.git is I think the most realistic scenario,
> i.e. you'll often end up fetching/pushing something roughly the size of
> its entire history on a big repo, and with it:
>
>     Test                                             peff/jk/loose-cache   avar/check-collisions-config
>     ---------------------------------------------------------------------------------------------------
>     0008.2: index-pack with 256*1 loose objects      0.16(0.04+0.01)       0.05(0.03+0.00) -68.8%
>     0008.3: index-pack with 256*10 loose objects     0.19(0.04+0.02)       0.05(0.02+0.00) -73.7%
>     0008.4: index-pack with 256*100 loose objects    0.32(0.17+0.02)       0.04(0.02+0.00) -87.5%
>     0008.5: index-pack with 256*250 loose objects    0.57(0.41+0.03)       0.04(0.02+0.00) -93.0%
>     0008.6: index-pack with 256*500 loose objects    1.02(0.83+0.06)       0.04(0.03+0.00) -96.1%
>     0008.7: index-pack with 256*750 loose objects    1.47(1.24+0.10)       0.04(0.02+0.00) -97.3%
>     0008.8: index-pack with 256*1000 loose objects   1.94(1.70+0.10)       0.04(0.02+0.00) -97.9%
>
> As noted in previous threads I have an in-house monorepo where (due to
> expiry policies) loose objects hover around the 256*250 mark.
>
> The script, which is hacky as hell and takes shortcuts not to re-create
> the huge fake loose object collection every time (takes ages). Perhaps
> you're interested in incorporating some version of this into a v2. To be
> useful it should take some target path as an env variable.

I forgot perhaps the most useful metric. Testing against origin/master
too on the sha1collisiondetection.git repo, which as noted above I think
is a good stand-in for making a medium sized push to a big repo. This
shows when the loose cache becomes counterproductive:

    Test                                             origin/master     peff/jk/loose-cache       avar/check-collisions-config
    -------------------------------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      0.42(0.04+0.03)   0.17(0.04+0.00) -59.5%    0.04(0.03+0.00) -90.5%
    0008.3: index-pack with 256*10 loose objects     0.49(0.04+0.03)   0.19(0.04+0.01) -61.2%    0.04(0.02+0.00) -91.8%
    0008.4: index-pack with 256*100 loose objects    0.49(0.04+0.04)   0.33(0.18+0.01) -32.7%    0.05(0.02+0.00) -89.8%
    0008.5: index-pack with 256*250 loose objects    0.54(0.03+0.04)   0.59(0.43+0.02) +9.3%     0.04(0.02+0.01) -92.6%
    0008.6: index-pack with 256*500 loose objects    0.49(0.04+0.03)   1.04(0.83+0.07) +112.2%   0.04(0.02+0.00) -91.8%
    0008.7: index-pack with 256*750 loose objects    0.56(0.04+0.05)   1.50(1.28+0.08) +167.9%   0.04(0.02+0.00) -92.9%
    0008.8: index-pack with 256*1000 loose objects   0.54(0.05+0.03)   1.95(1.68+0.13) +261.1%   0.04(0.02+0.00) -92.6%

I still think it's best to take this patch series since it's unlikely
we're making anything worse in practice, the >50k objects case is a
really high number, which I don't think is worth worrying about.

But I am somewhat paranoid about the potential performance
regression. I.e. this is me testing against a really expensive and
relatively well performing NetApp NFS device where the ping stats are:

    rtt min/avg/max/mdev = 0.155/0.396/1.387/0.349 ms

So I suspect this might get a lot worse for setups which don't enjoy the
same performance or network locality.


> $ cat t/perf/p0008-index-pack.sh
> #!/bin/sh
>
> test_description="Tests performance of index-pack with loose objects"
>
> . ./perf-lib.sh
>
> test_perf_fresh_repo
>
> test_expect_success 'setup tests' '
> 	for count in 1 10 100 250 500 750 1000
> 	do
> 		if test -d /mnt/ontap_githackers/repo-$count.git
> 		then
> 			rm -rf /mnt/ontap_githackers/repo-$count.git/objects/pack
> 		else
> 			git init --bare /mnt/ontap_githackers/repo-$count.git &&
> 			(
> 				cd /mnt/ontap_githackers/repo-$count.git &&
> 				for i in $(seq 0 255)
> 				do
> 					i=$(printf %02x $i) &&
> 					mkdir objects/$i &&
> 					for j in $(seq --format=%038g $count)
> 					do
> 						>objects/$i/$j
> 					done
> 				done
> 			)
> 		fi
> 	done
> '
>
> for count in 1 10 100 250 500 750 1000
> do
> 	echo 3 | sudo tee /proc/sys/vm/drop_caches
> 	test_perf "index-pack with 256*$count loose objects" "
> 		(
> 			cd /mnt/ontap_githackers/repo-$count.git &&
> 			rm -fv objects/pack/*;
> 			git -c core.checkCollisions=false index-pack -v --stdin </home/aearnfjord/g/DBD-mysql/.git/objects/pack/pack-*.pack
> 		)
> 	"
> done
>
> test_done

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 16:21                           ` Jeff King
  2018-11-12 22:18                             ` Ævar Arnfjörð Bjarmason
@ 2018-11-12 22:44                             ` Geert Jansen
  1 sibling, 0 replies; 87+ messages in thread
From: Geert Jansen @ 2018-11-12 22:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, git,
	René Scharfe, Takuto Ikuta

On Mon, Nov 12, 2018 at 11:21:51AM -0500, Jeff King wrote:

> No, but they don't even really need to be actual objects. So I suspect
> something like:
> 
>   git init
>   for i in $(seq 256); do
>     i=$(printf %02x $i)
>     mkdir -p .git/objects/$i
>     for j in $(seq --format=%038g 1000); do
>       echo foo >.git/objects/$i/$j
>     done
>   done
>   git index-pack -v --stdin </path/to/git.git/objects/pack/XYZ.pack
> 
> might work (for various values of 1000). The shell loop would probably
> be faster as perl, too. :)
> 
> Make sure you clear the object directory between runs, though (otherwise
> the subsequent index-pack's really do find collisions and spend time
> accessing the objects).

Below are my results. They are not as comprehensive as Ævar's tests. Similary I
kept the loose objects between tests and removed the packs instead. And I also
used the "echo 3 | sudo tee /proc/sys/vm/drop_caches" trick :)

This is with git.git:

                   origin/master    jk/loose-object-cache

256*100 objects    520s             13.5s (-97%)
256*1000 objects   826s             59s (-93%)

I've started a 256*10K setup but that's still creating the 2.5M loose objects.
I'll post the results when it's done. I would expect that jk/loose-object-cache
is still marginally faster than origin/master based on a simple linear
extrapolation.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [RFC PATCH] index-pack: improve performance on NFS
  2018-11-09 13:43                   ` [RFC PATCH] index-pack: improve performance on NFS Ævar Arnfjörð Bjarmason
  2018-11-09 16:08                     ` Duy Nguyen
@ 2018-11-12 22:58                     ` Geert Jansen
  1 sibling, 0 replies; 87+ messages in thread
From: Geert Jansen @ 2018-11-12 22:58 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jeff King, Junio C Hamano, git

On Fri, Nov 09, 2018 at 02:43:52PM +0100, Ævar Arnfjörð Bjarmason wrote:

> As noted in
> https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
> it's regardless of Jeff's optimization is. O(nothing) is always faster
> than O(something), particularly (as explained in that E-Mail) on NFS.
> 
> You didn't answer my question in
> https://public-inbox.org/git/20181030024925.GC8325@amazon.com/ about
> whether for your purposes you're interested in this for something where
> it needs to work out of the box on some random Amazon's customer's
> "git", or if it's something in-house and you just don't want to turn off
> collision checking. That would be useful to know.

The reason I started this thread is to optimize performance for AWS customers
that run git on EFS. Therefore, my preference is that git would be fast out of
the box on NFS/EFS but without having to disable collision checking
unconditionally (disabling it for empty repos is fine as that's a no-op
anyway).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-12 22:30                               ` Ævar Arnfjörð Bjarmason
@ 2018-11-13 10:02                                 ` Ævar Arnfjörð Bjarmason
  2018-11-14 18:21                                   ` René Scharfe
  0 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-13 10:02 UTC (permalink / raw)
  To: Jeff King
  Cc: Geert Jansen, Junio C Hamano, git\, René Scharfe, Takuto Ikuta


On Mon, Nov 12 2018, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Nov 12 2018, Ævar Arnfjörð Bjarmason wrote:
>
>> On Mon, Nov 12 2018, Jeff King wrote:
>>
>>> On Mon, Nov 12, 2018 at 05:01:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>>
>>>> > There's some obvious hand-waving in the paragraphs above. I would love
>>>> > it if somebody with an NFS system could do some before/after timings
>>>> > with various numbers of loose objects, to get a sense of where the
>>>> > breakeven point is.
>>>> >
>>>> > My gut is that we do not need the complexity of a cache-size limit, nor
>>>> > of a config option to disable this. But it would be nice to have a real
>>>> > number where "reasonable" ends and "pathological" begins. :)
>>>>
>>>> I'm happy to test this on some of the NFS we have locally, and started
>>>> out with a plan to write some for-loop using the low-level API (so it
>>>> would look up all 256), fake populate .git/objects/?? with N number of
>>>> objects etc, but ran out of time.
>>>>
>>>> Do you have something ready that you think would be representative and I
>>>> could just run? If not I'll try to pick this up again...
>>>
>>> No, but they don't even really need to be actual objects. So I suspect
>>> something like:
>>>
>>>   git init
>>>   for i in $(seq 256); do
>>>     i=$(printf %02x $i)
>>>     mkdir -p .git/objects/$i
>>>     for j in $(seq --format=%038g 1000); do
>>>       echo foo >.git/objects/$i/$j
>>>     done
>>>   done
>>>   git index-pack -v --stdin </path/to/git.git/objects/pack/XYZ.pack
>>>
>>> might work (for various values of 1000). The shell loop would probably
>>> be faster as perl, too. :)
>>>
>>> Make sure you clear the object directory between runs, though (otherwise
>>> the subsequent index-pack's really do find collisions and spend time
>>> accessing the objects).
>>>
>>> If you want real objects, you could probably just dump a bunch of
>>> sequential blobs to fast-import, and then pipe the result to
>>> unpack-objects.
>>>
>>> -Peff
>>
>> I did a very ad-hoc test against a NetApp filer using the test script
>> quoted at the end of this E-Mail. The test compared origin/master, this
>> branch of yours, and my core.checkCollisions=false branch.
>>
>> When run with DBD-mysql.git (just some random ~1k commit repo I had):
>>
>>     $ GIT_PERF_REPEAT_COUNT=3 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run origin/master peff/jk/loose-cache avar/check-collisions-config p0008-index-pack.sh
>>
>> I get:
>>
>>     Test                                             origin/master     peff/jk/loose-cache      avar/check-collisions-config
>>     ------------------------------------------------------------------------------------------------------------------------
>>     0008.2: index-pack with 256*1 loose objects      4.31(0.55+0.18)   0.41(0.40+0.02) -90.5%   0.23(0.36+0.01) -94.7%
>>     0008.3: index-pack with 256*10 loose objects     4.37(0.45+0.21)   0.45(0.40+0.02) -89.7%   0.25(0.38+0.01) -94.3%
>>     0008.4: index-pack with 256*100 loose objects    4.47(0.53+0.23)   0.67(0.63+0.02) -85.0%   0.24(0.38+0.01) -94.6%
>>     0008.5: index-pack with 256*250 loose objects    5.01(0.67+0.30)   1.04(0.98+0.06) -79.2%   0.24(0.37+0.01) -95.2%
>>     0008.6: index-pack with 256*500 loose objects    5.11(0.57+0.21)   1.81(1.70+0.09) -64.6%   0.25(0.38+0.01) -95.1%
>>     0008.7: index-pack with 256*750 loose objects    5.12(0.60+0.22)   2.54(2.38+0.14) -50.4%   0.24(0.38+0.01) -95.3%
>>     0008.8: index-pack with 256*1000 loose objects   4.52(0.52+0.21)   3.36(3.17+0.17) -25.7%   0.23(0.36+0.01) -94.9%
>>
>> I then hacked it to test against git.git, but skipped origin/master for
>> that one because it takes *ages*. So just mine v.s. yours:
>>
>>     $ GIT_PERF_REPEAT_COUNT=3 GIT_PERF_MAKE_OPTS='-j56 CFLAGS="-O3"' ./run peff/jk/loose-cache avar/check-collisions-config p0008-index-pack.sh
>>     [...]
>>     Test                                             peff/jk/loose-cache   avar/check-collisions-config
>>     ---------------------------------------------------------------------------------------------------
>>     0008.2: index-pack with 256*1 loose objects      12.57(28.72+0.61)     12.68(29.36+0.62) +0.9%
>>     0008.3: index-pack with 256*10 loose objects     12.77(28.75+0.61)     12.50(28.88+0.56) -2.1%
>>     0008.4: index-pack with 256*100 loose objects    13.20(29.49+0.66)     12.38(28.58+0.60) -6.2%
>>     0008.5: index-pack with 256*250 loose objects    14.10(30.59+0.64)     12.54(28.22+0.57) -11.1%
>>     0008.6: index-pack with 256*500 loose objects    14.48(31.06+0.74)     12.43(28.59+0.60) -14.2%
>>     0008.7: index-pack with 256*750 loose objects    15.31(31.91+0.74)     12.67(29.23+0.64) -17.2%
>>     0008.8: index-pack with 256*1000 loose objects   16.34(32.84+0.76)     13.11(30.19+0.68) -19.8%
>>
>> So not much of a practical difference perhaps. But then again this isn't
>> a very realistic test case of anything. Rarely are you going to push a
>> history of something the size of git.git into a repo with this many
>> loose objects.
>>
>> Using sha1collisiondetection.git is I think the most realistic scenario,
>> i.e. you'll often end up fetching/pushing something roughly the size of
>> its entire history on a big repo, and with it:
>>
>>     Test                                             peff/jk/loose-cache   avar/check-collisions-config
>>     ---------------------------------------------------------------------------------------------------
>>     0008.2: index-pack with 256*1 loose objects      0.16(0.04+0.01)       0.05(0.03+0.00) -68.8%
>>     0008.3: index-pack with 256*10 loose objects     0.19(0.04+0.02)       0.05(0.02+0.00) -73.7%
>>     0008.4: index-pack with 256*100 loose objects    0.32(0.17+0.02)       0.04(0.02+0.00) -87.5%
>>     0008.5: index-pack with 256*250 loose objects    0.57(0.41+0.03)       0.04(0.02+0.00) -93.0%
>>     0008.6: index-pack with 256*500 loose objects    1.02(0.83+0.06)       0.04(0.03+0.00) -96.1%
>>     0008.7: index-pack with 256*750 loose objects    1.47(1.24+0.10)       0.04(0.02+0.00) -97.3%
>>     0008.8: index-pack with 256*1000 loose objects   1.94(1.70+0.10)       0.04(0.02+0.00) -97.9%
>>
>> As noted in previous threads I have an in-house monorepo where (due to
>> expiry policies) loose objects hover around the 256*250 mark.
>>
>> The script, which is hacky as hell and takes shortcuts not to re-create
>> the huge fake loose object collection every time (takes ages). Perhaps
>> you're interested in incorporating some version of this into a v2. To be
>> useful it should take some target path as an env variable.
>
> I forgot perhaps the most useful metric. Testing against origin/master
> too on the sha1collisiondetection.git repo, which as noted above I think
> is a good stand-in for making a medium sized push to a big repo. This
> shows when the loose cache becomes counterproductive:
>
>     Test                                             origin/master     peff/jk/loose-cache       avar/check-collisions-config
>     -------------------------------------------------------------------------------------------------------------------------
>     0008.2: index-pack with 256*1 loose objects      0.42(0.04+0.03)   0.17(0.04+0.00) -59.5%    0.04(0.03+0.00) -90.5%
>     0008.3: index-pack with 256*10 loose objects     0.49(0.04+0.03)   0.19(0.04+0.01) -61.2%    0.04(0.02+0.00) -91.8%
>     0008.4: index-pack with 256*100 loose objects    0.49(0.04+0.04)   0.33(0.18+0.01) -32.7%    0.05(0.02+0.00) -89.8%
>     0008.5: index-pack with 256*250 loose objects    0.54(0.03+0.04)   0.59(0.43+0.02) +9.3%     0.04(0.02+0.01) -92.6%
>     0008.6: index-pack with 256*500 loose objects    0.49(0.04+0.03)   1.04(0.83+0.07) +112.2%   0.04(0.02+0.00) -91.8%
>     0008.7: index-pack with 256*750 loose objects    0.56(0.04+0.05)   1.50(1.28+0.08) +167.9%   0.04(0.02+0.00) -92.9%
>     0008.8: index-pack with 256*1000 loose objects   0.54(0.05+0.03)   1.95(1.68+0.13) +261.1%   0.04(0.02+0.00) -92.6%
>
> I still think it's best to take this patch series since it's unlikely
> we're making anything worse in practice, the >50k objects case is a
> really high number, which I don't think is worth worrying about.
>
> But I am somewhat paranoid about the potential performance
> regression. I.e. this is me testing against a really expensive and
> relatively well performing NetApp NFS device where the ping stats are:
>
>     rtt min/avg/max/mdev = 0.155/0.396/1.387/0.349 ms
>
> So I suspect this might get a lot worse for setups which don't enjoy the
> same performance or network locality.

I tried this with the same filer mounted from another DC with ~10x the
RTT:

    rtt min/avg/max/mdev = 11.553/11.618/11.739/0.121 ms

But otherwise the same setup (same machine type/specs mounting it). It
had the opposite results of what I was expecting:

    Test                                             origin/master     peff/jk/loose-cache      avar/check-collisions-config
    ------------------------------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      7.78(0.04+0.03)   2.75(0.03+0.01) -64.7%   0.40(0.02+0.00) -94.9%
    0008.3: index-pack with 256*10 loose objects     7.75(0.04+0.04)   2.77(0.05+0.01) -64.3%   0.40(0.02+0.00) -94.8%
    0008.4: index-pack with 256*100 loose objects    7.75(0.05+0.02)   2.91(0.18+0.01) -62.5%   0.40(0.02+0.00) -94.8%
    0008.5: index-pack with 256*250 loose objects    7.73(0.04+0.04)   3.19(0.43+0.02) -58.7%   0.40(0.02+0.00) -94.8%
    0008.6: index-pack with 256*500 loose objects    7.73(0.04+0.04)   3.64(0.83+0.05) -52.9%   0.40(0.02+0.00) -94.8%
    0008.7: index-pack with 256*750 loose objects    7.73(0.04+0.02)   4.14(1.29+0.07) -46.4%   0.40(0.02+0.00) -94.8%
    0008.8: index-pack with 256*1000 loose objects   7.73(0.04+0.03)   4.55(1.72+0.09) -41.1%   0.40(0.02+0.01) -94.8%

I.e. there the cliff of where the cache becomes counterproductive comes
much later, not earlier. The sha1collisiondetection.git repo has 418
objects.

So is it cheaper to fill a huge cache than look up those 418? I don't
know, haven't dug. But so far what this suggests is that we're helping
slow FSs to the detriment of faster ones.

So here's the same test not against NFS, but the local ext4 fs (CO7;
Linux 3.10) for sha1collisiondetection.git:

    Test                                             origin/master     peff/jk/loose-cache        avar/check-collisions-config
    --------------------------------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      0.02(0.02+0.00)   0.02(0.02+0.01) +0.0%      0.02(0.02+0.00) +0.0%
    0008.3: index-pack with 256*10 loose objects     0.02(0.02+0.00)   0.03(0.03+0.00) +50.0%     0.02(0.02+0.00) +0.0%
    0008.4: index-pack with 256*100 loose objects    0.02(0.02+0.00)   0.17(0.16+0.01) +750.0%    0.02(0.02+0.00) +0.0%
    0008.5: index-pack with 256*250 loose objects    0.02(0.02+0.00)   0.43(0.40+0.03) +2050.0%   0.02(0.02+0.00) +0.0%
    0008.6: index-pack with 256*500 loose objects    0.02(0.02+0.00)   0.88(0.80+0.09) +4300.0%   0.02(0.02+0.00) +0.0%
    0008.7: index-pack with 256*750 loose objects    0.02(0.02+0.00)   1.35(1.27+0.09) +6650.0%   0.02(0.02+0.00) +0.0%
    0008.8: index-pack with 256*1000 loose objects   0.02(0.02+0.00)   1.83(1.70+0.14) +9050.0%   0.02(0.02+0.00) +0.0%

And for mu.git, a ~20k object repo:

    Test                                             origin/master     peff/jk/loose-cache       avar/check-collisions-config
    -------------------------------------------------------------------------------------------------------------------------
    0008.2: index-pack with 256*1 loose objects      0.59(0.91+0.06)   0.58(0.93+0.03) -1.7%     0.57(0.89+0.04) -3.4%
    0008.3: index-pack with 256*10 loose objects     0.59(0.91+0.07)   0.59(0.92+0.03) +0.0%     0.57(0.89+0.03) -3.4%
    0008.4: index-pack with 256*100 loose objects    0.59(0.91+0.05)   0.81(1.13+0.04) +37.3%    0.58(0.91+0.04) -1.7%
    0008.5: index-pack with 256*250 loose objects    0.59(0.91+0.05)   1.23(1.51+0.08) +108.5%   0.58(0.91+0.04) -1.7%
    0008.6: index-pack with 256*500 loose objects    0.59(0.90+0.06)   1.96(2.20+0.12) +232.2%   0.58(0.91+0.04) -1.7%
    0008.7: index-pack with 256*750 loose objects    0.59(0.92+0.05)   2.72(2.92+0.17) +361.0%   0.58(0.90+0.04) -1.7%
    0008.8: index-pack with 256*1000 loose objects   0.59(0.90+0.06)   3.50(3.67+0.21) +493.2%   0.57(0.90+0.04) -3.4%

All of which is to say that I think it definitely makes sense to re-roll
this with a perf test, and a switch to toggle it + docs explaining the
caveats & pointing to the perf test. It's a clear win in some scenarios,
but a big loss in others.

>> $ cat t/perf/p0008-index-pack.sh
>> #!/bin/sh
>>
>> test_description="Tests performance of index-pack with loose objects"
>>
>> . ./perf-lib.sh
>>
>> test_perf_fresh_repo
>>
>> test_expect_success 'setup tests' '
>> 	for count in 1 10 100 250 500 750 1000
>> 	do
>> 		if test -d /mnt/ontap_githackers/repo-$count.git
>> 		then
>> 			rm -rf /mnt/ontap_githackers/repo-$count.git/objects/pack
>> 		else
>> 			git init --bare /mnt/ontap_githackers/repo-$count.git &&
>> 			(
>> 				cd /mnt/ontap_githackers/repo-$count.git &&
>> 				for i in $(seq 0 255)
>> 				do
>> 					i=$(printf %02x $i) &&
>> 					mkdir objects/$i &&
>> 					for j in $(seq --format=%038g $count)
>> 					do
>> 						>objects/$i/$j
>> 					done
>> 				done
>> 			)
>> 		fi
>> 	done
>> '
>>
>> for count in 1 10 100 250 500 750 1000
>> do
>> 	echo 3 | sudo tee /proc/sys/vm/drop_caches
>> 	test_perf "index-pack with 256*$count loose objects" "
>> 		(
>> 			cd /mnt/ontap_githackers/repo-$count.git &&
>> 			rm -fv objects/pack/*;
>> 			git -c core.checkCollisions=false index-pack -v --stdin </home/aearnfjord/g/DBD-mysql/.git/objects/pack/pack-*.pack
>> 		)
>> 	"
>> done
>>
>> test_done

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3] index-pack: add ability to disable SHA-1 collision check
  2018-10-30 18:43             ` [PATCH v2 0/3] index-pack: test updates Ævar Arnfjörð Bjarmason
@ 2018-11-13 20:19               ` Ævar Arnfjörð Bjarmason
  2018-11-14  7:09                 ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-13 20:19 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jeff King, Geert Jansen, Christian Couder,
	Nicolas Pitre, Linus Torvalds, Petr Baudis,
	Ævar Arnfjörð Bjarmason

Add a new core.checkCollisions setting. On by default, it can be set
to 'false' to disable the check for existing objects in sha1_object().

As noted in the documentation being added here this is done out of
paranoia about future SHA-1 collisions and as a canary (redundant to
"git fsck") for local object corruption.

For the history of SHA-1 collision checking see:

 - 5c2a7fbc36 ("[PATCH] SHA1 naive collision checking", 2005-04-13)

 - f864ba7448 ("Fix read-cache.c collission check logic.", 2005-04-13)

 - aac1794132 ("Improve sha1 object file writing.", 2005-05-03)

 - 8685da4256 ("don't ever allow SHA1 collisions to exist by fetching
   a pack", 2007-03-20)

 - 1421c5f274 ("write_loose_object: don't bother trying to read an old
   object", 2008-06-16)

 - 51054177b3 ("index-pack: detect local corruption in collision
   check", 2017-04-01)

As seen when going through that history there used to be a way to turn
this off at compile-time by using -DCOLLISION_CHECK=0 option (see
f864ba7448), but this check later went away in favor of general "don't
write if exists" logic for loose objects, and was then brought back
for remotely fetched packs in 8685da4256.

I plan to turn this off by default in my own settings since I'll
appreciate the performance improvement, and because I think worrying
about SHA-1 collisions is insane paranoia. But others might disagree,
so the check is still on by default.

Also add a "GIT_TEST_CHECK_COLLISIONS" setting so the entire test
suite can be exercised with the collision check turned off.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

Now that the v2 where I peeled of this patch (just the tests) has
landed here's a re-submission of the core.checkCollisions knob.

As noted in
https://public-inbox.org/git/878t1x2t3e.fsf@evledraar.gmail.com/ and
related messages this has a great impact on performance, and I'm
already using this in production, and for the reasons explained there
j ust having the loose object cache isn't enough in some scenarios.

Jeff: What do you think about the order of these going in & about the
config knob for the loose cache that I suggested in the E-Mail above?

I think it makes the most sense for this to land first, and to re-roll
the loose cache with a config knob & change to the docs here
explaining the trade-offs between the two settings and why you
would/wouldn't use them in combination.

 Documentation/config/core.txt | 68 +++++++++++++++++++++++++++++++++++
 builtin/index-pack.c          |  7 ++--
 cache.h                       |  1 +
 config.c                      | 20 +++++++++++
 config.h                      |  1 +
 environment.c                 |  1 +
 t/README                      |  3 ++
 t/t1060-object-corruption.sh  | 33 +++++++++++++++++
 t/t5300-pack-object.sh        | 10 ++++--
 9 files changed, 138 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index d0e6635fe0..1f5c891ccf 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -88,6 +88,74 @@ core.untrackedCache::
 	properly on your system.
 	See linkgit:git-update-index[1]. `keep` by default.
 
+core.checkCollisions::
+	When missing or set to `default` Git will assert when writing
+	a given object that it doesn't exist already anywhere else in
+	the object store (also accounting for
+	`GIT_ALTERNATE_OBJECT_DIRECTORIES` et al, see
+	linkgit:git[1]).
++
+The reasons for why this is on by default are:
++
+--
+. If there's ever a new SHA-1 collision attack similar to the
+  SHAttered attack (see https://shattered.io) Git can't be fooled into
+  replacing an existing known-good object with a new one with the same
+  SHA-1.
++
+Note that Git by default is built with a hardened version of SHA-1
+function with collision detection for attacks like the SHAttered
+attack (see link:technical/hash-function-transition.html[the hash
+function transition documentation]), but new future attacks might not
+be detected by the hardened SHA-1 code.
+
+. It serves as a canary for detecting some instances of repository
+  corruption. The type and size of the existing and new objects are
+  compared, if they differ Git will panic and abort. This can happen
+  e.g. if a loose object's content has been truncated or otherwise
+  mangled by filesystem corruption.
+--
++
+The reasons to disable this are, respectively:
++
+--
+. Doing the "does this object exist already?" check can be expensive,
+  and it's always cheaper to do nothing.
++
+Even on a very fast local disk (e.g. SSD) cloning a repository like
+git.git spends around 5% of its time just in `lstat()`. This
+percentage can get much higher (up to even hundreds of percents!) on
+network filesystems like NFS where metadata operations can be much
+slower.
++
+This is because with the collision check every object in an incoming
+packfile must be checked against any existing packfiles, as well as
+the loose object store (most of the `lstat()` time is spent on the
+latter). Git doesn't guarantee that some concurrent process isn't
+writing to the same repository during a `clone`. The same sort of
+slowdowns can be seen when doing a big fetch (lots of objects to write
+out).
+
+. If you have a corrupt local repository this check can prevent
+  repairing it by fetching a known-good version of the same object
+  from a remote repository. See the "repair a corrupted repo with
+  index-pack" test in the `t1060-object-corruption.sh` test in the git
+  source code.
+--
++
+Consider turning this off if you're more concerned about performance
+than you are about hypothetical future SHA-1 collisions or object
+corruption (linkgit:git-fsck[1] will also catch object
+corruption). This setting can also be disabled during specific
+phases/commands that can be bottlenecks, e.g. with `git -c
+core.checkCollisions=false clone [...]` for an initial clone on NFS.
++
+Setting this to `false` will disable object collision
+checking. I.e. the value can either be "default" or a boolean. Other
+values might be added in the future (e.g. for selectively disabling
+this just for "clone"), but now any non-boolean non-"default" values
+error out.
+
 core.checkStat::
 	When missing or is set to `default`, many fields in the stat
 	structure are checked to detect if a file has been modified
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 2004e25da2..4a3508aa9f 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -791,23 +791,24 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 {
 	void *new_data = NULL;
 	int collision_test_needed = 0;
+	int do_coll_check = git_config_get_collision_check();
 
 	assert(data || obj_entry);
 
-	if (startup_info->have_repository) {
+	if (do_coll_check && startup_info->have_repository) {
 		read_lock();
 		collision_test_needed =
 			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
 		read_unlock();
 	}
 
-	if (collision_test_needed && !data) {
+	if (do_coll_check && collision_test_needed && !data) {
 		read_lock();
 		if (!check_collison(obj_entry))
 			collision_test_needed = 0;
 		read_unlock();
 	}
-	if (collision_test_needed) {
+	if (do_coll_check && collision_test_needed) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
diff --git a/cache.h b/cache.h
index ca36b44ee0..a5f215f1fc 100644
--- a/cache.h
+++ b/cache.h
@@ -863,6 +863,7 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern int check_collisions;
 
 /*
  * Accessors for the core.sharedrepository config which lazy-load the value
diff --git a/config.c b/config.c
index 2ffd39c220..641d0c537f 100644
--- a/config.c
+++ b/config.c
@@ -1354,6 +1354,14 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkcollisions")) {
+		if (!strcasecmp(value, "default"))
+			check_collisions = 1;
+		else
+			check_collisions = git_config_bool(var, value);
+		return 0;
+	}
+
 	/* Add other config variables here and to Documentation/config.txt. */
 	return platform_core_config(var, value, cb);
 }
@@ -2310,6 +2318,18 @@ int git_config_get_index_threads(void)
 	return 0; /* auto */
 }
 
+int git_config_get_collision_check(void)
+{
+	static int checked_env = 0;
+	if (!checked_env) {
+		int v = git_env_bool("GIT_TEST_CHECK_COLLISIONS", -1);
+		checked_env = 1;
+		if (v != -1)
+			check_collisions = v;
+	}
+	return check_collisions;
+}
+
 NORETURN
 void git_die_config_linenr(const char *key, const char *filename, int linenr)
 {
diff --git a/config.h b/config.h
index a06027e69b..4c6f6d9ae4 100644
--- a/config.h
+++ b/config.h
@@ -251,6 +251,7 @@ extern int git_config_get_split_index(void);
 extern int git_config_get_max_percent_split_change(void);
 extern int git_config_get_fsmonitor(void);
 extern int git_config_get_index_threads(void);
+extern int git_config_get_collision_check(void);
 
 /* This dies if the configured or default date is in the future */
 extern int git_config_get_expiry(const char *key, const char **output);
diff --git a/environment.c b/environment.c
index 3465597707..4a55a1f05f 100644
--- a/environment.c
+++ b/environment.c
@@ -21,6 +21,7 @@
 int trust_executable_bit = 1;
 int trust_ctime = 1;
 int check_stat = 1;
+int check_collisions = 1;
 int has_symlinks = 1;
 int minimum_abbrev = 4, default_abbrev = -1;
 int ignore_case;
diff --git a/t/README b/t/README
index 242497455f..1862a30279 100644
--- a/t/README
+++ b/t/README
@@ -348,6 +348,9 @@ GIT_TEST_MULTI_PACK_INDEX=<boolean>, when true, forces the multi-pack-
 index to be written after every 'git repack' command, and overrides the
 'core.multiPackIndex' setting to true.
 
+GIT_TEST_CHECK_COLLISIONS=<boolean> excercises the
+core.checkCollisions=false codepath.
+
 Naming Tests
 ------------
 
diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
index 4feb65157d..87e395d2ba 100755
--- a/t/t1060-object-corruption.sh
+++ b/t/t1060-object-corruption.sh
@@ -117,6 +117,7 @@ test_expect_failure 'clone --local detects misnamed objects' '
 '
 
 test_expect_success 'fetch into corrupted repo with index-pack' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
 	cp -R bit-error bit-error-cp &&
 	test_when_finished "rm -rf bit-error-cp" &&
 	(
@@ -127,4 +128,36 @@ test_expect_success 'fetch into corrupted repo with index-pack' '
 	)
 '
 
+test_expect_success 'repair a corrupted repo with index-pack' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
+	cp -R bit-error bit-error-cp &&
+	test_when_finished "rm -rf bit-error-cp" &&
+	(
+		cd bit-error-cp &&
+
+		# Have the corrupt object still and fsck complains
+		test_must_fail git cat-file blob HEAD:content.t &&
+		test_must_fail git fsck 2>stderr &&
+		test_i18ngrep "corrupt or missing" stderr &&
+
+		# Fetch the new object (as a pack). The transfer.unpackLimit=1
+		# setting here is important, we must end up with a pack, not a
+		# loose object. The latter would fail due to "exists? Do not
+		# bother" semantics unrelated to the collision check.
+		git -c transfer.unpackLimit=1 \
+			-c core.checkCollisions=false \
+			fetch ../no-bit-error 2>stderr &&
+
+		# fsck still complains, but we have the non-corrupt object
+		# (we lookup in packs first)
+		test_must_fail git fsck 2>stderr &&
+		test_i18ngrep "corrupt or missing" stderr &&
+		git cat-file blob HEAD:content.t &&
+
+		# A "gc" will remove the now-redundant and corrupt object
+		git gc &&
+		git fsck
+	)
+'
+
 test_done
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 410a09b0dd..ca109fff84 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -481,18 +481,22 @@ test_expect_success 'setup: fake a SHA1 hash collision' '
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
 	(
 		cd corrupt &&
-		test_must_fail git index-pack -o ../bad.idx ../test-3.pack 2>msg &&
-		test_i18ngrep "SHA1 COLLISION FOUND" msg
+		test_must_fail git index-pack -o good.idx ../test-3.pack 2>msg &&
+		test_i18ngrep "SHA1 COLLISION FOUND" msg &&
+		git -c core.checkCollisions=false index-pack -o good.idx ../test-3.pack
 	)
 '
 
 test_expect_success 'make sure index-pack detects the SHA1 collision (large blobs)' '
+	sane_unset GIT_TEST_CHECK_COLLISIONS &&
 	(
 		cd corrupt &&
 		test_must_fail git -c core.bigfilethreshold=1 index-pack -o ../bad.idx ../test-3.pack 2>msg &&
-		test_i18ngrep "SHA1 COLLISION FOUND" msg
+		test_i18ngrep "SHA1 COLLISION FOUND" msg &&
+		git -c core.checkCollisions=false -c core.bigfilethreshold=1 index-pack -o good.idx ../test-3.pack
 	)
 '
 
-- 
2.19.1.1182.g4ecb1133ce


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3] index-pack: add ability to disable SHA-1 collision check
  2018-11-13 20:19               ` [PATCH v3] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
@ 2018-11-14  7:09                 ` Junio C Hamano
  2018-11-14 12:40                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-11-14  7:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Jeff King, Geert Jansen, Christian Couder, Nicolas Pitre,
	Linus Torvalds, Petr Baudis

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Add a new core.checkCollisions setting. On by default, it can be set
> to 'false' to disable the check for existing objects in sha1_object().
> ...
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 2004e25da2..4a3508aa9f 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -791,23 +791,24 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>  {
>  	void *new_data = NULL;
>  	int collision_test_needed = 0;
> +	int do_coll_check = git_config_get_collision_check();
>  
>  	assert(data || obj_entry);
>  
> -	if (startup_info->have_repository) {
> +	if (do_coll_check && startup_info->have_repository) {
>  		read_lock();
>  		collision_test_needed =
>  			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
>  		read_unlock();
>  	}
>  
> -	if (collision_test_needed && !data) {
> +	if (do_coll_check && collision_test_needed && !data) {
>  		read_lock();
>  		if (!check_collison(obj_entry))
>  			collision_test_needed = 0;
>  		read_unlock();
>  	}
> -	if (collision_test_needed) {
> +	if (do_coll_check && collision_test_needed) {

If I am reading the patch correctly, The latter two changes are
totally unnecessary.  c-t-needed is true only when dO-coll_check
allowed the initial "do we even have that object?" check to kick in
and never set otherwise.

I am not all that enthused to the idea of sending a wrong message to
our users, i.e. it is sometimes OK to sacrifice the security of
collision detection.

A small change like this is easy to adjust to apply to the codebase,
even after today's codebase undergoes extensive modifications; quite
honestly, I'd prefer not having to worry about it so close to the
pre-release feature freeze.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3] index-pack: add ability to disable SHA-1 collision check
  2018-11-14  7:09                 ` Junio C Hamano
@ 2018-11-14 12:40                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-11-14 12:40 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jeff King, Geert Jansen, Christian Couder, Nicolas Pitre,
	Linus Torvalds, Petr Baudis


On Wed, Nov 14 2018, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> Add a new core.checkCollisions setting. On by default, it can be set
>> to 'false' to disable the check for existing objects in sha1_object().
>> ...
>> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>> index 2004e25da2..4a3508aa9f 100644
>> --- a/builtin/index-pack.c
>> +++ b/builtin/index-pack.c
>> @@ -791,23 +791,24 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>>  {
>>  	void *new_data = NULL;
>>  	int collision_test_needed = 0;
>> +	int do_coll_check = git_config_get_collision_check();
>>
>>  	assert(data || obj_entry);
>>
>> -	if (startup_info->have_repository) {
>> +	if (do_coll_check && startup_info->have_repository) {
>>  		read_lock();
>>  		collision_test_needed =
>>  			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
>>  		read_unlock();
>>  	}
>>
>> -	if (collision_test_needed && !data) {
>> +	if (do_coll_check && collision_test_needed && !data) {
>>  		read_lock();
>>  		if (!check_collison(obj_entry))
>>  			collision_test_needed = 0;
>>  		read_unlock();
>>  	}
>> -	if (collision_test_needed) {
>> +	if (do_coll_check && collision_test_needed) {
>
> If I am reading the patch correctly, The latter two changes are
> totally unnecessary.  c-t-needed is true only when dO-coll_check
> allowed the initial "do we even have that object?" check to kick in
> and never set otherwise.

You're right. I was trying to do this in a few different ways and didn't
simplify this part.

> I am not all that enthused to the idea of sending a wrong message to
> our users, i.e. it is sometimes OK to sacrifice the security of
> collision detection.

I think I've made the case in side-threads that given the performance
numbers and the danger of an actual SHA-1 collision this is something
other powerusers would be interested in having.

> A small change like this is easy to adjust to apply to the codebase,
> even after today's codebase undergoes extensive modifications; quite
> honestly, I'd prefer not having to worry about it so close to the
> pre-release feature freeze.

Yeah, let's definitely wait with this under 2.20. I sent this out more
because I re-rolled it for an internal deployment, and wanted to get
some thoughts on what the plan should be for queuing up these two
related (no collision detection && loose cache) features.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
  2018-11-13 10:02                                 ` Ævar Arnfjörð Bjarmason
@ 2018-11-14 18:21                                   ` René Scharfe
  0 siblings, 0 replies; 87+ messages in thread
From: René Scharfe @ 2018-11-14 18:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Jeff King
  Cc: Geert Jansen, Junio C Hamano, git, Takuto Ikuta

Am 13.11.2018 um 11:02 schrieb Ævar Arnfjörð Bjarmason:
> So here's the same test not against NFS, but the local ext4 fs (CO7;
> Linux 3.10) for sha1collisiondetection.git:
> 
>     Test                                             origin/master     peff/jk/loose-cache        avar/check-collisions-config
>     --------------------------------------------------------------------------------------------------------------------------
>     0008.2: index-pack with 256*1 loose objects      0.02(0.02+0.00)   0.02(0.02+0.01) +0.0%      0.02(0.02+0.00) +0.0%
>     0008.3: index-pack with 256*10 loose objects     0.02(0.02+0.00)   0.03(0.03+0.00) +50.0%     0.02(0.02+0.00) +0.0%
>     0008.4: index-pack with 256*100 loose objects    0.02(0.02+0.00)   0.17(0.16+0.01) +750.0%    0.02(0.02+0.00) +0.0%
>     0008.5: index-pack with 256*250 loose objects    0.02(0.02+0.00)   0.43(0.40+0.03) +2050.0%   0.02(0.02+0.00) +0.0%
>     0008.6: index-pack with 256*500 loose objects    0.02(0.02+0.00)   0.88(0.80+0.09) +4300.0%   0.02(0.02+0.00) +0.0%
>     0008.7: index-pack with 256*750 loose objects    0.02(0.02+0.00)   1.35(1.27+0.09) +6650.0%   0.02(0.02+0.00) +0.0%
>     0008.8: index-pack with 256*1000 loose objects   0.02(0.02+0.00)   1.83(1.70+0.14) +9050.0%   0.02(0.02+0.00) +0.0%

Ouch.

> And for mu.git, a ~20k object repo:
> 
>     Test                                             origin/master     peff/jk/loose-cache       avar/check-collisions-config
>     -------------------------------------------------------------------------------------------------------------------------
>     0008.2: index-pack with 256*1 loose objects      0.59(0.91+0.06)   0.58(0.93+0.03) -1.7%     0.57(0.89+0.04) -3.4%
>     0008.3: index-pack with 256*10 loose objects     0.59(0.91+0.07)   0.59(0.92+0.03) +0.0%     0.57(0.89+0.03) -3.4%
>     0008.4: index-pack with 256*100 loose objects    0.59(0.91+0.05)   0.81(1.13+0.04) +37.3%    0.58(0.91+0.04) -1.7%
>     0008.5: index-pack with 256*250 loose objects    0.59(0.91+0.05)   1.23(1.51+0.08) +108.5%   0.58(0.91+0.04) -1.7%
>     0008.6: index-pack with 256*500 loose objects    0.59(0.90+0.06)   1.96(2.20+0.12) +232.2%   0.58(0.91+0.04) -1.7%
>     0008.7: index-pack with 256*750 loose objects    0.59(0.92+0.05)   2.72(2.92+0.17) +361.0%   0.58(0.90+0.04) -1.7%
>     0008.8: index-pack with 256*1000 loose objects   0.59(0.90+0.06)   3.50(3.67+0.21) +493.2%   0.57(0.90+0.04) -3.4%
> 
> All of which is to say that I think it definitely makes sense to re-roll
> this with a perf test, and a switch to toggle it + docs explaining the
> caveats & pointing to the perf test. It's a clear win in some scenarios,
> but a big loss in others.

Right, but can we perhaps find a way to toggle it automatically, like
the special fetch-pack cache tried to do?

So the code needs to decide between using lstat() on individual loose
objects files or opendir()+readdir() to load the names in a whole
fan-out directory.  Intuitively I'd try to solve it using red tape, by
measuring the duration of both kinds of calls, and then try to find a
heuristic based on those numbers.  Is the overhead worth it?

But first, may I interest you in some further complication?  We can
also use access(2) to check for the existence of files.  It doesn't
need to fill in struct stat, so may have a slight advantage if we
don't need any of that information.  The following patch is a
replacement for patch 8 and improves performance by ca. 3% with
git.git on an SSD for me; I'm curious to see how it does on NFS:

---
 sha1-file.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/sha1-file.c b/sha1-file.c
index b77dacedc7..5315c37cbc 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -888,8 +888,13 @@ static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
 	prepare_alt_odb(r);
 	for (odb = r->objects->odb; odb; odb = odb->next) {
 		*path = odb_loose_path(odb, &buf, sha1);
-		if (!lstat(*path, st))
-			return 0;
+		if (st) {
+			if (!lstat(*path, st))
+				return 0;
+		} else {
+			if (!access(*path, F_OK))
+				return 0;
+		}
 	}
 
 	return -1;
@@ -1171,7 +1176,8 @@ static int sha1_loose_object_info(struct repository *r,
 	if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
-		if (stat_sha1_file(r, sha1, &st, &path) < 0)
+		struct stat *stp = oi->disk_sizep ? &st : NULL;
+		if (stat_sha1_file(r, sha1, stp, &path) < 0)
 			return -1;
 		if (oi->disk_sizep)
 			*oi->disk_sizep = st.st_size;
@@ -1382,7 +1388,6 @@ void *read_object_file_extended(const struct object_id *oid,
 	void *data;
 	const struct packed_git *p;
 	const char *path;
-	struct stat st;
 	const struct object_id *repl = lookup_replace ?
 		lookup_replace_object(the_repository, oid) : oid;
 
@@ -1399,7 +1404,7 @@ void *read_object_file_extended(const struct object_id *oid,
 		die(_("replacement %s not found for %s"),
 		    oid_to_hex(repl), oid_to_hex(oid));
 
-	if (!stat_sha1_file(the_repository, repl->hash, &st, &path))
+	if (!stat_sha1_file(the_repository, repl->hash, NULL, &path))
 		die(_("loose object %s (stored in %s) is corrupt"),
 		    oid_to_hex(repl), path);
 
-- 
2.19.1

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, back to index

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-25 18:38 [RFC PATCH] index-pack: improve performance on NFS Jansen, Geert
2018-10-26  0:21 ` Junio C Hamano
2018-10-26 20:38   ` Ævar Arnfjörð Bjarmason
2018-10-27  7:26     ` Junio C Hamano
2018-10-27  9:33       ` Jeff King
2018-10-27 11:22         ` Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 0/4] index-pack: optionally turn off SHA-1 collision checking Ævar Arnfjörð Bjarmason
2018-10-30  2:49             ` Geert Jansen
2018-10-30  9:04               ` Junio C Hamano
2018-10-30 18:43             ` [PATCH v2 0/3] index-pack: test updates Ævar Arnfjörð Bjarmason
2018-11-13 20:19               ` [PATCH v3] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
2018-11-14  7:09                 ` Junio C Hamano
2018-11-14 12:40                   ` Ævar Arnfjörð Bjarmason
2018-10-30 18:43             ` [PATCH v2 1/3] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
2018-10-30 18:43             ` [PATCH v2 2/3] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
2018-10-30 18:43             ` [PATCH v2 3/3] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 1/4] pack-objects test: modernize style Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 2/4] pack-objects tests: don't leave test .git corrupt at end Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 3/4] index-pack tests: don't leave test repo dirty " Ævar Arnfjörð Bjarmason
2018-10-28 22:50           ` [PATCH 4/4] index-pack: add ability to disable SHA-1 collision check Ævar Arnfjörð Bjarmason
2018-10-29 15:04           ` [RFC PATCH] index-pack: improve performance on NFS Jeff King
2018-10-29 15:09             ` Jeff King
2018-10-29 19:36             ` Ævar Arnfjörð Bjarmason
2018-10-29 23:27               ` Jeff King
2018-11-07 22:55                 ` Geert Jansen
2018-11-08 12:02                   ` Jeff King
2018-11-08 20:58                     ` Geert Jansen
2018-11-08 21:18                       ` Jeff King
2018-11-08 21:55                         ` Geert Jansen
2018-11-08 22:20                     ` Ævar Arnfjörð Bjarmason
2018-11-09 10:11                       ` Ævar Arnfjörð Bjarmason
2018-11-12 14:31                       ` Jeff King
2018-11-12 14:46                     ` [PATCH 0/9] caching loose objects Jeff King
2018-11-12 14:46                       ` [PATCH 1/9] fsck: do not reuse child_process structs Jeff King
2018-11-12 15:26                         ` Derrick Stolee
2018-11-12 14:47                       ` [PATCH 2/9] submodule--helper: prefer strip_suffix() to ends_with() Jeff King
2018-11-12 18:23                         ` Stefan Beller
2018-11-12 14:48                       ` [PATCH 3/9] rename "alternate_object_database" to "object_directory" Jeff King
2018-11-12 15:30                         ` Derrick Stolee
2018-11-12 15:36                           ` Jeff King
2018-11-12 19:41                             ` Ramsay Jones
2018-11-12 14:48                       ` [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending Jeff King
2018-11-12 15:32                         ` Derrick Stolee
2018-11-12 14:49                       ` [PATCH 5/9] handle alternates paths the same as the main object dir Jeff King
2018-11-12 15:38                         ` Derrick Stolee
2018-11-12 15:46                           ` Jeff King
2018-11-12 15:50                             ` Derrick Stolee
2018-11-12 14:50                       ` [PATCH 6/9] sha1-file: use an object_directory for " Jeff King
2018-11-12 15:48                         ` Derrick Stolee
2018-11-12 16:09                           ` Jeff King
2018-11-12 19:04                             ` Stefan Beller
2018-11-12 18:48                           ` Stefan Beller
2018-11-12 14:50                       ` [PATCH 7/9] object-store: provide helpers for loose_objects_cache Jeff King
2018-11-12 19:24                         ` René Scharfe
2018-11-12 20:16                           ` Jeff King
2018-11-12 14:54                       ` [PATCH 8/9] sha1-file: use loose object cache for quick existence check Jeff King
2018-11-12 16:00                         ` Derrick Stolee
2018-11-12 16:01                         ` Ævar Arnfjörð Bjarmason
2018-11-12 16:21                           ` Jeff King
2018-11-12 22:18                             ` Ævar Arnfjörð Bjarmason
2018-11-12 22:30                               ` Ævar Arnfjörð Bjarmason
2018-11-13 10:02                                 ` Ævar Arnfjörð Bjarmason
2018-11-14 18:21                                   ` René Scharfe
2018-11-12 22:44                             ` Geert Jansen
2018-11-12 14:55                       ` [PATCH 9/9] fetch-pack: drop custom loose object cache Jeff King
2018-11-12 19:25                         ` René Scharfe
2018-11-12 19:32                           ` Ævar Arnfjörð Bjarmason
2018-11-12 20:07                             ` Jeff King
2018-11-12 20:13                             ` René Scharfe
2018-11-12 16:02                       ` [PATCH 0/9] caching loose objects Derrick Stolee
2018-11-12 19:10                         ` Stefan Beller
2018-11-09 13:43                   ` [RFC PATCH] index-pack: improve performance on NFS Ævar Arnfjörð Bjarmason
2018-11-09 16:08                     ` Duy Nguyen
2018-11-10 14:04                       ` Ævar Arnfjörð Bjarmason
2018-11-12 14:34                         ` Jeff King
2018-11-12 22:58                     ` Geert Jansen
2018-10-27 14:04         ` Duy Nguyen
2018-10-29 15:18           ` Jeff King
2018-10-29  0:48         ` Junio C Hamano
2018-10-29 15:20           ` Jeff King
2018-10-29 18:43             ` Ævar Arnfjörð Bjarmason
2018-10-29 21:34           ` Geert Jansen
2018-10-29 21:50             ` Jeff King
2018-10-29 22:21               ` Geert Jansen
2018-10-29 22:27             ` Jeff King
2018-10-29 22:35               ` Stefan Beller
2018-10-29 23:29                 ` Jeff King

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox