[PATCH] packfile.c: speed up loading lots of packfiles.

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] packfile.c: speed up loading lots of packfiles.
@ 2019-11-27 22:24 Colin Stolley
  2019-11-28  0:42 ` hashmap vs khash? " Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Colin Stolley @ 2019-11-27 22:24 UTC (permalink / raw)
  To: git

When loading packfiles on start-up, we traverse the internal packfile
list once per file to avoid reloading packfiles that have already
been loaded. This check runs in quadratic time, so for poorly
maintained repos with a large number of packfiles, it can be pretty
slow.

Add a hashmap containing the packfile names as we load them so that
the average runtime cost of checking for already-loaded packs becomes
constant.

Add a perf test to p5303 to show speed-up.

The existing p5303 test runtimes are dominated by other factors and do
not show an appreciable speed-up. The new test in p5303 clearly exposes
a speed-up in bad cases. In this test we create 10,000 packfiles and
measure the start-up time of git rev-parse, which does little else
besides load in the packs.

Here are the numbers for the new p5303 test:

Test                         HEAD^             HEAD
---------------------------------------------------------------------
5303.12: load 10,000 packs   1.03(0.92+0.10)   0.12(0.02+0.09) -88.3%

Thanks-to: Jeff King <peff@peff.net>
Signed-off-by: Colin Stolley <cstolley@runbox.com>
---
 object-store.h             | 21 +++++++++++++++++++++
 object.c                   |  3 +++
 packfile.c                 | 21 +++++++++++----------
 t/perf/p5303-many-packs.sh | 18 ++++++++++++++++++
 4 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/object-store.h b/object-store.h
index 7f7b3cdd80..55ee639350 100644
--- a/object-store.h
+++ b/object-store.h
@@ -60,6 +60,7 @@ struct oid_array *odb_loose_cache(struct object_directory *odb,
 void odb_clear_loose_cache(struct object_directory *odb);
 
 struct packed_git {
+	struct hashmap_entry packmap_ent;
 	struct packed_git *next;
 	struct list_head mru;
 	struct pack_window *windows;
@@ -88,6 +89,20 @@ struct packed_git {
 
 struct multi_pack_index;
 
+static inline int pack_map_entry_cmp(const void *unused_cmp_data,
+				     const struct hashmap_entry *entry,
+				     const struct hashmap_entry *entry2,
+				     const void *keydata)
+{
+	const char *key = keydata;
+	const struct packed_git *pg1, *pg2;
+
+	pg1 = container_of(entry, const struct packed_git, packmap_ent);
+	pg2 = container_of(entry2, const struct packed_git, packmap_ent);
+
+	return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
+}
+
 struct raw_object_store {
 	/*
 	 * Set of all object directories; the main directory is first (and
@@ -131,6 +146,12 @@ struct raw_object_store {
 	/* A most-recently-used ordered version of the packed_git list. */
 	struct list_head packed_git_mru;
 
+	/*
+	 * A map of packfiles to packed_git structs for tracking which
+	 * packs have been loaded already.
+	 */
+	struct hashmap pack_map;
+
 	/*
 	 * A fast, rough count of the number of objects in the repository.
 	 * These two fields are not meant for direct access. Use
diff --git a/object.c b/object.c
index 3b8b8c55c9..142ef69399 100644
--- a/object.c
+++ b/object.c
@@ -479,6 +479,7 @@ struct raw_object_store *raw_object_store_new(void)
 
 	memset(o, 0, sizeof(*o));
 	INIT_LIST_HEAD(&o->packed_git_mru);
+	hashmap_init(&o->pack_map, pack_map_entry_cmp, NULL, 0);
 	return o;
 }
 
@@ -518,6 +519,8 @@ void raw_object_store_clear(struct raw_object_store *o)
 	INIT_LIST_HEAD(&o->packed_git_mru);
 	close_object_store(o);
 	o->packed_git = NULL;
+
+	hashmap_free(&o->pack_map);
 }
 
 void parsed_object_pool_clear(struct parsed_object_pool *o)
diff --git a/packfile.c b/packfile.c
index 355066de17..253559fa87 100644
--- a/packfile.c
+++ b/packfile.c
@@ -856,20 +856,21 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 
 	if (strip_suffix_mem(full_name, &base_len, ".idx") &&
 	    !(data->m && midx_contains_pack(data->m, file_name))) {
-		/* Don't reopen a pack we already have. */
-		for (p = data->r->objects->packed_git; p; p = p->next) {
-			size_t len;
-			if (strip_suffix(p->pack_name, ".pack", &len) &&
-			    len == base_len &&
-			    !memcmp(p->pack_name, full_name, len))
-				break;
-		}
+		struct hashmap_entry hent;
+		char *pack_name = xstrfmt("%.*s.pack", (int)base_len, full_name);
+		unsigned int hash = strhash(pack_name);
+		hashmap_entry_init(&hent, hash);
 
-		if (!p) {
+		/* Don't reopen a pack we already have. */
+		if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) {
 			p = add_packed_git(full_name, full_name_len, data->local);
-			if (p)
+			if (p) {
+				hashmap_entry_init(&p->packmap_ent, hash);
+				hashmap_add(&data->r->objects->pack_map, &p->packmap_ent);
 				install_packed_git(data->r, p);
+			}
 		}
+		free(pack_name);
 	}
 
 	if (!report_garbage)
diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index 3779851941..ede78e19e2 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -84,4 +84,22 @@ do
 	'
 done
 
+# Measure pack loading with 10,000 packs.
+test_expect_success 'generate lots of packs' '
+	for i in $(test_seq 10000); do
+		echo "blob"
+		echo "data <<EOF"
+		echo "blob $i"
+		echo "EOF"
+		echo "checkpoint"
+	done |
+	git -c fastimport.unpackLimit=0 fast-import
+'
+
+# The purpose of this test is to evaluate load time for a large number
+# of packs while doing as little other work as possible.
+test_perf "load 10,000 packs" '
+	git rev-parse --verify "HEAD^{commit}"
+'
+
 test_done
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* hashmap vs khash? Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley
@ 2019-11-28  0:42 ` Eric Wong
  2019-11-30 17:36   ` Junio C Hamano
  2019-12-02 14:39   ` Jeff King
  2019-12-02 17:40 ` SZEDER Gábor
  2019-12-03  6:19 ` Taylor Blau
  2 siblings, 2 replies; 15+ messages in thread
From: Eric Wong @ 2019-11-28  0:42 UTC (permalink / raw)
  To: Colin Stolley; +Cc: git

Colin Stolley <cstolley@runbox.com> wrote:
> When loading packfiles on start-up, we traverse the internal packfile
> list once per file to avoid reloading packfiles that have already
> been loaded. This check runs in quadratic time, so for poorly
> maintained repos with a large number of packfiles, it can be pretty
> slow.

Cool!  Thanks for looking into this, and I've been having
trouble in that department with big alternates files.

> Add a hashmap containing the packfile names as we load them so that
> the average runtime cost of checking for already-loaded packs becomes
> constant.

Btw, would you have time to do a comparison against khash?

AFAIK hashmap predates khash in git; and hashmap was optimized
for removal.   Removals don't seem to be a problem for pack
loading.

I'm interested in exploring the removing of hashmap entirely in
favor of khash to keep our codebase smaller and easier-to-learn.
khash shows up more in other projects, and ought to have better
cache-locality.

Thanks again.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hashmap vs khash? Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-11-28  0:42 ` hashmap vs khash? " Eric Wong
@ 2019-11-30 17:36   ` Junio C Hamano
  2019-12-02 14:39   ` Jeff King
  1 sibling, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2019-11-30 17:36 UTC (permalink / raw)
  To: Eric Wong; +Cc: Colin Stolley, git

Eric Wong <e@80x24.org> writes:

> Colin Stolley <cstolley@runbox.com> wrote:
>> When loading packfiles on start-up, we traverse the internal packfile
>> list once per file to avoid reloading packfiles that have already
>> been loaded. This check runs in quadratic time, so for poorly
>> maintained repos with a large number of packfiles, it can be pretty
>> slow.
>
> Cool!  Thanks for looking into this, and I've been having
> trouble in that department with big alternates files.
>
>> Add a hashmap containing the packfile names as we load them so that
>> the average runtime cost of checking for already-loaded packs becomes
>> constant.
>
> Btw, would you have time to do a comparison against khash?
>
> AFAIK hashmap predates khash in git; and hashmap was optimized
> for removal.   Removals don't seem to be a problem for pack
> loading.
>
> I'm interested in exploring the removing of hashmap entirely in
> favor of khash to keep our codebase smaller and easier-to-learn.
> khash shows up more in other projects, and ought to have better
> cache-locality.

Sounds fun ;-) and useful.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: hashmap vs khash? Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-11-28  0:42 ` hashmap vs khash? " Eric Wong
  2019-11-30 17:36   ` Junio C Hamano
@ 2019-12-02 14:39   ` Jeff King
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff King @ 2019-12-02 14:39 UTC (permalink / raw)
  To: Eric Wong; +Cc: Colin Stolley, git

On Thu, Nov 28, 2019 at 12:42:02AM +0000, Eric Wong wrote:

> > Add a hashmap containing the packfile names as we load them so that
> > the average runtime cost of checking for already-loaded packs becomes
> > constant.
> 
> Btw, would you have time to do a comparison against khash?
> 
> AFAIK hashmap predates khash in git; and hashmap was optimized
> for removal.   Removals don't seem to be a problem for pack
> loading.

Actually, they came around simultaneously. I think hashmap.[ch] was
mostly a response to our open-coded hashes, like the one in object.c
(which still uses neither of the reusable forms!). Those didn't handle
removal at all. khash does handle removal, though you pay a price in
tombstone entries until the next resize operation.

> I'm interested in exploring the removing of hashmap entirely in
> favor of khash to keep our codebase smaller and easier-to-learn.
> khash shows up more in other projects, and ought to have better
> cache-locality.

I have been tempted to push for that, too. Every timing I have ever done
shows khash as faster (though for a trivial use like this one, I would
be quite surprised if it mattered either way).

My hesitation is that khash can be harder to debug because of the macro
implementation. But I have rarely needed to look beneath its API.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley
  2019-11-28  0:42 ` hashmap vs khash? " Eric Wong
@ 2019-12-02 17:40 ` SZEDER Gábor
  2019-12-02 19:42   ` Jeff King
  2019-12-03  6:19 ` Taylor Blau
  2 siblings, 1 reply; 15+ messages in thread
From: SZEDER Gábor @ 2019-12-02 17:40 UTC (permalink / raw)
  To: Colin Stolley; +Cc: git

On Wed, Nov 27, 2019 at 04:24:53PM -0600, Colin Stolley wrote:
> When loading packfiles on start-up, we traverse the internal packfile
> list once per file to avoid reloading packfiles that have already
> been loaded. This check runs in quadratic time, so for poorly
> maintained repos with a large number of packfiles, it can be pretty
> slow.
> 
> Add a hashmap containing the packfile names as we load them so that
> the average runtime cost of checking for already-loaded packs becomes
> constant.
> 
> Add a perf test to p5303 to show speed-up.
> 
> The existing p5303 test runtimes are dominated by other factors and do
> not show an appreciable speed-up. The new test in p5303 clearly exposes
> a speed-up in bad cases. In this test we create 10,000 packfiles and
> measure the start-up time of git rev-parse, which does little else
> besides load in the packs.
> 
> Here are the numbers for the new p5303 test:
> 
> Test                         HEAD^             HEAD
> ---------------------------------------------------------------------
> 5303.12: load 10,000 packs   1.03(0.92+0.10)   0.12(0.02+0.09) -88.3%
> 
> Thanks-to: Jeff King <peff@peff.net>
> Signed-off-by: Colin Stolley <cstolley@runbox.com>
> ---

This patch break test 'gc --keep-largest-pack' in 't6500-gc.sh' when
run with GIT_TEST_MULTI_PACK_INDEX=1, because there is a duplicate
entry in '.git/objects/info/packs':

  expecting success of 6500.7 'gc --keep-largest-pack':
          test_create_repo keep-pack &&
          (
                  cd keep-pack &&
                  test_commit one &&
                  test_commit two &&
                  test_commit three &&
                  git gc &&
                  ( cd .git/objects/pack && ls *.pack ) >pack-list &&
                  test_line_count = 1 pack-list &&
                  BASE_PACK=.git/objects/pack/pack-*.pack &&
                  test_commit four &&
                  git repack -d &&
                  test_commit five &&
                  git repack -d &&
                  ( cd .git/objects/pack && ls *.pack ) >pack-list &&
                  test_line_count = 3 pack-list &&
                  git gc --keep-largest-pack &&
                  ( cd .git/objects/pack && ls *.pack ) >pack-list &&
                  test_line_count = 2 pack-list &&
                  awk "/^P /{print \$2}" <.git/objects/info/packs >pack-info &&
                  test_line_count = 2 pack-info &&
                  test_path_is_file $BASE_PACK &&
                  git fsck
          )
  
  + test_create_repo keep-pack
  Initialized empty Git repository in /home/szeder/src/git/t/trash directory.t6500-gc/keep-pack/.git/
  + cd keep-pack
  + test_commit one
  [master (root-commit) d79ce16] one
   Author: A U Thor <author@example.com>
   1 file changed, 1 insertion(+)
   create mode 100644 one.t
  + test_commit two
  [master 139b20d] two
   Author: A U Thor <author@example.com>
   1 file changed, 1 insertion(+)
   create mode 100644 two.t
  + test_commit three
  [master 7c7cd71] three
   Author: A U Thor <author@example.com>
   1 file changed, 1 insertion(+)
   create mode 100644 three.t
  + git gc
  Computing commit graph generation numbers:  33% (1/3)^MComputing commit graph generation numbers:  66% (2/3)^MComputing commit graph generation numbers: 100% (3/3)^MComputing commit graph generation numbers: 100% (3/3), done.
  + cd .git/objects/pack
  + ls pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack
  + test_line_count = 1 pack-list
  + BASE_PACK=.git/objects/pack/pack-*.pack
  + test_commit four
  [master fd8d77e] four
   Author: A U Thor <author@example.com>
   1 file changed, 1 insertion(+)
   create mode 100644 four.t
  + git repack -d
  + test_commit five
  [master a383792] five
   Author: A U Thor <author@example.com>
   1 file changed, 1 insertion(+)
   create mode 100644 five.t
  + git repack -d
  + cd .git/objects/pack
  + ls pack-057d7f493a7c26d58090f4777ff66d4c226c4408.pack pack-54feec766fc7d2d204b03879d96f4595d7e48c37.pack pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack
  + test_line_count = 3 pack-list
  + git gc --keep-largest-pack
  Computing commit graph generation numbers:  20% (1/5)^MComputing commit graph generation numbers:  40% (2/5)^MComputing commit graph generation numbers:  60% (3/5)^MComputing commit graph generation numbers:  80% (4/5)^MComputing commit graph generation numbers: 100% (5/5)^MComputing commit graph generation numbers: 100% (5/5), done.
  + cd .git/objects/pack
  + ls pack-390dbbb8e27c014b080c08dfc482d4982d4c6644.pack pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack
  + test_line_count = 2 pack-list
  + awk /^P /{print $2}
  + test_line_count = 2 pack-info
  test_line_count: line count for pack-info != 2
  pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack
  pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack
  pack-390dbbb8e27c014b080c08dfc482d4982d4c6644.pack
  error: last command exited with $?=1
  not ok 7 - gc --keep-largest-pack



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-02 17:40 ` SZEDER Gábor
@ 2019-12-02 19:42   ` Jeff King
  2019-12-03  6:17     ` Taylor Blau
  2019-12-03 16:04     ` Junio C Hamano
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2019-12-02 19:42 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Colin Stolley, git

On Mon, Dec 02, 2019 at 06:40:35PM +0100, SZEDER Gábor wrote:

> > When loading packfiles on start-up, we traverse the internal packfile
> > list once per file to avoid reloading packfiles that have already
> > been loaded. This check runs in quadratic time, so for poorly
> > maintained repos with a large number of packfiles, it can be pretty
> > slow.
> > 
> > Add a hashmap containing the packfile names as we load them so that
> > the average runtime cost of checking for already-loaded packs becomes
> > constant.
> [...]
> This patch break test 'gc --keep-largest-pack' in 't6500-gc.sh' when
> run with GIT_TEST_MULTI_PACK_INDEX=1, because there is a duplicate
> entry in '.git/objects/info/packs':

Good catch. The issue is that we only add entries to the hashmap in
prepare_packed_git(), but they may be added to the pack list by other
callers of install_packed_git(). It probably makes sense to just push
the hashmap maintenance down into that function, like below. That
requires an extra strhash() when inserting a new pack, but I don't think
that's a big deal.

diff --git a/packfile.c b/packfile.c
index 253559fa87..f0dc63e92f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -757,6 +757,9 @@ void install_packed_git(struct repository *r, struct packed_git *pack)
 
 	pack->next = r->objects->packed_git;
 	r->objects->packed_git = pack;
+
+	hashmap_entry_init(&pack->packmap_ent, strhash(pack->pack_name));
+	hashmap_add(&r->objects->pack_map, &pack->packmap_ent);
 }
 
 void (*report_garbage)(unsigned seen_bits, const char *path);
@@ -864,11 +867,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 		/* Don't reopen a pack we already have. */
 		if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) {
 			p = add_packed_git(full_name, full_name_len, data->local);
-			if (p) {
-				hashmap_entry_init(&p->packmap_ent, hash);
-				hashmap_add(&data->r->objects->pack_map, &p->packmap_ent);
+			if (p)
 				install_packed_git(data->r, p);
-			}
 		}
 		free(pack_name);
 	}

-Peff

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-02 19:42   ` Jeff King
@ 2019-12-03  6:17     ` Taylor Blau
  2019-12-03 15:34       ` Jeff King
  2019-12-03 16:04     ` Junio C Hamano
  1 sibling, 1 reply; 15+ messages in thread
From: Taylor Blau @ 2019-12-03  6:17 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Colin Stolley, git

On Mon, Dec 02, 2019 at 02:42:31PM -0500, Jeff King wrote:
> On Mon, Dec 02, 2019 at 06:40:35PM +0100, SZEDER Gábor wrote:
>
> > > When loading packfiles on start-up, we traverse the internal packfile
> > > list once per file to avoid reloading packfiles that have already
> > > been loaded. This check runs in quadratic time, so for poorly
> > > maintained repos with a large number of packfiles, it can be pretty
> > > slow.
> > >
> > > Add a hashmap containing the packfile names as we load them so that
> > > the average runtime cost of checking for already-loaded packs becomes
> > > constant.
> > [...]
> > This patch break test 'gc --keep-largest-pack' in 't6500-gc.sh' when
> > run with GIT_TEST_MULTI_PACK_INDEX=1, because there is a duplicate
> > entry in '.git/objects/info/packs':
>
> Good catch. The issue is that we only add entries to the hashmap in
> prepare_packed_git(), but they may be added to the pack list by other
> callers of install_packed_git(). It probably makes sense to just push
> the hashmap maintenance down into that function, like below. That
> requires an extra strhash() when inserting a new pack, but I don't think
> that's a big deal.

Ah, great catch, and thanks for pointing it out. We have been running
this patch in production at GitHub for a few weeks now, but didn't
notice this because we never run tests with
'GIT_TEST_MULTI_PACK_INDEX=1' in the environment.

Perhaps in the future that might change, but I think that for now that
can explain why the failure wasn't noticed earlier.

> diff --git a/packfile.c b/packfile.c
> index 253559fa87..f0dc63e92f 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -757,6 +757,9 @@ void install_packed_git(struct repository *r, struct packed_git *pack)
>
>  	pack->next = r->objects->packed_git;
>  	r->objects->packed_git = pack;
> +
> +	hashmap_entry_init(&pack->packmap_ent, strhash(pack->pack_name));
> +	hashmap_add(&r->objects->pack_map, &pack->packmap_ent);
>  }
>
>  void (*report_garbage)(unsigned seen_bits, const char *path);
> @@ -864,11 +867,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
>  		/* Don't reopen a pack we already have. */
>  		if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) {
>  			p = add_packed_git(full_name, full_name_len, data->local);
> -			if (p) {
> -				hashmap_entry_init(&p->packmap_ent, hash);
> -				hashmap_add(&data->r->objects->pack_map, &p->packmap_ent);
> +			if (p)
>  				install_packed_git(data->r, p);
> -			}
>  		}
>  		free(pack_name);
>  	}
>
> -Peff
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley
  2019-11-28  0:42 ` hashmap vs khash? " Eric Wong
  2019-12-02 17:40 ` SZEDER Gábor
@ 2019-12-03  6:19 ` Taylor Blau
  2 siblings, 0 replies; 15+ messages in thread
From: Taylor Blau @ 2019-12-03  6:19 UTC (permalink / raw)
  To: Colin Stolley; +Cc: git

Hi Colin,

On Wed, Nov 27, 2019 at 04:24:53PM -0600, Colin Stolley wrote:
> When loading packfiles on start-up, we traverse the internal packfile
> list once per file to avoid reloading packfiles that have already
> been loaded. This check runs in quadratic time, so for poorly
> maintained repos with a large number of packfiles, it can be pretty
> slow.

Thanks for sending your patches to the list, and for working on this! I
have been most excited to see this develop internally, and I'm glad that
the new paths have been exercised enough to send the patches.

So, I'm quite interested to see what you find in terms of wiring this
code up to work with khash. If that's something that you don't have time
for, please do let me know and I'd be happy to take it up myself.

Other than that, this has been thoroughly reviewed internally, and so I
don't have much more to add to this patch than what we have already
discussed.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-03  6:17     ` Taylor Blau
@ 2019-12-03 15:34       ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2019-12-03 15:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: SZEDER Gábor, Colin Stolley, git

On Mon, Dec 02, 2019 at 10:17:44PM -0800, Taylor Blau wrote:

> > Good catch. The issue is that we only add entries to the hashmap in
> > prepare_packed_git(), but they may be added to the pack list by other
> > callers of install_packed_git(). It probably makes sense to just push
> > the hashmap maintenance down into that function, like below. That
> > requires an extra strhash() when inserting a new pack, but I don't think
> > that's a big deal.
> 
> Ah, great catch, and thanks for pointing it out. We have been running
> this patch in production at GitHub for a few weeks now, but didn't
> notice this because we never run tests with
> 'GIT_TEST_MULTI_PACK_INDEX=1' in the environment.
> 
> Perhaps in the future that might change, but I think that for now that
> can explain why the failure wasn't noticed earlier.

It would trigger in other contexts, too: anywhere we call
install_packed_git() outside of prepare_packed_git(). I think the main
reason we didn't notice in the tests or in production is that:

  - it would generally require re-scanning the pack directory, which
    implies an unexpected missing object (or a racy repack), which is
    not likely to happen during the tests

  - in most cases duplicates don't impact the outcome of the command at
    all. A duplicate entry in the packed_git list would just mean an
    extra pack to check for lookups. Most objects would be found in
    other packs (including its doppelganger entry), so the duplicate
    would float to the bottom of the MRU order. It's only lookups for
    objects we don't have that would pay a penalty (and it would be
    relatively small).

But obviously it's still worth fixing.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-02 19:42   ` Jeff King
  2019-12-03  6:17     ` Taylor Blau
@ 2019-12-03 16:04     ` Junio C Hamano
  2019-12-03 17:33       ` Colin Stolley
  2019-12-03 22:17       ` Jeff King
  1 sibling, 2 replies; 15+ messages in thread
From: Junio C Hamano @ 2019-12-03 16:04 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Colin Stolley, git

Jeff King <peff@peff.net> writes:

> Good catch. The issue is that we only add entries to the hashmap in
> prepare_packed_git(), but they may be added to the pack list by other
> callers of install_packed_git(). It probably makes sense to just push
> the hashmap maintenance down into that function, like below.

Makes sense to me.

Let me locally squash your fix in and credit you with helped-by
footer in the amended log message.  Strictly speaking, this may
invalidate the perf numbers, but I do not think the scenario p5303
sets up alone is all that interesting anyway---if you have 10,000
packs, not just registering them (which is improved with this patch)
but using objects from them would be slower than necessary X-<.

Thanks.

 -- >8 --

From: Colin Stolley <cstolley@runbox.com>
Date: Wed, 27 Nov 2019 16:24:53 -0600
Subject: [PATCH] packfile.c: speed up loading lots of packfiles

When loading packfiles on start-up, we traverse the internal packfile
list once per file to avoid reloading packfiles that have already
been loaded. This check runs in quadratic time, so for poorly
maintained repos with a large number of packfiles, it can be pretty
slow.

Add a hashmap containing the packfile names as we load them so that
the average runtime cost of checking for already-loaded packs becomes
constant.

Add a perf test to p5303 to show speed-up.

The existing p5303 test runtimes are dominated by other factors and do
not show an appreciable speed-up. The new test in p5303 clearly exposes
a speed-up in bad cases. In this test we create 10,000 packfiles and
measure the start-up time of git rev-parse, which does little else
besides load in the packs.

Here are the numbers for the new p5303 test:

Test                         HEAD^             HEAD
---------------------------------------------------------------------
5303.12: load 10,000 packs   1.03(0.92+0.10)   0.12(0.02+0.09) -88.3%

Signed-off-by: Colin Stolley <cstolley@runbox.com>
Helped-by: Jeff King <peff@peff.net>
[jc: squashed the change to call hashmap in install_packed_git() by peff]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 object-store.h             | 21 +++++++++++++++++++++
 object.c                   |  3 +++
 packfile.c                 | 19 ++++++++++---------
 t/perf/p5303-many-packs.sh | 18 ++++++++++++++++++
 4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/object-store.h b/object-store.h
index 7f7b3cdd80..55ee639350 100644
--- a/object-store.h
+++ b/object-store.h
@@ -60,6 +60,7 @@ struct oid_array *odb_loose_cache(struct object_directory *odb,
 void odb_clear_loose_cache(struct object_directory *odb);
 
 struct packed_git {
+	struct hashmap_entry packmap_ent;
 	struct packed_git *next;
 	struct list_head mru;
 	struct pack_window *windows;
@@ -88,6 +89,20 @@ struct packed_git {
 
 struct multi_pack_index;
 
+static inline int pack_map_entry_cmp(const void *unused_cmp_data,
+				     const struct hashmap_entry *entry,
+				     const struct hashmap_entry *entry2,
+				     const void *keydata)
+{
+	const char *key = keydata;
+	const struct packed_git *pg1, *pg2;
+
+	pg1 = container_of(entry, const struct packed_git, packmap_ent);
+	pg2 = container_of(entry2, const struct packed_git, packmap_ent);
+
+	return strcmp(pg1->pack_name, key ? key : pg2->pack_name);
+}
+
 struct raw_object_store {
 	/*
 	 * Set of all object directories; the main directory is first (and
@@ -131,6 +146,12 @@ struct raw_object_store {
 	/* A most-recently-used ordered version of the packed_git list. */
 	struct list_head packed_git_mru;
 
+	/*
+	 * A map of packfiles to packed_git structs for tracking which
+	 * packs have been loaded already.
+	 */
+	struct hashmap pack_map;
+
 	/*
 	 * A fast, rough count of the number of objects in the repository.
 	 * These two fields are not meant for direct access. Use
diff --git a/object.c b/object.c
index 3b8b8c55c9..142ef69399 100644
--- a/object.c
+++ b/object.c
@@ -479,6 +479,7 @@ struct raw_object_store *raw_object_store_new(void)
 
 	memset(o, 0, sizeof(*o));
 	INIT_LIST_HEAD(&o->packed_git_mru);
+	hashmap_init(&o->pack_map, pack_map_entry_cmp, NULL, 0);
 	return o;
 }
 
@@ -518,6 +519,8 @@ void raw_object_store_clear(struct raw_object_store *o)
 	INIT_LIST_HEAD(&o->packed_git_mru);
 	close_object_store(o);
 	o->packed_git = NULL;
+
+	hashmap_free(&o->pack_map);
 }
 
 void parsed_object_pool_clear(struct parsed_object_pool *o)
diff --git a/packfile.c b/packfile.c
index 355066de17..f0dc63e92f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -757,6 +757,9 @@ void install_packed_git(struct repository *r, struct packed_git *pack)
 
 	pack->next = r->objects->packed_git;
 	r->objects->packed_git = pack;
+
+	hashmap_entry_init(&pack->packmap_ent, strhash(pack->pack_name));
+	hashmap_add(&r->objects->pack_map, &pack->packmap_ent);
 }
 
 void (*report_garbage)(unsigned seen_bits, const char *path);
@@ -856,20 +859,18 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 
 	if (strip_suffix_mem(full_name, &base_len, ".idx") &&
 	    !(data->m && midx_contains_pack(data->m, file_name))) {
-		/* Don't reopen a pack we already have. */
-		for (p = data->r->objects->packed_git; p; p = p->next) {
-			size_t len;
-			if (strip_suffix(p->pack_name, ".pack", &len) &&
-			    len == base_len &&
-			    !memcmp(p->pack_name, full_name, len))
-				break;
-		}
+		struct hashmap_entry hent;
+		char *pack_name = xstrfmt("%.*s.pack", (int)base_len, full_name);
+		unsigned int hash = strhash(pack_name);
+		hashmap_entry_init(&hent, hash);
 
-		if (!p) {
+		/* Don't reopen a pack we already have. */
+		if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) {
 			p = add_packed_git(full_name, full_name_len, data->local);
 			if (p)
 				install_packed_git(data->r, p);
 		}
+		free(pack_name);
 	}
 
 	if (!report_garbage)
diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
index 3779851941..ede78e19e2 100755
--- a/t/perf/p5303-many-packs.sh
+++ b/t/perf/p5303-many-packs.sh
@@ -84,4 +84,22 @@ do
 	'
 done
 
+# Measure pack loading with 10,000 packs.
+test_expect_success 'generate lots of packs' '
+	for i in $(test_seq 10000); do
+		echo "blob"
+		echo "data <<EOF"
+		echo "blob $i"
+		echo "EOF"
+		echo "checkpoint"
+	done |
+	git -c fastimport.unpackLimit=0 fast-import
+'
+
+# The purpose of this test is to evaluate load time for a large number
+# of packs while doing as little other work as possible.
+test_perf "load 10,000 packs" '
+	git rev-parse --verify "HEAD^{commit}"
+'
+
 test_done
-- 
2.24.0-578-g4820254054


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-03 16:04     ` Junio C Hamano
@ 2019-12-03 17:33       ` Colin Stolley
  2019-12-03 22:18         ` Jeff King
  2019-12-03 22:17       ` Jeff King
  1 sibling, 1 reply; 15+ messages in thread
From: Colin Stolley @ 2019-12-03 17:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, SZEDER Gábor, git

Junio C Hamano wrote:

> Let me locally squash your fix in and credit you with helped-by
> footer in the amended log message.  Strictly speaking, this may

I'm also happy to provide a khash version if that is still desired,
perhaps as a separate patch.

Thanks for spotting that bug and providing a fix.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-03 16:04     ` Junio C Hamano
  2019-12-03 17:33       ` Colin Stolley
@ 2019-12-03 22:17       ` Jeff King
  2019-12-04  4:23         ` Jonathan Nieder
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff King @ 2019-12-03 22:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: SZEDER Gábor, Colin Stolley, git

On Tue, Dec 03, 2019 at 08:04:15AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Good catch. The issue is that we only add entries to the hashmap in
> > prepare_packed_git(), but they may be added to the pack list by other
> > callers of install_packed_git(). It probably makes sense to just push
> > the hashmap maintenance down into that function, like below.
> 
> Makes sense to me.
> 
> Let me locally squash your fix in and credit you with helped-by
> footer in the amended log message.  Strictly speaking, this may
> invalidate the perf numbers, but I do not think the scenario p5303
> sets up alone is all that interesting anyway---if you have 10,000
> packs, not just registering them (which is improved with this patch)
> but using objects from them would be slower than necessary X-<.

Thanks, that sounds good.

I actually re-checked the perf numbers (mostly to make sure I didn't
screw anything up) and got similar results.

I agree that 10,000 packs is ridiculous, but we do see it (and worse)
occasionally from people pushing in a loop before our scheduled
maintenance kicks in.  And it's quadratic, so if you hit 30k packs, it's
another factor of 9 worse. It makes even diagnosing the repository
pretty painful. :)

Also a fun fact: Linux actually has a limit on the number of
simultaneous mmaps that a process can have, which defaults to ~64k. But
if you have if you have 32k packs, then we'll map both the packs and the
idx files. Plus whatever you need for mapping the binary, libraries,
etc, plus any maps opened by malloc() for large requests.

I have occasionally run into this trying to repack some very
out-of-control cases (the magic fix is to double the limit with `sysctl
-w vm.max_map_count=131060`, if you are curious). I also wondered if
this might be made worse by the recent change to drop
release_pack_memory(). But I ran into it even before that change,
because zlib calls malloc() directly. We're also pretty aggressive about
dying when mmap() returns an error (rather than closing packs and trying
again).

I think Git _could_ be handle this more gracefully by just trying to
keep fewer packs open at one time (the way we similarly try not to use
up all of the file descriptors). But I have a hard time caring too much,
since it's such a ridiculous situation in the first place. Bumping the
limits is an easy operational fix.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-03 17:33       ` Colin Stolley
@ 2019-12-03 22:18         ` Jeff King
  2019-12-04 18:15           ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2019-12-03 22:18 UTC (permalink / raw)
  To: Colin Stolley; +Cc: Junio C Hamano, SZEDER Gábor, git

On Tue, Dec 03, 2019 at 11:33:34AM -0600, Colin Stolley wrote:

> Junio C Hamano wrote:
> 
> > Let me locally squash your fix in and credit you with helped-by
> > footer in the amended log message.  Strictly speaking, this may
> 
> I'm also happy to provide a khash version if that is still desired,
> perhaps as a separate patch.

Personally I'd be OK with it as-is, or with a conversion to khash (as I
said earlier, I prefer khash myself, but it's not like there aren't tons
of other hashmap.c users in the code).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-03 22:17       ` Jeff King
@ 2019-12-04  4:23         ` Jonathan Nieder
  0 siblings, 0 replies; 15+ messages in thread
From: Jonathan Nieder @ 2019-12-04  4:23 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, SZEDER Gábor, Colin Stolley, git,
	Martin Fick

Jeff King wrote:

> I agree that 10,000 packs is ridiculous, but we do see it (and worse)
> occasionally from people pushing in a loop before our scheduled
> maintenance kicks in.

On that subject: one thing Martin Fick (cc-ed) has suggested is moving
more of the garbage collection work "inline" into the push path.  That
is, instead of letting someone push 10,000 packs in a loop, build
concatenated packs ("exponential rollup") in the push path and don't
return success and commit the packs into the object store until we're
done.  That way a reasonably small amortized cost is paid up front by
the pusher instead of later by everyone.

Midx changes things a little: it might make sense to build
concatenated idxes instead of packs, which would still avoid the same
quadratic behavior.

Just a random thought,
Jonathan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] packfile.c: speed up loading lots of packfiles.
  2019-12-03 22:18         ` Jeff King
@ 2019-12-04 18:15           ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2019-12-04 18:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Colin Stolley, SZEDER Gábor, git

Jeff King <peff@peff.net> writes:

> On Tue, Dec 03, 2019 at 11:33:34AM -0600, Colin Stolley wrote:
>
>> Junio C Hamano wrote:
>> 
>> > Let me locally squash your fix in and credit you with helped-by
>> > footer in the amended log message.  Strictly speaking, this may
>> 
>> I'm also happy to provide a khash version if that is still desired,
>> perhaps as a separate patch.
>
> Personally I'd be OK with it as-is, or with a conversion to khash (as I
> said earlier, I prefer khash myself, but it's not like there aren't tons
> of other hashmap.c users in the code).

Exactly me feeling.  Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-12-04 18:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley
2019-11-28  0:42 ` hashmap vs khash? " Eric Wong
2019-11-30 17:36   ` Junio C Hamano
2019-12-02 14:39   ` Jeff King
2019-12-02 17:40 ` SZEDER Gábor
2019-12-02 19:42   ` Jeff King
2019-12-03  6:17     ` Taylor Blau
2019-12-03 15:34       ` Jeff King
2019-12-03 16:04     ` Junio C Hamano
2019-12-03 17:33       ` Colin Stolley
2019-12-03 22:18         ` Jeff King
2019-12-04 18:15           ` Junio C Hamano
2019-12-03 22:17       ` Jeff King
2019-12-04  4:23         ` Jonathan Nieder
2019-12-03  6:19 ` Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).