git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Git List" <git@vger.kernel.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Ramsay Jones" <ramsay@ramsayjones.plus.com>,
	"Johannes Schindelin" <johannes.schindelin@gmx.de>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH 2/2] fsck: use oidset for skiplist
Date: Sat, 11 Aug 2018 13:23:50 -0400	[thread overview]
Message-ID: <20180811172350.GA2689@sigill.intra.peff.net> (raw)
In-Reply-To: <20180811170248.GC27393@sigill.intra.peff.net>

On Sat, Aug 11, 2018 at 01:02:48PM -0400, Jeff King wrote:

>   - we could probably improve the speed of oidset. Two things I notice
>     about its implementation:
> 
>       - it has to malloc for each entry, which I suspect is the main
> 	bottleneck. We could probably pool-allocate blocks, and when
> 	entries get removed just leave the allocations in place until we
> 	clear(). Most callers tend to build up a set and then query it a
> 	lot, or possibly remove items from the set until it's empty. But
> 	my guess is that few or none want a long-lived set that they add
> 	and remove from randomly.
> 
>       - insertion lets you do check-and-insert as a single operation
> 	(something I failed to notice in [1]). But it's not implemented
> 	as efficiently as it could be, since the "contains" and "put"
> 	operations do separate lookups. This doesn't matter for a set
> 	that's queried a lot more, but for something like de-duping
> 	(like I was doing in [1]) most operations are check-and-insert.
> [...]
> [1] https://public-inbox.org/git/20180810232457.GG19875@sigill.intra.peff.net/
>     but note that it's buried pretty deep.

Some notes on this, based on the cat-file patch that I linked to.

Before any optimizations, my best-of-five timing for:

  git cat-file --batch-all-objects --unordered --buffer \
               --batch-check='%(objectname)' >/dev/null

in git.git was:

  real	0m0.434s
  user	0m0.414s
  sys	0m0.020s

That's enumerating every object in the repo but not doing much more than
de-duping the names and printing them.

Applying this patch:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 45992c9be9..04b5cda191 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -443,9 +443,8 @@ static int batch_unordered_object(const struct object_id *oid, void *vdata)
 {
 	struct object_cb_data *data = vdata;
 
-	if (oidset_contains(data->seen, oid))
+	if (oidset_insert(data->seen, oid))
 		return 0;
-	oidset_insert(data->seen, oid);
 
 	return batch_object_cb(oid, data);
 }

to use the single-call set-and-replace doesn't seem to help (any
improvement is within the run-to-run noise). So a single hash lookup per
object does not seem to be measurable. And thus teaching oidset_insert()
to do a single hash lookup for check-and-insert is unlikely to help us.

On top of that, I tried using a pool to store the set entries:

diff --git a/oidset.c b/oidset.c
index 454c54f933..504929f177 100644
--- a/oidset.c
+++ b/oidset.c
@@ -17,7 +17,10 @@ int oidset_insert(struct oidset *set, const struct object_id *oid)
 	else if (oidset_contains(set, oid))
 		return 1;
 
-	entry = xmalloc(sizeof(*entry));
+	if (!set->pool)
+		mem_pool_init(&set->pool, 0);
+
+	entry = mem_pool_alloc(set->pool, sizeof(*entry));
 	oidcpy(&entry->oid, oid);
 
 	oidmap_put(&set->map, entry);
@@ -29,12 +32,13 @@ int oidset_remove(struct oidset *set, const struct object_id *oid)
 	struct oidmap_entry *entry;
 
 	entry = oidmap_remove(&set->map, oid);
-	free(entry);
+	/* abandon pool memory for "entry" */
 
 	return (entry != NULL);
 }
 
 void oidset_clear(struct oidset *set)
 {
-	oidmap_free(&set->map, 1);
+	oidmap_free(&set->map, 0);
+	mem_pool_discard(set->pool, 0);
 }
diff --git a/oidset.h b/oidset.h
index 40ec5f87fe..6b8b802987 100644
--- a/oidset.h
+++ b/oidset.h
@@ -20,6 +20,7 @@
  */
 struct oidset {
 	struct oidmap map;
+	struct mem_pool *pool;
 };
 
 #define OIDSET_INIT { OIDMAP_INIT }

That drops my best-of-five to:

  real	0m0.300s
  user	0m0.288s
  sys	0m0.012s

which is over a 25% speedup. So that does seem worth pursuing.

For reference, the oid_array code path for cat-file is still:

  real	0m0.161s
  user	0m0.157s
  sys	0m0.004s

but that's not completely apples to apples. The oidset code is also
iterating the packfiles in a different order and generating a revidx
(which I know is about 25ms in this repo). So a better test would
actually swap one data structure out for the other with no other changes
(I just happened to have this test handy, and really wanted to know
whether the mem_pool stuff would help).

-Peff

  reply	other threads:[~2018-08-11 20:10 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-11 15:39 [PATCH 1/2] fsck: use strbuf_getline() to read skiplist file René Scharfe
2018-08-11 15:47 ` [PATCH 2/2] fsck: use oidset for skiplist René Scharfe
2018-08-11 16:54   ` Ævar Arnfjörð Bjarmason
2018-08-25 18:49     ` René Scharfe
2018-08-11 17:02   ` Jeff King
2018-08-11 17:23     ` Jeff King [this message]
2018-08-11 20:59       ` René Scharfe
2018-08-13 17:15         ` René Scharfe
2018-08-14  1:58           ` Jeff King
2018-08-14  2:03             ` Jeff King
2018-08-26 11:37             ` René Scharfe
2018-08-27 23:03               ` Jeff King
2018-10-01 19:15                 ` René Scharfe
2018-10-01 20:26                   ` Jeff King
2018-10-02 19:05                     ` René Scharfe
2018-10-02 19:19                       ` Jeff King
2018-08-13 17:15       ` René Scharfe
2018-08-14  2:01         ` Jeff King
2018-08-11 20:48   ` Ramsay Jones
2018-08-25 18:49     ` René Scharfe
2018-08-13 18:43   ` Junio C Hamano
2018-08-13 20:26     ` René Scharfe
2018-08-13 21:07       ` Junio C Hamano
2018-08-13 23:09         ` René Scharfe
2018-08-11 16:48 ` [PATCH 1/2] fsck: use strbuf_getline() to read skiplist file Jeff King
2018-08-11 21:00   ` René Scharfe
2018-08-25 18:50 ` [PATCH v2 " René Scharfe
2018-08-27 23:00   ` Jeff King
2018-08-25 18:50 ` [PATCH v2 2/2] fsck: use oidset for skiplist René Scharfe
2018-08-27  7:37   ` Ævar Arnfjörð Bjarmason
2018-08-27 15:23     ` René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180811172350.GA2689@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=l.s.r@web.de \
    --cc=ramsay@ramsayjones.plus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).