git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, dstolee@microsoft.com, gitster@pobox.com,
	jrnieder@gmail.com
Subject: Re: [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files
Date: Fri, 22 Jan 2021 18:24:59 -0500	[thread overview]
Message-ID: <YAtey9krU32mgEBV@coredump.intra.peff.net> (raw)
In-Reply-To: <8648c87fa71aec427dc11afcba6548fb66a1413b.1610576805.git.me@ttaylorr.com>

On Wed, Jan 13, 2021 at 05:28:11PM -0500, Taylor Blau wrote:

> +static void write_rev_index_positions(struct hashfile *f,
> +				      struct pack_idx_entry **objects,
> +				      uint32_t nr_objects)
> +{
> +	uint32_t *pack_order;
> +	uint32_t i;
> +
> +	ALLOC_ARRAY(pack_order, nr_objects);
> +	for (i = 0; i < nr_objects; i++)
> +		pack_order[i] = i;
> +	QSORT_S(pack_order, nr_objects, pack_order_cmp, objects);

qsort? Don't we have a perfectly good radix sort for this exact purpose? :)

I guess it is awkward to use because it hard-codes the assumption that
we are sorting revindex_entry structs, with their offsets directly
available.

We _could_ actually just generate an in-memory revindex array, and then
use that to write out the .rev file. That has the nice property that
we'd continue exercising the fallback code in the tests.

But a revindex_entry is much larger than the uint32_t's we need here. So
we'd be incurring extra memory costs during the generation, which is
probably not worth it. If we want better test coverage, we probably
should explicitly run a series of tests both with and without a .rev
file.

It's possible we'd still benefit from using a more generalized radix
sort, but it probably isn't worth the trouble. The timings from
8b8dfd5132 (pack-revindex: radix-sort the revindex, 2013-07-11) claim a
4x speedup on sorting 3M entries (the size matters because we are
comparing an O(n) sort versus an O(n log n) one). We can imagine that at
10M entries for the current kernel, it might even be 8x. But the
absolute numbers are pretty small. The radix sort takes ~150ms for
linux.git on my machine.  At 8x, that's 1.2s. For a repack of the
kernel, that is mostly a drop in the bucket. It mattered a lot more when
many processes were doing it on the fly.

> +static int pack_order_cmp(const void *va, const void *vb, void *ctx)
> +{
> +	struct pack_idx_entry **objects = ctx;
> +
> +	off_t oa = objects[*(uint32_t*)va]->offset;
> +	off_t ob = objects[*(uint32_t*)vb]->offset;

Dereferencing a pointer to index another array always makes me nervous
that we may have a bounds problem with bogus data.

In this case we know it is OK because we filled the array ourselves with
in-bound numbers in write_rev_index_positions.

> +#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
> +#define RIDX_VERSION 1

I was surprised we didn't define these already on the reading side, but
it looks like we didn't. In patch 1, we probably should be checking
RIDX_SIGNATURE in load_revindex_from_disk(). But much more importantly,
we should be checking that we find version 1, since that's what will
make it safe to later invent a version 2.

> +static void write_rev_header(struct hashfile *f)
> +{
> +	uint32_t oid_version;
> +	switch (hash_algo_by_ptr(the_hash_algo)) {
> +	case GIT_HASH_SHA1:
> +		oid_version = 1;
> +		break;
> +	case GIT_HASH_SHA256:
> +		oid_version = 2;
> +		break;
> +	default:
> +		die("write_rev_header: unknown hash version");
> +	}

I forgot to comment on this in patch 1, but: I think the format is
really independent of the hash size. The contents are identical for a
sha-1 versus sha-256 file.

That said, I don't overly mind having a hash identifier if it might help
debug things (OTOH, how the heck do you end up with one that matches the
trailer's packfile but  _doesn't_ match the trailer's contents?).

If we do have it, should we also be checking it in the loading function?

> +const char *write_rev_file(const char *rev_name,
> +			   struct pack_idx_entry **objects,
> +			   uint32_t nr_objects,
> +			   const unsigned char *hash,
> +			   unsigned flags)
> +{
> +	struct hashfile *f;
> +	int fd;
> +
> +	if ((flags & WRITE_REV) && (flags & WRITE_REV_VERIFY))
> +		die(_("cannot both write and verify reverse index"));
> +
> +	if (flags & WRITE_REV) {
> +		if (!rev_name) {
> +			struct strbuf tmp_file = STRBUF_INIT;
> +			fd = odb_mkstemp(&tmp_file, "pack/tmp_rev_XXXXXX");
> +			rev_name = strbuf_detach(&tmp_file, NULL);
> +		} else {
> +			unlink(rev_name);
> +			fd = open(rev_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> +			if (fd < 0)
> +				die_errno("unable to create '%s'", rev_name);
> +		}

So if the caller gave us a name, we force-overwrite it. That seemed
weird to me at first, but it makes sense; the atomic rename-into-place
writers will not be passing in the name. And this is exactly how
write_idx_file() works.

I wonder if we could factor out some of this repeated logic, but I
suspect it is mostly diminishing returns. Maybe this "open a pack file
for writing" could become a helper function, though.

> diff --git a/pack.h b/pack.h
> index 9fc0945ac9..30439e0784 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -42,6 +42,8 @@ struct pack_idx_option {
>  	/* flag bits */
>  #define WRITE_IDX_VERIFY 01 /* verify only, do not write the idx file */
>  #define WRITE_IDX_STRICT 02
> +#define WRITE_REV 04
> +#define WRITE_REV_VERIFY 010

It is a little funny that the write-rev function has both WRITE_REV and
WRITE_REV_VERIFY (which we must be sure are not both provided), but we
do not need both for the IDX.

I thought maybe the reason is that we'd pass these flags to
write_idx_file() or similar, and it would need to know whether to write
just an idx, or both. That doesn't seem to happen in this patch, but
maybe it does in a future one...

-Peff

  reply	other threads:[~2021-01-22 23:27 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
2021-01-08 18:19 ` [PATCH 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
2021-01-08 18:20 ` [PATCH 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
2021-01-08 18:20 ` [PATCH 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
2021-01-08 18:20 ` [PATCH 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
2021-01-08 18:20 ` [PATCH 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
2021-01-08 18:20 ` [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-12 17:11   ` Ævar Arnfjörð Bjarmason
2021-01-12 18:40     ` Taylor Blau
2021-01-08 18:20 ` [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-12 16:49   ` Derrick Stolee
2021-01-12 17:34     ` Taylor Blau
2021-01-12 17:18   ` Ævar Arnfjörð Bjarmason
2021-01-12 17:39     ` Derrick Stolee
2021-01-12 18:17       ` Taylor Blau
2021-01-08 18:20 ` [PATCH 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
2021-01-14  7:22     ` Junio C Hamano
2021-01-14 12:07       ` Derrick Stolee
2021-01-14 19:57         ` Jeff King
2021-01-14 18:28       ` Taylor Blau
2021-01-14  7:26     ` Junio C Hamano
2021-01-14 18:13       ` Taylor Blau
2021-01-14 20:57         ` Junio C Hamano
2021-01-22 22:54     ` Jeff King
2021-01-25 17:44       ` Taylor Blau
2021-01-25 18:27         ` Jeff King
2021-01-25 19:04         ` Junio C Hamano
2021-01-25 19:23           ` Taylor Blau
2021-01-13 22:28   ` [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
2021-01-22 23:24     ` Jeff King [this message]
2021-01-25 19:15       ` Taylor Blau
2021-01-26 21:43         ` Jeff King
2021-01-13 22:28   ` [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
2021-01-22 23:53     ` Jeff King
2021-01-25 20:03       ` Taylor Blau
2021-01-13 22:28   ` [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
2021-01-22 23:57     ` Jeff King
2021-01-23  0:08       ` Jeff King
2021-01-25 20:21         ` Taylor Blau
2021-01-25 20:50           ` Jeff King
2021-01-13 22:28   ` [PATCH v2 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
2021-01-13 22:28   ` [PATCH v2 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-13 22:28   ` [PATCH v2 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-13 22:28   ` [PATCH v2 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
2021-01-25 23:37   ` [PATCH v3 01/10] packfile: prepare for the existence of '*.rev' files Taylor Blau
2021-01-29  0:27     ` Jeff King
2021-01-29  1:14       ` Taylor Blau
2021-01-30  8:39         ` Jeff King
2021-01-25 23:37   ` [PATCH v3 02/10] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
2021-01-25 23:37   ` [PATCH v3 03/10] builtin/index-pack.c: allow stripping arbitrary extensions Taylor Blau
2021-01-29  0:28     ` Jeff King
2021-01-29  1:15       ` Taylor Blau
2021-01-25 23:37   ` [PATCH v3 04/10] builtin/index-pack.c: write reverse indexes Taylor Blau
2021-01-25 23:37   ` [PATCH v3 05/10] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
2021-01-25 23:37   ` [PATCH v3 06/10] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
2021-01-29  0:30     ` Jeff King
2021-01-29  1:17       ` Taylor Blau
2021-01-30  8:41         ` Jeff King
2021-01-25 23:37   ` [PATCH v3 07/10] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-29  0:45     ` Jeff King
2021-01-29  1:09       ` Eric Sunshine
2021-01-29  1:21       ` Taylor Blau
2021-01-30  8:43         ` Jeff King
2021-01-29  2:42       ` Junio C Hamano
2021-01-25 23:37   ` [PATCH v3 08/10] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-29  0:47     ` Jeff King
2021-01-25 23:37   ` [PATCH v3 09/10] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
2021-01-29  0:53     ` Jeff King
2021-01-29  1:25       ` Taylor Blau
2021-01-30  8:46         ` Jeff King
2021-01-25 23:37   ` [PATCH v3 10/10] t5325: check both on-disk and in-memory reverse index Taylor Blau
2021-01-29  1:04     ` Jeff King
2021-01-29  1:05     ` Jeff King
2021-01-29  1:32       ` Taylor Blau
2021-01-30  8:47         ` Jeff King
2021-01-26  2:36   ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Junio C Hamano
2021-01-26  2:49     ` Taylor Blau
2021-01-29  1:06   ` Jeff King
2021-01-29  1:34     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YAtey9krU32mgEBV@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).