git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
* [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format
@ 2021-01-08 18:19 Taylor Blau
  2021-01-08 18:19 ` [PATCH 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
                   ` (9 more replies)
  0 siblings, 10 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:19 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Hi,

This is the second of two series to implement support for an on-disk format for
storing the reverse index. (It depends on the patches in the previous series
[1]).

The format is described in the first patch, but it is roughly as follows:

  - It begins with a 12-byte header, containing a magic string, a version
    identifier, and a hash function identifier.

  - It then contains a 4 * N (where 'N' is the number of objects) table of index
    positions, sorted by each object's offset within the corresponding packfile.

  - Finally, a trailer contains a checksum of the corresponding packfile, and a
    checksum of the above contents.

Since this is a large change, a new 'pack.writeReverseIndex' option is
introduced, which defaults to 'false'. When false, `*.rev` files are not
written, and Git gracefully falls back to generate each reverse index in
memory. This could optionally be tied to the "feature.experimental" option, and
eventually the defalt changed to 'true' in a couple of releases.

To test these new changes, the test suite now understands
'GIT_TEST_WRITE_REV_INDEX' to mean that 'pack.writeReverseIndex' should be
'true' everywhere. Some minor test fall-out is addressed in the sixth patch
before enabling this new mode in the seventh patch.

One option that is _not_ persued in this series is to store the (pack) offset of
each object in the `.rev` file. This would at worst triple the size of the file
(by having to store an additional eight bytes per entry), and add complexity
(like storing an extended offset table as in the `*.idx` format). An extensive
discussion about why this option was not persued can be found in the first
patch.

Thanks in advance for your review.

[1]: https://lore.kernel.org/git/cover.1610129796.git.me@ttaylorr.com/

Taylor Blau (8):
  packfile: prepare for the existence of '*.rev' files
  pack-write.c: prepare to write 'pack-*.rev' files
  builtin/index-pack.c: write reverse indexes
  builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
  t: prepare for GIT_TEST_WRITE_REV_INDEX
  t: support GIT_TEST_WRITE_REV_INDEX
  pack-revindex: ensure that on-disk reverse indexes are given
    precedence

 Documentation/config/pack.txt           |   7 ++
 Documentation/git-index-pack.txt        |  20 ++--
 Documentation/technical/pack-format.txt |  17 ++++
 builtin/index-pack.c                    |  67 +++++++++++--
 builtin/pack-objects.c                  |   9 ++
 builtin/repack.c                        |   1 +
 object-store.h                          |   3 +
 pack-revindex.c                         | 116 ++++++++++++++++++++--
 pack-revindex.h                         |   3 +
 pack-write.c                            | 123 +++++++++++++++++++++++-
 pack.h                                  |   4 +
 packfile.c                              |  13 ++-
 packfile.h                              |   1 +
 t/README                                |   3 +
 t/t5319-multi-pack-index.sh             |   2 +-
 t/t5325-reverse-index.sh                |  94 ++++++++++++++++++
 t/t5604-clone-reference.sh              |   2 +-
 t/t5702-protocol-v2.sh                  |   4 +-
 t/t6500-gc.sh                           |   4 +-
 t/t9300-fast-import.sh                  |   2 +-
 tmp-objdir.c                            |   4 +-
 21 files changed, 463 insertions(+), 36 deletions(-)
 create mode 100755 t/t5325-reverse-index.sh

-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
@ 2021-01-08 18:19 ` Taylor Blau
  2021-01-08 18:20 ` [PATCH 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:19 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Specify the format of the on-disk reverse index 'pack-*.rev' file, as
well as prepare the code for the existence of such files.

The reverse index maps from pack relative positions (i.e., an index into
the array of object which is sorted by their offsets within the
packfile) to their position within the 'pack-*.idx' file. Today, this is
done by building up a list of (off_t, uint32_t) tuples for each object
(the off_t corresponding to that object's offset, and the uint32_t
corresponding to its position in the index). To convert between pack and
index position quickly, this array of tuples is radix sorted based on
its offset.

This has two major drawbacks:

First, the in-memory cost scales linearly with the number of objects in
a pack.  Each 'struct revindex_entry' is sizeof(off_t) +
sizeof(uint32_t) + padding bytes for a total of 16.

To observe this, force Git to load the reverse index by, for e.g.,
running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking
for a single object in a fresh clone of the kernel, Git needs to
allocate 120+ MB of memory in order to hold the reverse index in memory.

Second, the cost to sort also scales with the size of the pack.
Luckily, this is a linear function since 'load_pack_revindex()' uses a
radix sort, but this cost still must be paid once per pack per process.

As an example, it takes ~60x longer to print the _size_ of an object as
it does to print that entire object's _contents_:

  Benchmark #1: git.compile cat-file --batch <obj
    Time (mean ± σ):       3.4 ms ±   0.1 ms    [User: 3.3 ms, System: 2.1 ms]
    Range (min … max):     3.2 ms …   3.7 ms    726 runs

  Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj
    Time (mean ± σ):     210.3 ms ±   8.9 ms    [User: 188.2 ms, System: 23.2 ms]
    Range (min … max):   193.7 ms … 224.4 ms    13 runs

Instead, avoid computing and sorting the revindex once per process by
writing it to a file when the pack itself is generated.

The format is relatively straightforward. It contains an array of
uint32_t's, the length of which is equal to the number of objects in the
pack.  The ith entry in this table contains the index position of the
ith object in the pack, where "ith object in the pack" is determined by
pack offset.

One thing that the on-disk format does _not_ contain is the full (up to)
eight-byte offset corresponding to each object. This is something that
the in-memory revindex contains (it stores an off_t in 'struct
revindex_entry' along with the same uint32_t that the on-disk format
has). Omit it in the on-disk format, since knowing the index position
for some object is sufficient to get a constant-time lookup in the
pack-*.idx file to ask for an object's offset within the pack.

This trades off between the on-disk size of the 'pack-*.rev' file for
runtime to chase down the offset for some object. Even though the lookup
is constant time, the constant is heavier, since it can potentially
involve two pointer walks in v2 indexes (one to access the 4-byte offset
table, and potentially a second to access the double wide offset table).

Consider trying to map an object's pack offset to a relative position
within that pack. In a cold-cache scenario, more page faults occur while
switching between binary searching through the reverse index and
searching through the *.idx file for an object's offset. Sure enough,
with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after
'sync'ing), printing out the entire object's contents is still
marginally faster than printing its size:

  Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null
    Time (mean ± σ):      22.6 ms ±   0.5 ms    [User: 2.4 ms, System: 7.9 ms]
    Range (min … max):    21.4 ms …  23.5 ms    41 runs

  Benchmark #2: git.compile cat-file --batch <obj >/dev/null
    Time (mean ± σ):      17.2 ms ±   0.7 ms    [User: 2.8 ms, System: 5.5 ms]
    Range (min … max):    15.6 ms …  18.2 ms    45 runs

(Numbers taken in the kernel after cheating and using the next patch to
generate a reverse index). There are a couple of approaches to improve
cold cache performance not pursued here:

  - We could include the object offsets in the reverse index format.
    Predictably, this does result in fewer page faults, but it triples
    the size of the file, while simultaneously duplicating a ton of data
    already available in the .idx file. (This was the original way I
    implemented the format, and it did show
    `--batch-check='%(objectsize:disk)'` winning out against `--batch`.)

    On the other hand, this increase in size also results in a large
    block-cache footprint, which could potentially hurt other workloads.

  - We could store the mapping from pack to index position in more
    cache-friendly way, like constructing a binary search tree from the
    table and writing the values in breadth-first order. This would
    result in much better locality, but the price you pay is trading
    O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
    can no longer directly index the table).

So, neither of these approaches are taken here. (Thankfully, the format
is versioned, so we are free to pursue these in the future.) But, cold
cache performance likely isn't interesting outside of one-off cases like
asking for the size of an object directly. In real-world usage, Git is
often performing many operations in the revindex,

The trade-off is worth it, since we will avoid the vast majority of the
cost of generating the revindex that the extra pointer chase will look
like noise in the following patch's benchmarks.

This patch describes the format and prepares callers (like in
pack-revindex.c) to be able to read *.rev files once they exist. An
implementation of the writer will appear in the next patch, and callers
will gradually begin to start using the writer in the patches that
follow after that.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  17 ++++
 builtin/repack.c                        |   1 +
 object-store.h                          |   3 +
 pack-revindex.c                         | 112 +++++++++++++++++++++---
 packfile.c                              |  13 ++-
 packfile.h                              |   1 +
 tmp-objdir.c                            |   4 +-
 7 files changed, 139 insertions(+), 12 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f..9593f8bc68 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -259,6 +259,23 @@ Pack file entry: <+
 
     Index checksum of all of the above.
 
+== pack-*.rev files have the format:
+
+  - A 4-byte magic number '0x52494458' ('RIDX').
+
+  - A 4-byte version identifier (= 1)
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)
+
+  - A table of index positions, sorted by their corresponding offsets in the
+    packfile.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/builtin/repack.c b/builtin/repack.c
index 279be11a16..8d643ddcb9 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -208,6 +208,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".idx"},
+	{".rev", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 };
diff --git a/object-store.h b/object-store.h
index c4fc9dd74e..3fbf11280f 100644
--- a/object-store.h
+++ b/object-store.h
@@ -85,6 +85,9 @@ struct packed_git {
 		 multi_pack_index:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
+	const void *revindex_data;
+	const void *revindex_map;
+	size_t revindex_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-revindex.c b/pack-revindex.c
index 2cd9d632f1..1baaf2c42a 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -164,16 +164,98 @@ static void create_pack_revindex(struct packed_git *p)
 	sort_revindex(p->revindex, num_ent, p->pack_size);
 }
 
-int load_pack_revindex(struct packed_git *p)
+static int load_pack_revindex_from_memory(struct packed_git *p)
 {
-	if (!p->revindex) {
-		if (open_pack_index(p))
-			return -1;
-		create_pack_revindex(p);
-	}
+	if (open_pack_index(p))
+		return -1;
+	create_pack_revindex(p);
 	return 0;
 }
 
+static char *pack_revindex_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
+}
+
+#define RIDX_MIN_SIZE (12 + (2 * the_hash_algo->rawsz))
+
+static int load_revindex_from_disk(char *revindex_name,
+				   uint32_t num_objects,
+				   const void **data, size_t *len)
+{
+	int fd, ret = 0;
+	struct stat st;
+	size_t revindex_size;
+
+	fd = git_open(revindex_name);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), revindex_name);
+		goto cleanup;
+	}
+
+	revindex_size = xsize_t(st.st_size);
+
+	if (revindex_size < RIDX_MIN_SIZE) {
+		ret = error(_("reverse-index file %s is too small"), revindex_name);
+		goto cleanup;
+	}
+
+	if (revindex_size - RIDX_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("reverse-index file %s is corrupt"), revindex_name);
+		goto cleanup;
+	}
+
+	*len = revindex_size;
+	*data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+cleanup:
+	close(fd);
+	return ret;
+}
+
+static int load_pack_revindex_from_disk(struct packed_git *p)
+{
+	char *revindex_name;
+	int ret;
+	if (open_pack_index(p))
+		return -1;
+
+	revindex_name = pack_revindex_filename(p);
+
+	ret = load_revindex_from_disk(revindex_name,
+				      p->num_objects,
+				      &p->revindex_map,
+				      &p->revindex_size);
+	if (ret)
+		goto cleanup;
+
+	p->revindex_data = (char *)p->revindex_map + 12;
+
+cleanup:
+	free(revindex_name);
+	return ret;
+}
+
+int load_pack_revindex(struct packed_git *p)
+{
+	if (p->revindex || p->revindex_data)
+		return 0;
+
+	if (!load_pack_revindex_from_disk(p))
+		return 0;
+	else if (!load_pack_revindex_from_memory(p))
+		return 0;
+	return -1;
+}
+
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	int lo = 0;
@@ -203,18 +285,28 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
 {
-	if (!p->revindex)
+	if (!(p->revindex || p->revindex_data))
 		BUG("pack_pos_to_index: reverse index not yet loaded");
 	if (pos >= p->num_objects)
 		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
-	return p->revindex[pos].nr;
+
+	if (p->revindex)
+		return p->revindex[pos].nr;
+	else
+		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));
 }
 
 off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
 {
-	if (!p->revindex)
+	if (!(p->revindex || p->revindex_data))
 		BUG("pack_pos_to_index: reverse index not yet loaded");
 	if (pos > p->num_objects)
 		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
-	return p->revindex[pos].offset;
+
+	if (p->revindex)
+		return p->revindex[pos].offset;
+	else if (pos == p->num_objects)
+		return p->pack_size - the_hash_algo->rawsz;
+	else
+		return nth_packed_object_offset(p, pack_pos_to_index(p, pos));
 }
diff --git a/packfile.c b/packfile.c
index 46c9c7ea3c..e636e5ca17 100644
--- a/packfile.c
+++ b/packfile.c
@@ -324,11 +324,21 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
+void close_pack_revindex(struct packed_git *p) {
+	if (!p->revindex_map)
+		return;
+
+	munmap((void *)p->revindex_map, p->revindex_size);
+	p->revindex_map = NULL;
+	p->revindex_data = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
+	close_pack_revindex(p);
 }
 
 void close_object_store(struct raw_object_store *o)
@@ -351,7 +361,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -853,6 +863,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	if (!strcmp(file_name, "multi-pack-index"))
 		return;
 	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".rev") ||
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
diff --git a/packfile.h b/packfile.h
index a58fc738e0..4cfec9e8d3 100644
--- a/packfile.h
+++ b/packfile.h
@@ -90,6 +90,7 @@ uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
 unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 void close_pack_windows(struct packed_git *);
+void close_pack_revindex(struct packed_git *);
 void close_pack(struct packed_git *);
 void close_object_store(struct raw_object_store *o);
 void unuse_pack(struct pack_window **);
diff --git a/tmp-objdir.c b/tmp-objdir.c
index 42ed4db5d3..da414df14f 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -187,7 +187,9 @@ static int pack_copy_priority(const char *name)
 		return 2;
 	if (ends_with(name, ".idx"))
 		return 3;
-	return 4;
+	if (ends_with(name, ".rev"))
+		return 4;
+	return 5;
 }
 
 static int pack_copy_cmp(const char *a, const char *b)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 2/8] pack-write.c: prepare to write 'pack-*.rev' files
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-08 18:19 ` [PATCH 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-08 18:20 ` [PATCH 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

This patch prepares for callers to be able to write reverse index files
to disk.

It adds the necessary machinery to write a format-compliant .rev file
from within 'write_rev_file()', which is called from
'finish_tmp_packfile()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-write.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 pack.h       |   4 ++
 2 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/pack-write.c b/pack-write.c
index 3513665e1e..68db5a9edf 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -166,6 +166,116 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	return index_name;
 }
 
+static int pack_order_cmp(const void *va, const void *vb, void *ctx)
+{
+	struct pack_idx_entry **objects = ctx;
+
+	off_t oa = objects[*(uint32_t*)va]->offset;
+	off_t ob = objects[*(uint32_t*)vb]->offset;
+
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
+#define RIDX_VERSION 1
+
+static void write_rev_header(struct hashfile *f)
+{
+	uint32_t oid_version;
+	switch (hash_algo_by_ptr(the_hash_algo)) {
+	case GIT_HASH_SHA1:
+		oid_version = 1;
+		break;
+	case GIT_HASH_SHA256:
+		oid_version = 2;
+		break;
+	default:
+		die("write_rev_header: unknown hash version");
+	}
+
+	hashwrite_be32(f, RIDX_SIGNATURE);
+	hashwrite_be32(f, RIDX_VERSION);
+	hashwrite_be32(f, oid_version);
+}
+
+static void write_rev_index_positions(struct hashfile *f,
+				      struct pack_idx_entry **objects,
+				      uint32_t nr_objects)
+{
+	uint32_t *pack_order;
+	uint32_t i;
+
+	ALLOC_ARRAY(pack_order, nr_objects);
+	for (i = 0; i < nr_objects; i++)
+		pack_order[i] = i;
+	QSORT_S(pack_order, nr_objects, pack_order_cmp, objects);
+
+	for (i = 0; i < nr_objects; i++)
+		hashwrite_be32(f, pack_order[i]);
+
+	free(pack_order);
+}
+
+static void write_rev_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+const char *write_rev_file(const char *rev_name,
+			   struct pack_idx_entry **objects,
+			   uint32_t nr_objects,
+			   const unsigned char *hash,
+			   unsigned flags)
+{
+	struct hashfile *f;
+	int fd;
+
+	if ((flags & WRITE_REV) && (flags & WRITE_REV_VERIFY))
+		die(_("cannot both write and verify reverse index"));
+
+	if (flags & WRITE_REV) {
+		if (!rev_name) {
+			struct strbuf tmp_file = STRBUF_INIT;
+			fd = odb_mkstemp(&tmp_file, "pack/tmp_rev_XXXXXX");
+			rev_name = strbuf_detach(&tmp_file, NULL);
+		} else {
+			unlink(rev_name);
+			fd = open(rev_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+			if (fd < 0)
+				die_errno("unable to create '%s'", rev_name);
+		}
+		f = hashfd(fd, rev_name);
+	} else if (flags & WRITE_REV_VERIFY) {
+		struct stat statbuf;
+		if (stat(rev_name, &statbuf)) {
+			if (errno == ENOENT) {
+				/* .rev files are optional */
+				return NULL;
+			} else
+				die_errno(_("could not stat: %s"), rev_name);
+		}
+		f = hashfd_check(rev_name);
+	} else
+		return NULL;
+
+	write_rev_header(f);
+
+	write_rev_index_positions(f, objects, nr_objects);
+	write_rev_trailer(f, hash);
+
+	if (rev_name && adjust_shared_perm(rev_name) < 0)
+		die(_("failed to make %s readable"), rev_name);
+
+	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
+				    ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
+
+	return rev_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -341,7 +451,7 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[])
 {
-	const char *idx_tmp_name;
+	const char *idx_tmp_name, *rev_tmp_name = NULL;
 	int basename_len = name_buffer->len;
 
 	if (adjust_shared_perm(pack_tmp_name))
@@ -352,6 +462,9 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 	if (adjust_shared_perm(idx_tmp_name))
 		die_errno("unable to make temporary index file readable");
 
+	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
+				      pack_idx_opts->flags);
+
 	strbuf_addf(name_buffer, "%s.pack", hash_to_hex(hash));
 
 	if (rename(pack_tmp_name, name_buffer->buf))
@@ -365,5 +478,13 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 
 	strbuf_setlen(name_buffer, basename_len);
 
+	if (rev_tmp_name) {
+		strbuf_addf(name_buffer, "%s.rev", hash_to_hex(hash));
+		if (rename(rev_tmp_name, name_buffer->buf))
+			die_errno("unable to rename temporary reverse-index file");
+	}
+
+	strbuf_setlen(name_buffer, basename_len);
+
 	free((void *)idx_tmp_name);
 }
diff --git a/pack.h b/pack.h
index 9fc0945ac9..30439e0784 100644
--- a/pack.h
+++ b/pack.h
@@ -42,6 +42,8 @@ struct pack_idx_option {
 	/* flag bits */
 #define WRITE_IDX_VERIFY 01 /* verify only, do not write the idx file */
 #define WRITE_IDX_STRICT 02
+#define WRITE_REV 04
+#define WRITE_REV_VERIFY 010
 
 	uint32_t version;
 	uint32_t off32_limit;
@@ -87,6 +89,8 @@ off_t write_pack_header(struct hashfile *f, uint32_t);
 void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 char *index_pack_lockfile(int fd);
 
+const char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
+
 /*
  * The "hdr" output buffer should be at least this big, which will handle sizes
  * up to 2^67.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 3/8] builtin/index-pack.c: write reverse indexes
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-08 18:19 ` [PATCH 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
  2021-01-08 18:20 ` [PATCH 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-08 18:20 ` [PATCH 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Teach 'git index-pack' to optionally write and verify reverse index with
'--[no-]rev-index', as well as respecting the 'pack.writeReverseIndex'
configuration option.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-index-pack.txt | 20 ++++++---
 builtin/index-pack.c             | 64 ++++++++++++++++++++++++-----
 t/t5325-reverse-index.sh         | 69 ++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+), 16 deletions(-)
 create mode 100755 t/t5325-reverse-index.sh

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index af0c26232c..b65f380269 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -9,17 +9,18 @@ git-index-pack - Build pack index file for an existing packed archive
 SYNOPSIS
 --------
 [verse]
-'git index-pack' [-v] [-o <index-file>] <pack-file>
+'git index-pack' [-v] [-o <index-file>] [--[no-]rev-index] <pack-file>
 'git index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>]
-                 [<pack-file>]
+		  [--[no-]rev-index] [<pack-file>]
 
 
 DESCRIPTION
 -----------
 Reads a packed archive (.pack) from the specified file, and
-builds a pack index file (.idx) for it.  The packed archive
-together with the pack index can then be placed in the
-objects/pack/ directory of a Git repository.
+builds a pack index file (.idx) for it. Optionally writes a
+reverse-index (.rev) for the specified pack. The packed
+archive together with the pack index can then be placed in
+the objects/pack/ directory of a Git repository.
 
 
 OPTIONS
@@ -33,7 +34,14 @@ OPTIONS
 	file is constructed from the name of packed archive
 	file by replacing .pack with .idx (and the program
 	fails if the name of packed archive does not end
-	with .pack).
+	with .pack). Incompatible with `--rev-index`.
+
+--[no-]rev-index::
+	When this flag is provided, generate a reverse index
+	(a `.rev` file) corresponding to the given pack. If
+	`--verify` is given, ensure that the existing
+	reverse index is correct. Takes precedence over
+	`pack.writeReverseIndex`.
 
 --stdin::
 	When this flag is provided, the pack is read from stdin
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 4b8d86e0ad..03408250b1 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -17,7 +17,7 @@
 #include "promisor-remote.h"
 
 static const char index_pack_usage[] =
-"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
+"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
 struct object_entry {
 	struct pack_idx_entry idx;
@@ -1436,13 +1436,13 @@ static void fix_unresolved_deltas(struct hashfile *f)
 	free(sorted_by_pos);
 }
 
-static const char *derive_filename(const char *pack_name, const char *suffix,
-				   struct strbuf *buf)
+static const char *derive_filename(const char *pack_name, const char *strip,
+				   const char *suffix, struct strbuf *buf)
 {
 	size_t len;
-	if (!strip_suffix(pack_name, ".pack", &len))
-		die(_("packfile name '%s' does not end with '.pack'"),
-		    pack_name);
+	if (!strip_suffix(pack_name, strip, &len))
+		die(_("packfile name '%s' does not end with '%s'"),
+		    pack_name, strip);
 	strbuf_add(buf, pack_name, len);
 	strbuf_addch(buf, '.');
 	strbuf_addstr(buf, suffix);
@@ -1459,7 +1459,7 @@ static void write_special_file(const char *suffix, const char *msg,
 	int msg_len = strlen(msg);
 
 	if (pack_name)
-		filename = derive_filename(pack_name, suffix, &name_buf);
+		filename = derive_filename(pack_name, ".pack", suffix, &name_buf);
 	else
 		filename = odb_pack_name(&name_buf, hash, suffix);
 
@@ -1484,12 +1484,14 @@ static void write_special_file(const char *suffix, const char *msg,
 
 static void final(const char *final_pack_name, const char *curr_pack_name,
 		  const char *final_index_name, const char *curr_index_name,
+		  const char *final_rev_index_name, const char *curr_rev_index_name,
 		  const char *keep_msg, const char *promisor_msg,
 		  unsigned char *hash)
 {
 	const char *report = "pack";
 	struct strbuf pack_name = STRBUF_INIT;
 	struct strbuf index_name = STRBUF_INIT;
+	struct strbuf rev_index_name = STRBUF_INIT;
 	int err;
 
 	if (!from_stdin) {
@@ -1524,6 +1526,16 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 	} else
 		chmod(final_index_name, 0444);
 
+	if (curr_rev_index_name) {
+		if (final_rev_index_name != curr_rev_index_name) {
+			if (!final_rev_index_name)
+				final_rev_index_name = odb_pack_name(&rev_index_name, hash, "rev");
+			if (finalize_object_file(curr_rev_index_name, final_rev_index_name))
+				die(_("cannot store reverse index file"));
+		} else
+			chmod(final_rev_index_name, 0444);
+	}
+
 	if (do_fsck_object) {
 		struct packed_git *p;
 		p = add_packed_git(final_index_name, strlen(final_index_name), 0);
@@ -1553,6 +1565,7 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 		}
 	}
 
+	strbuf_release(&rev_index_name);
 	strbuf_release(&index_name);
 	strbuf_release(&pack_name);
 }
@@ -1578,6 +1591,12 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 		}
 		return 0;
 	}
+	if (!strcmp(k, "pack.writereverseindex")) {
+		if (git_config_bool(k, v))
+			opts->flags |= WRITE_REV;
+		else
+			opts->flags &= ~WRITE_REV;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -1695,12 +1714,14 @@ static void show_pack_info(int stat_only)
 
 int cmd_index_pack(int argc, const char **argv, const char *prefix)
 {
-	int i, fix_thin_pack = 0, verify = 0, stat_only = 0;
+	int i, fix_thin_pack = 0, verify = 0, stat_only = 0, rev_index;
 	const char *curr_index;
-	const char *index_name = NULL, *pack_name = NULL;
+	const char *curr_rev_index = NULL;
+	const char *index_name = NULL, *pack_name = NULL, *rev_index_name = NULL;
 	const char *keep_msg = NULL;
 	const char *promisor_msg = NULL;
 	struct strbuf index_name_buf = STRBUF_INIT;
+	struct strbuf rev_index_name_buf = STRBUF_INIT;
 	struct pack_idx_entry **idx_objects;
 	struct pack_idx_option opts;
 	unsigned char pack_hash[GIT_MAX_RAWSZ];
@@ -1727,6 +1748,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
+	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
+
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
 
@@ -1805,6 +1828,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				if (hash_algo == GIT_HASH_UNKNOWN)
 					die(_("unknown hash algorithm '%s'"), arg);
 				repo_set_hash_algo(the_repository, hash_algo);
+			} else if (!strcmp(arg, "--rev-index")) {
+				rev_index = 1;
+			} else if (!strcmp(arg, "--no-rev-index")) {
+				rev_index = 0;
 			} else
 				usage(index_pack_usage);
 			continue;
@@ -1824,7 +1851,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && hash_algo)
 		die(_("--object-format cannot be used with --stdin"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, "idx", &index_name_buf);
+		index_name = derive_filename(pack_name, ".pack", "idx", &index_name_buf);
+
+	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
+	if (rev_index) {
+		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
+		if (index_name)
+			rev_index_name = derive_filename(index_name,
+							 ".idx", "rev",
+							 &rev_index_name_buf);
+	}
 
 	if (verify) {
 		if (!index_name)
@@ -1878,11 +1914,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	for (i = 0; i < nr_objects; i++)
 		idx_objects[i] = &objects[i].idx;
 	curr_index = write_idx_file(index_name, idx_objects, nr_objects, &opts, pack_hash);
+	if (rev_index)
+		curr_rev_index = write_rev_file(rev_index_name, idx_objects,
+						nr_objects, pack_hash,
+						opts.flags);
 	free(idx_objects);
 
 	if (!verify)
 		final(pack_name, curr_pack,
 		      index_name, curr_index,
+		      rev_index_name, curr_rev_index,
 		      keep_msg, promisor_msg,
 		      pack_hash);
 	else
@@ -1893,10 +1934,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 
 	free(objects);
 	strbuf_release(&index_name_buf);
+	strbuf_release(&rev_index_name_buf);
 	if (pack_name == NULL)
 		free((void *) curr_pack);
 	if (index_name == NULL)
 		free((void *) curr_index);
+	if (rev_index_name == NULL)
+		free((void *) curr_rev_index);
 
 	/*
 	 * Let the caller know this pack is not self contained
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
new file mode 100755
index 0000000000..f9afa698dc
--- /dev/null
+++ b/t/t5325-reverse-index.sh
@@ -0,0 +1,69 @@
+#!/bin/sh
+
+test_description='on-disk reverse index'
+. ./test-lib.sh
+
+packdir=.git/objects/pack
+
+test_expect_success 'setup' '
+	test_commit base &&
+
+	pack=$(git pack-objects --all $packdir/pack) &&
+	rev=$packdir/pack-$pack.rev &&
+
+	test_path_is_missing $rev
+'
+
+test_index_pack () {
+	rm -f $rev &&
+	conf=$1 &&
+	shift &&
+	git -c pack.writeReverseIndex=$conf index-pack "$@" \
+		$packdir/pack-$pack.pack
+}
+
+test_expect_success 'index-pack with pack.writeReverseIndex' '
+	test_index_pack "" &&
+	test_path_is_missing $rev &&
+
+	test_index_pack false &&
+	test_path_is_missing $rev &&
+
+	test_index_pack true &&
+	test_path_is_file $rev
+'
+
+test_expect_success 'index-pack with --[no-]rev-index' '
+	for conf in "" true false
+	do
+		test_index_pack "$conf" --rev-index &&
+		test_path_exists $rev &&
+
+		test_index_pack "$conf" --no-rev-index &&
+		test_path_is_missing $rev
+	done
+'
+
+test_expect_success 'index-pack can verify reverse indexes' '
+	test_when_finished "rm -f $rev" &&
+	test_index_pack true &&
+
+	test_path_is_file $rev &&
+	git index-pack --rev-index --verify $packdir/pack-$pack.pack &&
+
+	# Intentionally corrupt the reverse index.
+	chmod u+w $rev &&
+	printf "xxxx" | dd of=$rev bs=1 count=4 conv=notrunc &&
+
+	test_must_fail git index-pack --rev-index --verify \
+		$packdir/pack-$pack.pack 2>err &&
+	grep "validation error" err
+'
+
+test_expect_success 'index-pack infers reverse index name with -o' '
+	git index-pack --rev-index -o other.idx $packdir/pack-$pack.pack &&
+	test_path_is_file other.idx &&
+	test_path_is_file other.rev
+'
+
+test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (2 preceding siblings ...)
  2021-01-08 18:20 ` [PATCH 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-08 18:20 ` [PATCH 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Now that we have an implementation that can write the new reverse index
format, enable writing a .rev file in 'git pack-objects' by consulting
the pack.writeReverseIndex configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c   |  7 +++++++
 t/t5325-reverse-index.sh | 13 +++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a193cdaf2f..80adce154a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2951,6 +2951,13 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
+	if (!strcmp(k, "pack.writereverseindex")) {
+		if (git_config_bool(k, v))
+			pack_idx_opts.flags |= WRITE_REV;
+		else
+			pack_idx_opts.flags &= ~WRITE_REV;
+		return 0;
+	}
 	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
 		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
 		const char *oid_end, *pack_end;
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index f9afa698dc..431699ade2 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -66,4 +66,17 @@ test_expect_success 'index-pack infers reverse index name with -o' '
 	test_path_is_file other.rev
 '
 
+test_expect_success 'pack-objects respects pack.writeReverseIndex' '
+	test_when_finished "rm -fr pack-1-*" &&
+
+	git -c pack.writeReverseIndex= pack-objects --all pack-1 &&
+	test_path_is_missing pack-1-*.rev &&
+
+	git -c pack.writeReverseIndex=false pack-objects --all pack-1 &&
+	test_path_is_missing pack-1-*.rev &&
+
+	git -c pack.writeReverseIndex=true pack-objects --all pack-1 &&
+	test_path_is_file pack-1-*.rev
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (3 preceding siblings ...)
  2021-01-08 18:20 ` [PATCH 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-08 18:20 ` [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Now that the pack.writeReverseIndex configuration is respected in both
'git index-pack' and 'git pack-objects' (and therefore, all of their
callers), we can safely advertise it for use in the git-config manual.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 837f1b1679..3da4ea98e2 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -133,3 +133,10 @@ pack.writeBitmapHashCache::
 	between an older, bitmapped pack and objects that have been
 	pushed since the last gc). The downside is that it consumes 4
 	bytes per object of disk space. Defaults to true.
+
+pack.writeReverseIndex::
+	When true, git will write a corresponding .rev file (see:
+	link:../technical/pack-format.html[Documentation/technical/pack-format.txt])
+	for each new packfile that it writes in all places except for
+	linkgit:git-fast-import[1] and in the bulk checkin mechanism.
+	Defaults to false.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (4 preceding siblings ...)
  2021-01-08 18:20 ` [PATCH 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-12 17:11   ` Ævar Arnfjörð Bjarmason
  2021-01-08 18:20 ` [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

In the next patch, we'll add support for unconditionally enabling the
'pack.writeReverseIndex' setting with a new GIT_TEST_WRITE_REV_INDEX
environment variable.

This causes a little bit of fallout with tests that, for example,
compare the list of files in the pack directory being unprepared to see
.rev files in its output.

For now, sprinkle these locations with a 'grep -v "\.rev$"' to ignore
them. Once the pack.writeReverseIndex option has been thoroughly
tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX,
and making it possible to revert this patch.

At that time, we'll have to adjust the expected output to contain the
relevant .rev files, but for now this will do just fine.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5319-multi-pack-index.sh | 2 +-
 t/t5325-reverse-index.sh    | 4 ++++
 t/t5604-clone-reference.sh  | 2 +-
 t/t5702-protocol-v2.sh      | 4 ++--
 t/t6500-gc.sh               | 4 ++--
 t/t9300-fast-import.sh      | 2 +-
 6 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a9..9696f88c2f 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -710,7 +710,7 @@ test_expect_success 'expire respects .keep files' '
 		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
 		touch $PACKA.keep &&
 		git multi-pack-index expire &&
-		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
+		ls -S .git/objects/pack/a-pack* | grep $PACKA | grep -v "\.rev$" >a-pack-files &&
 		test_line_count = 3 a-pack-files &&
 		test-tool read-midx .git/objects | grep idx >midx-list &&
 		test_line_count = 2 midx-list
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 431699ade2..5a59cc71e4 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -3,6 +3,10 @@
 test_description='on-disk reverse index'
 . ./test-lib.sh
 
+# The below tests want control over the 'pack.writeReverseIndex' setting
+# themselves to assert various combinations of it with other options.
+sane_unset GIT_TEST_WRITE_REV_INDEX
+
 packdir=.git/objects/pack
 
 test_expect_success 'setup' '
diff --git a/t/t5604-clone-reference.sh b/t/t5604-clone-reference.sh
index 2f7be23044..d064426d03 100755
--- a/t/t5604-clone-reference.sh
+++ b/t/t5604-clone-reference.sh
@@ -318,7 +318,7 @@ test_expect_success SYMLINKS 'clone repo with symlinked or unknown files at obje
 		test_cmp T.objects T$option.objects &&
 		(
 			cd T$option/.git/objects &&
-			find . -type f | sort >../../../T$option.objects-files.raw &&
+			find . -type f | grep -v \.rev$ | sort >../../../T$option.objects-files.raw &&
 			find . -type l | sort >../../../T$option.objects-symlinks.raw
 		)
 	done &&
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 7d5b17909b..9ebf045739 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -848,7 +848,7 @@ test_expect_success 'part of packfile response provided as URI' '
 	test -f h2found &&
 
 	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
-	ls http_child/.git/objects/pack/* >filelist &&
+	ls http_child/.git/objects/pack/* | grep -v \.rev$ >filelist &&
 	test_line_count = 6 filelist
 '
 
@@ -902,7 +902,7 @@ test_expect_success 'packfile-uri with transfer.fsckobjects' '
 		clone "$HTTPD_URL/smart/http_parent" http_child &&
 
 	# Ensure that there are exactly 4 files (2 .pack and 2 .idx).
-	ls http_child/.git/objects/pack/* >filelist &&
+	ls http_child/.git/objects/pack/* | grep -v \.rev$ >filelist &&
 	test_line_count = 4 filelist
 '
 
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 4a3b8f48ac..d52f92f5a1 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -106,13 +106,13 @@ test_expect_success 'auto gc with too many loose objects does not attempt to cre
 	test_commit "$(test_oid obj2)" &&
 	# Our first gc will create a pack; our second will create a second pack
 	git gc --auto &&
-	ls .git/objects/pack | sort >existing_packs &&
+	ls .git/objects/pack | grep -v \.rev$ | sort >existing_packs &&
 	test_commit "$(test_oid obj3)" &&
 	test_commit "$(test_oid obj4)" &&
 
 	git gc --auto 2>err &&
 	test_i18ngrep ! "^warning:" err &&
-	ls .git/objects/pack/ | sort >post_packs &&
+	ls .git/objects/pack/ | grep -v \.rev$ | sort >post_packs &&
 	comm -1 -3 existing_packs post_packs >new &&
 	comm -2 -3 existing_packs post_packs >del &&
 	test_line_count = 0 del && # No packs are deleted
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 308c1ef42c..100df52a71 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1629,7 +1629,7 @@ test_expect_success 'O: blank lines not necessary after other commands' '
 	INPUT_END
 
 	git fast-import <input &&
-	test 8 = $(find .git/objects/pack -type f | grep -v multi-pack-index | wc -l) &&
+	test 8 = $(find .git/objects/pack -type f \( -name "*.idx" -o -name "*.pack" \) | wc -l) &&
 	test $(git rev-parse refs/tags/O3-2nd) = $(git rev-parse O3^) &&
 	git log --reverse --pretty=oneline O3 | sed s/^.*z// >actual &&
 	test_cmp expect actual
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (5 preceding siblings ...)
  2021-01-08 18:20 ` [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-12 16:49   ` Derrick Stolee
  2021-01-12 17:18   ` Ævar Arnfjörð Bjarmason
  2021-01-08 18:20 ` [PATCH 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

Add a new option that unconditionally enables the pack.writeReverseIndex
setting in order to run the whole test suite in a mode that generates
on-disk reverse indexes.

Once on-disk reverse indexes are proven out over several releases, we
can change the default value of that configuration to 'true', and drop
this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/index-pack.c   | 5 ++++-
 builtin/pack-objects.c | 2 ++
 pack-revindex.h        | 2 ++
 t/README               | 3 +++
 4 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 03408250b1..0bde325a8b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1748,7 +1748,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
-	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
+	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
+		rev_index = 1;
+	else
+		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
 
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 80adce154a..5b3395bd7a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3597,6 +3597,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
+		pack_idx_opts.flags |= WRITE_REV;
 
 	progress = isatty(2);
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
diff --git a/pack-revindex.h b/pack-revindex.h
index b501a7cd62..d2d466e298 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -1,6 +1,8 @@
 #ifndef PACK_REVINDEX_H
 #define PACK_REVINDEX_H
 
+#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+
 struct packed_git;
 
 int load_pack_revindex(struct packed_git *p);
diff --git a/t/README b/t/README
index c730a70770..0f97a51640 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ GIT_TEST_DEFAULT_HASH=<hash-algo> specifies which hash algorithm to
 use in the test scripts. Recognized values for <hash-algo> are "sha1"
 and "sha256".
 
+GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
+'pack.writeReverseIndex' setting.
+
 Naming Tests
 ------------
 
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (6 preceding siblings ...)
  2021-01-08 18:20 ` [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-08 18:20 ` Taylor Blau
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-08 18:20 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder

When an on-disk reverse index exists, there is no need to generate one
in memory. In fact, doing so can be slow, and require large amounts of
the heap.

Let's make sure that we treat the on-disk reverse index with precedence
(i.e., that when it exists, we don't bother trying to generate an
equivalent one in memory) by teaching Git how to conditionally die()
when generating a reverse index in memory.

Then, add a test to ensure that when (a) an on-disk reverse index
exists, and (b) when setting GIT_TEST_REV_INDEX_DIE_IN_MEMORY, that we
do not die, implying that we read from the on-disk one.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c          | 4 ++++
 pack-revindex.h          | 1 +
 t/t5325-reverse-index.sh | 8 ++++++++
 3 files changed, 13 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index 1baaf2c42a..683d22e898 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -2,6 +2,7 @@
 #include "pack-revindex.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "config.h"
 
 struct revindex_entry {
 	off_t offset;
@@ -166,6 +167,9 @@ static void create_pack_revindex(struct packed_git *p)
 
 static int load_pack_revindex_from_memory(struct packed_git *p)
 {
+	if (git_env_bool(GIT_TEST_REV_INDEX_DIE_IN_MEMORY, 0))
+		die("dying as requested by '%s'",
+		    GIT_TEST_REV_INDEX_DIE_IN_MEMORY);
 	if (open_pack_index(p))
 		return -1;
 	create_pack_revindex(p);
diff --git a/pack-revindex.h b/pack-revindex.h
index d2d466e298..e271da871a 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -2,6 +2,7 @@
 #define PACK_REVINDEX_H
 
 #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+#define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
 
 struct packed_git;
 
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 5a59cc71e4..9d4eecccc9 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -83,4 +83,12 @@ test_expect_success 'pack-objects respects pack.writeReverseIndex' '
 	test_path_is_file pack-1-*.rev
 '
 
+test_expect_success 'reverse index is not generated when available on disk' '
+	git index-pack --rev-index $packdir/pack-$pack.pack &&
+
+	git rev-parse HEAD >tip &&
+	GIT_TEST_REV_INDEX_DIE_IN_MEMORY=1 git cat-file \
+		--batch-check="%(objectsize:disk)" <tip
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-08 18:20 ` [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-12 16:49   ` Derrick Stolee
  2021-01-12 17:34     ` Taylor Blau
  2021-01-12 17:18   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 54+ messages in thread
From: Derrick Stolee @ 2021-01-12 16:49 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, jrnieder

On 1/8/2021 1:20 PM, Taylor Blau wrote:
> Add a new option that unconditionally enables the pack.writeReverseIndex
> setting in order to run the whole test suite in a mode that generates
> on-disk reverse indexes.
> 
> Once on-disk reverse indexes are proven out over several releases, we
> can change the default value of that configuration to 'true', and drop
> this patch.

...

> +GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
> +'pack.writeReverseIndex' setting.
> +

Should this also be added to the second run of the test
suite with optional variables?

diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 6c27b886b8f..d1cbf330a14 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -22,6 +22,7 @@ linux-gcc)
 	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	export GIT_TEST_ADD_I_USE_BUILTIN=1
+	export GIT_TEST_WRITE_REV_INDEX=1
 	make test
 	;;
 linux-clang)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX
  2021-01-08 18:20 ` [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-12 17:11   ` Ævar Arnfjörð Bjarmason
  2021-01-12 18:40     ` Taylor Blau
  0 siblings, 1 reply; 54+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-12 17:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder


On Fri, Jan 08 2021, Taylor Blau wrote:

> For now, sprinkle these locations with a 'grep -v "\.rev$"' to ignore
> them. Once the pack.writeReverseIndex option has been thoroughly
> tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX,
> and making it possible to revert this patch.

Maybe some of it we can change/revert, but some of it just seems to be
test warts we can fix:

> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 297de502a9..9696f88c2f 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -710,7 +710,7 @@ test_expect_success 'expire respects .keep files' '
>  		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
>  		touch $PACKA.keep &&
>  		git multi-pack-index expire &&
> -		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
> +		ls -S .git/objects/pack/a-pack* | grep $PACKA | grep -v "\.rev$" >a-pack-files &&

This seems to be testing "a *.keep file made my pack not expire". Can't
it just check for *.{pack,idx,keep} or even just *.pack?

>  		test_line_count = 3 a-pack-files &&
>  		test-tool read-midx .git/objects | grep idx >midx-list &&
>  		test_line_count = 2 midx-list
> diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
> index 431699ade2..5a59cc71e4 100755
> --- a/t/t5325-reverse-index.sh
> +++ b/t/t5325-reverse-index.sh
> @@ -3,6 +3,10 @@
>  test_description='on-disk reverse index'
>  . ./test-lib.sh
>  
> +# The below tests want control over the 'pack.writeReverseIndex' setting
> +# themselves to assert various combinations of it with other options.
> +sane_unset GIT_TEST_WRITE_REV_INDEX
> +
>  packdir=.git/objects/pack
>  
>  test_expect_success 'setup' '
> diff --git a/t/t5604-clone-reference.sh b/t/t5604-clone-reference.sh
> index 2f7be23044..d064426d03 100755
> --- a/t/t5604-clone-reference.sh
> +++ b/t/t5604-clone-reference.sh
> @@ -318,7 +318,7 @@ test_expect_success SYMLINKS 'clone repo with symlinked or unknown files at obje
>  		test_cmp T.objects T$option.objects &&
>  		(
>  			cd T$option/.git/objects &&
> -			find . -type f | sort >../../../T$option.objects-files.raw &&
> +			find . -type f | grep -v \.rev$ | sort >../../../T$option.objects-files.raw &&
>  			find . -type l | sort >../../../T$option.objects-symlinks.raw

There's an existing loop just below that where we grep out
/commit-graph/, /multi-pack-index/ etc. whith other test modes add to
the objects directory with sed. Seems like this belongs there, not in
the find above it.

>  		)
>  	done &&
> diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
> index 7d5b17909b..9ebf045739 100755
> --- a/t/t5702-protocol-v2.sh
> +++ b/t/t5702-protocol-v2.sh
> @@ -848,7 +848,7 @@ test_expect_success 'part of packfile response provided as URI' '
>  	test -f h2found &&
>  
>  	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
> -	ls http_child/.git/objects/pack/* >filelist &&
> +	ls http_child/.git/objects/pack/* | grep -v \.rev$ >filelist &&
>  	test_line_count = 6 filelist
>  '

Maybe just check *.{pack,idx,keep}. I was looking at that code the other
day and it's really just being overly specific. It really just cares
about the *.pack files.

> @@ -902,7 +902,7 @@ test_expect_success 'packfile-uri with transfer.fsckobjects' '
>  		clone "$HTTPD_URL/smart/http_parent" http_child &&
>  
>  	# Ensure that there are exactly 4 files (2 .pack and 2 .idx).
> -	ls http_child/.git/objects/pack/* >filelist &&
> +	ls http_child/.git/objects/pack/* | grep -v \.rev$ >filelist &&

ditto.

>  	test_line_count = 4 filelist
>  '
>  
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index 4a3b8f48ac..d52f92f5a1 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -106,13 +106,13 @@ test_expect_success 'auto gc with too many loose objects does not attempt to cre
>  	test_commit "$(test_oid obj2)" &&
>  	# Our first gc will create a pack; our second will create a second pack
>  	git gc --auto &&
> -	ls .git/objects/pack | sort >existing_packs &&
> +	ls .git/objects/pack | grep -v \.rev$ | sort >existing_packs &&
>  	test_commit "$(test_oid obj3)" &&
>  	test_commit "$(test_oid obj4)" &&
>  
>  	git gc --auto 2>err &&
>  	test_i18ngrep ! "^warning:" err &&
> -	ls .git/objects/pack/ | sort >post_packs &&
> +	ls .git/objects/pack/ | grep -v \.rev$ | sort >post_packs &&
>  	comm -1 -3 existing_packs post_packs >new &&
>  	comm -2 -3 existing_packs post_packs >del &&
>  	test_line_count = 0 del && # No packs are deleted

This is all part of account where we later use comm/wc -l to check how
many new packs we have,so just check *.pack?

> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index 308c1ef42c..100df52a71 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -1629,7 +1629,7 @@ test_expect_success 'O: blank lines not necessary after other commands' '
>  	INPUT_END
>  
>  	git fast-import <input &&
> -	test 8 = $(find .git/objects/pack -type f | grep -v multi-pack-index | wc -l) &&
> +	test 8 = $(find .git/objects/pack -type f \( -name "*.idx" -o -name "*.pack" \) | wc -l) &&

Yay, there the existing multi-pack-index case is amended in a
future-proof way :)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-08 18:20 ` [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
  2021-01-12 16:49   ` Derrick Stolee
@ 2021-01-12 17:18   ` Ævar Arnfjörð Bjarmason
  2021-01-12 17:39     ` Derrick Stolee
  1 sibling, 1 reply; 54+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-12 17:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, jrnieder


On Fri, Jan 08 2021, Taylor Blau wrote:

> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/index-pack.c   | 5 ++++-
>  builtin/pack-objects.c | 2 ++
>  pack-revindex.h        | 2 ++
>  t/README               | 3 +++
>  4 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 03408250b1..0bde325a8b 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1748,7 +1748,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	if (prefix && chdir(prefix))
>  		die(_("Cannot come back to cwd"));
>  
> -	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
> +	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
> +		rev_index = 1;
> +	else
> +		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));

Why not make an explicit GIT_TEST_WRITE_REV_INDEX=false meaningful? It's
also sometimes handy to turn these off in the tests.

    rev_index = git_env_bool("GIT_TEST_WRITE_REV_INDEX",
    	!!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV)));

> +#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"

Micro style nit: FWIW I'm not a fan of this macro->string indirection a
few GIT_TEST_* names have had since 859fdc0c3c (commit-graph: define
GIT_TEST_COMMIT_GRAPH, 2018-08-29).

Most of them just use git_env_bool("GIT_TEST_[...]") which IMO makes it
easier to eyeball a "git grep"

> +GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
> +'pack.writeReverseIndex' setting.
> +

Re the git_env_bool() default value comment above: I see our other
boolean GIT_TEST_* docs say "when true", but mostly they mean things
like:

    GIT_TEST_WRITE_REV_INDEX=<boolean>, when set, configures the
    'pack.writeReverseIndex' setting. Defaults to 'false'.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-12 16:49   ` Derrick Stolee
@ 2021-01-12 17:34     ` Taylor Blau
  0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-12 17:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, jrnieder

On Tue, Jan 12, 2021 at 11:49:40AM -0500, Derrick Stolee wrote:
> Should this also be added to the second run of the test
> suite with optional variables?

Ah, I wasn't aware of this script myself. Yes, it should be there,
thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-12 17:18   ` Ævar Arnfjörð Bjarmason
@ 2021-01-12 17:39     ` Derrick Stolee
  2021-01-12 18:17       ` Taylor Blau
  0 siblings, 1 reply; 54+ messages in thread
From: Derrick Stolee @ 2021-01-12 17:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Taylor Blau; +Cc: git, peff, jrnieder

On 1/12/2021 12:18 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jan 08 2021, Taylor Blau wrote:
>> -	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
>> +	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
>> +		rev_index = 1;
>> +	else
>> +		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
> 
> Why not make an explicit GIT_TEST_WRITE_REV_INDEX=false meaningful? It's
> also sometimes handy to turn these off in the tests.
> 
>     rev_index = git_env_bool("GIT_TEST_WRITE_REV_INDEX",
>     	!!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV)));
 
This will cause tests that explicitly request a rev-index to fail when
GIT_TEST_WRITE_REF_INDEX=false. I'm not sure that's a good pattern to
follow.

>> +#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
> 
> Micro style nit: FWIW I'm not a fan of this macro->string indirection a
> few GIT_TEST_* names have had since 859fdc0c3c (commit-graph: define
> GIT_TEST_COMMIT_GRAPH, 2018-08-29).
> 
> Most of them just use git_env_bool("GIT_TEST_[...]") which IMO makes it
> easier to eyeball a "git grep"

In the case of GIT_TEST_COMMIT_GRAPH, there are multiple places that
check the environment variable, and it is probably best to have the
strings consistent through a macro.

For something like GIT_TEST_WRITE_REV_INDEX, this macro is less
important because it is checked only once.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-12 17:39     ` Derrick Stolee
@ 2021-01-12 18:17       ` Taylor Blau
  0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-12 18:17 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, git, peff, jrnieder

On Tue, Jan 12, 2021 at 12:39:37PM -0500, Derrick Stolee wrote:
> On 1/12/2021 12:18 PM, Ævar Arnfjörð Bjarmason wrote:
> >
> > On Fri, Jan 08 2021, Taylor Blau wrote:
> >> -	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
> >> +	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
> >> +		rev_index = 1;
> >> +	else
> >> +		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
> >
> > Why not make an explicit GIT_TEST_WRITE_REV_INDEX=false meaningful? It's
> > also sometimes handy to turn these off in the tests.
> >
> >     rev_index = git_env_bool("GIT_TEST_WRITE_REV_INDEX",
> >     	!!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV)));
>
> This will cause tests that explicitly request a rev-index to fail when
> GIT_TEST_WRITE_REF_INDEX=false. I'm not sure that's a good pattern to
> follow.

I agree that this wouldn't work; the second parameter is the default
value, not a substitute for "false".

It _would_ work currently, since there aren't any tests in this series
that explicitly

    export GIT_TEST_WRITE_REF_INDEX=false

t5325 does 'sane_unset GIT_TEST_WRITE_REV_INDEX', which accomplishes
the same thing without setting a value. If we instead used the export
form, this would break, so I think the implementation is more robust
as-is.

> >> +#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
> >
> > Micro style nit: FWIW I'm not a fan of this macro->string indirection a
> > few GIT_TEST_* names have had since 859fdc0c3c (commit-graph: define
> > GIT_TEST_COMMIT_GRAPH, 2018-08-29).
> >
> > Most of them just use git_env_bool("GIT_TEST_[...]") which IMO makes it
> > easier to eyeball a "git grep"
>
> In the case of GIT_TEST_COMMIT_GRAPH, there are multiple places that
> check the environment variable, and it is probably best to have the
> strings consistent through a macro.
>
> For something like GIT_TEST_WRITE_REV_INDEX, this macro is less
> important because it is checked only once.

FWIW, I'm not a huge fan of the indirection either, but I think that its
use here is warranted, since it is checked twice (in index-pack, and
pack-objects).

It _could_ be checked lower in the call stack, but I tried this when
developing this series and it ended up being far messier than what is
presented here. The trickiness is with the extra verification mode
(which ensures that an existing '.rev' file was written correctly), and
having to indicate that at each caller.

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX
  2021-01-12 17:11   ` Ævar Arnfjörð Bjarmason
@ 2021-01-12 18:40     ` Taylor Blau
  0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-12 18:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff, jrnieder

Hi Ævar,

Your suggestions are quite helpful, and I'm glad to apply them,
especially if it means that this "clean up" patch can actually harden us
from similar changes in the future.

On Tue, Jan 12, 2021 at 06:11:00PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Jan 08 2021, Taylor Blau wrote:
>
> > For now, sprinkle these locations with a 'grep -v "\.rev$"' to ignore
> > them. Once the pack.writeReverseIndex option has been thoroughly
> > tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX,
> > and making it possible to revert this patch.
>
> Maybe some of it we can change/revert, but some of it just seems to be
> test warts we can fix:
>
> > diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> > index 297de502a9..9696f88c2f 100755
> > --- a/t/t5319-multi-pack-index.sh
> > +++ b/t/t5319-multi-pack-index.sh
> > @@ -710,7 +710,7 @@ test_expect_success 'expire respects .keep files' '
> >  		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
> >  		touch $PACKA.keep &&
> >  		git multi-pack-index expire &&
> > -		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
> > +		ls -S .git/objects/pack/a-pack* | grep $PACKA | grep -v "\.rev$" >a-pack-files &&
>
> This seems to be testing "a *.keep file made my pack not expire". Can't
> it just check for *.{pack,idx,keep} or even just *.pack?

Yeah, and I think the simplest thing to do here is just check that these
files exist at all. Something like:

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 9696f88c2f..f5e50508c9 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -710,7 +710,9 @@ test_expect_success 'expire respects .keep files' '
 		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
 		touch $PACKA.keep &&
 		git multi-pack-index expire &&
-		ls -S .git/objects/pack/a-pack* | grep $PACKA | grep -v "\.rev$" >a-pack-files &&
+		test_is_file .git/objects/pack/pack-a-$PACKA.idx &&
+		test_is_file .git/objects/pack/pack-a-$PACKA.keep &&
+		test_is_file .git/objects/pack/pack-a-$PACKA.pack &&
 		test_line_count = 3 a-pack-files &&
 		test-tool read-midx .git/objects | grep idx >midx-list &&
 		test_line_count = 2 midx-list

> > diff --git a/t/t5604-clone-reference.sh b/t/t5604-clone-reference.sh
> > index 2f7be23044..d064426d03 100755
> > --- a/t/t5604-clone-reference.sh
> > +++ b/t/t5604-clone-reference.sh
> > @@ -318,7 +318,7 @@ test_expect_success SYMLINKS 'clone repo with symlinked or unknown files at obje
> >  		test_cmp T.objects T$option.objects &&
> >  		(
> >  			cd T$option/.git/objects &&
> > -			find . -type f | sort >../../../T$option.objects-files.raw &&
> > +			find . -type f | grep -v \.rev$ | sort >../../../T$option.objects-files.raw &&
> >  			find . -type l | sort >../../../T$option.objects-symlinks.raw
>
> There's an existing loop just below that where we grep out
> /commit-graph/, /multi-pack-index/ etc. whith other test modes add to
> the objects directory with sed. Seems like this belongs there, not in
> the find above it.

Much cleaner! Thank you.

> > diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
> > index 7d5b17909b..9ebf045739 100755
> > --- a/t/t5702-protocol-v2.sh
> > +++ b/t/t5702-protocol-v2.sh
> > @@ -848,7 +848,7 @@ test_expect_success 'part of packfile response provided as URI' '
> >  	test -f h2found &&
> >
> >  	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
> > -	ls http_child/.git/objects/pack/* >filelist &&
> > +	ls http_child/.git/objects/pack/* | grep -v \.rev$ >filelist &&
> >  	test_line_count = 6 filelist
> >  '
>
> Maybe just check *.{pack,idx,keep}. I was looking at that code the other
> day and it's really just being overly specific. It really just cares
> about the *.pack files.

I think here and in t6500 as well as t9300 the easiest thing to do is
just

diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 9ebf045739..73cd9e3ff6 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -848,8 +848,10 @@ test_expect_success 'part of packfile response provided as URI' '
 	test -f h2found &&

 	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
-	ls http_child/.git/objects/pack/* | grep -v \.rev$ >filelist &&
-	test_line_count = 6 filelist
+	ls http_child/.git/objects/pack/*.pack >packlist &&
+	ls http_child/.git/objects/pack/*.idx >idxlist &&
+	test_line_count = 3 idxlist &&
+	test_line_count = 3 packlist
 '

> > diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> > index 308c1ef42c..100df52a71 100755
> > --- a/t/t9300-fast-import.sh
> > +++ b/t/t9300-fast-import.sh
> > @@ -1629,7 +1629,7 @@ test_expect_success 'O: blank lines not necessary after other commands' '
> >  	INPUT_END
> >
> >  	git fast-import <input &&
> > -	test 8 = $(find .git/objects/pack -type f | grep -v multi-pack-index | wc -l) &&
> > +	test 8 = $(find .git/objects/pack -type f \( -name "*.idx" -o -name "*.pack" \) | wc -l) &&
>
> Yay, there the existing multi-pack-index case is amended in a
> future-proof way :)

Here's actually a spot where I'm unhappy with the resulting complexity,
and I think that it would be much cleaner if we did the same
packlist+idxlist thing and then checked the line count of each.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (7 preceding siblings ...)
  2021-01-08 18:20 ` [PATCH 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
@ 2021-01-13 22:28 ` Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
                     ` (7 more replies)
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  9 siblings, 8 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Hi,

This is the second of two series to implement support for an on-disk format for
storing the reverse index. Note that this depends on the patches in the previous
series [1], which was recently updated).

This version is largely unchanged from the original, with the following
exceptions:

  - It has been rebased onto the patches in the first series.
  - The operands of two comparisons in 'offset_to_pack_pos()' were swapped so
    that the smaller of the two appears on the left-hand side of the comparison.
  - A brown-paper-bag bug was fixed in tests so that they pass on Windows (last
    night's integration broke 'seen' on Windows).
  - The GIT_TEST_WRITE_REV_INDEX mode was enabled in the "all-features" test.

Thanks in advance for your review.

[1]: https://lore.kernel.org/git/cover.1610129796.git.me@ttaylorr.com/

Taylor Blau (8):
  packfile: prepare for the existence of '*.rev' files
  pack-write.c: prepare to write 'pack-*.rev' files
  builtin/index-pack.c: write reverse indexes
  builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
  t: prepare for GIT_TEST_WRITE_REV_INDEX
  t: support GIT_TEST_WRITE_REV_INDEX
  pack-revindex: ensure that on-disk reverse indexes are given
    precedence

 Documentation/config/pack.txt           |   7 ++
 Documentation/git-index-pack.txt        |  20 ++--
 Documentation/technical/pack-format.txt |  17 ++++
 builtin/index-pack.c                    |  67 +++++++++++--
 builtin/pack-objects.c                  |   9 ++
 builtin/repack.c                        |   1 +
 ci/run-build-and-tests.sh               |   1 +
 object-store.h                          |   3 +
 pack-revindex.c                         | 116 ++++++++++++++++++++--
 pack-revindex.h                         |  10 +-
 pack-write.c                            | 123 +++++++++++++++++++++++-
 pack.h                                  |   4 +
 packfile.c                              |  13 ++-
 packfile.h                              |   1 +
 t/README                                |   3 +
 t/t5319-multi-pack-index.sh             |   5 +-
 t/t5325-reverse-index.sh                |  97 +++++++++++++++++++
 t/t5604-clone-reference.sh              |   2 +-
 t/t5702-protocol-v2.sh                  |  12 ++-
 t/t6500-gc.sh                           |   6 +-
 t/t9300-fast-import.sh                  |   5 +-
 tmp-objdir.c                            |   4 +-
 22 files changed, 485 insertions(+), 41 deletions(-)
 create mode 100755 t/t5325-reverse-index.sh

 -:  ---------- >  1:  e1aa89244a pack-revindex: introduce a new API
 -:  ---------- >  2:  0fca7d5812 write_reuse_object(): convert to new revindex API
 -:  ---------- >  3:  7676822a54 write_reused_pack_one(): convert to new revindex API
 -:  ---------- >  4:  dd7133fdb7 write_reused_pack_verbatim(): convert to new revindex API
 -:  ---------- >  5:  8e93ca3886 check_object(): convert to new revindex API
 -:  ---------- >  6:  084bbf2145 bitmap_position_packfile(): convert to new revindex API
 -:  ---------- >  7:  68794e9484 show_objects_for_type(): convert to new revindex API
 -:  ---------- >  8:  31ac6f5703 get_size_by_pos(): convert to new revindex API
 -:  ---------- >  9:  acd80069a2 try_partial_reuse(): convert to new revindex API
 -:  ---------- > 10:  569acdca7f rebuild_existing_bitmaps(): convert to new revindex API
 -:  ---------- > 11:  9881637724 get_delta_base_oid(): convert to new revindex API
 -:  ---------- > 12:  df8bb571a5 retry_bad_packed_offset(): convert to new revindex API
 -:  ---------- > 13:  41b2e00947 packed_object_info(): convert to new revindex API
 -:  ---------- > 14:  8ad49d231f unpack_entry(): convert to new revindex API
 -:  ---------- > 15:  e757476351 for_each_object_in_pack(): convert to new revindex API
 -:  ---------- > 16:  a500311e33 builtin/gc.c: guess the size of the revindex
 -:  ---------- > 17:  67d14da04a pack-revindex: remove unused 'find_pack_revindex()'
 -:  ---------- > 18:  3b5c92be68 pack-revindex: remove unused 'find_revindex_position()'
 -:  ---------- > 19:  cabafce4a1 pack-revindex: hide the definition of 'revindex_entry'
 -:  ---------- > 20:  8400ff6c96 pack-revindex.c: avoid direct revindex access in 'offset_to_pack_pos()'
 1:  ddf47a0a50 ! 21:  6742c15c84 packfile: prepare for the existence of '*.rev' files
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)
     +
      int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
      {
    - 	int lo = 0;
    + 	unsigned lo, hi;
     @@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)

      uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
    @@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_
     -	if (!p->revindex)
     +	if (!(p->revindex || p->revindex_data))
      		BUG("pack_pos_to_index: reverse index not yet loaded");
    - 	if (pos >= p->num_objects)
    + 	if (p->num_objects <= pos)
      		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
     -	return p->revindex[pos].nr;
     +
    @@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_
     -	if (!p->revindex)
     +	if (!(p->revindex || p->revindex_data))
      		BUG("pack_pos_to_index: reverse index not yet loaded");
    - 	if (pos > p->num_objects)
    + 	if (p->num_objects < pos)
      		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
     -	return p->revindex[pos].offset;
     +
    @@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_
     +		return nth_packed_object_offset(p, pack_pos_to_index(p, pos));
      }

    + ## pack-revindex.h ##
    +@@ pack-revindex.h: struct packed_git;
    + /*
    +  * load_pack_revindex populates the revindex's internal data-structures for the
    +  * given pack, returning zero on success and a negative value otherwise.
    ++ *
    ++ * If a '.rev' file is present, it is checked for consistency, mmap'd, and
    ++ * pointers are assigned into it (instead of using the in-memory variant).
    +  */
    + int load_pack_revindex(struct packed_git *p);
    +
    +@@ pack-revindex.h: uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
    +  * If the reverse index has not yet been loaded, or the position is out of
    +  * bounds, this function aborts.
    +  *
    +- * This function runs in constant time.
    ++ * This function runs in constant time under both in-memory and on-disk reverse
    ++ * indexes, but an additional step is taken to consult the corresponding .idx
    ++ * file when using the on-disk format.
    +  */
    + off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
    +
    +
      ## packfile.c ##
     @@ packfile.c: void close_pack_index(struct packed_git *p)
      	}
 2:  88393e2662 = 22:  8648c87fa7 pack-write.c: prepare to write 'pack-*.rev' files
 3:  b0a7329824 ! 23:  5b18ada611 builtin/index-pack.c: write reverse indexes
    @@ t/t5325-reverse-index.sh (new)
     +	rm -f $rev &&
     +	conf=$1 &&
     +	shift &&
    ++	# remove the index since Windows won't overwrite an existing file
    ++	rm $packdir/pack-$pack.idx &&
     +	git -c pack.writeReverseIndex=$conf index-pack "$@" \
     +		$packdir/pack-$pack.pack
     +}
 4:  e297a31875 = 24:  68bde3ea97 builtin/pack-objects.c: respect 'pack.writeReverseIndex'
 5:  5d3e96a498 = 25:  38a253d0ce Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
 6:  2288571fbe <  -:  ---------- t: prepare for GIT_TEST_WRITE_REV_INDEX
 -:  ---------- > 26:  12cdf2d67a t: prepare for GIT_TEST_WRITE_REV_INDEX
 7:  3525c4d114 ! 27:  6b647d9775 t: support GIT_TEST_WRITE_REV_INDEX
    @@ Commit message

         Add a new option that unconditionally enables the pack.writeReverseIndex
         setting in order to run the whole test suite in a mode that generates
    -    on-disk reverse indexes.
    +    on-disk reverse indexes. Additionally, enable this mode in the second
    +    run of tests under linux-gcc in 'ci/run-build-and-tests.sh'.

         Once on-disk reverse indexes are proven out over several releases, we
         can change the default value of that configuration to 'true', and drop
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const
      	progress = isatty(2);
      	argc = parse_options(argc, argv, prefix, pack_objects_options,

    + ## ci/run-build-and-tests.sh ##
    +@@ ci/run-build-and-tests.sh: linux-gcc)
    + 	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
    + 	export GIT_TEST_MULTI_PACK_INDEX=1
    + 	export GIT_TEST_ADD_I_USE_BUILTIN=1
    ++	export GIT_TEST_WRITE_REV_INDEX=1
    + 	make test
    + 	;;
    + linux-clang)
    +
      ## pack-revindex.h ##
     @@
    - #ifndef PACK_REVINDEX_H
    - #define PACK_REVINDEX_H
    +  *   can be found
    +  */

     +#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
     +
      struct packed_git;

    - int load_pack_revindex(struct packed_git *p);
    + /*

      ## t/README ##
     @@ t/README: GIT_TEST_DEFAULT_HASH=<hash-algo> specifies which hash algorithm to
 8:  6e580d43d1 ! 28:  48926ae182 pack-revindex: ensure that on-disk reverse indexes are given precedence
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)

      ## pack-revindex.h ##
     @@
    - #define PACK_REVINDEX_H
    +  */

      #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
     +#define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
    @@ t/t5325-reverse-index.sh: test_expect_success 'pack-objects respects pack.writeR
      '

     +test_expect_success 'reverse index is not generated when available on disk' '
    -+	git index-pack --rev-index $packdir/pack-$pack.pack &&
    ++	test_index_pack true &&
    ++	test_path_is_file $rev &&
     +
     +	git rev-parse HEAD >tip &&
     +	GIT_TEST_REV_INDEX_DIE_IN_MEMORY=1 git cat-file \
--
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-14  7:22     ` Junio C Hamano
                       ` (2 more replies)
  2021-01-13 22:28   ` [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
                     ` (6 subsequent siblings)
  7 siblings, 3 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Specify the format of the on-disk reverse index 'pack-*.rev' file, as
well as prepare the code for the existence of such files.

The reverse index maps from pack relative positions (i.e., an index into
the array of object which is sorted by their offsets within the
packfile) to their position within the 'pack-*.idx' file. Today, this is
done by building up a list of (off_t, uint32_t) tuples for each object
(the off_t corresponding to that object's offset, and the uint32_t
corresponding to its position in the index). To convert between pack and
index position quickly, this array of tuples is radix sorted based on
its offset.

This has two major drawbacks:

First, the in-memory cost scales linearly with the number of objects in
a pack.  Each 'struct revindex_entry' is sizeof(off_t) +
sizeof(uint32_t) + padding bytes for a total of 16.

To observe this, force Git to load the reverse index by, for e.g.,
running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking
for a single object in a fresh clone of the kernel, Git needs to
allocate 120+ MB of memory in order to hold the reverse index in memory.

Second, the cost to sort also scales with the size of the pack.
Luckily, this is a linear function since 'load_pack_revindex()' uses a
radix sort, but this cost still must be paid once per pack per process.

As an example, it takes ~60x longer to print the _size_ of an object as
it does to print that entire object's _contents_:

  Benchmark #1: git.compile cat-file --batch <obj
    Time (mean ± σ):       3.4 ms ±   0.1 ms    [User: 3.3 ms, System: 2.1 ms]
    Range (min … max):     3.2 ms …   3.7 ms    726 runs

  Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj
    Time (mean ± σ):     210.3 ms ±   8.9 ms    [User: 188.2 ms, System: 23.2 ms]
    Range (min … max):   193.7 ms … 224.4 ms    13 runs

Instead, avoid computing and sorting the revindex once per process by
writing it to a file when the pack itself is generated.

The format is relatively straightforward. It contains an array of
uint32_t's, the length of which is equal to the number of objects in the
pack.  The ith entry in this table contains the index position of the
ith object in the pack, where "ith object in the pack" is determined by
pack offset.

One thing that the on-disk format does _not_ contain is the full (up to)
eight-byte offset corresponding to each object. This is something that
the in-memory revindex contains (it stores an off_t in 'struct
revindex_entry' along with the same uint32_t that the on-disk format
has). Omit it in the on-disk format, since knowing the index position
for some object is sufficient to get a constant-time lookup in the
pack-*.idx file to ask for an object's offset within the pack.

This trades off between the on-disk size of the 'pack-*.rev' file for
runtime to chase down the offset for some object. Even though the lookup
is constant time, the constant is heavier, since it can potentially
involve two pointer walks in v2 indexes (one to access the 4-byte offset
table, and potentially a second to access the double wide offset table).

Consider trying to map an object's pack offset to a relative position
within that pack. In a cold-cache scenario, more page faults occur while
switching between binary searching through the reverse index and
searching through the *.idx file for an object's offset. Sure enough,
with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after
'sync'ing), printing out the entire object's contents is still
marginally faster than printing its size:

  Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null
    Time (mean ± σ):      22.6 ms ±   0.5 ms    [User: 2.4 ms, System: 7.9 ms]
    Range (min … max):    21.4 ms …  23.5 ms    41 runs

  Benchmark #2: git.compile cat-file --batch <obj >/dev/null
    Time (mean ± σ):      17.2 ms ±   0.7 ms    [User: 2.8 ms, System: 5.5 ms]
    Range (min … max):    15.6 ms …  18.2 ms    45 runs

(Numbers taken in the kernel after cheating and using the next patch to
generate a reverse index). There are a couple of approaches to improve
cold cache performance not pursued here:

  - We could include the object offsets in the reverse index format.
    Predictably, this does result in fewer page faults, but it triples
    the size of the file, while simultaneously duplicating a ton of data
    already available in the .idx file. (This was the original way I
    implemented the format, and it did show
    `--batch-check='%(objectsize:disk)'` winning out against `--batch`.)

    On the other hand, this increase in size also results in a large
    block-cache footprint, which could potentially hurt other workloads.

  - We could store the mapping from pack to index position in more
    cache-friendly way, like constructing a binary search tree from the
    table and writing the values in breadth-first order. This would
    result in much better locality, but the price you pay is trading
    O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
    can no longer directly index the table).

So, neither of these approaches are taken here. (Thankfully, the format
is versioned, so we are free to pursue these in the future.) But, cold
cache performance likely isn't interesting outside of one-off cases like
asking for the size of an object directly. In real-world usage, Git is
often performing many operations in the revindex,

The trade-off is worth it, since we will avoid the vast majority of the
cost of generating the revindex that the extra pointer chase will look
like noise in the following patch's benchmarks.

This patch describes the format and prepares callers (like in
pack-revindex.c) to be able to read *.rev files once they exist. An
implementation of the writer will appear in the next patch, and callers
will gradually begin to start using the writer in the patches that
follow after that.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  17 ++++
 builtin/repack.c                        |   1 +
 object-store.h                          |   3 +
 pack-revindex.c                         | 112 +++++++++++++++++++++---
 pack-revindex.h                         |   7 +-
 packfile.c                              |  13 ++-
 packfile.h                              |   1 +
 tmp-objdir.c                            |   4 +-
 8 files changed, 145 insertions(+), 13 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f..9593f8bc68 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -259,6 +259,23 @@ Pack file entry: <+
 
     Index checksum of all of the above.
 
+== pack-*.rev files have the format:
+
+  - A 4-byte magic number '0x52494458' ('RIDX').
+
+  - A 4-byte version identifier (= 1)
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)
+
+  - A table of index positions, sorted by their corresponding offsets in the
+    packfile.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/builtin/repack.c b/builtin/repack.c
index 279be11a16..8d643ddcb9 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -208,6 +208,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".idx"},
+	{".rev", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 };
diff --git a/object-store.h b/object-store.h
index c4fc9dd74e..3fbf11280f 100644
--- a/object-store.h
+++ b/object-store.h
@@ -85,6 +85,9 @@ struct packed_git {
 		 multi_pack_index:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
+	const void *revindex_data;
+	const void *revindex_map;
+	size_t revindex_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-revindex.c b/pack-revindex.c
index 5e69bc7372..369812dd21 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -164,16 +164,98 @@ static void create_pack_revindex(struct packed_git *p)
 	sort_revindex(p->revindex, num_ent, p->pack_size);
 }
 
-int load_pack_revindex(struct packed_git *p)
+static int load_pack_revindex_from_memory(struct packed_git *p)
 {
-	if (!p->revindex) {
-		if (open_pack_index(p))
-			return -1;
-		create_pack_revindex(p);
-	}
+	if (open_pack_index(p))
+		return -1;
+	create_pack_revindex(p);
 	return 0;
 }
 
+static char *pack_revindex_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
+}
+
+#define RIDX_MIN_SIZE (12 + (2 * the_hash_algo->rawsz))
+
+static int load_revindex_from_disk(char *revindex_name,
+				   uint32_t num_objects,
+				   const void **data, size_t *len)
+{
+	int fd, ret = 0;
+	struct stat st;
+	size_t revindex_size;
+
+	fd = git_open(revindex_name);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), revindex_name);
+		goto cleanup;
+	}
+
+	revindex_size = xsize_t(st.st_size);
+
+	if (revindex_size < RIDX_MIN_SIZE) {
+		ret = error(_("reverse-index file %s is too small"), revindex_name);
+		goto cleanup;
+	}
+
+	if (revindex_size - RIDX_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("reverse-index file %s is corrupt"), revindex_name);
+		goto cleanup;
+	}
+
+	*len = revindex_size;
+	*data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
+
+cleanup:
+	close(fd);
+	return ret;
+}
+
+static int load_pack_revindex_from_disk(struct packed_git *p)
+{
+	char *revindex_name;
+	int ret;
+	if (open_pack_index(p))
+		return -1;
+
+	revindex_name = pack_revindex_filename(p);
+
+	ret = load_revindex_from_disk(revindex_name,
+				      p->num_objects,
+				      &p->revindex_map,
+				      &p->revindex_size);
+	if (ret)
+		goto cleanup;
+
+	p->revindex_data = (char *)p->revindex_map + 12;
+
+cleanup:
+	free(revindex_name);
+	return ret;
+}
+
+int load_pack_revindex(struct packed_git *p)
+{
+	if (p->revindex || p->revindex_data)
+		return 0;
+
+	if (!load_pack_revindex_from_disk(p))
+		return 0;
+	else if (!load_pack_revindex_from_memory(p))
+		return 0;
+	return -1;
+}
+
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	unsigned lo, hi;
@@ -203,18 +285,28 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
 {
-	if (!p->revindex)
+	if (!(p->revindex || p->revindex_data))
 		BUG("pack_pos_to_index: reverse index not yet loaded");
 	if (p->num_objects <= pos)
 		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
-	return p->revindex[pos].nr;
+
+	if (p->revindex)
+		return p->revindex[pos].nr;
+	else
+		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));
 }
 
 off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
 {
-	if (!p->revindex)
+	if (!(p->revindex || p->revindex_data))
 		BUG("pack_pos_to_index: reverse index not yet loaded");
 	if (p->num_objects < pos)
 		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
-	return p->revindex[pos].offset;
+
+	if (p->revindex)
+		return p->revindex[pos].offset;
+	else if (pos == p->num_objects)
+		return p->pack_size - the_hash_algo->rawsz;
+	else
+		return nth_packed_object_offset(p, pack_pos_to_index(p, pos));
 }
diff --git a/pack-revindex.h b/pack-revindex.h
index 6e0320b08b..01622cf21a 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -21,6 +21,9 @@ struct packed_git;
 /*
  * load_pack_revindex populates the revindex's internal data-structures for the
  * given pack, returning zero on success and a negative value otherwise.
+ *
+ * If a '.rev' file is present, it is checked for consistency, mmap'd, and
+ * pointers are assigned into it (instead of using the in-memory variant).
  */
 int load_pack_revindex(struct packed_git *p);
 
@@ -55,7 +58,9 @@ uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
  * If the reverse index has not yet been loaded, or the position is out of
  * bounds, this function aborts.
  *
- * This function runs in constant time.
+ * This function runs in constant time under both in-memory and on-disk reverse
+ * indexes, but an additional step is taken to consult the corresponding .idx
+ * file when using the on-disk format.
  */
 off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
 
diff --git a/packfile.c b/packfile.c
index 7bb1750934..b04eac9286 100644
--- a/packfile.c
+++ b/packfile.c
@@ -324,11 +324,21 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
+void close_pack_revindex(struct packed_git *p) {
+	if (!p->revindex_map)
+		return;
+
+	munmap((void *)p->revindex_map, p->revindex_size);
+	p->revindex_map = NULL;
+	p->revindex_data = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
+	close_pack_revindex(p);
 }
 
 void close_object_store(struct raw_object_store *o)
@@ -351,7 +361,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -853,6 +863,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	if (!strcmp(file_name, "multi-pack-index"))
 		return;
 	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".rev") ||
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
diff --git a/packfile.h b/packfile.h
index a58fc738e0..4cfec9e8d3 100644
--- a/packfile.h
+++ b/packfile.h
@@ -90,6 +90,7 @@ uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
 unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 void close_pack_windows(struct packed_git *);
+void close_pack_revindex(struct packed_git *);
 void close_pack(struct packed_git *);
 void close_object_store(struct raw_object_store *o);
 void unuse_pack(struct pack_window **);
diff --git a/tmp-objdir.c b/tmp-objdir.c
index 42ed4db5d3..da414df14f 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -187,7 +187,9 @@ static int pack_copy_priority(const char *name)
 		return 2;
 	if (ends_with(name, ".idx"))
 		return 3;
-	return 4;
+	if (ends_with(name, ".rev"))
+		return 4;
+	return 5;
 }
 
 static int pack_copy_cmp(const char *a, const char *b)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-22 23:24     ` Jeff King
  2021-01-13 22:28   ` [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

This patch prepares for callers to be able to write reverse index files
to disk.

It adds the necessary machinery to write a format-compliant .rev file
from within 'write_rev_file()', which is called from
'finish_tmp_packfile()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-write.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 pack.h       |   4 ++
 2 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/pack-write.c b/pack-write.c
index 3513665e1e..68db5a9edf 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -166,6 +166,116 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	return index_name;
 }
 
+static int pack_order_cmp(const void *va, const void *vb, void *ctx)
+{
+	struct pack_idx_entry **objects = ctx;
+
+	off_t oa = objects[*(uint32_t*)va]->offset;
+	off_t ob = objects[*(uint32_t*)vb]->offset;
+
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
+#define RIDX_VERSION 1
+
+static void write_rev_header(struct hashfile *f)
+{
+	uint32_t oid_version;
+	switch (hash_algo_by_ptr(the_hash_algo)) {
+	case GIT_HASH_SHA1:
+		oid_version = 1;
+		break;
+	case GIT_HASH_SHA256:
+		oid_version = 2;
+		break;
+	default:
+		die("write_rev_header: unknown hash version");
+	}
+
+	hashwrite_be32(f, RIDX_SIGNATURE);
+	hashwrite_be32(f, RIDX_VERSION);
+	hashwrite_be32(f, oid_version);
+}
+
+static void write_rev_index_positions(struct hashfile *f,
+				      struct pack_idx_entry **objects,
+				      uint32_t nr_objects)
+{
+	uint32_t *pack_order;
+	uint32_t i;
+
+	ALLOC_ARRAY(pack_order, nr_objects);
+	for (i = 0; i < nr_objects; i++)
+		pack_order[i] = i;
+	QSORT_S(pack_order, nr_objects, pack_order_cmp, objects);
+
+	for (i = 0; i < nr_objects; i++)
+		hashwrite_be32(f, pack_order[i]);
+
+	free(pack_order);
+}
+
+static void write_rev_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+const char *write_rev_file(const char *rev_name,
+			   struct pack_idx_entry **objects,
+			   uint32_t nr_objects,
+			   const unsigned char *hash,
+			   unsigned flags)
+{
+	struct hashfile *f;
+	int fd;
+
+	if ((flags & WRITE_REV) && (flags & WRITE_REV_VERIFY))
+		die(_("cannot both write and verify reverse index"));
+
+	if (flags & WRITE_REV) {
+		if (!rev_name) {
+			struct strbuf tmp_file = STRBUF_INIT;
+			fd = odb_mkstemp(&tmp_file, "pack/tmp_rev_XXXXXX");
+			rev_name = strbuf_detach(&tmp_file, NULL);
+		} else {
+			unlink(rev_name);
+			fd = open(rev_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+			if (fd < 0)
+				die_errno("unable to create '%s'", rev_name);
+		}
+		f = hashfd(fd, rev_name);
+	} else if (flags & WRITE_REV_VERIFY) {
+		struct stat statbuf;
+		if (stat(rev_name, &statbuf)) {
+			if (errno == ENOENT) {
+				/* .rev files are optional */
+				return NULL;
+			} else
+				die_errno(_("could not stat: %s"), rev_name);
+		}
+		f = hashfd_check(rev_name);
+	} else
+		return NULL;
+
+	write_rev_header(f);
+
+	write_rev_index_positions(f, objects, nr_objects);
+	write_rev_trailer(f, hash);
+
+	if (rev_name && adjust_shared_perm(rev_name) < 0)
+		die(_("failed to make %s readable"), rev_name);
+
+	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
+				    ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
+
+	return rev_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -341,7 +451,7 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[])
 {
-	const char *idx_tmp_name;
+	const char *idx_tmp_name, *rev_tmp_name = NULL;
 	int basename_len = name_buffer->len;
 
 	if (adjust_shared_perm(pack_tmp_name))
@@ -352,6 +462,9 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 	if (adjust_shared_perm(idx_tmp_name))
 		die_errno("unable to make temporary index file readable");
 
+	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
+				      pack_idx_opts->flags);
+
 	strbuf_addf(name_buffer, "%s.pack", hash_to_hex(hash));
 
 	if (rename(pack_tmp_name, name_buffer->buf))
@@ -365,5 +478,13 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 
 	strbuf_setlen(name_buffer, basename_len);
 
+	if (rev_tmp_name) {
+		strbuf_addf(name_buffer, "%s.rev", hash_to_hex(hash));
+		if (rename(rev_tmp_name, name_buffer->buf))
+			die_errno("unable to rename temporary reverse-index file");
+	}
+
+	strbuf_setlen(name_buffer, basename_len);
+
 	free((void *)idx_tmp_name);
 }
diff --git a/pack.h b/pack.h
index 9fc0945ac9..30439e0784 100644
--- a/pack.h
+++ b/pack.h
@@ -42,6 +42,8 @@ struct pack_idx_option {
 	/* flag bits */
 #define WRITE_IDX_VERIFY 01 /* verify only, do not write the idx file */
 #define WRITE_IDX_STRICT 02
+#define WRITE_REV 04
+#define WRITE_REV_VERIFY 010
 
 	uint32_t version;
 	uint32_t off32_limit;
@@ -87,6 +89,8 @@ off_t write_pack_header(struct hashfile *f, uint32_t);
 void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 char *index_pack_lockfile(int fd);
 
+const char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
+
 /*
  * The "hdr" output buffer should be at least this big, which will handle sizes
  * up to 2^67.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-22 23:53     ` Jeff King
  2021-01-13 22:28   ` [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Teach 'git index-pack' to optionally write and verify reverse index with
'--[no-]rev-index', as well as respecting the 'pack.writeReverseIndex'
configuration option.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-index-pack.txt | 20 ++++++---
 builtin/index-pack.c             | 64 +++++++++++++++++++++++-----
 t/t5325-reverse-index.sh         | 71 ++++++++++++++++++++++++++++++++
 3 files changed, 139 insertions(+), 16 deletions(-)
 create mode 100755 t/t5325-reverse-index.sh

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index af0c26232c..b65f380269 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -9,17 +9,18 @@ git-index-pack - Build pack index file for an existing packed archive
 SYNOPSIS
 --------
 [verse]
-'git index-pack' [-v] [-o <index-file>] <pack-file>
+'git index-pack' [-v] [-o <index-file>] [--[no-]rev-index] <pack-file>
 'git index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>]
-                 [<pack-file>]
+		  [--[no-]rev-index] [<pack-file>]
 
 
 DESCRIPTION
 -----------
 Reads a packed archive (.pack) from the specified file, and
-builds a pack index file (.idx) for it.  The packed archive
-together with the pack index can then be placed in the
-objects/pack/ directory of a Git repository.
+builds a pack index file (.idx) for it. Optionally writes a
+reverse-index (.rev) for the specified pack. The packed
+archive together with the pack index can then be placed in
+the objects/pack/ directory of a Git repository.
 
 
 OPTIONS
@@ -33,7 +34,14 @@ OPTIONS
 	file is constructed from the name of packed archive
 	file by replacing .pack with .idx (and the program
 	fails if the name of packed archive does not end
-	with .pack).
+	with .pack). Incompatible with `--rev-index`.
+
+--[no-]rev-index::
+	When this flag is provided, generate a reverse index
+	(a `.rev` file) corresponding to the given pack. If
+	`--verify` is given, ensure that the existing
+	reverse index is correct. Takes precedence over
+	`pack.writeReverseIndex`.
 
 --stdin::
 	When this flag is provided, the pack is read from stdin
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 4b8d86e0ad..03408250b1 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -17,7 +17,7 @@
 #include "promisor-remote.h"
 
 static const char index_pack_usage[] =
-"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
+"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
 struct object_entry {
 	struct pack_idx_entry idx;
@@ -1436,13 +1436,13 @@ static void fix_unresolved_deltas(struct hashfile *f)
 	free(sorted_by_pos);
 }
 
-static const char *derive_filename(const char *pack_name, const char *suffix,
-				   struct strbuf *buf)
+static const char *derive_filename(const char *pack_name, const char *strip,
+				   const char *suffix, struct strbuf *buf)
 {
 	size_t len;
-	if (!strip_suffix(pack_name, ".pack", &len))
-		die(_("packfile name '%s' does not end with '.pack'"),
-		    pack_name);
+	if (!strip_suffix(pack_name, strip, &len))
+		die(_("packfile name '%s' does not end with '%s'"),
+		    pack_name, strip);
 	strbuf_add(buf, pack_name, len);
 	strbuf_addch(buf, '.');
 	strbuf_addstr(buf, suffix);
@@ -1459,7 +1459,7 @@ static void write_special_file(const char *suffix, const char *msg,
 	int msg_len = strlen(msg);
 
 	if (pack_name)
-		filename = derive_filename(pack_name, suffix, &name_buf);
+		filename = derive_filename(pack_name, ".pack", suffix, &name_buf);
 	else
 		filename = odb_pack_name(&name_buf, hash, suffix);
 
@@ -1484,12 +1484,14 @@ static void write_special_file(const char *suffix, const char *msg,
 
 static void final(const char *final_pack_name, const char *curr_pack_name,
 		  const char *final_index_name, const char *curr_index_name,
+		  const char *final_rev_index_name, const char *curr_rev_index_name,
 		  const char *keep_msg, const char *promisor_msg,
 		  unsigned char *hash)
 {
 	const char *report = "pack";
 	struct strbuf pack_name = STRBUF_INIT;
 	struct strbuf index_name = STRBUF_INIT;
+	struct strbuf rev_index_name = STRBUF_INIT;
 	int err;
 
 	if (!from_stdin) {
@@ -1524,6 +1526,16 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 	} else
 		chmod(final_index_name, 0444);
 
+	if (curr_rev_index_name) {
+		if (final_rev_index_name != curr_rev_index_name) {
+			if (!final_rev_index_name)
+				final_rev_index_name = odb_pack_name(&rev_index_name, hash, "rev");
+			if (finalize_object_file(curr_rev_index_name, final_rev_index_name))
+				die(_("cannot store reverse index file"));
+		} else
+			chmod(final_rev_index_name, 0444);
+	}
+
 	if (do_fsck_object) {
 		struct packed_git *p;
 		p = add_packed_git(final_index_name, strlen(final_index_name), 0);
@@ -1553,6 +1565,7 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 		}
 	}
 
+	strbuf_release(&rev_index_name);
 	strbuf_release(&index_name);
 	strbuf_release(&pack_name);
 }
@@ -1578,6 +1591,12 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 		}
 		return 0;
 	}
+	if (!strcmp(k, "pack.writereverseindex")) {
+		if (git_config_bool(k, v))
+			opts->flags |= WRITE_REV;
+		else
+			opts->flags &= ~WRITE_REV;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -1695,12 +1714,14 @@ static void show_pack_info(int stat_only)
 
 int cmd_index_pack(int argc, const char **argv, const char *prefix)
 {
-	int i, fix_thin_pack = 0, verify = 0, stat_only = 0;
+	int i, fix_thin_pack = 0, verify = 0, stat_only = 0, rev_index;
 	const char *curr_index;
-	const char *index_name = NULL, *pack_name = NULL;
+	const char *curr_rev_index = NULL;
+	const char *index_name = NULL, *pack_name = NULL, *rev_index_name = NULL;
 	const char *keep_msg = NULL;
 	const char *promisor_msg = NULL;
 	struct strbuf index_name_buf = STRBUF_INIT;
+	struct strbuf rev_index_name_buf = STRBUF_INIT;
 	struct pack_idx_entry **idx_objects;
 	struct pack_idx_option opts;
 	unsigned char pack_hash[GIT_MAX_RAWSZ];
@@ -1727,6 +1748,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
+	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
+
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
 
@@ -1805,6 +1828,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				if (hash_algo == GIT_HASH_UNKNOWN)
 					die(_("unknown hash algorithm '%s'"), arg);
 				repo_set_hash_algo(the_repository, hash_algo);
+			} else if (!strcmp(arg, "--rev-index")) {
+				rev_index = 1;
+			} else if (!strcmp(arg, "--no-rev-index")) {
+				rev_index = 0;
 			} else
 				usage(index_pack_usage);
 			continue;
@@ -1824,7 +1851,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && hash_algo)
 		die(_("--object-format cannot be used with --stdin"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, "idx", &index_name_buf);
+		index_name = derive_filename(pack_name, ".pack", "idx", &index_name_buf);
+
+	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
+	if (rev_index) {
+		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
+		if (index_name)
+			rev_index_name = derive_filename(index_name,
+							 ".idx", "rev",
+							 &rev_index_name_buf);
+	}
 
 	if (verify) {
 		if (!index_name)
@@ -1878,11 +1914,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	for (i = 0; i < nr_objects; i++)
 		idx_objects[i] = &objects[i].idx;
 	curr_index = write_idx_file(index_name, idx_objects, nr_objects, &opts, pack_hash);
+	if (rev_index)
+		curr_rev_index = write_rev_file(rev_index_name, idx_objects,
+						nr_objects, pack_hash,
+						opts.flags);
 	free(idx_objects);
 
 	if (!verify)
 		final(pack_name, curr_pack,
 		      index_name, curr_index,
+		      rev_index_name, curr_rev_index,
 		      keep_msg, promisor_msg,
 		      pack_hash);
 	else
@@ -1893,10 +1934,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 
 	free(objects);
 	strbuf_release(&index_name_buf);
+	strbuf_release(&rev_index_name_buf);
 	if (pack_name == NULL)
 		free((void *) curr_pack);
 	if (index_name == NULL)
 		free((void *) curr_index);
+	if (rev_index_name == NULL)
+		free((void *) curr_rev_index);
 
 	/*
 	 * Let the caller know this pack is not self contained
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
new file mode 100755
index 0000000000..2dae213126
--- /dev/null
+++ b/t/t5325-reverse-index.sh
@@ -0,0 +1,71 @@
+#!/bin/sh
+
+test_description='on-disk reverse index'
+. ./test-lib.sh
+
+packdir=.git/objects/pack
+
+test_expect_success 'setup' '
+	test_commit base &&
+
+	pack=$(git pack-objects --all $packdir/pack) &&
+	rev=$packdir/pack-$pack.rev &&
+
+	test_path_is_missing $rev
+'
+
+test_index_pack () {
+	rm -f $rev &&
+	conf=$1 &&
+	shift &&
+	# remove the index since Windows won't overwrite an existing file
+	rm $packdir/pack-$pack.idx &&
+	git -c pack.writeReverseIndex=$conf index-pack "$@" \
+		$packdir/pack-$pack.pack
+}
+
+test_expect_success 'index-pack with pack.writeReverseIndex' '
+	test_index_pack "" &&
+	test_path_is_missing $rev &&
+
+	test_index_pack false &&
+	test_path_is_missing $rev &&
+
+	test_index_pack true &&
+	test_path_is_file $rev
+'
+
+test_expect_success 'index-pack with --[no-]rev-index' '
+	for conf in "" true false
+	do
+		test_index_pack "$conf" --rev-index &&
+		test_path_exists $rev &&
+
+		test_index_pack "$conf" --no-rev-index &&
+		test_path_is_missing $rev
+	done
+'
+
+test_expect_success 'index-pack can verify reverse indexes' '
+	test_when_finished "rm -f $rev" &&
+	test_index_pack true &&
+
+	test_path_is_file $rev &&
+	git index-pack --rev-index --verify $packdir/pack-$pack.pack &&
+
+	# Intentionally corrupt the reverse index.
+	chmod u+w $rev &&
+	printf "xxxx" | dd of=$rev bs=1 count=4 conv=notrunc &&
+
+	test_must_fail git index-pack --rev-index --verify \
+		$packdir/pack-$pack.pack 2>err &&
+	grep "validation error" err
+'
+
+test_expect_success 'index-pack infers reverse index name with -o' '
+	git index-pack --rev-index -o other.idx $packdir/pack-$pack.pack &&
+	test_path_is_file other.idx &&
+	test_path_is_file other.rev
+'
+
+test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (2 preceding siblings ...)
  2021-01-13 22:28   ` [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-22 23:57     ` Jeff King
  2021-01-13 22:28   ` [PATCH v2 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that we have an implementation that can write the new reverse index
format, enable writing a .rev file in 'git pack-objects' by consulting
the pack.writeReverseIndex configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c   |  7 +++++++
 t/t5325-reverse-index.sh | 13 +++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5b0c4489e2..d784569200 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2955,6 +2955,13 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
+	if (!strcmp(k, "pack.writereverseindex")) {
+		if (git_config_bool(k, v))
+			pack_idx_opts.flags |= WRITE_REV;
+		else
+			pack_idx_opts.flags &= ~WRITE_REV;
+		return 0;
+	}
 	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
 		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
 		const char *oid_end, *pack_end;
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 2dae213126..87040263b7 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -68,4 +68,17 @@ test_expect_success 'index-pack infers reverse index name with -o' '
 	test_path_is_file other.rev
 '
 
+test_expect_success 'pack-objects respects pack.writeReverseIndex' '
+	test_when_finished "rm -fr pack-1-*" &&
+
+	git -c pack.writeReverseIndex= pack-objects --all pack-1 &&
+	test_path_is_missing pack-1-*.rev &&
+
+	git -c pack.writeReverseIndex=false pack-objects --all pack-1 &&
+	test_path_is_missing pack-1-*.rev &&
+
+	git -c pack.writeReverseIndex=true pack-objects --all pack-1 &&
+	test_path_is_file pack-1-*.rev
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (3 preceding siblings ...)
  2021-01-13 22:28   ` [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that the pack.writeReverseIndex configuration is respected in both
'git index-pack' and 'git pack-objects' (and therefore, all of their
callers), we can safely advertise it for use in the git-config manual.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 837f1b1679..3da4ea98e2 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -133,3 +133,10 @@ pack.writeBitmapHashCache::
 	between an older, bitmapped pack and objects that have been
 	pushed since the last gc). The downside is that it consumes 4
 	bytes per object of disk space. Defaults to true.
+
+pack.writeReverseIndex::
+	When true, git will write a corresponding .rev file (see:
+	link:../technical/pack-format.html[Documentation/technical/pack-format.txt])
+	for each new packfile that it writes in all places except for
+	linkgit:git-fast-import[1] and in the bulk checkin mechanism.
+	Defaults to false.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (4 preceding siblings ...)
  2021-01-13 22:28   ` [PATCH v2 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
  7 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

In the next patch, we'll add support for unconditionally enabling the
'pack.writeReverseIndex' setting with a new GIT_TEST_WRITE_REV_INDEX
environment variable.

This causes a little bit of fallout with tests that, for example,
compare the list of files in the pack directory being unprepared to see
.rev files in its output.

Those locations can be cleaned up to look for specific file extensions,
rather than take everything in the pack directory (for instance) and
then grep out unwanted items.

Once the pack.writeReverseIndex option has been thoroughly
tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX,
and making it possible to revert this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5319-multi-pack-index.sh |  5 +++--
 t/t5325-reverse-index.sh    |  4 ++++
 t/t5604-clone-reference.sh  |  2 +-
 t/t5702-protocol-v2.sh      | 12 ++++++++----
 t/t6500-gc.sh               |  6 +++---
 t/t9300-fast-import.sh      |  5 ++++-
 6 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a9..2fc3aadbd1 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -710,8 +710,9 @@ test_expect_success 'expire respects .keep files' '
 		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
 		touch $PACKA.keep &&
 		git multi-pack-index expire &&
-		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
-		test_line_count = 3 a-pack-files &&
+		test_path_is_file $PACKA.idx &&
+		test_path_is_file $PACKA.keep &&
+		test_path_is_file $PACKA.pack &&
 		test-tool read-midx .git/objects | grep idx >midx-list &&
 		test_line_count = 2 midx-list
 	)
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 87040263b7..be452bb343 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -3,6 +3,10 @@
 test_description='on-disk reverse index'
 . ./test-lib.sh
 
+# The below tests want control over the 'pack.writeReverseIndex' setting
+# themselves to assert various combinations of it with other options.
+sane_unset GIT_TEST_WRITE_REV_INDEX
+
 packdir=.git/objects/pack
 
 test_expect_success 'setup' '
diff --git a/t/t5604-clone-reference.sh b/t/t5604-clone-reference.sh
index 2f7be23044..7d93588982 100755
--- a/t/t5604-clone-reference.sh
+++ b/t/t5604-clone-reference.sh
@@ -326,7 +326,7 @@ test_expect_success SYMLINKS 'clone repo with symlinked or unknown files at obje
 	for raw in $(ls T*.raw)
 	do
 		sed -e "s!/../!/Y/!; s![0-9a-f]\{38,\}!Z!" -e "/commit-graph/d" \
-		    -e "/multi-pack-index/d" <$raw >$raw.de-sha-1 &&
+		    -e "/multi-pack-index/d" -e "/rev/d" <$raw >$raw.de-sha-1 &&
 		sort $raw.de-sha-1 >$raw.de-sha || return 1
 	done &&
 
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 7d5b17909b..73cd9e3ff6 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -848,8 +848,10 @@ test_expect_success 'part of packfile response provided as URI' '
 	test -f h2found &&
 
 	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
-	ls http_child/.git/objects/pack/* >filelist &&
-	test_line_count = 6 filelist
+	ls http_child/.git/objects/pack/*.pack >packlist &&
+	ls http_child/.git/objects/pack/*.idx >idxlist &&
+	test_line_count = 3 idxlist &&
+	test_line_count = 3 packlist
 '
 
 test_expect_success 'fetching with valid packfile URI but invalid hash fails' '
@@ -902,8 +904,10 @@ test_expect_success 'packfile-uri with transfer.fsckobjects' '
 		clone "$HTTPD_URL/smart/http_parent" http_child &&
 
 	# Ensure that there are exactly 4 files (2 .pack and 2 .idx).
-	ls http_child/.git/objects/pack/* >filelist &&
-	test_line_count = 4 filelist
+	ls http_child/.git/objects/pack/*.pack >packlist &&
+	ls http_child/.git/objects/pack/*.idx >idxlist &&
+	test_line_count = 2 idxlist &&
+	test_line_count = 2 packlist
 '
 
 test_expect_success 'packfile-uri with transfer.fsckobjects fails on bad object' '
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 4a3b8f48ac..f76586f808 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -106,17 +106,17 @@ test_expect_success 'auto gc with too many loose objects does not attempt to cre
 	test_commit "$(test_oid obj2)" &&
 	# Our first gc will create a pack; our second will create a second pack
 	git gc --auto &&
-	ls .git/objects/pack | sort >existing_packs &&
+	ls .git/objects/pack/pack-*.pack | sort >existing_packs &&
 	test_commit "$(test_oid obj3)" &&
 	test_commit "$(test_oid obj4)" &&
 
 	git gc --auto 2>err &&
 	test_i18ngrep ! "^warning:" err &&
-	ls .git/objects/pack/ | sort >post_packs &&
+	ls .git/objects/pack/pack-*.pack | sort >post_packs &&
 	comm -1 -3 existing_packs post_packs >new &&
 	comm -2 -3 existing_packs post_packs >del &&
 	test_line_count = 0 del && # No packs are deleted
-	test_line_count = 2 new # There is one new pack and its .idx
+	test_line_count = 1 new # There is one new pack
 '
 
 test_expect_success 'gc --no-quiet' '
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 308c1ef42c..2cc1f43c1b 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1629,7 +1629,10 @@ test_expect_success 'O: blank lines not necessary after other commands' '
 	INPUT_END
 
 	git fast-import <input &&
-	test 8 = $(find .git/objects/pack -type f | grep -v multi-pack-index | wc -l) &&
+	ls -la .git/objects/pack/pack-*.pack >packlist &&
+	ls -la .git/objects/pack/pack-*.pack >idxlist &&
+	test_line_count = 4 idxlist &&
+	test_line_count = 4 packlist &&
 	test $(git rev-parse refs/tags/O3-2nd) = $(git rev-parse O3^) &&
 	git log --reverse --pretty=oneline O3 | sed s/^.*z// >actual &&
 	test_cmp expect actual
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 7/8] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (5 preceding siblings ...)
  2021-01-13 22:28   ` [PATCH v2 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  2021-01-13 22:28   ` [PATCH v2 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
  7 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Add a new option that unconditionally enables the pack.writeReverseIndex
setting in order to run the whole test suite in a mode that generates
on-disk reverse indexes. Additionally, enable this mode in the second
run of tests under linux-gcc in 'ci/run-build-and-tests.sh'.

Once on-disk reverse indexes are proven out over several releases, we
can change the default value of that configuration to 'true', and drop
this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/index-pack.c      | 5 ++++-
 builtin/pack-objects.c    | 2 ++
 ci/run-build-and-tests.sh | 1 +
 pack-revindex.h           | 2 ++
 t/README                  | 3 +++
 5 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 03408250b1..0bde325a8b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1748,7 +1748,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
-	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
+	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
+		rev_index = 1;
+	else
+		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
 
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d784569200..24df0c98f7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3601,6 +3601,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
+		pack_idx_opts.flags |= WRITE_REV;
 
 	progress = isatty(2);
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 6c27b886b8..d1cbf330a1 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -22,6 +22,7 @@ linux-gcc)
 	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	export GIT_TEST_ADD_I_USE_BUILTIN=1
+	export GIT_TEST_WRITE_REV_INDEX=1
 	make test
 	;;
 linux-clang)
diff --git a/pack-revindex.h b/pack-revindex.h
index 01622cf21a..7237b2b6f8 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -16,6 +16,8 @@
  *   can be found
  */
 
+#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+
 struct packed_git;
 
 /*
diff --git a/t/README b/t/README
index c730a70770..0f97a51640 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ GIT_TEST_DEFAULT_HASH=<hash-algo> specifies which hash algorithm to
 use in the test scripts. Recognized values for <hash-algo> are "sha1"
 and "sha256".
 
+GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
+'pack.writeReverseIndex' setting.
+
 Naming Tests
 ------------
 
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (6 preceding siblings ...)
  2021-01-13 22:28   ` [PATCH v2 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-13 22:28   ` Taylor Blau
  7 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-13 22:28 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

When an on-disk reverse index exists, there is no need to generate one
in memory. In fact, doing so can be slow, and require large amounts of
the heap.

Let's make sure that we treat the on-disk reverse index with precedence
(i.e., that when it exists, we don't bother trying to generate an
equivalent one in memory) by teaching Git how to conditionally die()
when generating a reverse index in memory.

Then, add a test to ensure that when (a) an on-disk reverse index
exists, and (b) when setting GIT_TEST_REV_INDEX_DIE_IN_MEMORY, that we
do not die, implying that we read from the on-disk one.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c          | 4 ++++
 pack-revindex.h          | 1 +
 t/t5325-reverse-index.sh | 9 +++++++++
 3 files changed, 14 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index 369812dd21..f264319f34 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -2,6 +2,7 @@
 #include "pack-revindex.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "config.h"
 
 struct revindex_entry {
 	off_t offset;
@@ -166,6 +167,9 @@ static void create_pack_revindex(struct packed_git *p)
 
 static int load_pack_revindex_from_memory(struct packed_git *p)
 {
+	if (git_env_bool(GIT_TEST_REV_INDEX_DIE_IN_MEMORY, 0))
+		die("dying as requested by '%s'",
+		    GIT_TEST_REV_INDEX_DIE_IN_MEMORY);
 	if (open_pack_index(p))
 		return -1;
 	create_pack_revindex(p);
diff --git a/pack-revindex.h b/pack-revindex.h
index 7237b2b6f8..97f5893d3a 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -17,6 +17,7 @@
  */
 
 #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+#define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
 
 struct packed_git;
 
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index be452bb343..a344b18d7e 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -85,4 +85,13 @@ test_expect_success 'pack-objects respects pack.writeReverseIndex' '
 	test_path_is_file pack-1-*.rev
 '
 
+test_expect_success 'reverse index is not generated when available on disk' '
+	test_index_pack true &&
+	test_path_is_file $rev &&
+
+	git rev-parse HEAD >tip &&
+	GIT_TEST_REV_INDEX_DIE_IN_MEMORY=1 git cat-file \
+		--batch-check="%(objectsize:disk)" <tip
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
@ 2021-01-14  7:22     ` Junio C Hamano
  2021-01-14 12:07       ` Derrick Stolee
  2021-01-14 18:28       ` Taylor Blau
  2021-01-14  7:26     ` Junio C Hamano
  2021-01-22 22:54     ` Jeff King
  2 siblings, 2 replies; 54+ messages in thread
From: Junio C Hamano @ 2021-01-14  7:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> +== pack-*.rev files have the format:
> +
> +  - A 4-byte magic number '0x52494458' ('RIDX').
> +
> +  - A 4-byte version identifier (= 1)
> +
> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)

These two are presumably 4-byte-wide network byte order integers.
We should spell it out.

> +  - A table of index positions, sorted by their corresponding offsets in the
> +    packfile.

Likewise, how wide is each entry and in what byte order, and how
many entries are there in the table?

	... oh, what about the "one beyond the last"?  We cannot
	go back to the forward index to learn the offset of such
	an non-existent object, can we?

Again, I expect it to be 4-byte-wide network byte order integer.

> -int load_pack_revindex(struct packed_git *p)
> +static int load_pack_revindex_from_memory(struct packed_git *p)

I said it when I saw the beginning of v1 API patches, but it is a
bit unreasonable to call the act of "computing from the forward
index" "to load from memory".  Loading from disk perfectly works as
a phrase, though.

> +#define RIDX_MIN_SIZE (12 + (2 * the_hash_algo->rawsz))
> +
> +static int load_revindex_from_disk(char *revindex_name,
> +				   uint32_t num_objects,
> +				   const void **data, size_t *len)
> +{
> +	int fd, ret = 0;
> +	struct stat st;
> +	size_t revindex_size;
> +
> +	fd = git_open(revindex_name);
> +
> +	if (fd < 0) {
> +		ret = -1;
> +		goto cleanup;
> +	}
> +	if (fstat(fd, &st)) {
> +		ret = error_errno(_("failed to read %s"), revindex_name);
> +		goto cleanup;
> +	}
> +
> +	revindex_size = xsize_t(st.st_size);
> +
> +	if (revindex_size < RIDX_MIN_SIZE) {
> +		ret = error(_("reverse-index file %s is too small"), revindex_name);
> +		goto cleanup;
> +	}
> +
> +	if (revindex_size - RIDX_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
> +		ret = error(_("reverse-index file %s is corrupt"), revindex_name);
> +		goto cleanup;
> +	}
> +
> +	*len = revindex_size;
> +	*data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +
> +cleanup:
> +	close(fd);
> +	return ret;
> +}
> +
> +static int load_pack_revindex_from_disk(struct packed_git *p)
> +{
> +	char *revindex_name;
> +	int ret;
> +	if (open_pack_index(p))
> +		return -1;
> +
> +	revindex_name = pack_revindex_filename(p);
> +
> +	ret = load_revindex_from_disk(revindex_name,
> +				      p->num_objects,
> +				      &p->revindex_map,
> +				      &p->revindex_size);
> +	if (ret)
> +		goto cleanup;
> +
> +	p->revindex_data = (char *)p->revindex_map + 12;

We've seen hardcoded constant "12" twice so far in this patch.

We need a C proprocessor macro "#define RIDX_FILE_HEADER_SIZE 12" or
something, perhaps?

> +cleanup:
> +	free(revindex_name);
> +	return ret;
> +}
> +
> +int load_pack_revindex(struct packed_git *p)
> +{
> +	if (p->revindex || p->revindex_data)
> +		return 0;
> +
> +	if (!load_pack_revindex_from_disk(p))
> +		return 0;
> +	else if (!load_pack_revindex_from_memory(p))
> +		return 0;
> +	return -1;
> +}
> +
>  int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  {
>  	unsigned lo, hi;
> @@ -203,18 +285,28 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
>  
>  uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
>  {
> -	if (!p->revindex)
> +	if (!(p->revindex || p->revindex_data))
>  		BUG("pack_pos_to_index: reverse index not yet loaded");
>  	if (p->num_objects <= pos)
>  		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
> -	return p->revindex[pos].nr;
> +
> +	if (p->revindex)
> +		return p->revindex[pos].nr;
> +	else
> +		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));

Good.  We are using 32-bit uint in network byte order.  We should
document it as such.

Let's not strip const away while casting, though.  get_be32()
ensures that it only reads and never writes thru the pointer, and
p->revindex_data is a "const void *".

>  }
>  
>  off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
>  {
> -	if (!p->revindex)
> +	if (!(p->revindex || p->revindex_data))
>  		BUG("pack_pos_to_index: reverse index not yet loaded");
>  	if (p->num_objects < pos)
>  		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
> -	return p->revindex[pos].offset;
> +
> +	if (p->revindex)
> +		return p->revindex[pos].offset;
> +	else if (pos == p->num_objects)
> +		return p->pack_size - the_hash_algo->rawsz;

OK, here is the answer to my previous question.  We should document
that the table has num_objects entries in the on-disk file (we do
not need to say that there is no sentinel entry in the table at the
end).

> +	else
> +		return nth_packed_object_offset(p, pack_pos_to_index(p, pos));
>  }
> diff --git a/pack-revindex.h b/pack-revindex.h
> index 6e0320b08b..01622cf21a 100644
> --- a/pack-revindex.h
> +++ b/pack-revindex.h
> @@ -21,6 +21,9 @@ struct packed_git;
>  /*
>   * load_pack_revindex populates the revindex's internal data-structures for the
>   * given pack, returning zero on success and a negative value otherwise.
> + *
> + * If a '.rev' file is present, it is checked for consistency, mmap'd, and
> + * pointers are assigned into it (instead of using the in-memory variant).

Hmph, I missed where it got checked for consistency, though.  If the
file is corrupt and has say duplicated entries, we'd happily grab
the data via get_be32(), for example.

> @@ -55,7 +58,9 @@ uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
>   * If the reverse index has not yet been loaded, or the position is out of
>   * bounds, this function aborts.
>   *
> - * This function runs in constant time.
> + * This function runs in constant time under both in-memory and on-disk reverse
> + * indexes, but an additional step is taken to consult the corresponding .idx
> + * file when using the on-disk format.

Again, I know this is a kind of detail that is interesting to those
who implemented the function, but I wonder how it would help those
who wonder if they should call it or use some other method to
achieve what they want.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
  2021-01-14  7:22     ` Junio C Hamano
@ 2021-01-14  7:26     ` Junio C Hamano
  2021-01-14 18:13       ` Taylor Blau
  2021-01-22 22:54     ` Jeff King
  2 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2021-01-14  7:26 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> Specify the format of the on-disk reverse index 'pack-*.rev' file, as
> well as prepare the code for the existence of such files.

We've changed the pack .idx file format once as the file format is
versioned.  I wonder if you considered placing the reverse index
information in the same file, with version bump, in the .idx?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-14  7:22     ` Junio C Hamano
@ 2021-01-14 12:07       ` Derrick Stolee
  2021-01-14 19:57         ` Jeff King
  2021-01-14 18:28       ` Taylor Blau
  1 sibling, 1 reply; 54+ messages in thread
From: Derrick Stolee @ 2021-01-14 12:07 UTC (permalink / raw)
  To: Junio C Hamano, Taylor Blau; +Cc: git, dstolee, jrnieder, peff

On 1/14/2021 2:22 AM, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> +== pack-*.rev files have the format:
>> +
>> +  - A 4-byte magic number '0x52494458' ('RIDX').
>> +
>> +  - A 4-byte version identifier (= 1)
>> +
>> +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)
> 
> These two are presumably 4-byte-wide network byte order integers.
> We should spell it out.

In the past, we've included the sentence "All multi-byte numbers are
in network byte order." to clarify this. These two entries need to
specify that the "identifier" is actually an integer. Perhaps these
three values could be provided as:

+== pack-*.rev files have the format:
+
+  - A 4-byte identifier '0x52494458' ('RIDX').
+
+  - A 4-byte integer version (= 1)
+
+  - A 4-byte integer hash function version (= 1 for SHA-1, 2 for SHA-256)

>>  /*
>>   * load_pack_revindex populates the revindex's internal data-structures for the
>>   * given pack, returning zero on success and a negative value otherwise.
>> + *
>> + * If a '.rev' file is present, it is checked for consistency, mmap'd, and
>> + * pointers are assigned into it (instead of using the in-memory variant).
> 
> Hmph, I missed where it got checked for consistency, though.  If the
> file is corrupt and has say duplicated entries, we'd happily grab
> the data via get_be32(), for example.

Even if the consistency check is just verifying the trailing hash, that
seems like something that requires O(N) before performing a lookup. Perhaps
this was copied from somewhere else, or means something different?

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-14  7:26     ` Junio C Hamano
@ 2021-01-14 18:13       ` Taylor Blau
  2021-01-14 20:57         ` Junio C Hamano
  0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2021-01-14 18:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, dstolee, jrnieder, peff

On Wed, Jan 13, 2021 at 11:26:59PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Specify the format of the on-disk reverse index 'pack-*.rev' file, as
> > well as prepare the code for the existence of such files.
>
> We've changed the pack .idx file format once as the file format is
> versioned.  I wonder if you considered placing the reverse index
> information in the same file, with version bump, in the .idx?

Funny enough, I couldn't remember whether I considered this or not.
Peff reminded me off-list that he and I *had* considered this, but
decided against it.

The main benefit to introducing a new '.rev' format is that we can have
packs that can be upgraded to write a reverse index without having to
rewrite their forward index. It would also allow us to avoid teaching
other implementations about a new version of the index file (since they
can ignore it and continue to build their equivalent of the reverse
index file in memory by reading the forward index).

(Peff reminds me that dumb-http does look at remote .idx files, so this
new format would leak across to clients, whether or not that's something
to be concerned about...).

Of course, having the contents of the .rev file be included in the .idx
file nets us one fewer file to manage, but I'm not sure that's a reason
to do things one way or another.

Your response did pique my interest, since I was wondering if we could
improve the cold cache performance if the .rev file's contents were
included in the .idx, but after giving it some thought I don't think we
can. Reasons are:

  - If the reverse index's contents appears at the end of the .idx file,
    then in any .idx file large enough to matter, we'll almost certainly
    still be evicting cache lines back and forth when swapping between
    reading the forward- and reverse-indexes. So, no gains to be had
    there.

  - If, on the other hand, we included the reverse index's contents by
    interleaving it with the forward index's offsets, then we'd be
    worsening the cache performance of the forward index.

So, I'm more in favor of a new .rev file rather than a v3 .idx version.
Apologies for not including more of a rationale "why" in the cover
letter (had I not forgotten that I'd even considered it, I would have).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-14  7:22     ` Junio C Hamano
  2021-01-14 12:07       ` Derrick Stolee
@ 2021-01-14 18:28       ` Taylor Blau
  1 sibling, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-14 18:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, dstolee, jrnieder, peff

On Wed, Jan 13, 2021 at 11:22:53PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > +== pack-*.rev files have the format:
> > +
> > +  - A 4-byte magic number '0x52494458' ('RIDX').
> > +
> > +  - A 4-byte version identifier (= 1)
> > +
> > +  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)
>
> These two are presumably 4-byte-wide network byte order integers.
> We should spell it out.

Yep, all entries are network order. I added a sentence at the bottom of
this sub-section to say as much.

> We've seen hardcoded constant "12" twice so far in this patch.
>
> We need a C proprocessor macro "#define RIDX_FILE_HEADER_SIZE 12" or
> something, perhaps?

Good idea, thanks.

> > -	return p->revindex[pos].nr;
> > +
> > +	if (p->revindex)
> > +		return p->revindex[pos].nr;
> > +	else
> > +		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));
>
> Good.  We are using 32-bit uint in network byte order.  We should
> document it as such.
>
> Let's not strip const away while casting, though.  get_be32()
> ensures that it only reads and never writes thru the pointer, and
> p->revindex_data is a "const void *".

Agreed, and thanks for the suggestion. I take it that what you mean is:

-		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));
+		return get_be32((const char *)p->revindex_data + (pos * sizeof(uint32_t)));

...yes?

> > diff --git a/pack-revindex.h b/pack-revindex.h
> > index 6e0320b08b..01622cf21a 100644
> > --- a/pack-revindex.h
> > +++ b/pack-revindex.h
> > @@ -21,6 +21,9 @@ struct packed_git;
> >  /*
> >   * load_pack_revindex populates the revindex's internal data-structures for the
> >   * given pack, returning zero on success and a negative value otherwise.
> > + *
> > + * If a '.rev' file is present, it is checked for consistency, mmap'd, and
> > + * pointers are assigned into it (instead of using the in-memory variant).
>
> Hmph, I missed where it got checked for consistency, though.  If the
> file is corrupt and has say duplicated entries, we'd happily grab
> the data via get_be32(), for example.

It doesn't, I'm mistaken. I removed that incorrect detail from this
comment. Thanks for catching it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-14 12:07       ` Derrick Stolee
@ 2021-01-14 19:57         ` Jeff King
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff King @ 2021-01-14 19:57 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, Taylor Blau, git, dstolee, jrnieder

On Thu, Jan 14, 2021 at 07:07:08AM -0500, Derrick Stolee wrote:

> >> + * If a '.rev' file is present, it is checked for consistency, mmap'd, and
> >> + * pointers are assigned into it (instead of using the in-memory variant).
> > 
> > Hmph, I missed where it got checked for consistency, though.  If the
> > file is corrupt and has say duplicated entries, we'd happily grab
> > the data via get_be32(), for example.
> 
> Even if the consistency check is just verifying the trailing hash, that
> seems like something that requires O(N) before performing a lookup. Perhaps
> this was copied from somewhere else, or means something different?

For the .idx file, we check that the size is what we expect. This is
important because it lets us access the mapped bytes in normal use
without having to do a bounds check.

It looks like we do the same for the .rev file here, which is good.  If
calling that "checked for consistency" is too strong, I don't think it's
a big deal to drop the wording (we do not make any such claim for
open_pack_index()).

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-14 18:13       ` Taylor Blau
@ 2021-01-14 20:57         ` Junio C Hamano
  0 siblings, 0 replies; 54+ messages in thread
From: Junio C Hamano @ 2021-01-14 20:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, jrnieder, peff

Taylor Blau <me@ttaylorr.com> writes:

> Your response did pique my interest, since I was wondering if we could
> improve the cold cache performance if the .rev file's contents were
> included in the .idx,...

Yes, that was the primary thing I was wondering.  As long as it was
considered and rejected for good reason, that is good.

Thanks.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
  2021-01-14  7:22     ` Junio C Hamano
  2021-01-14  7:26     ` Junio C Hamano
@ 2021-01-22 22:54     ` Jeff King
  2021-01-25 17:44       ` Taylor Blau
  2 siblings, 1 reply; 54+ messages in thread
From: Jeff King @ 2021-01-22 22:54 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Wed, Jan 13, 2021 at 05:28:06PM -0500, Taylor Blau wrote:

> (Numbers taken in the kernel after cheating and using the next patch to
> generate a reverse index). There are a couple of approaches to improve
> cold cache performance not pursued here:
> 
>   - We could include the object offsets in the reverse index format.
>     Predictably, this does result in fewer page faults, but it triples
>     the size of the file, while simultaneously duplicating a ton of data
>     already available in the .idx file. (This was the original way I
>     implemented the format, and it did show
>     `--batch-check='%(objectsize:disk)'` winning out against `--batch`.)
> 
>     On the other hand, this increase in size also results in a large
>     block-cache footprint, which could potentially hurt other workloads.
> 
>   - We could store the mapping from pack to index position in more
>     cache-friendly way, like constructing a binary search tree from the
>     table and writing the values in breadth-first order. This would
>     result in much better locality, but the price you pay is trading
>     O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
>     can no longer directly index the table).
> 
> So, neither of these approaches are taken here. (Thankfully, the format
> is versioned, so we are free to pursue these in the future.) But, cold
> cache performance likely isn't interesting outside of one-off cases like
> asking for the size of an object directly. In real-world usage, Git is
> often performing many operations in the revindex,

I think you've nicely covered the arguments for and against the extra
offset here. This final paragraph ends in a comma, which makes me wonder
if you wanted to say something more. I'd guess it is along the lines
that most commands will be looking up more than one object, so that
cold-cache effort is amortized.

Or another way of thinking about it: 17ms versus 25ms in the cold-cache
for a _single_ object is not that big a deal, because the extra 8ms does
not scale as we ask about more objects. Here's an actual argument in
numbers (test repo is linux.git after building a .rev file using your
series):

For a single object, the extra cold-cache costs give --batch a slight
edge:

  $ git rev-parse HEAD >obj
  $ hyperfine -p 'echo 3 | sudo tee /proc/sys/vm/drop_caches' \
                 'git cat-file --buffer --batch-check="%(objectsize:disk)" <obj' \
                 'git cat-file --buffer --batch <obj'

  Benchmark #1: git cat-file --buffer --batch-check="%(objectsize:disk)" <obj
    Time (mean ± σ):      37.2 ms ±   8.3 ms    [User: 2.6 ms, System: 4.6 ms]
    Range (min … max):    28.5 ms …  55.6 ms    10 runs
   
  Benchmark #2: git cat-file --buffer --batch <obj
    Time (mean ± σ):      27.4 ms ±   3.4 ms    [User: 2.9 ms, System: 2.5 ms]
    Range (min … max):    23.2 ms …  37.1 ms    51 runs
   
  Summary
    'git cat-file --buffer --batch <obj' ran
      1.36 ± 0.35 times faster than 'git cat-file --buffer --batch-check="%(objectsize:disk)" <obj'

But with even a moderate number of objects, that's reversed:

  $ git cat-file --batch-all-objects --batch-check='%(objectname)' |
    shuffle | head -1000 >obj-1000
  $ hyperfine -p 'echo 3 | sudo tee /proc/sys/vm/drop_caches' \
                 'git cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000' \
		 'git cat-file --buffer --batch <obj-1000'
  
  Benchmark #1: git cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000
    Time (mean ± σ):      1.599 s ±  0.285 s    [User: 22.4 ms, System: 334.5 ms]
    Range (min … max):    0.816 s …  1.762 s    10 runs
   
  Benchmark #2: git cat-file --buffer --batch <obj-1000
    Time (mean ± σ):      1.972 s ±  0.225 s    [User: 343.5 ms, System: 404.2 ms]
    Range (min … max):    1.691 s …  2.283 s    10 runs
   
  Summary
    'git cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000' ran
      1.23 ± 0.26 times faster than 'git cat-file --buffer --batch <obj-1000'


Of course this isn't exactly an apples-to-apples comparison in the first
place, since the --batch one is doing a lot more. So "winning" with
objectsize:disk is not much of an accomplishment. A more interesting
comparison would be the same operation on a repo with your series,
versus one with the offset embedded in the .rev file, as the number of
objects grows.

But since we don't have that readily available, another interesting
comparison is stock git (with no .rev file) against your new .rev file,
with a cold cache.

At 1000 objects, the old code has a slight win, because it has less to
fault in from the disk (instead it's recreating the same data in RAM).
"git.compile" is your branch below; "git" is a stock build of "next":

  Benchmark #1: git cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000
    Time (mean ± σ):      1.483 s ±  0.260 s    [User: 148.9 ms, System: 301.2 ms]
    Range (min … max):    0.792 s …  1.725 s    10 runs
   
  Benchmark #2: git.compile cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000
    Time (mean ± σ):      1.820 s ±  0.138 s    [User: 27.7 ms, System: 399.3 ms]
    Range (min … max):    1.610 s …  2.012 s    10 runs
   
  Summary
    'git cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000' ran
      1.23 ± 0.23 times faster than 'git.compile cat-file --buffer --batch-check="%(objectsize:disk)" <obj-1000'

But that edge drops to 1.08x at 10,000 objects, and then at 100,000
objects your code is a win (by 1.16x). And of course it's a giant win
when the cache is already warm.

And in a cold cache, we'd expect a .rev file with offsets in it to be
much worse, since there's many more bytes to pull from the disk.

All of which is a really verbose way of saying: you might want to add a
few words after the comma:

  In real-world usage, Git is often performing many operations in the
  revindex (i.e., rather than asking about a single object, we'd
  generally ask about a range of history).

:) But hopefully it shows that including the offsets is not really
making things better for the cold cache anyway.

>  Documentation/technical/pack-format.txt |  17 ++++
>  builtin/repack.c                        |   1 +
>  object-store.h                          |   3 +
>  pack-revindex.c                         | 112 +++++++++++++++++++++---
>  pack-revindex.h                         |   7 +-
>  packfile.c                              |  13 ++-
>  packfile.h                              |   1 +
>  tmp-objdir.c                            |   4 +-
>  8 files changed, 145 insertions(+), 13 deletions(-)

Oh, there's a patch here, too. :)

It mostly looks good to me. I agree with Junio that "compute" is a
better verb than "load" for generating the in-memory revindex.

> +static int load_pack_revindex_from_disk(struct packed_git *p)
> +{
> +	char *revindex_name;
> +	int ret;
> +	if (open_pack_index(p))
> +		return -1;
> +
> +	revindex_name = pack_revindex_filename(p);
> +
> +	ret = load_revindex_from_disk(revindex_name,
> +				      p->num_objects,
> +				      &p->revindex_map,
> +				      &p->revindex_size);
> +	if (ret)
> +		goto cleanup;
> +
> +	p->revindex_data = (char *)p->revindex_map + 12;

Junio mentioned once spot where we lose constness through a cast. This
is another. I wonder if revindex_map should just be a "char *" to make
pointer arithmetic easier without having to cast.

But also...

> +	if (p->revindex)
> +		return p->revindex[pos].nr;
> +	else
> +		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));

If p->revindex_data were "const uint32_t *", then this line would just
be:

  return get_be32(p->revindex_data + pos);

Not a huge deal either way since the whole point is to abstract this
behind a function where it only has to be written once. I don't think
there is any downside from the compiler's view (and we already use this
trick for the bitmap name-hash cache).

> diff --git a/packfile.c b/packfile.c
> index 7bb1750934..b04eac9286 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -324,11 +324,21 @@ void close_pack_index(struct packed_git *p)
>  	}
>  }
>  
> +void close_pack_revindex(struct packed_git *p) {
> +	if (!p->revindex_map)
> +		return;
> +
> +	munmap((void *)p->revindex_map, p->revindex_size);
> +	p->revindex_map = NULL;
> +	p->revindex_data = NULL;
> +}
> +
>  void close_pack(struct packed_git *p)
>  {
>  	close_pack_windows(p);
>  	close_pack_fd(p);
>  	close_pack_index(p);
> +	close_pack_revindex(p);
>  }

Thinking out loud a bit: a .rev file means we're spending an extra map
per pack (but not a descriptor, since we close after mmap). And like the
.idx files (but unlike .pack file maps), we don't keep track of these
and try to close them when under memory pressure. I think that's
probably OK in terms of bytes. It may mean running up against operating
system number-of-mmap limits more quickly when you have a very large
number of packs, as mentioned in:

  https://lore.kernel.org/git/20200601044511.GA2529317@coredump.intra.peff.net/

But this is probably bumping the number of problematic packs from 30k to
20k. Both are sufficiently ridiculous that I don't think it matters in
practice.

> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index 42ed4db5d3..da414df14f 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -187,7 +187,9 @@ static int pack_copy_priority(const char *name)
>  		return 2;
>  	if (ends_with(name, ".idx"))
>  		return 3;
> -	return 4;
> +	if (ends_with(name, ".rev"))
> +		return 4;
> +	return 5;
>  }

Probably not super important, but: should the .idx file still come last
here? Simultaneous readers won't start using the pack until the .idx
file is present. We'd probably prefer they see the whole thing
atomically, than see a .idx missing its .rev (they won't ever produce a
wrong answer, but they'll generate the in-core revindex on the fly when
they don't need to).

I guess one could argue that .bitmap files should get similar treatment,
but we'd not generally see those in the quarantine objdir anyway, so
nobody ever gave it much thought.

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files
  2021-01-13 22:28   ` [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
@ 2021-01-22 23:24     ` Jeff King
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff King @ 2021-01-22 23:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Wed, Jan 13, 2021 at 05:28:11PM -0500, Taylor Blau wrote:

> +static void write_rev_index_positions(struct hashfile *f,
> +				      struct pack_idx_entry **objects,
> +				      uint32_t nr_objects)
> +{
> +	uint32_t *pack_order;
> +	uint32_t i;
> +
> +	ALLOC_ARRAY(pack_order, nr_objects);
> +	for (i = 0; i < nr_objects; i++)
> +		pack_order[i] = i;
> +	QSORT_S(pack_order, nr_objects, pack_order_cmp, objects);

qsort? Don't we have a perfectly good radix sort for this exact purpose? :)

I guess it is awkward to use because it hard-codes the assumption that
we are sorting revindex_entry structs, with their offsets directly
available.

We _could_ actually just generate an in-memory revindex array, and then
use that to write out the .rev file. That has the nice property that
we'd continue exercising the fallback code in the tests.

But a revindex_entry is much larger than the uint32_t's we need here. So
we'd be incurring extra memory costs during the generation, which is
probably not worth it. If we want better test coverage, we probably
should explicitly run a series of tests both with and without a .rev
file.

It's possible we'd still benefit from using a more generalized radix
sort, but it probably isn't worth the trouble. The timings from
8b8dfd5132 (pack-revindex: radix-sort the revindex, 2013-07-11) claim a
4x speedup on sorting 3M entries (the size matters because we are
comparing an O(n) sort versus an O(n log n) one). We can imagine that at
10M entries for the current kernel, it might even be 8x. But the
absolute numbers are pretty small. The radix sort takes ~150ms for
linux.git on my machine.  At 8x, that's 1.2s. For a repack of the
kernel, that is mostly a drop in the bucket. It mattered a lot more when
many processes were doing it on the fly.

> +static int pack_order_cmp(const void *va, const void *vb, void *ctx)
> +{
> +	struct pack_idx_entry **objects = ctx;
> +
> +	off_t oa = objects[*(uint32_t*)va]->offset;
> +	off_t ob = objects[*(uint32_t*)vb]->offset;

Dereferencing a pointer to index another array always makes me nervous
that we may have a bounds problem with bogus data.

In this case we know it is OK because we filled the array ourselves with
in-bound numbers in write_rev_index_positions.

> +#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
> +#define RIDX_VERSION 1

I was surprised we didn't define these already on the reading side, but
it looks like we didn't. In patch 1, we probably should be checking
RIDX_SIGNATURE in load_revindex_from_disk(). But much more importantly,
we should be checking that we find version 1, since that's what will
make it safe to later invent a version 2.

> +static void write_rev_header(struct hashfile *f)
> +{
> +	uint32_t oid_version;
> +	switch (hash_algo_by_ptr(the_hash_algo)) {
> +	case GIT_HASH_SHA1:
> +		oid_version = 1;
> +		break;
> +	case GIT_HASH_SHA256:
> +		oid_version = 2;
> +		break;
> +	default:
> +		die("write_rev_header: unknown hash version");
> +	}

I forgot to comment on this in patch 1, but: I think the format is
really independent of the hash size. The contents are identical for a
sha-1 versus sha-256 file.

That said, I don't overly mind having a hash identifier if it might help
debug things (OTOH, how the heck do you end up with one that matches the
trailer's packfile but  _doesn't_ match the trailer's contents?).

If we do have it, should we also be checking it in the loading function?

> +const char *write_rev_file(const char *rev_name,
> +			   struct pack_idx_entry **objects,
> +			   uint32_t nr_objects,
> +			   const unsigned char *hash,
> +			   unsigned flags)
> +{
> +	struct hashfile *f;
> +	int fd;
> +
> +	if ((flags & WRITE_REV) && (flags & WRITE_REV_VERIFY))
> +		die(_("cannot both write and verify reverse index"));
> +
> +	if (flags & WRITE_REV) {
> +		if (!rev_name) {
> +			struct strbuf tmp_file = STRBUF_INIT;
> +			fd = odb_mkstemp(&tmp_file, "pack/tmp_rev_XXXXXX");
> +			rev_name = strbuf_detach(&tmp_file, NULL);
> +		} else {
> +			unlink(rev_name);
> +			fd = open(rev_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> +			if (fd < 0)
> +				die_errno("unable to create '%s'", rev_name);
> +		}

So if the caller gave us a name, we force-overwrite it. That seemed
weird to me at first, but it makes sense; the atomic rename-into-place
writers will not be passing in the name. And this is exactly how
write_idx_file() works.

I wonder if we could factor out some of this repeated logic, but I
suspect it is mostly diminishing returns. Maybe this "open a pack file
for writing" could become a helper function, though.

> diff --git a/pack.h b/pack.h
> index 9fc0945ac9..30439e0784 100644
> --- a/pack.h
> +++ b/pack.h
> @@ -42,6 +42,8 @@ struct pack_idx_option {
>  	/* flag bits */
>  #define WRITE_IDX_VERIFY 01 /* verify only, do not write the idx file */
>  #define WRITE_IDX_STRICT 02
> +#define WRITE_REV 04
> +#define WRITE_REV_VERIFY 010

It is a little funny that the write-rev function has both WRITE_REV and
WRITE_REV_VERIFY (which we must be sure are not both provided), but we
do not need both for the IDX.

I thought maybe the reason is that we'd pass these flags to
write_idx_file() or similar, and it would need to know whether to write
just an idx, or both. That doesn't seem to happen in this patch, but
maybe it does in a future one...

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes
  2021-01-13 22:28   ` [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
@ 2021-01-22 23:53     ` Jeff King
  2021-01-25 20:03       ` Taylor Blau
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff King @ 2021-01-22 23:53 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Wed, Jan 13, 2021 at 05:28:15PM -0500, Taylor Blau wrote:

>  OPTIONS
> @@ -33,7 +34,14 @@ OPTIONS
>  	file is constructed from the name of packed archive
>  	file by replacing .pack with .idx (and the program
>  	fails if the name of packed archive does not end
> -	with .pack).
> +	with .pack). Incompatible with `--rev-index`.

I wondered which option was incompatible, but couldn't see from the
context. It is "index-pack -o", which kind of makes sense. We can derive
"foo.rev" from "foo.idx", but normally "-o" does not do any deriving.

So I was all set to say "OK, we can live without it", but...

> @@ -1824,7 +1851,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	if (from_stdin && hash_algo)
>  		die(_("--object-format cannot be used with --stdin"));
>  	if (!index_name && pack_name)
> -		index_name = derive_filename(pack_name, "idx", &index_name_buf);
> +		index_name = derive_filename(pack_name, ".pack", "idx", &index_name_buf);
> +
> +	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
> +	if (rev_index) {
> +		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
> +		if (index_name)
> +			rev_index_name = derive_filename(index_name,
> +							 ".idx", "rev",
> +							 &rev_index_name_buf);
> +	}

...here we do end up deriving ".rev" from ".idx" anyway. So I guess we
probably could support "-o".  I also wonder what happens with "git
index-pack -o foo.idx" when pack.writeReverseIndex is set. It looks like
it would just work because of this block. But then shouldn't
"--rev-index" work, too? And indeed, there is a test for that at the end
of the patch! So is the documentation just wrong?

I admit to finding the use of opts.flags versus the rev_index option a
bit confusing. It seems like they are doing roughly the same thing, but
influenced by different sources. It seems like we should be able to have
a single local variable (then that goes on to set opts.flags for any
sub-functions we call). Or maybe two, if we need to distinguish config
versus command-line, but then they should have clear names
(rev_index_config and rev_index_cmdline or something).

As an aside, looking at derive_filename(), it seems a bit weird that one
argument has a dot in the suffix and the other does not. I guess you are
following the convention from write_special_file(), which omits it in
the newly-added suffix. But it is slightly awkward to omit it for the
old suffix in derive_filename(), because we want to strip_suffix() it
all at once.

Probably not that big a deal, but if anybody feels strongly, then
derive_filename() could do:

  if (!strip_suffix(pack_name, strip, &len) ||
      !len || pack_name[len] != '.')
	die("does not end in .%s", strip);

> @@ -1578,6 +1591,12 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
>  		}
>  		return 0;
>  	}
> +	if (!strcmp(k, "pack.writereverseindex")) {
> +		if (git_config_bool(k, v))
> +			opts->flags |= WRITE_REV;
> +		else
> +			opts->flags &= ~WRITE_REV;
> +	}
>  	return git_default_config(k, v, cb);
>  }

IMHO we'll eventually want to turn this feature on by default. In which
case we'll have to update every caller which is checking the config
manually. Should we hide this in a function that looks up the config,
and sets the default? Or alternatively, I guess, they could all use some
shared initializer for "flags".

> +	# Intentionally corrupt the reverse index.
> +	chmod u+w $rev &&
> +	printf "xxxx" | dd of=$rev bs=1 count=4 conv=notrunc &&
> +
> +	test_must_fail git index-pack --rev-index --verify \
> +		$packdir/pack-$pack.pack 2>err &&
> +	grep "validation error" err
> +'

This isn't that subtle of a corruption, because we are corrupting the
first 4 bytes, which is the magic signature. Maybe something further in
the actual data would be interesting instead of or in addition?

I dunno. There are a lot of edge cases around corruption (likewise, we
might care how the normal reading code-path perceives a signature
corruption like this). I'm not sure it's all that interesting to test
all of them.

> +test_expect_success 'index-pack infers reverse index name with -o' '
> +	git index-pack --rev-index -o other.idx $packdir/pack-$pack.pack &&
> +	test_path_is_file other.idx &&
> +	test_path_is_file other.rev
> +'

Hey, we _do_ support "--rev-index -o". Is it just the documentation that
is wrong?

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-13 22:28   ` [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-22 23:57     ` Jeff King
  2021-01-23  0:08       ` Jeff King
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff King @ 2021-01-22 23:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Wed, Jan 13, 2021 at 05:28:19PM -0500, Taylor Blau wrote:

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 5b0c4489e2..d784569200 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -2955,6 +2955,13 @@ static int git_pack_config(const char *k, const char *v, void *cb)
>  			    pack_idx_opts.version);
>  		return 0;
>  	}
> +	if (!strcmp(k, "pack.writereverseindex")) {
> +		if (git_config_bool(k, v))
> +			pack_idx_opts.flags |= WRITE_REV;
> +		else
> +			pack_idx_opts.flags &= ~WRITE_REV;
> +		return 0;
> +	}

This turned out delightfully simple. And I guess this is the "why is
WRITE_REV" caller I asked about from patch 2. It is
finish_tmp_packfile() where the magic happens. That unconditionally
calls write_rev_file(), but it's a noop if WRITE_REV isn't specified.

Makes sense.

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-22 23:57     ` Jeff King
@ 2021-01-23  0:08       ` Jeff King
  2021-01-25 20:21         ` Taylor Blau
  0 siblings, 1 reply; 54+ messages in thread
From: Jeff King @ 2021-01-23  0:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Fri, Jan 22, 2021 at 06:57:35PM -0500, Jeff King wrote:

> > +	if (!strcmp(k, "pack.writereverseindex")) {
> > +		if (git_config_bool(k, v))
> > +			pack_idx_opts.flags |= WRITE_REV;
> > +		else
> > +			pack_idx_opts.flags &= ~WRITE_REV;
> > +		return 0;
> > +	}
> 
> This turned out delightfully simple. And I guess this is the "why is
> WRITE_REV" caller I asked about from patch 2. It is
> finish_tmp_packfile() where the magic happens. That unconditionally
> calls write_rev_file(), but it's a noop if WRITE_REV isn't specified.
> 
> Makes sense.

Oh, one subtlety here: this is in pack-objects itself, _not_ in
git-repack. This has bit us before with options like
repack.writebitmaps, which was originally pack.writebitmaps and
introduced all sorts of awkwardness (because pack-objects serves many
other purposes besides repacks).

I think this _might_ be OK, because we wouldn't even hit the code-paths
that handle this unless we are also writing a .idx (and really, the
point is that the two should always be written together or not at all).

So probably it's fine, but I wonder if we should err on the side of
conservatism by saying that pack-objects will not change behavior
without receiving a new command-line option, and that repack should
trigger that option. I dunno. I guess that it makes it weird with
respect to index-pack, which wants to see a very low-level option, too.
So maybe this is the best way forward.

(Sorry for being non-committal; I'm mostly thinking out loud).

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-22 22:54     ` Jeff King
@ 2021-01-25 17:44       ` Taylor Blau
  2021-01-25 18:27         ` Jeff King
  2021-01-25 19:04         ` Junio C Hamano
  0 siblings, 2 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 17:44 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jrnieder

On Fri, Jan 22, 2021 at 05:54:18PM -0500, Jeff King wrote:
> All of which is a really verbose way of saying: you might want to add a
> few words after the comma:
>
>   In real-world usage, Git is often performing many operations in the
>   revindex (i.e., rather than asking about a single object, we'd
>   generally ask about a range of history).
>
> :) But hopefully it shows that including the offsets is not really
> making things better for the cold cache anyway.

Thanks for including a compelling argument in favor of the approach that
I took in this patch.

I added something along the lines of what you suggested to the final
paragraph, so now it concludes nicely instead of ending in a comma. I
briefly considered whether I should add something about how these
operations scale and how the warming efforts are really amortized across
all of the objects, but I decided against it.

I think that this argument is already documented here, and that there's
no way to concisely state it in an already long patch. Interested
readers will easily be able to find our discussion here, which is good.

> >  Documentation/technical/pack-format.txt |  17 ++++
> >  builtin/repack.c                        |   1 +
> >  object-store.h                          |   3 +
> >  pack-revindex.c                         | 112 +++++++++++++++++++++---
> >  pack-revindex.h                         |   7 +-
> >  packfile.c                              |  13 ++-
> >  packfile.h                              |   1 +
> >  tmp-objdir.c                            |   4 +-
> >  8 files changed, 145 insertions(+), 13 deletions(-)
>
> Oh, there's a patch here, too. :)

:-).

> It mostly looks good to me. I agree with Junio that "compute" is a
> better verb than "load" for generating the in-memory revindex.

Yeah, I settled on load_pack_revindex() either calling
"create_pack_revindex_in_memory()" or "load_pack_revindex_from_disk()".

> > +static int load_pack_revindex_from_disk(struct packed_git *p)
> > +{
> > +	char *revindex_name;
> > +	int ret;
> > +	if (open_pack_index(p))
> > +		return -1;
> > +
> > +	revindex_name = pack_revindex_filename(p);
> > +
> > +	ret = load_revindex_from_disk(revindex_name,
> > +				      p->num_objects,
> > +				      &p->revindex_map,
> > +				      &p->revindex_size);
> > +	if (ret)
> > +		goto cleanup;
> > +
> > +	p->revindex_data = (char *)p->revindex_map + 12;
>
> Junio mentioned once spot where we lose constness through a cast. This
> is another. I wonder if revindex_map should just be a "char *" to make
> pointer arithmetic easier without having to cast.
>
> But also...
>
> > +	if (p->revindex)
> > +		return p->revindex[pos].nr;
> > +	else
> > +		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));
>
> If p->revindex_data were "const uint32_t *", then this line would just
> be:
>
>   return get_be32(p->revindex_data + pos);
>
> Not a huge deal either way since the whole point is to abstract this
> behind a function where it only has to be written once. I don't think
> there is any downside from the compiler's view (and we already use this
> trick for the bitmap name-hash cache).

Honestly, I'm not a huge fan of implicitly scaling pos by
sizeof(*p->revindex_data), but I can understand why it reads more
clearly here. I don't really feel strongly either way, so I'm happy to
change it in favor of your suggestion.

Of course, since RIDX_HEADER_SIZE is in bytes, not uint32_t's (and it
has to be, since it's also used in the RIDX_MIN_SIZE macro, which is
compared against the st_size of stating the .rev file), you have to do
gross stuff like:

  p->revindex_data = (const uint32_t *)((const char *)p->revindex_map + RIDX_HEADER_SIZE);

But I guess the tradeoff is worth it, since the readers are easier to
parse.

> Thinking out loud a bit: a .rev file means we're spending an extra map
> per pack (but not a descriptor, since we close after mmap). And like the
> .idx files (but unlike .pack file maps), we don't keep track of these
> and try to close them when under memory pressure. I think that's
> probably OK in terms of bytes. It may mean running up against operating
> system number-of-mmap limits more quickly when you have a very large
> number of packs, as mentioned in:
>
>   https://lore.kernel.org/git/20200601044511.GA2529317@coredump.intra.peff.net/
>
> But this is probably bumping the number of problematic packs from 30k to
> 20k. Both are sufficiently ridiculous that I don't think it matters in
> practice.

Agreed.

> > diff --git a/tmp-objdir.c b/tmp-objdir.c
> > index 42ed4db5d3..da414df14f 100644
> > --- a/tmp-objdir.c
> > +++ b/tmp-objdir.c
> > @@ -187,7 +187,9 @@ static int pack_copy_priority(const char *name)
> >  		return 2;
> >  	if (ends_with(name, ".idx"))
> >  		return 3;
> > -	return 4;
> > +	if (ends_with(name, ".rev"))
> > +		return 4;
> > +	return 5;
> >  }
>
> Probably not super important, but: should the .idx file still come last
> here? Simultaneous readers won't start using the pack until the .idx
> file is present. We'd probably prefer they see the whole thing
> atomically, than see a .idx missing its .rev (they won't ever produce a
> wrong answer, but they'll generate the in-core revindex on the fly when
> they don't need to).
>
> I guess one could argue that .bitmap files should get similar treatment,
> but we'd not generally see those in the quarantine objdir anyway, so
> nobody ever gave it much thought.

Yeah, you're right (.idx files should come last, and probably an
argument to include .bitmap files here, too, exists. I'll leave the
latter as #leftoverbits).

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-25 17:44       ` Taylor Blau
@ 2021-01-25 18:27         ` Jeff King
  2021-01-25 19:04         ` Junio C Hamano
  1 sibling, 0 replies; 54+ messages in thread
From: Jeff King @ 2021-01-25 18:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Mon, Jan 25, 2021 at 12:44:53PM -0500, Taylor Blau wrote:

> Thanks for including a compelling argument in favor of the approach that
> I took in this patch.
> 
> I added something along the lines of what you suggested to the final
> paragraph, so now it concludes nicely instead of ending in a comma. I
> briefly considered whether I should add something about how these
> operations scale and how the warming efforts are really amortized across
> all of the objects, but I decided against it.
> 
> I think that this argument is already documented here, and that there's
> no way to concisely state it in an already long patch. Interested
> readers will easily be able to find our discussion here, which is good.

That sounds good. It is sort of arguing against a strawman anyway.

> > It mostly looks good to me. I agree with Junio that "compute" is a
> > better verb than "load" for generating the in-memory revindex.
> 
> Yeah, I settled on load_pack_revindex() either calling
> "create_pack_revindex_in_memory()" or "load_pack_revindex_from_disk()".

Perfect.

> > If p->revindex_data were "const uint32_t *", then this line would just
> > be:
> >
> >   return get_be32(p->revindex_data + pos);
> >
> > Not a huge deal either way since the whole point is to abstract this
> > behind a function where it only has to be written once. I don't think
> > there is any downside from the compiler's view (and we already use this
> > trick for the bitmap name-hash cache).
> 
> Honestly, I'm not a huge fan of implicitly scaling pos by
> sizeof(*p->revindex_data), but I can understand why it reads more
> clearly here. I don't really feel strongly either way, so I'm happy to
> change it in favor of your suggestion.
> 
> Of course, since RIDX_HEADER_SIZE is in bytes, not uint32_t's (and it
> has to be, since it's also used in the RIDX_MIN_SIZE macro, which is
> compared against the st_size of stating the .rev file), you have to do
> gross stuff like:
> 
>   p->revindex_data = (const uint32_t *)((const char *)p->revindex_map + RIDX_HEADER_SIZE);
> 
> But I guess the tradeoff is worth it, since the readers are easier to
> parse.

Yeah, that is definitely a downside. Perhaps keeping everything in bytes
makes things a bit more obvious. In which case I might suggest that
revindex_data just be a "const char *". You'd have to scale any pointer
computations at the point of use then, but you'd avoid needing to do any
extra casting.

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files
  2021-01-25 17:44       ` Taylor Blau
  2021-01-25 18:27         ` Jeff King
@ 2021-01-25 19:04         ` Junio C Hamano
  1 sibling, 0 replies; 54+ messages in thread
From: Junio C Hamano @ 2021-01-25 19:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git, dstolee, jrnieder

Taylor Blau <me@ttaylorr.com> writes:

>> Thinking out loud a bit: a .rev file means we're spending an extra map
>> per pack (but not a descriptor, since we close after mmap). And like the
>> .idx files (but unlike .pack file maps), we don't keep track of these
>> and try to close them when under memory pressure. I think that's
>> probably OK in terms of bytes. It may mean running up against operating
>> system number-of-mmap limits more quickly ...
>> ...
>> >  	if (ends_with(name, ".idx"))
>> >  		return 3;
>> > -	return 4;
>> > +	if (ends_with(name, ".rev"))
>> > +		return 4;
>> > +	return 5;
>> >  }
>>
>> Probably not super important, but: should the .idx file still come last
>> here? Simultaneous readers won't start using the pack until the .idx
>> file is present. We'd probably prefer they see the whole thing
>> atomically, than see a .idx missing its .rev (they won't ever produce a
>> wrong answer, but they'll generate the in-core revindex on the fly when
>> they don't need to).

At some point, we may want to 

 - introduce .idx version 3 that is more extensible, so that the
   reverse info is included in one of its chunks;

 - make the .rev data for all packs stored as a chunk in .midx, so
   we can first check with .midx and not open any .rev files.

either of which would reduce the numberfrom 30k down to 10k ;-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes
  2021-01-22 23:53     ` Jeff King
@ 2021-01-25 20:03       ` Taylor Blau
  0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 20:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jrnieder

On Fri, Jan 22, 2021 at 06:53:11PM -0500, Jeff King wrote:
> On Wed, Jan 13, 2021 at 05:28:15PM -0500, Taylor Blau wrote:
>
> >  OPTIONS
> > @@ -33,7 +34,14 @@ OPTIONS
> >  	file is constructed from the name of packed archive
> >  	file by replacing .pack with .idx (and the program
> >  	fails if the name of packed archive does not end
> > -	with .pack).
> > +	with .pack). Incompatible with `--rev-index`.
>
> I wondered which option was incompatible, but couldn't see from the
> context. It is "index-pack -o", which kind of makes sense. We can derive
> "foo.rev" from "foo.idx", but normally "-o" does not do any deriving.
>
> So I was all set to say "OK, we can live without it", but...
>
> > @@ -1824,7 +1851,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
> >  	if (from_stdin && hash_algo)
> >  		die(_("--object-format cannot be used with --stdin"));
> >  	if (!index_name && pack_name)
> > -		index_name = derive_filename(pack_name, "idx", &index_name_buf);
> > +		index_name = derive_filename(pack_name, ".pack", "idx", &index_name_buf);
> > +
> > +	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
> > +	if (rev_index) {
> > +		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
> > +		if (index_name)
> > +			rev_index_name = derive_filename(index_name,
> > +							 ".idx", "rev",
> > +							 &rev_index_name_buf);
> > +	}
>
> ...here we do end up deriving ".rev" from ".idx" anyway. So I guess we
> probably could support "-o".  I also wonder what happens with "git
> index-pack -o foo.idx" when pack.writeReverseIndex is set. It looks like
> it would just work because of this block. But then shouldn't
> "--rev-index" work, too? And indeed, there is a test for that at the end
> of the patch! So is the documentation just wrong?

Hah! The documentation is just plain wrong. It's been a while, but I
have a vague recollection of writing this documentation before changing
the implementation of index-pack to allow this. Clearly, I forgot to go
back to update the broken documentation.

Hilariously, there is even a test in t5325 that demonstrates this
working! 'git index-pack --rev-index -o other.idx' writes both
'other.idx' and 'other.rev'. That was easy :-).

> I admit to finding the use of opts.flags versus the rev_index option a
> bit confusing. It seems like they are doing roughly the same thing, but
> influenced by different sources. It seems like we should be able to have
> a single local variable (then that goes on to set opts.flags for any
> sub-functions we call). Or maybe two, if we need to distinguish config
> versus command-line, but then they should have clear names
> (rev_index_config and rev_index_cmdline or something).

Yeah, I know. It's because we already pass a pointer to a struct
pack_idx_option to git_index_pack_config(), so in effect the 'flags' on
that struct *is* rev_index_config.

It's a little ugly, I agree, but I'm skeptical that the effort to clean
it up is worth it, mostly because the pack_idx_option struct probably
shouldn't be part of the index-pack builtin in the first place.

> As an aside, looking at derive_filename(), it seems a bit weird that one
> argument has a dot in the suffix and the other does not. I guess you are
> following the convention from write_special_file(), which omits it in
> the newly-added suffix. But it is slightly awkward to omit it for the
> old suffix in derive_filename(), because we want to strip_suffix() it
> all at once.

> Probably not that big a deal, but if anybody feels strongly, then
> derive_filename() could do:
>
>   if (!strip_suffix(pack_name, strip, &len) ||
>       !len || pack_name[len] != '.')
> 	die("does not end in .%s", strip);

That does make the callers look nicer, but it needs an extra two things:

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index ef2874a8e6..c758f3b8e9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1441,11 +1441,10 @@ static const char *derive_filename(const char *pack_name, const char *strip,
 {
        size_t len;
        if (!strip_suffix(pack_name, strip, &len) || !len ||
-           pack_name[len] != '.')
+           pack_name[len - 1] != '.')
                die(_("packfile name '%s' does not end with '.%s'"),
                    pack_name, strip);
        strbuf_add(buf, pack_name, len);
-       strbuf_addch(buf, '.');
        strbuf_addstr(buf, suffix);
        return buf->buf;
 }

And then it does what you are looking for. I'll pull that change out
with your Suggested-by as a preparatory commit right before this one.

> > @@ -1578,6 +1591,12 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
> >  		}
> >  		return 0;
> >  	}
> > +	if (!strcmp(k, "pack.writereverseindex")) {
> > +		if (git_config_bool(k, v))
> > +			opts->flags |= WRITE_REV;
> > +		else
> > +			opts->flags &= ~WRITE_REV;
> > +	}
> >  	return git_default_config(k, v, cb);
> >  }
>
> IMHO we'll eventually want to turn this feature on by default. In which
> case we'll have to update every caller which is checking the config
> manually. Should we hide this in a function that looks up the config,
> and sets the default? Or alternatively, I guess, they could all use some
> shared initializer for "flags".

Note that there are only two such callers, so I'm not sure the effort to
extract this would be worth it.

> > +	# Intentionally corrupt the reverse index.
> > +	chmod u+w $rev &&
> > +	printf "xxxx" | dd of=$rev bs=1 count=4 conv=notrunc &&
> > +
> > +	test_must_fail git index-pack --rev-index --verify \
> > +		$packdir/pack-$pack.pack 2>err &&
> > +	grep "validation error" err
> > +'
>
> This isn't that subtle of a corruption, because we are corrupting the
> first 4 bytes, which is the magic signature. Maybe something further in
> the actual data would be interesting instead of or in addition?
>
> I dunno. There are a lot of edge cases around corruption (likewise, we
> might care how the normal reading code-path perceives a signature
> corruption like this). I'm not sure it's all that interesting to test
> all of them.

Agreed, I think what is here (even though it's not a severe corruption)
would be sufficient to make me feel good about our error handling in
general.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-23  0:08       ` Jeff King
@ 2021-01-25 20:21         ` Taylor Blau
  2021-01-25 20:50           ` Jeff King
  0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 20:21 UTC (permalink / raw)
  To: Jeff King; +Cc: git, dstolee, gitster, jrnieder

On Fri, Jan 22, 2021 at 07:08:30PM -0500, Jeff King wrote:
> On Fri, Jan 22, 2021 at 06:57:35PM -0500, Jeff King wrote:
>
> > > +	if (!strcmp(k, "pack.writereverseindex")) {
> > > +		if (git_config_bool(k, v))
> > > +			pack_idx_opts.flags |= WRITE_REV;
> > > +		else
> > > +			pack_idx_opts.flags &= ~WRITE_REV;
> > > +		return 0;
> > > +	}
> >
> > This turned out delightfully simple. And I guess this is the "why is
> > WRITE_REV" caller I asked about from patch 2. It is
> > finish_tmp_packfile() where the magic happens. That unconditionally
> > calls write_rev_file(), but it's a noop if WRITE_REV isn't specified.
> >
> > Makes sense.
>
> Oh, one subtlety here: this is in pack-objects itself, _not_ in
> git-repack. This has bit us before with options like
> repack.writebitmaps, which was originally pack.writebitmaps and
> introduced all sorts of awkwardness (because pack-objects serves many
> other purposes besides repacks).

I'd think that we'd want a single option to control whether or not
reverse indexes are written to disk. I briefly considered (and I believe
that you and I even discussed) having options to control this behavior
per command, but it got out of hand quickly.

And that might have been OK, but I don't think the complexity was even
warranted, because really on-disk reverse indexes are an all-or-nothing
thing. That is: either you want to have revindexes accompanying .packs,
or you don't. IOW, it doesn't matter whether those packs were pushed to
us, or generated during repack, or from another pack-objects or what
have you.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-25 20:21         ` Taylor Blau
@ 2021-01-25 20:50           ` Jeff King
  0 siblings, 0 replies; 54+ messages in thread
From: Jeff King @ 2021-01-25 20:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, jrnieder

On Mon, Jan 25, 2021 at 03:21:17PM -0500, Taylor Blau wrote:

> > Oh, one subtlety here: this is in pack-objects itself, _not_ in
> > git-repack. This has bit us before with options like
> > repack.writebitmaps, which was originally pack.writebitmaps and
> > introduced all sorts of awkwardness (because pack-objects serves many
> > other purposes besides repacks).
> 
> I'd think that we'd want a single option to control whether or not
> reverse indexes are written to disk. I briefly considered (and I believe
> that you and I even discussed) having options to control this behavior
> per command, but it got out of hand quickly.

Yeah, I think per-command is something we should avoid.

My concern is mostly that pack-objects in a non-repack setting would
accidentally try to write a .rev file (or complain that it cannot). But
again, we already have the same potential issue for .idx files, so by
following those code paths as you did, we should be OK.

And certainly we test that code path a lot in the normal test suite
(i.e., every fetch or push), so your run-the-test-suite-with-rev-files
patches from later would presumably catch it otherwise.

> And that might have been OK, but I don't think the complexity was even
> warranted, because really on-disk reverse indexes are an all-or-nothing
> thing. That is: either you want to have revindexes accompanying .packs,
> or you don't. IOW, it doesn't matter whether those packs were pushed to
> us, or generated during repack, or from another pack-objects or what
> have you.

Right, I agree with all of that.

-Peff

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format
  2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                   ` (8 preceding siblings ...)
  2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
@ 2021-01-25 23:37 ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 01/10] packfile: prepare for the existence of '*.rev' files Taylor Blau
                     ` (9 more replies)
  9 siblings, 10 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Hi,

Here is a third reroll of my series to introduce an on-disk format for the
reverse index. Since the first series (to introduce a new API) has been merged
to 'next', this series has been rebased onto 'next', too.

This version is largely unchanged from the previous, with the following
exceptions:

  - Some commit messages and documentation have been clarified to address
    reviewer questions (including why we don't want to store offsets in the .rev
    file, and that all four-byte numbers are in network order, etc.).

  - A few more things are done on opening the revindex file, namely checking
    that the version/hash ID are known.

  - The idiom "load in memory" has been removed.

  - Other minor changes.

A range diff is below.

Thanks in advance for your review.

Taylor Blau (10):
  packfile: prepare for the existence of '*.rev' files
  pack-write.c: prepare to write 'pack-*.rev' files
  builtin/index-pack.c: allow stripping arbitrary extensions
  builtin/index-pack.c: write reverse indexes
  builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
  t: prepare for GIT_TEST_WRITE_REV_INDEX
  t: support GIT_TEST_WRITE_REV_INDEX
  pack-revindex: ensure that on-disk reverse indexes are given
    precedence
  t5325: check both on-disk and in-memory reverse index

 Documentation/config/pack.txt           |   7 ++
 Documentation/git-index-pack.txt        |  18 ++-
 Documentation/technical/pack-format.txt |  20 ++++
 builtin/index-pack.c                    |  69 +++++++++--
 builtin/pack-objects.c                  |   9 ++
 builtin/repack.c                        |   1 +
 ci/run-build-and-tests.sh               |   1 +
 object-store.h                          |   3 +
 pack-revindex.c                         | 148 ++++++++++++++++++++++--
 pack-revindex.h                         |  14 ++-
 pack-write.c                            | 120 ++++++++++++++++++-
 pack.h                                  |   4 +
 packfile.c                              |  13 ++-
 packfile.h                              |   1 +
 t/README                                |   3 +
 t/t5319-multi-pack-index.sh             |   5 +-
 t/t5325-reverse-index.sh                | 142 +++++++++++++++++++++++
 t/t5604-clone-reference.sh              |   2 +-
 t/t5702-protocol-v2.sh                  |  12 +-
 t/t6500-gc.sh                           |   6 +-
 t/t9300-fast-import.sh                  |   5 +-
 tmp-objdir.c                            |   6 +-
 22 files changed, 567 insertions(+), 42 deletions(-)
 create mode 100755 t/t5325-reverse-index.sh

Range-diff against v2:
21:  6742c15c84 !  1:  6f8b70ab27 packfile: prepare for the existence of '*.rev' files
    @@ Commit message
         is versioned, so we are free to pursue these in the future.) But, cold
         cache performance likely isn't interesting outside of one-off cases like
         asking for the size of an object directly. In real-world usage, Git is
    -    often performing many operations in the revindex,
    +    often performing many operations in the revindex (i.e., asking about
    +    many objects rather than a single one).

         The trade-off is worth it, since we will avoid the vast majority of the
         cost of generating the revindex that the extra pointer chase will look
    @@ Documentation/technical/pack-format.txt: Pack file entry: <+
     +
     +  - A 4-byte magic number '0x52494458' ('RIDX').
     +
    -+  - A 4-byte version identifier (= 1)
    ++  - A 4-byte version identifier (= 1).
     +
    -+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256)
    ++  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
     +
    -+  - A table of index positions, sorted by their corresponding offsets in the
    -+    packfile.
    ++  - A table of index positions (one per packed object, num_objects in
    ++    total, each a 4-byte unsigned integer in network order), sorted by
    ++    their corresponding offsets in the packfile.
     +
     +  - A trailer, containing a:
     +
     +    checksum of the corresponding packfile, and
     +
     +    a checksum of all of the above.
    ++
    ++All 4-byte numbers are in network order.
     +
      == multi-pack-index (MIDX) files have the following format:

    @@ object-store.h: struct packed_git {
      		 multi_pack_index:1;
      	unsigned char hash[GIT_MAX_RAWSZ];
      	struct revindex_entry *revindex;
    -+	const void *revindex_data;
    -+	const void *revindex_map;
    ++	const uint32_t *revindex_data;
    ++	const uint32_t *revindex_map;
     +	size_t revindex_size;
      	/* something like ".git/objects/pack/xxxxx.pack" */
      	char pack_name[FLEX_ARRAY]; /* more */
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)
      }

     -int load_pack_revindex(struct packed_git *p)
    -+static int load_pack_revindex_from_memory(struct packed_git *p)
    ++static int create_pack_revindex_in_memory(struct packed_git *p)
      {
     -	if (!p->revindex) {
     -		if (open_pack_index(p))
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)
     +	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
     +}
     +
    -+#define RIDX_MIN_SIZE (12 + (2 * the_hash_algo->rawsz))
    ++#define RIDX_HEADER_SIZE (12)
    ++#define RIDX_MIN_SIZE (RIDX_HEADER_SIZE + (2 * the_hash_algo->rawsz))
    ++
    ++struct revindex_header {
    ++	uint32_t signature;
    ++	uint32_t version;
    ++	uint32_t hash_id;
    ++};
     +
     +static int load_revindex_from_disk(char *revindex_name,
     +				   uint32_t num_objects,
    -+				   const void **data, size_t *len)
    ++				   const uint32_t **data_p, size_t *len_p)
     +{
     +	int fd, ret = 0;
     +	struct stat st;
    ++	void *data = NULL;
     +	size_t revindex_size;
    ++	struct revindex_header *hdr;
     +
     +	fd = git_open(revindex_name);
     +
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)
     +		goto cleanup;
     +	}
     +
    -+	*len = revindex_size;
    -+	*data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
    ++	data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
    ++	hdr = data;
    ++
    ++	if (ntohl(hdr->signature) != RIDX_SIGNATURE) {
    ++		ret = error(_("reverse-index file %s has unknown signature"), revindex_name);
    ++		goto cleanup;
    ++	}
    ++	if (ntohl(hdr->version) != 1) {
    ++		ret = error(_("reverse-index file %s has unsupported version %"PRIu32),
    ++			    revindex_name, ntohl(hdr->version));
    ++		goto cleanup;
    ++	}
    ++	if (!(ntohl(hdr->hash_id) == 1 || ntohl(hdr->hash_id) == 2)) {
    ++		ret = error(_("reverse-index file %s has unsupported hash id %"PRIu32),
    ++			    revindex_name, ntohl(hdr->hash_id));
    ++		goto cleanup;
    ++	}
     +
     +cleanup:
    ++	if (ret) {
    ++		if (data)
    ++			munmap(data, revindex_size);
    ++	} else {
    ++		*len_p = revindex_size;
    ++		*data_p = (const uint32_t *)data;
    ++	}
    ++
     +	close(fd);
     +	return ret;
     +}
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)
     +	if (ret)
     +		goto cleanup;
     +
    -+	p->revindex_data = (char *)p->revindex_map + 12;
    ++	p->revindex_data = (const uint32_t *)((const char *)p->revindex_map + RIDX_HEADER_SIZE);
     +
     +cleanup:
     +	free(revindex_name);
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)
     +
     +	if (!load_pack_revindex_from_disk(p))
     +		return 0;
    -+	else if (!load_pack_revindex_from_memory(p))
    ++	else if (!create_pack_revindex_in_memory(p))
     +		return 0;
     +	return -1;
     +}
    @@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_
     +	if (p->revindex)
     +		return p->revindex[pos].nr;
     +	else
    -+		return get_be32((char *)p->revindex_data + (pos * sizeof(uint32_t)));
    ++		return get_be32(p->revindex_data + pos);
      }

      off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
    @@ pack-revindex.c: int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_
      }

      ## pack-revindex.h ##
    -@@ pack-revindex.h: struct packed_git;
    +@@
    +  *   can be found
    +  */
    +
    ++#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
    ++#define RIDX_VERSION 1
    ++
    + struct packed_git;
    +
      /*
       * load_pack_revindex populates the revindex's internal data-structures for the
       * given pack, returning zero on success and a negative value otherwise.
     + *
    -+ * If a '.rev' file is present, it is checked for consistency, mmap'd, and
    -+ * pointers are assigned into it (instead of using the in-memory variant).
    ++ * If a '.rev' file is present it is mmap'd, and pointers are assigned into it
    ++ * (instead of using the in-memory variant).
       */
      int load_pack_revindex(struct packed_git *p);

    @@ packfile.h: uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);

      ## tmp-objdir.c ##
     @@ tmp-objdir.c: static int pack_copy_priority(const char *name)
    + 		return 1;
    + 	if (ends_with(name, ".pack"))
      		return 2;
    - 	if (ends_with(name, ".idx"))
    +-	if (ends_with(name, ".idx"))
    ++	if (ends_with(name, ".rev"))
      		return 3;
     -	return 4;
    -+	if (ends_with(name, ".rev"))
    ++	if (ends_with(name, ".idx"))
     +		return 4;
     +	return 5;
      }
22:  8648c87fa7 !  2:  5efc870742 pack-write.c: prepare to write 'pack-*.rev' files
    @@ Commit message
         from within 'write_rev_file()', which is called from
         'finish_tmp_packfile()'.

    +    Similar to the process by which the reverse index is computed in memory,
    +    these new paths also have to sort a list of objects by their offsets
    +    within a packfile. These new paths use a qsort() (as opposed to a radix
    +    sort), since our specialized radix sort requires a full revindex_entry
    +    struct per object, which is more memory than we need to allocate.
    +
    +    The qsort is obviously slower, but the theoretical slowdown would
    +    require a repository with a large amount of objects, likely implying
    +    that the time spent in, say, pack-objects during a repack would dominate
    +    the overall runtime.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

      ## pack-write.c ##
    @@ pack-write.c: const char *write_idx_file(const char *index_name, struct pack_idx
     +	return 0;
     +}
     +
    -+#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
    -+#define RIDX_VERSION 1
    -+
     +static void write_rev_header(struct hashfile *f)
     +{
     +	uint32_t oid_version;
    @@ pack-write.c: void finish_tmp_packfile(struct strbuf *name_buffer,
     +
      	free((void *)idx_tmp_name);
      }
    +

      ## pack.h ##
     @@ pack.h: struct pack_idx_option {
    @@ pack.h: struct pack_idx_option {

      	uint32_t version;
      	uint32_t off32_limit;
    -@@ pack.h: off_t write_pack_header(struct hashfile *f, uint32_t);
    - void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
    - char *index_pack_lockfile(int fd);
    +@@ pack.h: struct ref;
    +
    + void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought);

     +const char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
     +
 -:  ---------- >  3:  8a3e70454b builtin/index-pack.c: allow stripping arbitrary extensions
23:  5b18ada611 !  4:  a8ee59fccf builtin/index-pack.c: write reverse indexes
    @@ Documentation/git-index-pack.txt: git-index-pack - Build pack index file for an

      OPTIONS
     @@ Documentation/git-index-pack.txt: OPTIONS
    - 	file is constructed from the name of packed archive
    - 	file by replacing .pack with .idx (and the program
      	fails if the name of packed archive does not end
    --	with .pack).
    -+	with .pack). Incompatible with `--rev-index`.
    -+
    + 	with .pack).
    +
     +--[no-]rev-index::
     +	When this flag is provided, generate a reverse index
     +	(a `.rev` file) corresponding to the given pack. If
     +	`--verify` is given, ensure that the existing
     +	reverse index is correct. Takes precedence over
     +	`pack.writeReverseIndex`.
    -
    ++
      --stdin::
      	When this flag is provided, the pack is read from stdin
    + 	instead and a copy is then written to <pack-file>. If

      ## builtin/index-pack.c ##
     @@
    @@ builtin/index-pack.c

      struct object_entry {
      	struct pack_idx_entry idx;
    -@@ builtin/index-pack.c: static void fix_unresolved_deltas(struct hashfile *f)
    - 	free(sorted_by_pos);
    - }
    -
    --static const char *derive_filename(const char *pack_name, const char *suffix,
    --				   struct strbuf *buf)
    -+static const char *derive_filename(const char *pack_name, const char *strip,
    -+				   const char *suffix, struct strbuf *buf)
    - {
    - 	size_t len;
    --	if (!strip_suffix(pack_name, ".pack", &len))
    --		die(_("packfile name '%s' does not end with '.pack'"),
    --		    pack_name);
    -+	if (!strip_suffix(pack_name, strip, &len))
    -+		die(_("packfile name '%s' does not end with '%s'"),
    -+		    pack_name, strip);
    - 	strbuf_add(buf, pack_name, len);
    - 	strbuf_addch(buf, '.');
    - 	strbuf_addstr(buf, suffix);
    -@@ builtin/index-pack.c: static void write_special_file(const char *suffix, const char *msg,
    - 	int msg_len = strlen(msg);
    -
    - 	if (pack_name)
    --		filename = derive_filename(pack_name, suffix, &name_buf);
    -+		filename = derive_filename(pack_name, ".pack", suffix, &name_buf);
    - 	else
    - 		filename = odb_pack_name(&name_buf, hash, suffix);
    -
     @@ builtin/index-pack.c: static void write_special_file(const char *suffix, const char *msg,

      static void final(const char *final_pack_name, const char *curr_pack_name,
    @@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char
      				usage(index_pack_usage);
      			continue;
     @@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char *prefix)
    - 	if (from_stdin && hash_algo)
    - 		die(_("--object-format cannot be used with --stdin"));
      	if (!index_name && pack_name)
    --		index_name = derive_filename(pack_name, "idx", &index_name_buf);
    -+		index_name = derive_filename(pack_name, ".pack", "idx", &index_name_buf);
    -+
    + 		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
    +
     +	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
     +	if (rev_index) {
     +		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
     +		if (index_name)
     +			rev_index_name = derive_filename(index_name,
    -+							 ".idx", "rev",
    ++							 "idx", "rev",
     +							 &rev_index_name_buf);
     +	}
    -
    ++
      	if (verify) {
      		if (!index_name)
    + 			die(_("--verify with no packfile name given"));
     @@ builtin/index-pack.c: int cmd_index_pack(int argc, const char **argv, const char *prefix)
      	for (i = 0; i < nr_objects; i++)
      		idx_objects[i] = &objects[i].idx;
24:  68bde3ea97 =  5:  5bebe05a16 builtin/pack-objects.c: respect 'pack.writeReverseIndex'
25:  38a253d0ce =  6:  7e29f2d3a0 Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
26:  12cdf2d67a =  7:  7cf16485cc t: prepare for GIT_TEST_WRITE_REV_INDEX
27:  6b647d9775 !  8:  02550a251d t: support GIT_TEST_WRITE_REV_INDEX
    @@ builtin/pack-objects.c: int cmd_pack_objects(int argc, const char **argv, const

      ## ci/run-build-and-tests.sh ##
     @@ ci/run-build-and-tests.sh: linux-gcc)
    - 	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
      	export GIT_TEST_MULTI_PACK_INDEX=1
      	export GIT_TEST_ADD_I_USE_BUILTIN=1
    + 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
     +	export GIT_TEST_WRITE_REV_INDEX=1
      	make test
      	;;
    @@ pack-revindex.h
       *   can be found
       */

    ++
    + #define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
    + #define RIDX_VERSION 1
    +
     +#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
     +
      struct packed_git;
28:  48926ae182 !  9:  a66d2f9f7c pack-revindex: ensure that on-disk reverse indexes are given precedence
    @@ pack-revindex.c
      	off_t offset;
     @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)

    - static int load_pack_revindex_from_memory(struct packed_git *p)
    + static int create_pack_revindex_in_memory(struct packed_git *p)
      {
     +	if (git_env_bool(GIT_TEST_REV_INDEX_DIE_IN_MEMORY, 0))
     +		die("dying as requested by '%s'",
    @@ pack-revindex.c: static void create_pack_revindex(struct packed_git *p)

      ## pack-revindex.h ##
     @@
    -  */
    + #define RIDX_VERSION 1

      #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
     +#define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
 -:  ---------- > 10:  38c8afabf2 t5325: check both on-disk and in-memory reverse index
--
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 01/10] packfile: prepare for the existence of '*.rev' files
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 02/10] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Specify the format of the on-disk reverse index 'pack-*.rev' file, as
well as prepare the code for the existence of such files.

The reverse index maps from pack relative positions (i.e., an index into
the array of object which is sorted by their offsets within the
packfile) to their position within the 'pack-*.idx' file. Today, this is
done by building up a list of (off_t, uint32_t) tuples for each object
(the off_t corresponding to that object's offset, and the uint32_t
corresponding to its position in the index). To convert between pack and
index position quickly, this array of tuples is radix sorted based on
its offset.

This has two major drawbacks:

First, the in-memory cost scales linearly with the number of objects in
a pack.  Each 'struct revindex_entry' is sizeof(off_t) +
sizeof(uint32_t) + padding bytes for a total of 16.

To observe this, force Git to load the reverse index by, for e.g.,
running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking
for a single object in a fresh clone of the kernel, Git needs to
allocate 120+ MB of memory in order to hold the reverse index in memory.

Second, the cost to sort also scales with the size of the pack.
Luckily, this is a linear function since 'load_pack_revindex()' uses a
radix sort, but this cost still must be paid once per pack per process.

As an example, it takes ~60x longer to print the _size_ of an object as
it does to print that entire object's _contents_:

  Benchmark #1: git.compile cat-file --batch <obj
    Time (mean ± σ):       3.4 ms ±   0.1 ms    [User: 3.3 ms, System: 2.1 ms]
    Range (min … max):     3.2 ms …   3.7 ms    726 runs

  Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj
    Time (mean ± σ):     210.3 ms ±   8.9 ms    [User: 188.2 ms, System: 23.2 ms]
    Range (min … max):   193.7 ms … 224.4 ms    13 runs

Instead, avoid computing and sorting the revindex once per process by
writing it to a file when the pack itself is generated.

The format is relatively straightforward. It contains an array of
uint32_t's, the length of which is equal to the number of objects in the
pack.  The ith entry in this table contains the index position of the
ith object in the pack, where "ith object in the pack" is determined by
pack offset.

One thing that the on-disk format does _not_ contain is the full (up to)
eight-byte offset corresponding to each object. This is something that
the in-memory revindex contains (it stores an off_t in 'struct
revindex_entry' along with the same uint32_t that the on-disk format
has). Omit it in the on-disk format, since knowing the index position
for some object is sufficient to get a constant-time lookup in the
pack-*.idx file to ask for an object's offset within the pack.

This trades off between the on-disk size of the 'pack-*.rev' file for
runtime to chase down the offset for some object. Even though the lookup
is constant time, the constant is heavier, since it can potentially
involve two pointer walks in v2 indexes (one to access the 4-byte offset
table, and potentially a second to access the double wide offset table).

Consider trying to map an object's pack offset to a relative position
within that pack. In a cold-cache scenario, more page faults occur while
switching between binary searching through the reverse index and
searching through the *.idx file for an object's offset. Sure enough,
with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after
'sync'ing), printing out the entire object's contents is still
marginally faster than printing its size:

  Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null
    Time (mean ± σ):      22.6 ms ±   0.5 ms    [User: 2.4 ms, System: 7.9 ms]
    Range (min … max):    21.4 ms …  23.5 ms    41 runs

  Benchmark #2: git.compile cat-file --batch <obj >/dev/null
    Time (mean ± σ):      17.2 ms ±   0.7 ms    [User: 2.8 ms, System: 5.5 ms]
    Range (min … max):    15.6 ms …  18.2 ms    45 runs

(Numbers taken in the kernel after cheating and using the next patch to
generate a reverse index). There are a couple of approaches to improve
cold cache performance not pursued here:

  - We could include the object offsets in the reverse index format.
    Predictably, this does result in fewer page faults, but it triples
    the size of the file, while simultaneously duplicating a ton of data
    already available in the .idx file. (This was the original way I
    implemented the format, and it did show
    `--batch-check='%(objectsize:disk)'` winning out against `--batch`.)

    On the other hand, this increase in size also results in a large
    block-cache footprint, which could potentially hurt other workloads.

  - We could store the mapping from pack to index position in more
    cache-friendly way, like constructing a binary search tree from the
    table and writing the values in breadth-first order. This would
    result in much better locality, but the price you pay is trading
    O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
    can no longer directly index the table).

So, neither of these approaches are taken here. (Thankfully, the format
is versioned, so we are free to pursue these in the future.) But, cold
cache performance likely isn't interesting outside of one-off cases like
asking for the size of an object directly. In real-world usage, Git is
often performing many operations in the revindex (i.e., asking about
many objects rather than a single one).

The trade-off is worth it, since we will avoid the vast majority of the
cost of generating the revindex that the extra pointer chase will look
like noise in the following patch's benchmarks.

This patch describes the format and prepares callers (like in
pack-revindex.c) to be able to read *.rev files once they exist. An
implementation of the writer will appear in the next patch, and callers
will gradually begin to start using the writer in the patches that
follow after that.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt |  20 ++++
 builtin/repack.c                        |   1 +
 object-store.h                          |   3 +
 pack-revindex.c                         | 144 ++++++++++++++++++++++--
 pack-revindex.h                         |  10 +-
 packfile.c                              |  13 ++-
 packfile.h                              |   1 +
 tmp-objdir.c                            |   6 +-
 8 files changed, 184 insertions(+), 14 deletions(-)

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 96d2fc589f..8833b71c8b 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -274,6 +274,26 @@ Pack file entry: <+
 
     Index checksum of all of the above.
 
+== pack-*.rev files have the format:
+
+  - A 4-byte magic number '0x52494458' ('RIDX').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of index positions (one per packed object, num_objects in
+    total, each a 4-byte unsigned integer in network order), sorted by
+    their corresponding offsets in the packfile.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/builtin/repack.c b/builtin/repack.c
index 2158b48f4c..01440de2d5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -209,6 +209,7 @@ static struct {
 } exts[] = {
 	{".pack"},
 	{".idx"},
+	{".rev", 1},
 	{".bitmap", 1},
 	{".promisor", 1},
 };
diff --git a/object-store.h b/object-store.h
index c4fc9dd74e..541dab0858 100644
--- a/object-store.h
+++ b/object-store.h
@@ -85,6 +85,9 @@ struct packed_git {
 		 multi_pack_index:1;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	struct revindex_entry *revindex;
+	const uint32_t *revindex_data;
+	const uint32_t *revindex_map;
+	size_t revindex_size;
 	/* something like ".git/objects/pack/xxxxx.pack" */
 	char pack_name[FLEX_ARRAY]; /* more */
 };
diff --git a/pack-revindex.c b/pack-revindex.c
index 5e69bc7372..a174fa5388 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -164,16 +164,130 @@ static void create_pack_revindex(struct packed_git *p)
 	sort_revindex(p->revindex, num_ent, p->pack_size);
 }
 
-int load_pack_revindex(struct packed_git *p)
+static int create_pack_revindex_in_memory(struct packed_git *p)
 {
-	if (!p->revindex) {
-		if (open_pack_index(p))
-			return -1;
-		create_pack_revindex(p);
-	}
+	if (open_pack_index(p))
+		return -1;
+	create_pack_revindex(p);
 	return 0;
 }
 
+static char *pack_revindex_filename(struct packed_git *p)
+{
+	size_t len;
+	if (!strip_suffix(p->pack_name, ".pack", &len))
+		BUG("pack_name does not end in .pack");
+	return xstrfmt("%.*s.rev", (int)len, p->pack_name);
+}
+
+#define RIDX_HEADER_SIZE (12)
+#define RIDX_MIN_SIZE (RIDX_HEADER_SIZE + (2 * the_hash_algo->rawsz))
+
+struct revindex_header {
+	uint32_t signature;
+	uint32_t version;
+	uint32_t hash_id;
+};
+
+static int load_revindex_from_disk(char *revindex_name,
+				   uint32_t num_objects,
+				   const uint32_t **data_p, size_t *len_p)
+{
+	int fd, ret = 0;
+	struct stat st;
+	void *data = NULL;
+	size_t revindex_size;
+	struct revindex_header *hdr;
+
+	fd = git_open(revindex_name);
+
+	if (fd < 0) {
+		ret = -1;
+		goto cleanup;
+	}
+	if (fstat(fd, &st)) {
+		ret = error_errno(_("failed to read %s"), revindex_name);
+		goto cleanup;
+	}
+
+	revindex_size = xsize_t(st.st_size);
+
+	if (revindex_size < RIDX_MIN_SIZE) {
+		ret = error(_("reverse-index file %s is too small"), revindex_name);
+		goto cleanup;
+	}
+
+	if (revindex_size - RIDX_MIN_SIZE != st_mult(sizeof(uint32_t), num_objects)) {
+		ret = error(_("reverse-index file %s is corrupt"), revindex_name);
+		goto cleanup;
+	}
+
+	data = xmmap(NULL, revindex_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	hdr = data;
+
+	if (ntohl(hdr->signature) != RIDX_SIGNATURE) {
+		ret = error(_("reverse-index file %s has unknown signature"), revindex_name);
+		goto cleanup;
+	}
+	if (ntohl(hdr->version) != 1) {
+		ret = error(_("reverse-index file %s has unsupported version %"PRIu32),
+			    revindex_name, ntohl(hdr->version));
+		goto cleanup;
+	}
+	if (!(ntohl(hdr->hash_id) == 1 || ntohl(hdr->hash_id) == 2)) {
+		ret = error(_("reverse-index file %s has unsupported hash id %"PRIu32),
+			    revindex_name, ntohl(hdr->hash_id));
+		goto cleanup;
+	}
+
+cleanup:
+	if (ret) {
+		if (data)
+			munmap(data, revindex_size);
+	} else {
+		*len_p = revindex_size;
+		*data_p = (const uint32_t *)data;
+	}
+
+	close(fd);
+	return ret;
+}
+
+static int load_pack_revindex_from_disk(struct packed_git *p)
+{
+	char *revindex_name;
+	int ret;
+	if (open_pack_index(p))
+		return -1;
+
+	revindex_name = pack_revindex_filename(p);
+
+	ret = load_revindex_from_disk(revindex_name,
+				      p->num_objects,
+				      &p->revindex_map,
+				      &p->revindex_size);
+	if (ret)
+		goto cleanup;
+
+	p->revindex_data = (const uint32_t *)((const char *)p->revindex_map + RIDX_HEADER_SIZE);
+
+cleanup:
+	free(revindex_name);
+	return ret;
+}
+
+int load_pack_revindex(struct packed_git *p)
+{
+	if (p->revindex || p->revindex_data)
+		return 0;
+
+	if (!load_pack_revindex_from_disk(p))
+		return 0;
+	else if (!create_pack_revindex_in_memory(p))
+		return 0;
+	return -1;
+}
+
 int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 {
 	unsigned lo, hi;
@@ -203,18 +317,28 @@ int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos)
 
 uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos)
 {
-	if (!p->revindex)
+	if (!(p->revindex || p->revindex_data))
 		BUG("pack_pos_to_index: reverse index not yet loaded");
 	if (p->num_objects <= pos)
 		BUG("pack_pos_to_index: out-of-bounds object at %"PRIu32, pos);
-	return p->revindex[pos].nr;
+
+	if (p->revindex)
+		return p->revindex[pos].nr;
+	else
+		return get_be32(p->revindex_data + pos);
 }
 
 off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos)
 {
-	if (!p->revindex)
+	if (!(p->revindex || p->revindex_data))
 		BUG("pack_pos_to_index: reverse index not yet loaded");
 	if (p->num_objects < pos)
 		BUG("pack_pos_to_offset: out-of-bounds object at %"PRIu32, pos);
-	return p->revindex[pos].offset;
+
+	if (p->revindex)
+		return p->revindex[pos].offset;
+	else if (pos == p->num_objects)
+		return p->pack_size - the_hash_algo->rawsz;
+	else
+		return nth_packed_object_offset(p, pack_pos_to_index(p, pos));
 }
diff --git a/pack-revindex.h b/pack-revindex.h
index 6e0320b08b..61b2f3ab75 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -16,11 +16,17 @@
  *   can be found
  */
 
+#define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
+#define RIDX_VERSION 1
+
 struct packed_git;
 
 /*
  * load_pack_revindex populates the revindex's internal data-structures for the
  * given pack, returning zero on success and a negative value otherwise.
+ *
+ * If a '.rev' file is present it is mmap'd, and pointers are assigned into it
+ * (instead of using the in-memory variant).
  */
 int load_pack_revindex(struct packed_git *p);
 
@@ -55,7 +61,9 @@ uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos);
  * If the reverse index has not yet been loaded, or the position is out of
  * bounds, this function aborts.
  *
- * This function runs in constant time.
+ * This function runs in constant time under both in-memory and on-disk reverse
+ * indexes, but an additional step is taken to consult the corresponding .idx
+ * file when using the on-disk format.
  */
 off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos);
 
diff --git a/packfile.c b/packfile.c
index 4b938b4372..1fec12ac5f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -324,11 +324,21 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
+void close_pack_revindex(struct packed_git *p) {
+	if (!p->revindex_map)
+		return;
+
+	munmap((void *)p->revindex_map, p->revindex_size);
+	p->revindex_map = NULL;
+	p->revindex_data = NULL;
+}
+
 void close_pack(struct packed_git *p)
 {
 	close_pack_windows(p);
 	close_pack_fd(p);
 	close_pack_index(p);
+	close_pack_revindex(p);
 }
 
 void close_object_store(struct raw_object_store *o)
@@ -351,7 +361,7 @@ void close_object_store(struct raw_object_store *o)
 
 void unlink_pack_path(const char *pack_name, int force_delete)
 {
-	static const char *exts[] = {".pack", ".idx", ".keep", ".bitmap", ".promisor"};
+	static const char *exts[] = {".pack", ".idx", ".rev", ".keep", ".bitmap", ".promisor"};
 	int i;
 	struct strbuf buf = STRBUF_INIT;
 	size_t plen;
@@ -853,6 +863,7 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	if (!strcmp(file_name, "multi-pack-index"))
 		return;
 	if (ends_with(file_name, ".idx") ||
+	    ends_with(file_name, ".rev") ||
 	    ends_with(file_name, ".pack") ||
 	    ends_with(file_name, ".bitmap") ||
 	    ends_with(file_name, ".keep") ||
diff --git a/packfile.h b/packfile.h
index a58fc738e0..4cfec9e8d3 100644
--- a/packfile.h
+++ b/packfile.h
@@ -90,6 +90,7 @@ uint32_t get_pack_fanout(struct packed_git *p, uint32_t value);
 
 unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t, unsigned long *);
 void close_pack_windows(struct packed_git *);
+void close_pack_revindex(struct packed_git *);
 void close_pack(struct packed_git *);
 void close_object_store(struct raw_object_store *o);
 void unuse_pack(struct pack_window **);
diff --git a/tmp-objdir.c b/tmp-objdir.c
index 42ed4db5d3..b8d880e362 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -185,9 +185,11 @@ static int pack_copy_priority(const char *name)
 		return 1;
 	if (ends_with(name, ".pack"))
 		return 2;
-	if (ends_with(name, ".idx"))
+	if (ends_with(name, ".rev"))
 		return 3;
-	return 4;
+	if (ends_with(name, ".idx"))
+		return 4;
+	return 5;
 }
 
 static int pack_copy_cmp(const char *a, const char *b)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 02/10] pack-write.c: prepare to write 'pack-*.rev' files
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 01/10] packfile: prepare for the existence of '*.rev' files Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 03/10] builtin/index-pack.c: allow stripping arbitrary extensions Taylor Blau
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

This patch prepares for callers to be able to write reverse index files
to disk.

It adds the necessary machinery to write a format-compliant .rev file
from within 'write_rev_file()', which is called from
'finish_tmp_packfile()'.

Similar to the process by which the reverse index is computed in memory,
these new paths also have to sort a list of objects by their offsets
within a packfile. These new paths use a qsort() (as opposed to a radix
sort), since our specialized radix sort requires a full revindex_entry
struct per object, which is more memory than we need to allocate.

The qsort is obviously slower, but the theoretical slowdown would
require a repository with a large amount of objects, likely implying
that the time spent in, say, pack-objects during a repack would dominate
the overall runtime.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-write.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 pack.h       |   4 ++
 2 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/pack-write.c b/pack-write.c
index e9bb3fd949..680c36755d 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -167,6 +167,113 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	return index_name;
 }
 
+static int pack_order_cmp(const void *va, const void *vb, void *ctx)
+{
+	struct pack_idx_entry **objects = ctx;
+
+	off_t oa = objects[*(uint32_t*)va]->offset;
+	off_t ob = objects[*(uint32_t*)vb]->offset;
+
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+static void write_rev_header(struct hashfile *f)
+{
+	uint32_t oid_version;
+	switch (hash_algo_by_ptr(the_hash_algo)) {
+	case GIT_HASH_SHA1:
+		oid_version = 1;
+		break;
+	case GIT_HASH_SHA256:
+		oid_version = 2;
+		break;
+	default:
+		die("write_rev_header: unknown hash version");
+	}
+
+	hashwrite_be32(f, RIDX_SIGNATURE);
+	hashwrite_be32(f, RIDX_VERSION);
+	hashwrite_be32(f, oid_version);
+}
+
+static void write_rev_index_positions(struct hashfile *f,
+				      struct pack_idx_entry **objects,
+				      uint32_t nr_objects)
+{
+	uint32_t *pack_order;
+	uint32_t i;
+
+	ALLOC_ARRAY(pack_order, nr_objects);
+	for (i = 0; i < nr_objects; i++)
+		pack_order[i] = i;
+	QSORT_S(pack_order, nr_objects, pack_order_cmp, objects);
+
+	for (i = 0; i < nr_objects; i++)
+		hashwrite_be32(f, pack_order[i]);
+
+	free(pack_order);
+}
+
+static void write_rev_trailer(struct hashfile *f, const unsigned char *hash)
+{
+	hashwrite(f, hash, the_hash_algo->rawsz);
+}
+
+const char *write_rev_file(const char *rev_name,
+			   struct pack_idx_entry **objects,
+			   uint32_t nr_objects,
+			   const unsigned char *hash,
+			   unsigned flags)
+{
+	struct hashfile *f;
+	int fd;
+
+	if ((flags & WRITE_REV) && (flags & WRITE_REV_VERIFY))
+		die(_("cannot both write and verify reverse index"));
+
+	if (flags & WRITE_REV) {
+		if (!rev_name) {
+			struct strbuf tmp_file = STRBUF_INIT;
+			fd = odb_mkstemp(&tmp_file, "pack/tmp_rev_XXXXXX");
+			rev_name = strbuf_detach(&tmp_file, NULL);
+		} else {
+			unlink(rev_name);
+			fd = open(rev_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
+			if (fd < 0)
+				die_errno("unable to create '%s'", rev_name);
+		}
+		f = hashfd(fd, rev_name);
+	} else if (flags & WRITE_REV_VERIFY) {
+		struct stat statbuf;
+		if (stat(rev_name, &statbuf)) {
+			if (errno == ENOENT) {
+				/* .rev files are optional */
+				return NULL;
+			} else
+				die_errno(_("could not stat: %s"), rev_name);
+		}
+		f = hashfd_check(rev_name);
+	} else
+		return NULL;
+
+	write_rev_header(f);
+
+	write_rev_index_positions(f, objects, nr_objects);
+	write_rev_trailer(f, hash);
+
+	if (rev_name && adjust_shared_perm(rev_name) < 0)
+		die(_("failed to make %s readable"), rev_name);
+
+	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
+				    ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
+
+	return rev_name;
+}
+
 off_t write_pack_header(struct hashfile *f, uint32_t nr_entries)
 {
 	struct pack_header hdr;
@@ -342,7 +449,7 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 			 struct pack_idx_option *pack_idx_opts,
 			 unsigned char hash[])
 {
-	const char *idx_tmp_name;
+	const char *idx_tmp_name, *rev_tmp_name = NULL;
 	int basename_len = name_buffer->len;
 
 	if (adjust_shared_perm(pack_tmp_name))
@@ -353,6 +460,9 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 	if (adjust_shared_perm(idx_tmp_name))
 		die_errno("unable to make temporary index file readable");
 
+	rev_tmp_name = write_rev_file(NULL, written_list, nr_written, hash,
+				      pack_idx_opts->flags);
+
 	strbuf_addf(name_buffer, "%s.pack", hash_to_hex(hash));
 
 	if (rename(pack_tmp_name, name_buffer->buf))
@@ -366,6 +476,14 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 
 	strbuf_setlen(name_buffer, basename_len);
 
+	if (rev_tmp_name) {
+		strbuf_addf(name_buffer, "%s.rev", hash_to_hex(hash));
+		if (rename(rev_tmp_name, name_buffer->buf))
+			die_errno("unable to rename temporary reverse-index file");
+	}
+
+	strbuf_setlen(name_buffer, basename_len);
+
 	free((void *)idx_tmp_name);
 }
 
diff --git a/pack.h b/pack.h
index 9ae640f417..afdcf8f5c7 100644
--- a/pack.h
+++ b/pack.h
@@ -42,6 +42,8 @@ struct pack_idx_option {
 	/* flag bits */
 #define WRITE_IDX_VERIFY 01 /* verify only, do not write the idx file */
 #define WRITE_IDX_STRICT 02
+#define WRITE_REV 04
+#define WRITE_REV_VERIFY 010
 
 	uint32_t version;
 	uint32_t off32_limit;
@@ -91,6 +93,8 @@ struct ref;
 
 void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought);
 
+const char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags);
+
 /*
  * The "hdr" output buffer should be at least this big, which will handle sizes
  * up to 2^67.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 03/10] builtin/index-pack.c: allow stripping arbitrary extensions
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 01/10] packfile: prepare for the existence of '*.rev' files Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 02/10] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 04/10] builtin/index-pack.c: write reverse indexes Taylor Blau
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

To derive the filename for a .idx file, 'git index-pack' uses
derive_filename() to strip the '.pack' suffix and add the new suffix.

Prepare for stripping off suffixes other than '.pack' by making the
suffix to strip a parameter of derive_filename(). In order to make this
consistent with the "suffix" parameter which does not begin with a ".",
an additional check in derive_filename.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/index-pack.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 557bd2f348..c758f3b8e9 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1436,15 +1436,15 @@ static void fix_unresolved_deltas(struct hashfile *f)
 	free(sorted_by_pos);
 }
 
-static const char *derive_filename(const char *pack_name, const char *suffix,
-				   struct strbuf *buf)
+static const char *derive_filename(const char *pack_name, const char *strip,
+				   const char *suffix, struct strbuf *buf)
 {
 	size_t len;
-	if (!strip_suffix(pack_name, ".pack", &len))
-		die(_("packfile name '%s' does not end with '.pack'"),
-		    pack_name);
+	if (!strip_suffix(pack_name, strip, &len) || !len ||
+	    pack_name[len - 1] != '.')
+		die(_("packfile name '%s' does not end with '.%s'"),
+		    pack_name, strip);
 	strbuf_add(buf, pack_name, len);
-	strbuf_addch(buf, '.');
 	strbuf_addstr(buf, suffix);
 	return buf->buf;
 }
@@ -1459,7 +1459,7 @@ static void write_special_file(const char *suffix, const char *msg,
 	int msg_len = strlen(msg);
 
 	if (pack_name)
-		filename = derive_filename(pack_name, suffix, &name_buf);
+		filename = derive_filename(pack_name, "pack", suffix, &name_buf);
 	else
 		filename = odb_pack_name(&name_buf, hash, suffix);
 
@@ -1824,7 +1824,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && hash_algo)
 		die(_("--object-format cannot be used with --stdin"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, "idx", &index_name_buf);
+		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
 
 	if (verify) {
 		if (!index_name)
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 04/10] builtin/index-pack.c: write reverse indexes
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (2 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 03/10] builtin/index-pack.c: allow stripping arbitrary extensions Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 05/10] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Teach 'git index-pack' to optionally write and verify reverse index with
'--[no-]rev-index', as well as respecting the 'pack.writeReverseIndex'
configuration option.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-index-pack.txt | 18 +++++---
 builtin/index-pack.c             | 50 ++++++++++++++++++++--
 t/t5325-reverse-index.sh         | 71 ++++++++++++++++++++++++++++++++
 3 files changed, 131 insertions(+), 8 deletions(-)
 create mode 100755 t/t5325-reverse-index.sh

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index af0c26232c..69ba904d44 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -9,17 +9,18 @@ git-index-pack - Build pack index file for an existing packed archive
 SYNOPSIS
 --------
 [verse]
-'git index-pack' [-v] [-o <index-file>] <pack-file>
+'git index-pack' [-v] [-o <index-file>] [--[no-]rev-index] <pack-file>
 'git index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>]
-                 [<pack-file>]
+		  [--[no-]rev-index] [<pack-file>]
 
 
 DESCRIPTION
 -----------
 Reads a packed archive (.pack) from the specified file, and
-builds a pack index file (.idx) for it.  The packed archive
-together with the pack index can then be placed in the
-objects/pack/ directory of a Git repository.
+builds a pack index file (.idx) for it. Optionally writes a
+reverse-index (.rev) for the specified pack. The packed
+archive together with the pack index can then be placed in
+the objects/pack/ directory of a Git repository.
 
 
 OPTIONS
@@ -35,6 +36,13 @@ OPTIONS
 	fails if the name of packed archive does not end
 	with .pack).
 
+--[no-]rev-index::
+	When this flag is provided, generate a reverse index
+	(a `.rev` file) corresponding to the given pack. If
+	`--verify` is given, ensure that the existing
+	reverse index is correct. Takes precedence over
+	`pack.writeReverseIndex`.
+
 --stdin::
 	When this flag is provided, the pack is read from stdin
 	instead and a copy is then written to <pack-file>. If
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index c758f3b8e9..d5cd665b98 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -17,7 +17,7 @@
 #include "promisor-remote.h"
 
 static const char index_pack_usage[] =
-"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
+"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
 struct object_entry {
 	struct pack_idx_entry idx;
@@ -1484,12 +1484,14 @@ static void write_special_file(const char *suffix, const char *msg,
 
 static void final(const char *final_pack_name, const char *curr_pack_name,
 		  const char *final_index_name, const char *curr_index_name,
+		  const char *final_rev_index_name, const char *curr_rev_index_name,
 		  const char *keep_msg, const char *promisor_msg,
 		  unsigned char *hash)
 {
 	const char *report = "pack";
 	struct strbuf pack_name = STRBUF_INIT;
 	struct strbuf index_name = STRBUF_INIT;
+	struct strbuf rev_index_name = STRBUF_INIT;
 	int err;
 
 	if (!from_stdin) {
@@ -1524,6 +1526,16 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 	} else
 		chmod(final_index_name, 0444);
 
+	if (curr_rev_index_name) {
+		if (final_rev_index_name != curr_rev_index_name) {
+			if (!final_rev_index_name)
+				final_rev_index_name = odb_pack_name(&rev_index_name, hash, "rev");
+			if (finalize_object_file(curr_rev_index_name, final_rev_index_name))
+				die(_("cannot store reverse index file"));
+		} else
+			chmod(final_rev_index_name, 0444);
+	}
+
 	if (do_fsck_object) {
 		struct packed_git *p;
 		p = add_packed_git(final_index_name, strlen(final_index_name), 0);
@@ -1553,6 +1565,7 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 		}
 	}
 
+	strbuf_release(&rev_index_name);
 	strbuf_release(&index_name);
 	strbuf_release(&pack_name);
 }
@@ -1578,6 +1591,12 @@ static int git_index_pack_config(const char *k, const char *v, void *cb)
 		}
 		return 0;
 	}
+	if (!strcmp(k, "pack.writereverseindex")) {
+		if (git_config_bool(k, v))
+			opts->flags |= WRITE_REV;
+		else
+			opts->flags &= ~WRITE_REV;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -1695,12 +1714,14 @@ static void show_pack_info(int stat_only)
 
 int cmd_index_pack(int argc, const char **argv, const char *prefix)
 {
-	int i, fix_thin_pack = 0, verify = 0, stat_only = 0;
+	int i, fix_thin_pack = 0, verify = 0, stat_only = 0, rev_index;
 	const char *curr_index;
-	const char *index_name = NULL, *pack_name = NULL;
+	const char *curr_rev_index = NULL;
+	const char *index_name = NULL, *pack_name = NULL, *rev_index_name = NULL;
 	const char *keep_msg = NULL;
 	const char *promisor_msg = NULL;
 	struct strbuf index_name_buf = STRBUF_INIT;
+	struct strbuf rev_index_name_buf = STRBUF_INIT;
 	struct pack_idx_entry **idx_objects;
 	struct pack_idx_option opts;
 	unsigned char pack_hash[GIT_MAX_RAWSZ];
@@ -1727,6 +1748,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
+	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
+
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
 
@@ -1805,6 +1828,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				if (hash_algo == GIT_HASH_UNKNOWN)
 					die(_("unknown hash algorithm '%s'"), arg);
 				repo_set_hash_algo(the_repository, hash_algo);
+			} else if (!strcmp(arg, "--rev-index")) {
+				rev_index = 1;
+			} else if (!strcmp(arg, "--no-rev-index")) {
+				rev_index = 0;
 			} else
 				usage(index_pack_usage);
 			continue;
@@ -1826,6 +1853,15 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (!index_name && pack_name)
 		index_name = derive_filename(pack_name, "pack", "idx", &index_name_buf);
 
+	opts.flags &= ~(WRITE_REV | WRITE_REV_VERIFY);
+	if (rev_index) {
+		opts.flags |= verify ? WRITE_REV_VERIFY : WRITE_REV;
+		if (index_name)
+			rev_index_name = derive_filename(index_name,
+							 "idx", "rev",
+							 &rev_index_name_buf);
+	}
+
 	if (verify) {
 		if (!index_name)
 			die(_("--verify with no packfile name given"));
@@ -1878,11 +1914,16 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	for (i = 0; i < nr_objects; i++)
 		idx_objects[i] = &objects[i].idx;
 	curr_index = write_idx_file(index_name, idx_objects, nr_objects, &opts, pack_hash);
+	if (rev_index)
+		curr_rev_index = write_rev_file(rev_index_name, idx_objects,
+						nr_objects, pack_hash,
+						opts.flags);
 	free(idx_objects);
 
 	if (!verify)
 		final(pack_name, curr_pack,
 		      index_name, curr_index,
+		      rev_index_name, curr_rev_index,
 		      keep_msg, promisor_msg,
 		      pack_hash);
 	else
@@ -1893,10 +1934,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 
 	free(objects);
 	strbuf_release(&index_name_buf);
+	strbuf_release(&rev_index_name_buf);
 	if (pack_name == NULL)
 		free((void *) curr_pack);
 	if (index_name == NULL)
 		free((void *) curr_index);
+	if (rev_index_name == NULL)
+		free((void *) curr_rev_index);
 
 	/*
 	 * Let the caller know this pack is not self contained
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
new file mode 100755
index 0000000000..2dae213126
--- /dev/null
+++ b/t/t5325-reverse-index.sh
@@ -0,0 +1,71 @@
+#!/bin/sh
+
+test_description='on-disk reverse index'
+. ./test-lib.sh
+
+packdir=.git/objects/pack
+
+test_expect_success 'setup' '
+	test_commit base &&
+
+	pack=$(git pack-objects --all $packdir/pack) &&
+	rev=$packdir/pack-$pack.rev &&
+
+	test_path_is_missing $rev
+'
+
+test_index_pack () {
+	rm -f $rev &&
+	conf=$1 &&
+	shift &&
+	# remove the index since Windows won't overwrite an existing file
+	rm $packdir/pack-$pack.idx &&
+	git -c pack.writeReverseIndex=$conf index-pack "$@" \
+		$packdir/pack-$pack.pack
+}
+
+test_expect_success 'index-pack with pack.writeReverseIndex' '
+	test_index_pack "" &&
+	test_path_is_missing $rev &&
+
+	test_index_pack false &&
+	test_path_is_missing $rev &&
+
+	test_index_pack true &&
+	test_path_is_file $rev
+'
+
+test_expect_success 'index-pack with --[no-]rev-index' '
+	for conf in "" true false
+	do
+		test_index_pack "$conf" --rev-index &&
+		test_path_exists $rev &&
+
+		test_index_pack "$conf" --no-rev-index &&
+		test_path_is_missing $rev
+	done
+'
+
+test_expect_success 'index-pack can verify reverse indexes' '
+	test_when_finished "rm -f $rev" &&
+	test_index_pack true &&
+
+	test_path_is_file $rev &&
+	git index-pack --rev-index --verify $packdir/pack-$pack.pack &&
+
+	# Intentionally corrupt the reverse index.
+	chmod u+w $rev &&
+	printf "xxxx" | dd of=$rev bs=1 count=4 conv=notrunc &&
+
+	test_must_fail git index-pack --rev-index --verify \
+		$packdir/pack-$pack.pack 2>err &&
+	grep "validation error" err
+'
+
+test_expect_success 'index-pack infers reverse index name with -o' '
+	git index-pack --rev-index -o other.idx $packdir/pack-$pack.pack &&
+	test_path_is_file other.idx &&
+	test_path_is_file other.rev
+'
+
+test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 05/10] builtin/pack-objects.c: respect 'pack.writeReverseIndex'
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (3 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 04/10] builtin/index-pack.c: write reverse indexes Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 06/10] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that we have an implementation that can write the new reverse index
format, enable writing a .rev file in 'git pack-objects' by consulting
the pack.writeReverseIndex configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c   |  7 +++++++
 t/t5325-reverse-index.sh | 13 +++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5b0c4489e2..d784569200 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2955,6 +2955,13 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
+	if (!strcmp(k, "pack.writereverseindex")) {
+		if (git_config_bool(k, v))
+			pack_idx_opts.flags |= WRITE_REV;
+		else
+			pack_idx_opts.flags &= ~WRITE_REV;
+		return 0;
+	}
 	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
 		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
 		const char *oid_end, *pack_end;
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 2dae213126..87040263b7 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -68,4 +68,17 @@ test_expect_success 'index-pack infers reverse index name with -o' '
 	test_path_is_file other.rev
 '
 
+test_expect_success 'pack-objects respects pack.writeReverseIndex' '
+	test_when_finished "rm -fr pack-1-*" &&
+
+	git -c pack.writeReverseIndex= pack-objects --all pack-1 &&
+	test_path_is_missing pack-1-*.rev &&
+
+	git -c pack.writeReverseIndex=false pack-objects --all pack-1 &&
+	test_path_is_missing pack-1-*.rev &&
+
+	git -c pack.writeReverseIndex=true pack-objects --all pack-1 &&
+	test_path_is_file pack-1-*.rev
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 06/10] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex'
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (4 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 05/10] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 07/10] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Now that the pack.writeReverseIndex configuration is respected in both
'git index-pack' and 'git pack-objects' (and therefore, all of their
callers), we can safely advertise it for use in the git-config manual.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.txt | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index 837f1b1679..3da4ea98e2 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -133,3 +133,10 @@ pack.writeBitmapHashCache::
 	between an older, bitmapped pack and objects that have been
 	pushed since the last gc). The downside is that it consumes 4
 	bytes per object of disk space. Defaults to true.
+
+pack.writeReverseIndex::
+	When true, git will write a corresponding .rev file (see:
+	link:../technical/pack-format.html[Documentation/technical/pack-format.txt])
+	for each new packfile that it writes in all places except for
+	linkgit:git-fast-import[1] and in the bulk checkin mechanism.
+	Defaults to false.
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 07/10] t: prepare for GIT_TEST_WRITE_REV_INDEX
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (5 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 06/10] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 08/10] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

In the next patch, we'll add support for unconditionally enabling the
'pack.writeReverseIndex' setting with a new GIT_TEST_WRITE_REV_INDEX
environment variable.

This causes a little bit of fallout with tests that, for example,
compare the list of files in the pack directory being unprepared to see
.rev files in its output.

Those locations can be cleaned up to look for specific file extensions,
rather than take everything in the pack directory (for instance) and
then grep out unwanted items.

Once the pack.writeReverseIndex option has been thoroughly
tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX,
and making it possible to revert this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5319-multi-pack-index.sh |  5 +++--
 t/t5325-reverse-index.sh    |  4 ++++
 t/t5604-clone-reference.sh  |  2 +-
 t/t5702-protocol-v2.sh      | 12 ++++++++----
 t/t6500-gc.sh               |  6 +++---
 t/t9300-fast-import.sh      |  5 ++++-
 6 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a9..2fc3aadbd1 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -710,8 +710,9 @@ test_expect_success 'expire respects .keep files' '
 		PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
 		touch $PACKA.keep &&
 		git multi-pack-index expire &&
-		ls -S .git/objects/pack/a-pack* | grep $PACKA >a-pack-files &&
-		test_line_count = 3 a-pack-files &&
+		test_path_is_file $PACKA.idx &&
+		test_path_is_file $PACKA.keep &&
+		test_path_is_file $PACKA.pack &&
 		test-tool read-midx .git/objects | grep idx >midx-list &&
 		test_line_count = 2 midx-list
 	)
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index 87040263b7..be452bb343 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -3,6 +3,10 @@
 test_description='on-disk reverse index'
 . ./test-lib.sh
 
+# The below tests want control over the 'pack.writeReverseIndex' setting
+# themselves to assert various combinations of it with other options.
+sane_unset GIT_TEST_WRITE_REV_INDEX
+
 packdir=.git/objects/pack
 
 test_expect_success 'setup' '
diff --git a/t/t5604-clone-reference.sh b/t/t5604-clone-reference.sh
index 5d682706ae..e845d621f6 100755
--- a/t/t5604-clone-reference.sh
+++ b/t/t5604-clone-reference.sh
@@ -329,7 +329,7 @@ test_expect_success SYMLINKS 'clone repo with symlinked or unknown files at obje
 	for raw in $(ls T*.raw)
 	do
 		sed -e "s!/../!/Y/!; s![0-9a-f]\{38,\}!Z!" -e "/commit-graph/d" \
-		    -e "/multi-pack-index/d" <$raw >$raw.de-sha-1 &&
+		    -e "/multi-pack-index/d" -e "/rev/d" <$raw >$raw.de-sha-1 &&
 		sort $raw.de-sha-1 >$raw.de-sha || return 1
 	done &&
 
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 3d994e0b1b..e8f0b4a299 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -851,8 +851,10 @@ test_expect_success 'part of packfile response provided as URI' '
 	test -f h2found &&
 
 	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
-	ls http_child/.git/objects/pack/* >filelist &&
-	test_line_count = 6 filelist
+	ls http_child/.git/objects/pack/*.pack >packlist &&
+	ls http_child/.git/objects/pack/*.idx >idxlist &&
+	test_line_count = 3 idxlist &&
+	test_line_count = 3 packlist
 '
 
 test_expect_success 'fetching with valid packfile URI but invalid hash fails' '
@@ -905,8 +907,10 @@ test_expect_success 'packfile-uri with transfer.fsckobjects' '
 		clone "$HTTPD_URL/smart/http_parent" http_child &&
 
 	# Ensure that there are exactly 4 files (2 .pack and 2 .idx).
-	ls http_child/.git/objects/pack/* >filelist &&
-	test_line_count = 4 filelist
+	ls http_child/.git/objects/pack/*.pack >packlist &&
+	ls http_child/.git/objects/pack/*.idx >idxlist &&
+	test_line_count = 2 idxlist &&
+	test_line_count = 2 packlist
 '
 
 test_expect_success 'packfile-uri with transfer.fsckobjects fails on bad object' '
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index 4a3b8f48ac..f76586f808 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -106,17 +106,17 @@ test_expect_success 'auto gc with too many loose objects does not attempt to cre
 	test_commit "$(test_oid obj2)" &&
 	# Our first gc will create a pack; our second will create a second pack
 	git gc --auto &&
-	ls .git/objects/pack | sort >existing_packs &&
+	ls .git/objects/pack/pack-*.pack | sort >existing_packs &&
 	test_commit "$(test_oid obj3)" &&
 	test_commit "$(test_oid obj4)" &&
 
 	git gc --auto 2>err &&
 	test_i18ngrep ! "^warning:" err &&
-	ls .git/objects/pack/ | sort >post_packs &&
+	ls .git/objects/pack/pack-*.pack | sort >post_packs &&
 	comm -1 -3 existing_packs post_packs >new &&
 	comm -2 -3 existing_packs post_packs >del &&
 	test_line_count = 0 del && # No packs are deleted
-	test_line_count = 2 new # There is one new pack and its .idx
+	test_line_count = 1 new # There is one new pack
 '
 
 test_expect_success 'gc --no-quiet' '
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3d17e932a0..8f1caf8025 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1632,7 +1632,10 @@ test_expect_success 'O: blank lines not necessary after other commands' '
 	INPUT_END
 
 	git fast-import <input &&
-	test 8 = $(find .git/objects/pack -type f | grep -v multi-pack-index | wc -l) &&
+	ls -la .git/objects/pack/pack-*.pack >packlist &&
+	ls -la .git/objects/pack/pack-*.pack >idxlist &&
+	test_line_count = 4 idxlist &&
+	test_line_count = 4 packlist &&
 	test $(git rev-parse refs/tags/O3-2nd) = $(git rev-parse O3^) &&
 	git log --reverse --pretty=oneline O3 | sed s/^.*z// >actual &&
 	test_cmp expect actual
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 08/10] t: support GIT_TEST_WRITE_REV_INDEX
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (6 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 07/10] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 09/10] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 10/10] t5325: check both on-disk and in-memory reverse index Taylor Blau
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Add a new option that unconditionally enables the pack.writeReverseIndex
setting in order to run the whole test suite in a mode that generates
on-disk reverse indexes. Additionally, enable this mode in the second
run of tests under linux-gcc in 'ci/run-build-and-tests.sh'.

Once on-disk reverse indexes are proven out over several releases, we
can change the default value of that configuration to 'true', and drop
this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/index-pack.c      | 5 ++++-
 builtin/pack-objects.c    | 2 ++
 ci/run-build-and-tests.sh | 1 +
 pack-revindex.h           | 3 +++
 t/README                  | 3 +++
 5 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index d5cd665b98..54f74c4874 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1748,7 +1748,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (prefix && chdir(prefix))
 		die(_("Cannot come back to cwd"));
 
-	rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
+	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
+		rev_index = 1;
+	else
+		rev_index = !!(opts.flags & (WRITE_REV_VERIFY | WRITE_REV));
 
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d784569200..24df0c98f7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3601,6 +3601,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	if (git_env_bool(GIT_TEST_WRITE_REV_INDEX, 0))
+		pack_idx_opts.flags |= WRITE_REV;
 
 	progress = isatty(2);
 	argc = parse_options(argc, argv, prefix, pack_objects_options,
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 50e0b90073..a66b5e8c75 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -24,6 +24,7 @@ linux-gcc)
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	export GIT_TEST_ADD_I_USE_BUILTIN=1
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
+	export GIT_TEST_WRITE_REV_INDEX=1
 	make test
 	;;
 linux-clang)
diff --git a/pack-revindex.h b/pack-revindex.h
index 61b2f3ab75..d1a0595e89 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -16,9 +16,12 @@
  *   can be found
  */
 
+
 #define RIDX_SIGNATURE 0x52494458 /* "RIDX" */
 #define RIDX_VERSION 1
 
+#define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+
 struct packed_git;
 
 /*
diff --git a/t/README b/t/README
index c730a70770..0f97a51640 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ GIT_TEST_DEFAULT_HASH=<hash-algo> specifies which hash algorithm to
 use in the test scripts. Recognized values for <hash-algo> are "sha1"
 and "sha256".
 
+GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
+'pack.writeReverseIndex' setting.
+
 Naming Tests
 ------------
 
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 09/10] pack-revindex: ensure that on-disk reverse indexes are given precedence
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (7 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 08/10] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  2021-01-25 23:37   ` [PATCH v3 10/10] t5325: check both on-disk and in-memory reverse index Taylor Blau
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

When an on-disk reverse index exists, there is no need to generate one
in memory. In fact, doing so can be slow, and require large amounts of
the heap.

Let's make sure that we treat the on-disk reverse index with precedence
(i.e., that when it exists, we don't bother trying to generate an
equivalent one in memory) by teaching Git how to conditionally die()
when generating a reverse index in memory.

Then, add a test to ensure that when (a) an on-disk reverse index
exists, and (b) when setting GIT_TEST_REV_INDEX_DIE_IN_MEMORY, that we
do not die, implying that we read from the on-disk one.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-revindex.c          | 4 ++++
 pack-revindex.h          | 1 +
 t/t5325-reverse-index.sh | 9 +++++++++
 3 files changed, 14 insertions(+)

diff --git a/pack-revindex.c b/pack-revindex.c
index a174fa5388..83fe4de773 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -2,6 +2,7 @@
 #include "pack-revindex.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "config.h"
 
 struct revindex_entry {
 	off_t offset;
@@ -166,6 +167,9 @@ static void create_pack_revindex(struct packed_git *p)
 
 static int create_pack_revindex_in_memory(struct packed_git *p)
 {
+	if (git_env_bool(GIT_TEST_REV_INDEX_DIE_IN_MEMORY, 0))
+		die("dying as requested by '%s'",
+		    GIT_TEST_REV_INDEX_DIE_IN_MEMORY);
 	if (open_pack_index(p))
 		return -1;
 	create_pack_revindex(p);
diff --git a/pack-revindex.h b/pack-revindex.h
index d1a0595e89..ba7c82c125 100644
--- a/pack-revindex.h
+++ b/pack-revindex.h
@@ -21,6 +21,7 @@
 #define RIDX_VERSION 1
 
 #define GIT_TEST_WRITE_REV_INDEX "GIT_TEST_WRITE_REV_INDEX"
+#define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY"
 
 struct packed_git;
 
diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index be452bb343..a344b18d7e 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -85,4 +85,13 @@ test_expect_success 'pack-objects respects pack.writeReverseIndex' '
 	test_path_is_file pack-1-*.rev
 '
 
+test_expect_success 'reverse index is not generated when available on disk' '
+	test_index_pack true &&
+	test_path_is_file $rev &&
+
+	git rev-parse HEAD >tip &&
+	GIT_TEST_REV_INDEX_DIE_IN_MEMORY=1 git cat-file \
+		--batch-check="%(objectsize:disk)" <tip
+'
+
 test_done
-- 
2.30.0.138.g6d7191ea01


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v3 10/10] t5325: check both on-disk and in-memory reverse index
  2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
                     ` (8 preceding siblings ...)
  2021-01-25 23:37   ` [PATCH v3 09/10] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
@ 2021-01-25 23:37   ` Taylor Blau
  9 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2021-01-25 23:37 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, jrnieder, peff

Right now, the test suite can be run with 'GIT_TEST_WRITE_REV_INDEX=1'
in the environment, which causes all operations which write a pack to
also write a .rev file.

To prepare for when that eventually becomes the default, we should
continue to test the in-memory reverse index, too, in order to avoid
losing existing coverage. Unfortuantely, explicit existing coverage is
rather sparse, so only a basic test is added.

The new test is parameterized over whether or not the .rev file should
be written, and is run in both modes by t5325 (without having to touch
GIT_TEST_WRITE_REV_INDEX).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5325-reverse-index.sh | 45 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/t/t5325-reverse-index.sh b/t/t5325-reverse-index.sh
index a344b18d7e..b1dd726d0e 100755
--- a/t/t5325-reverse-index.sh
+++ b/t/t5325-reverse-index.sh
@@ -94,4 +94,49 @@ test_expect_success 'reverse index is not generated when available on disk' '
 		--batch-check="%(objectsize:disk)" <tip
 '
 
+revindex_tests () {
+	on_disk="$1"
+
+	test_expect_success "setup revindex tests (on_disk=$on_disk)" "
+		test_oid_cache <<-EOF &&
+		disklen sha1:47
+		disklen sha256:59
+		EOF
+
+		git init repo &&
+		(
+			cd repo &&
+
+			if test ztrue = \"z$on_disk\"
+			then
+				git config pack.writeReverseIndex true
+			fi &&
+
+			test_commit commit &&
+			git repack -ad
+		)
+
+	"
+
+	test_expect_success "check objectsize:disk (on_disk=$on_disk)" '
+		(
+			cd repo &&
+			git rev-parse HEAD^{tree} >tree &&
+			git cat-file --batch-check="%(objectsize:disk)" <tree >actual &&
+
+			git cat-file -p HEAD^{tree} &&
+
+			printf "%s\n" "$(test_oid disklen)" >expect &&
+			test_cmp expect actual
+		)
+	'
+
+	test_expect_success "cleanup revindex tests (on_disk=$on_disk)" '
+		rm -fr repo
+	'
+}
+
+revindex_tests "true"
+revindex_tests "false"
+
 test_done
-- 
2.30.0.138.g6d7191ea01

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2021-01-25 23:47 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 18:19 [PATCH 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
2021-01-08 18:19 ` [PATCH 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
2021-01-08 18:20 ` [PATCH 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
2021-01-08 18:20 ` [PATCH 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
2021-01-08 18:20 ` [PATCH 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
2021-01-08 18:20 ` [PATCH 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
2021-01-08 18:20 ` [PATCH 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-12 17:11   ` Ævar Arnfjörð Bjarmason
2021-01-12 18:40     ` Taylor Blau
2021-01-08 18:20 ` [PATCH 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-12 16:49   ` Derrick Stolee
2021-01-12 17:34     ` Taylor Blau
2021-01-12 17:18   ` Ævar Arnfjörð Bjarmason
2021-01-12 17:39     ` Derrick Stolee
2021-01-12 18:17       ` Taylor Blau
2021-01-08 18:20 ` [PATCH 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
2021-01-13 22:28 ` [PATCH v2 0/8] pack-revindex: introduce on-disk '.rev' format Taylor Blau
2021-01-13 22:28   ` [PATCH v2 1/8] packfile: prepare for the existence of '*.rev' files Taylor Blau
2021-01-14  7:22     ` Junio C Hamano
2021-01-14 12:07       ` Derrick Stolee
2021-01-14 19:57         ` Jeff King
2021-01-14 18:28       ` Taylor Blau
2021-01-14  7:26     ` Junio C Hamano
2021-01-14 18:13       ` Taylor Blau
2021-01-14 20:57         ` Junio C Hamano
2021-01-22 22:54     ` Jeff King
2021-01-25 17:44       ` Taylor Blau
2021-01-25 18:27         ` Jeff King
2021-01-25 19:04         ` Junio C Hamano
2021-01-13 22:28   ` [PATCH v2 2/8] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
2021-01-22 23:24     ` Jeff King
2021-01-13 22:28   ` [PATCH v2 3/8] builtin/index-pack.c: write reverse indexes Taylor Blau
2021-01-22 23:53     ` Jeff King
2021-01-25 20:03       ` Taylor Blau
2021-01-13 22:28   ` [PATCH v2 4/8] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
2021-01-22 23:57     ` Jeff King
2021-01-23  0:08       ` Jeff King
2021-01-25 20:21         ` Taylor Blau
2021-01-25 20:50           ` Jeff King
2021-01-13 22:28   ` [PATCH v2 5/8] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
2021-01-13 22:28   ` [PATCH v2 6/8] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-13 22:28   ` [PATCH v2 7/8] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-13 22:28   ` [PATCH v2 8/8] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
2021-01-25 23:37 ` [PATCH v3 00/10] pack-revindex: introduce on-disk '.rev' format Taylor Blau
2021-01-25 23:37   ` [PATCH v3 01/10] packfile: prepare for the existence of '*.rev' files Taylor Blau
2021-01-25 23:37   ` [PATCH v3 02/10] pack-write.c: prepare to write 'pack-*.rev' files Taylor Blau
2021-01-25 23:37   ` [PATCH v3 03/10] builtin/index-pack.c: allow stripping arbitrary extensions Taylor Blau
2021-01-25 23:37   ` [PATCH v3 04/10] builtin/index-pack.c: write reverse indexes Taylor Blau
2021-01-25 23:37   ` [PATCH v3 05/10] builtin/pack-objects.c: respect 'pack.writeReverseIndex' Taylor Blau
2021-01-25 23:37   ` [PATCH v3 06/10] Documentation/config/pack.txt: advertise 'pack.writeReverseIndex' Taylor Blau
2021-01-25 23:37   ` [PATCH v3 07/10] t: prepare for GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-25 23:37   ` [PATCH v3 08/10] t: support GIT_TEST_WRITE_REV_INDEX Taylor Blau
2021-01-25 23:37   ` [PATCH v3 09/10] pack-revindex: ensure that on-disk reverse indexes are given precedence Taylor Blau
2021-01-25 23:37   ` [PATCH v3 10/10] t5325: check both on-disk and in-memory reverse index Taylor Blau

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git