[PATCH 00/15] Refactor chunk-format into an API

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 00/15] Refactor chunk-format into an API
@ 2020-12-03 16:16 Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
                   ` (15 more replies)
  0 siblings, 16 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee

I was thinking about file formats recently and realized that the "chunks"
that are common to the commit-graph and multi-pack-index could inform future
file formats. To make that process easier, let's combine the process of
writing and reading chunks into a common API that both of these existing
formats use.

There is some extra benefit immediately: the writing and reading code for
each gets a bit cleaner. Also, there were different checks in each that made
the process more robust. Now, these share a common set of checks.

In particular, Szeder made some updates to the commit-graph writing process
that forms the model for this API.

Thanks, -Stolee

Derrick Stolee (15):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: add API for writing table of contents
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: convert chunk write methods to return int
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: create write_chunks()
  chunk-format: create chunk reading API
  commit-graph: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      |  54 ++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 105 ++++
 chunk-format.h                                |  69 +++
 commit-graph.c                                | 298 ++++++-----
 midx.c                                        | 466 ++++++++----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 623 insertions(+), 384 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 72ffeb997eaf999f6938b2a7e0d9a75dcceaa311
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-804%2Fderrickstolee%2Fchunk-format%2Frefactor-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-804/derrickstolee/chunk-format/refactor-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/804
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 02/15] chunk-format: add API for writing table of contents Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6f62a07313..6b5bb8b6b8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -980,8 +980,10 @@ struct write_commit_graph_context {
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
-				    struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
 
@@ -1006,8 +1008,10 @@ static int write_graph_chunk_fanout(struct hashfile *f,
 }
 
 static int write_graph_chunk_oids(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
@@ -1025,8 +1029,10 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 }
 
 static int write_graph_chunk_data(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
@@ -1127,8 +1133,10 @@ static int write_graph_chunk_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					 void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	struct commit_list *parent;
@@ -1181,8 +1189,10 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 }
 
 static int write_graph_chunk_bloom_indexes(struct hashfile *f,
-					   struct write_commit_graph_context *ctx)
+					   void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
@@ -1216,8 +1226,10 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 }
 
 static int write_graph_chunk_bloom_data(struct hashfile *f,
-					struct write_commit_graph_context *ctx)
+					void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 
@@ -1670,8 +1682,10 @@ static int write_graph_chunk_base_1(struct hashfile *f,
 }
 
 static int write_graph_chunk_base(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
 
 	if (num != ctx->num_commit_graphs_after - 1) {
@@ -1683,7 +1697,7 @@ static int write_graph_chunk_base(struct hashfile *f,
 }
 
 typedef int (*chunk_write_fn)(struct hashfile *f,
-			      struct write_commit_graph_context *ctx);
+			      void *data);
 
 struct chunk_info {
 	uint32_t id;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/15] chunk-format: add API for writing table of contents
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-08 17:56   ` Taylor Blau
  2020-12-03 16:16 ` [PATCH 03/15] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph and multi-pack-index formats share a concept of
"chunks" that are described by a table of contents near the beginning of
the file.

The table of contents consists of rows of 12 bytes. Each row starts with
a 4-byte ID that signals the type of data stored in the chunk. The row
then continues with an 8-byte offset describing the position in the file
where that data starts. The table of contents lists the chunks in
position order so the length of a chunk can be determined by subtracting
its start position from the start position of the next chunk.

The table of contents always ends with ID 0x0000 to assist finding the
end of the last "real" chunk. Typically, this points to the trailing
hash of a file.

Convert the chunk-writing loop in commit-graph.c to use the new
write_table_of_contents() method in chunk-format.c.

The most subtle part of this conversion is the use of 'cur_offset' to
allow the caller to specify how many bytes were written in the file's
header before the table of contents. This may differ between formats.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |  1 +
 chunk-format.c | 26 ++++++++++++++++++++++++++
 chunk-format.h | 36 ++++++++++++++++++++++++++++++++++++
 commit-graph.c | 23 ++---------------------
 4 files changed, 65 insertions(+), 21 deletions(-)
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h

diff --git a/Makefile b/Makefile
index d3a531d3c6..cdbcadac14 100644
--- a/Makefile
+++ b/Makefile
@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += chdir-notify.o
 LIB_OBJS += checkout.o
+LIB_OBJS += chunk-format.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
diff --git a/chunk-format.c b/chunk-format.c
new file mode 100644
index 0000000000..771b6d98d0
--- /dev/null
+++ b/chunk-format.c
@@ -0,0 +1,26 @@
+#include "git-compat-util.h"
+#include "chunk-format.h"
+#include "csum-file.h"
+#define CHUNK_LOOKUP_WIDTH 12
+
+void write_table_of_contents(struct hashfile *f,
+			     uint64_t cur_offset,
+			     struct chunk_info *chunks,
+			     int nr)
+{
+	int i;
+
+	/* Add the table of contents to the current offset */
+	cur_offset += (nr + 1) * CHUNK_LOOKUP_WIDTH;
+
+	for (i = 0; i < nr; i++) {
+		hashwrite_be32(f, chunks[i].id);
+		hashwrite_be64(f, cur_offset);
+
+		cur_offset += chunks[i].size;
+	}
+
+	/* Trailing entry marks the end of the chunks */
+	hashwrite_be32(f, 0);
+	hashwrite_be64(f, cur_offset);
+}
diff --git a/chunk-format.h b/chunk-format.h
new file mode 100644
index 0000000000..4b9cbeb372
--- /dev/null
+++ b/chunk-format.h
@@ -0,0 +1,36 @@
+#ifndef CHUNK_FORMAT_H
+#define CHUNK_FORMAT_H
+
+#include "git-compat-util.h"
+
+struct hashfile;
+
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      void *data);
+
+/*
+ * When writing a chunk-based file format, collect the chunks in
+ * an array of chunk_info structs. The size stores the _expected_
+ * amount of data that will be written by write_fn.
+ */
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+	chunk_write_fn write_fn;
+};
+
+/*
+ * Write the chunk data into the supplied hashfile.
+ *
+ * * 'cur_offset' indicates the number of bytes written to the hashfile
+ *   before the table of contents starts.
+ *
+ * * 'nr' is the number of chunks with non-zero IDs, so 'nr + 1'
+ *   chunks are written in total.
+ */
+void write_table_of_contents(struct hashfile *f,
+			     uint64_t cur_offset,
+			     struct chunk_info *chunks,
+			     int nr);
+
+#endif
diff --git a/commit-graph.c b/commit-graph.c
index 6b5bb8b6b8..5494fda1d3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,7 @@
 #include "shallow.h"
 #include "json-writer.h"
 #include "trace2.h"
+#include "chunk-format.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1696,15 +1697,6 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-typedef int (*chunk_write_fn)(struct hashfile *f,
-			      void *data);
-
-struct chunk_info {
-	uint32_t id;
-	uint64_t size;
-	chunk_write_fn write_fn;
-};
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
@@ -1715,7 +1707,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-	uint64_t chunk_offset;
 	struct object_id file_hash;
 
 	if (ctx->split) {
@@ -1805,17 +1796,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	hashwrite_u8(f, num_chunks);
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
-	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
-
-		chunk_offset += chunks[i].size;
-	}
+	write_table_of_contents(f, /* cur_offset */ 8, chunks, num_chunks);
 
 	if (ctx->report_progress) {
 		strbuf_addf(&progress_title,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/15] midx: rename pack_info to write_midx_context
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 02/15] chunk-format: add API for writing table of contents Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 04/15] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 130 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 65 deletions(-)

diff --git a/midx.c b/midx.c
index da03c1449a..ded4d394bb 100644
--- a/midx.c
+++ b/midx.c
@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
 	return strcmp(a->pack_name, b->pack_name);
 }
 
-struct pack_list {
+struct write_midx_context {
 	struct pack_info *info;
 	uint32_t nr;
 	uint32_t alloc;
@@ -463,37 +463,37 @@ struct pack_list {
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
-	struct pack_list *packs = (struct pack_list *)data;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
 
 	if (ends_with(file_name, ".idx")) {
-		display_progress(packs->progress, ++packs->pack_paths_checked);
-		if (packs->m && midx_contains_pack(packs->m, file_name))
+		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
+		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		packs->info[packs->nr].p = add_packed_git(full_path,
-							  full_path_len,
-							  0);
+		ctx->info[ctx->nr].p = add_packed_git(full_path,
+						      full_path_len,
+						      0);
 
-		if (!packs->info[packs->nr].p) {
+		if (!ctx->info[ctx->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->info[packs->nr].p)) {
+		if (open_pack_index(ctx->info[ctx->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->info[packs->nr].p);
-			FREE_AND_NULL(packs->info[packs->nr].p);
+			close_pack(ctx->info[ctx->nr].p);
+			FREE_AND_NULL(ctx->info[ctx->nr].p);
 			return;
 		}
 
-		packs->info[packs->nr].pack_name = xstrdup(file_name);
-		packs->info[packs->nr].orig_pack_int_id = packs->nr;
-		packs->info[packs->nr].expired = 0;
-		packs->nr++;
+		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
+		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
+		ctx->info[ctx->nr].expired = 0;
+		ctx->nr++;
 	}
 }
 
@@ -801,7 +801,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
-	struct pack_list packs;
+	struct write_midx_context ctx = { 0 };
 	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
@@ -820,40 +820,40 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			  midx_name);
 
 	if (m)
-		packs.m = m;
+		ctx.m = m;
 	else
-		packs.m = load_multi_pack_index(object_dir, 1);
-
-	packs.nr = 0;
-	packs.alloc = packs.m ? packs.m->num_packs : 16;
-	packs.info = NULL;
-	ALLOC_ARRAY(packs.info, packs.alloc);
-
-	if (packs.m) {
-		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
-
-			packs.info[packs.nr].orig_pack_int_id = i;
-			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
-			packs.info[packs.nr].p = NULL;
-			packs.info[packs.nr].expired = 0;
-			packs.nr++;
+		ctx.m = load_multi_pack_index(object_dir, 1);
+
+	ctx.nr = 0;
+	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
+	ctx.info = NULL;
+	ALLOC_ARRAY(ctx.info, ctx.alloc);
+
+	if (ctx.m) {
+		for (i = 0; i < ctx.m->num_packs; i++) {
+			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
+
+			ctx.info[ctx.nr].orig_pack_int_id = i;
+			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
+			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].expired = 0;
+			ctx.nr++;
 		}
 	}
 
-	packs.pack_paths_checked = 0;
+	ctx.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
+		ctx.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
-		packs.progress = NULL;
+		ctx.progress = NULL;
 
-	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
-	stop_progress(&packs.progress);
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
+	stop_progress(&ctx.progress);
 
-	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
+	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -862,19 +862,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			large_offsets_needed = 1;
 	}
 
-	QSORT(packs.info, packs.nr, pack_info_compare);
+	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
 	if (packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
-		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
-			int cmp = strcmp(packs.info[i].pack_name,
+		for (i = 0; i < ctx.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(ctx.info[i].pack_name,
 					 packs_to_drop->items[drop_index].string);
 
 			if (!cmp) {
 				drop_index++;
-				packs.info[i].expired = 1;
+				ctx.info[i].expired = 1;
 			} else if (cmp > 0) {
 				error(_("did not see pack-file %s to drop"),
 				      packs_to_drop->items[drop_index].string);
@@ -882,7 +882,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				missing_drops++;
 				i--;
 			} else {
-				packs.info[i].expired = 0;
+				ctx.info[i].expired = 0;
 			}
 		}
 
@@ -898,19 +898,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].expired) {
+	ALLOC_ARRAY(pack_perm, ctx.nr);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
-	for (i = 0; i < packs.nr; i++) {
-		if (!packs.info[i].expired)
-			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < ctx.nr; i++) {
+		if (!ctx.info[i].expired)
+			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
 	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
@@ -921,19 +921,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	if (packs.m)
-		close_midx(packs.m);
+	if (ctx.m)
+		close_midx(ctx.m);
 
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	if (packs.nr - dropped_packs == 0) {
+	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
+	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -990,7 +990,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.info, packs.nr);
+				written += write_midx_pack_names(f, ctx.info, ctx.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -1027,15 +1027,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	commit_lock_file(&lk);
 
 cleanup:
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].p) {
-			close_pack(packs.info[i].p);
-			free(packs.info[i].p);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].p) {
+			close_pack(ctx.info[i].p);
+			free(ctx.info[i].p);
 		}
-		free(packs.info[i].pack_name);
+		free(ctx.info[i].pack_name);
 	}
 
-	free(packs.info);
+	free(ctx.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/15] midx: use context in write_midx_pack_names()
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 03/15] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index ded4d394bb..6ab655ddda 100644
--- a/midx.c
+++ b/midx.c
@@ -643,27 +643,26 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f,
-				    struct pack_info *info,
-				    uint32_t num_packs)
+static size_t write_midx_pack_names(struct hashfile *f, void *data)
 {
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	uint32_t i;
 	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
 	size_t written = 0;
 
-	for (i = 0; i < num_packs; i++) {
+	for (i = 0; i < ctx->nr; i++) {
 		size_t writelen;
 
-		if (info[i].expired)
+		if (ctx->info[i].expired)
 			continue;
 
-		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
+		if (i && strcmp(ctx->info[i].pack_name, ctx->info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    info[i - 1].pack_name,
-			    info[i].pack_name);
+			    ctx->info[i - 1].pack_name,
+			    ctx->info[i].pack_name);
 
-		writelen = strlen(info[i].pack_name) + 1;
-		hashwrite(f, info[i].pack_name, writelen);
+		writelen = strlen(ctx->info[i].pack_name) + 1;
+		hashwrite(f, ctx->info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -990,7 +989,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, ctx.info, ctx.nr);
+				written += write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/15] midx: add entries to write_midx_context
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 04/15] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 21:42   ` Junio C Hamano
  2020-12-08 18:00   ` Taylor Blau
  2020-12-03 16:16 ` [PATCH 06/15] midx: add pack_perm " Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/midx.c b/midx.c
index 6ab655ddda..2af4452165 100644
--- a/midx.c
+++ b/midx.c
@@ -458,6 +458,9 @@ struct write_midx_context {
 	struct multi_pack_index *m;
 	struct progress *progress;
 	unsigned pack_paths_checked;
+
+	struct pack_midx_entry *entries;
+	uint32_t entries_nr;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -678,11 +681,11 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 }
 
 static size_t write_midx_oid_fanout(struct hashfile *f,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
-	struct pack_midx_entry *last = objects + nr_objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *last = ctx->entries + ctx->entries_nr;
 	uint32_t count = 0;
 	uint32_t i;
 
@@ -706,18 +709,19 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 	return MIDX_CHUNK_FANOUT_SIZE;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+static size_t write_midx_oid_lookup(struct hashfile *f,
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	unsigned char hash_len = the_hash_algo->rawsz;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (i < nr_objects - 1) {
+		if (i < ctx->entries_nr - 1) {
 			struct pack_midx_entry *next = list;
 			if (oidcmp(&obj->oid, &next->oid) >= 0)
 				BUG("OIDs not in order: %s >= %s",
@@ -805,8 +809,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries, num_large_offsets = 0;
-	struct pack_midx_entry *entries = NULL;
+	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
@@ -852,12 +855,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
+	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
-	for (i = 0; i < nr_entries; i++) {
-		if (entries[i].offset > 0x7fffffff)
+	for (i = 0; i < ctx.entries_nr; i++) {
+		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
-		if (entries[i].offset > 0xffffffff)
+		if (ctx.entries[i].offset > 0xffffffff)
 			large_offsets_needed = 1;
 	}
 
@@ -947,10 +950,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * the_hash_algo->rawsz;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
 
 	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	if (large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
@@ -993,19 +996,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, entries, nr_entries);
+				written += write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, the_hash_algo->rawsz, entries, nr_entries);
+				written += write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
 				break;
 
 			default:
@@ -1035,7 +1038,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	free(ctx.info);
-	free(entries);
+	free(ctx.entries);
 	free(pack_perm);
 	free(midx_name);
 	return result;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/15] midx: add pack_perm to write_midx_context
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 07/15] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/midx.c b/midx.c
index 2af4452165..f7c3a54a33 100644
--- a/midx.c
+++ b/midx.c
@@ -461,6 +461,9 @@ struct write_midx_context {
 
 	struct pack_midx_entry *entries;
 	uint32_t entries_nr;
+
+	uint32_t *pack_perm;
+	unsigned large_offsets_needed:1;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -736,27 +739,27 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
-					uint32_t *perm,
-					struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_object_offsets(struct hashfile *f,
+					void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+		if (ctx->pack_perm[obj->pack_int_id] == PACK_EXPIRED)
 			BUG("object %s is in an expired pack with int-id %d",
 			    oid_to_hex(&obj->oid),
 			    obj->pack_int_id);
 
-		hashwrite_be32(f, perm[obj->pack_int_id]);
+		hashwrite_be32(f, ctx->pack_perm[obj->pack_int_id]);
 
-		if (large_offset_needed && obj->offset >> 31)
+		if (ctx->large_offsets_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
-		else if (!large_offset_needed && obj->offset >> 32)
+		else if (!ctx->large_offsets_needed && obj->offset >> 32)
 			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
 			    oid_to_hex(&obj->oid),
 			    obj->offset);
@@ -805,13 +808,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
-	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -857,11 +858,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
+	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
-			large_offsets_needed = 1;
+			ctx.large_offsets_needed = 1;
 	}
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
@@ -900,13 +902,13 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, ctx.nr);
+	ALLOC_ARRAY(ctx.pack_perm, ctx.nr);
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
@@ -927,7 +929,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		close_midx(ctx.m);
 
 	cur_chunk = 0;
-	num_chunks = large_offsets_needed ? 5 : 4;
+	num_chunks = ctx.large_offsets_needed ? 5 : 4;
 
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
@@ -954,7 +956,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (large_offsets_needed) {
+	if (ctx.large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
 		cur_chunk++;
@@ -1004,7 +1006,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
+				written += write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -1039,7 +1041,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	free(ctx.info);
 	free(ctx.entries);
-	free(pack_perm);
+	free(ctx.pack_perm);
 	free(midx_name);
 	return result;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/15] midx: add num_large_offsets to write_midx_context
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 06/15] midx: add pack_perm " Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 08/15] midx: convert chunk write methods to return int Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index f7c3a54a33..d7da358a3f 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,7 @@ struct write_midx_context {
 
 	uint32_t *pack_perm;
 	unsigned large_offsets_needed:1;
+	uint32_t num_large_offsets;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -772,11 +773,14 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
-				       struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_large_offsets(struct hashfile *f,
+				       void *data)
 {
-	struct pack_midx_entry *list = objects, *end = objects + nr_objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
 	size_t written = 0;
+	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
 		struct pack_midx_entry *obj;
@@ -811,7 +815,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
@@ -861,7 +864,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
-			num_large_offsets++;
+			ctx.num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
 			ctx.large_offsets_needed = 1;
 	}
@@ -961,7 +964,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		cur_chunk++;
 		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
 	}
 
 	chunk_ids[cur_chunk] = 0;
@@ -1010,7 +1013,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
+				written += write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/15] midx: convert chunk write methods to return int
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 07/15] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 21:50   ` Junio C Hamano
  2020-12-03 16:16 ` [PATCH 09/15] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 63 +++++++++++++++++++++++++---------------------------------
 1 file changed, 27 insertions(+), 36 deletions(-)

diff --git a/midx.c b/midx.c
index d7da358a3f..5eb1b01946 100644
--- a/midx.c
+++ b/midx.c
@@ -650,7 +650,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f, void *data)
+static int write_midx_pack_names(struct hashfile *f, void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	uint32_t i;
@@ -678,14 +678,13 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 	if (i < MIDX_CHUNK_ALIGNMENT) {
 		memset(padding, 0, sizeof(padding));
 		hashwrite(f, padding, i);
-		written += i;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_oid_fanout(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_fanout(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	struct pack_midx_entry *list = ctx->entries;
@@ -710,17 +709,16 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 		list = next;
 	}
 
-	return MIDX_CHUNK_FANOUT_SIZE;
+	return 0;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_lookup(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	unsigned char hash_len = the_hash_algo->rawsz;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -734,19 +732,17 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 		}
 
 		hashwrite(f, obj->oid.hash, (int)hash_len);
-		written += hash_len;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f,
-					void *data)
+static int write_midx_object_offsets(struct hashfile *f,
+				     void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -766,20 +762,17 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 			    obj->offset);
 		else
 			hashwrite_be32(f, (uint32_t)obj->offset);
-
-		written += MIDX_CHUNK_OFFSET_WIDTH;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f,
-				       void *data)
+static int write_midx_large_offsets(struct hashfile *f,
+				    void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	struct pack_midx_entry *list = ctx->entries;
 	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
-	size_t written = 0;
 	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
@@ -795,12 +788,12 @@ static size_t write_midx_large_offsets(struct hashfile *f,
 		if (!(offset >> 31))
 			continue;
 
-		written += hashwrite_be64(f, offset);
+		hashwrite_be64(f, offset);
 
 		nr_large_offset--;
 	}
 
-	return written;
+	return 0;
 }
 
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
@@ -812,7 +805,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t written = 0;
+	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct progress *progress = NULL;
@@ -940,10 +933,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
+	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
@@ -981,39 +974,37 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		hashwrite_be32(f, chunk_ids[i]);
 		hashwrite_be64(f, chunk_offsets[i]);
-
-		written += MIDX_CHUNKLOOKUP_WIDTH;
 	}
 
 	if (flags & MIDX_PROGRESS)
 		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
-		if (written != chunk_offsets[i])
+		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
 			    chunk_offsets[i],
-			    written,
+			    f->total + f->offset,
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, &ctx);
+				write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, &ctx);
+				write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, &ctx);
+				write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, &ctx);
+				write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, &ctx);
+				write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
@@ -1025,9 +1016,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 	stop_progress(&progress);
 
-	if (written != chunk_offsets[num_chunks])
+	if (f->total + f->offset != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    written,
+		    f->total + f->offset,
 		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/15] midx: drop chunk progress during write
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 08/15] midx: convert chunk write methods to return int Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 16:16 ` [PATCH 10/15] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/midx.c b/midx.c
index 5eb1b01946..ce6d4339bd 100644
--- a/midx.c
+++ b/midx.c
@@ -808,7 +808,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -976,9 +975,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		hashwrite_be64(f, chunk_offsets[i]);
 	}
 
-	if (flags & MIDX_PROGRESS)
-		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
-					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
@@ -1011,10 +1007,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
 		}
-
-		display_progress(progress, i + 1);
 	}
-	stop_progress(&progress);
 
 	if (f->total + f->offset != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/15] midx: use chunk-format API in write_midx_internal()
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 09/15] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-08 18:42   ` Taylor Blau
  2020-12-03 16:16 ` [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-format API allows automatically writing the table of contents
for a chunk-based file format when using an array of "struct
chunk_info"s. Update write_midx_internal() to use this strategy, which
also simplifies the chunk writing loop. This loop will be replaced with
a chunk-format API call in an upcoming change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 96 +++++++++++++---------------------------------------------
 1 file changed, 21 insertions(+), 75 deletions(-)

diff --git a/midx.c b/midx.c
index ce6d4339bd..0548266bea 100644
--- a/midx.c
+++ b/midx.c
@@ -11,6 +11,7 @@
 #include "trace2.h"
 #include "run-command.h"
 #include "repository.h"
+#include "chunk-format.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -799,15 +800,14 @@ static int write_midx_large_offsets(struct hashfile *f,
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop, unsigned flags)
 {
-	unsigned char cur_chunk, num_chunks = 0;
+	unsigned char num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
 	uint64_t header_size = 0;
-	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
-	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
+	struct chunk_info chunks[MIDX_MAX_CHUNKS];
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -923,7 +923,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m)
 		close_midx(ctx.m);
 
-	cur_chunk = 0;
 	num_chunks = ctx.large_offsets_needed ? 5 : 4;
 
 	if (ctx.nr - dropped_packs == 0) {
@@ -934,85 +933,32 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+	chunks[0].id = MIDX_CHUNKID_PACKNAMES;
+	chunks[0].size = pack_name_concat_len;
+	chunks[0].write_fn = write_midx_pack_names;
 
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
+	chunks[1].id = MIDX_CHUNKID_OIDFANOUT;
+	chunks[1].size = MIDX_CHUNK_FANOUT_SIZE;
+	chunks[1].write_fn = write_midx_oid_fanout;
 
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
+	chunks[2].id = MIDX_CHUNKID_OIDLOOKUP;
+	chunks[2].size = ctx.entries_nr * the_hash_algo->rawsz;
+	chunks[2].write_fn = write_midx_oid_lookup;
 
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
+	chunks[3].id = MIDX_CHUNKID_OBJECTOFFSETS;
+	chunks[3].size = ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
+	chunks[3].write_fn = write_midx_object_offsets;
 
-	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	if (ctx.large_offsets_needed) {
-		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
-
-		cur_chunk++;
-		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+		chunks[4].id = MIDX_CHUNKID_LARGEOFFSETS;
+		chunks[4].size = ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+		chunks[4].write_fn = write_midx_large_offsets;
 	}
 
-	chunk_ids[cur_chunk] = 0;
-
-	for (i = 0; i <= num_chunks; i++) {
-		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
-			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
-			    chunk_offsets[i - 1],
-			    chunk_offsets[i]);
-
-		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
-			BUG("chunk offset %"PRIu64" is not properly aligned",
-			    chunk_offsets[i]);
-
-		hashwrite_be32(f, chunk_ids[i]);
-		hashwrite_be64(f, chunk_offsets[i]);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		if (f->total + f->offset != chunk_offsets[i])
-			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
-			    chunk_offsets[i],
-			    f->total + f->offset,
-			    chunk_ids[i]);
-
-		switch (chunk_ids[i]) {
-			case MIDX_CHUNKID_PACKNAMES:
-				write_midx_pack_names(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OIDFANOUT:
-				write_midx_oid_fanout(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				write_midx_oid_lookup(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				write_midx_object_offsets(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				write_midx_large_offsets(f, &ctx);
-				break;
-
-			default:
-				BUG("trying to write unknown chunk id %"PRIx32,
-				    chunk_ids[i]);
-		}
-	}
+	write_table_of_contents(f, header_size, chunks, num_chunks);
 
-	if (f->total + f->offset != chunk_offsets[num_chunks])
-		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    f->total + f->offset,
-		    chunk_offsets[num_chunks]);
+	for (i = 0; i < num_chunks; i++)
+		chunks[i].write_fn(f, &ctx);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 10/15] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 22:00   ` Junio C Hamano
  2020-12-03 16:16 ` [PATCH 12/15] chunk-format: create write_chunks() Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index 0548266bea..47f5f60fcd 100644
--- a/midx.c
+++ b/midx.c
@@ -946,12 +946,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	chunks[2].write_fn = write_midx_oid_lookup;
 
 	chunks[3].id = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunks[3].size = ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
+	chunks[3].size = (uint64_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	chunks[3].write_fn = write_midx_object_offsets;
 
 	if (ctx.large_offsets_needed) {
 		chunks[4].id = MIDX_CHUNKID_LARGEOFFSETS;
-		chunks[4].size = ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+		chunks[4].size = (uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
 		chunks[4].write_fn = write_midx_large_offsets;
 	}
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 12/15] chunk-format: create write_chunks()
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-08 18:45   ` Taylor Blau
  2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph and multi-pack-index files both use a chunk-based file
format. They have already unified on using write_table_of_contents(),
but we expand upon that by unifying their chunk writing loop.

This takes the concepts already present in the commit-graph that were
dropped in the multi-pack-index code during refactoring, including:

* Check the hashfile for how much data was written by each write_fn.

* Allow write_fn() to report an error that results in a failure
  without using die() in the low-level commands.

This simplifies the code in commit-graph.c and midx.c while laying the
foundation for future formats using similar ideas.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 23 +++++++++++++++++++++++
 chunk-format.h | 13 +++++++++++++
 commit-graph.c | 13 ++-----------
 midx.c         |  3 +--
 4 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 771b6d98d0..a6643a4fc8 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -24,3 +24,26 @@ void write_table_of_contents(struct hashfile *f,
 	hashwrite_be32(f, 0);
 	hashwrite_be64(f, cur_offset);
 }
+
+int write_chunks(struct hashfile *f,
+		 struct chunk_info *chunks,
+		 int nr,
+		 void *data)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		uint64_t start_offset = f->total + f->offset;
+		int result = chunks[i].write_fn(f, data);
+
+		if (result)
+			return result;
+
+		if (f->total + f->offset != start_offset + chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    chunks[i].size, chunks[i].id,
+			    f->total + f->offset - start_offset);
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
index 4b9cbeb372..a2c7ddb23b 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -33,4 +33,17 @@ void write_table_of_contents(struct hashfile *f,
 			     struct chunk_info *chunks,
 			     int nr);
 
+/*
+ * Write the data for the given chunk list using the provided
+ * write_fn values. The given 'data' parameter is passed to those
+ * methods.
+ *
+ * The data that is written by each write_fn is checked to be of
+ * the expected size, and a BUG() is thrown if not specified correctly.
+ */
+int write_chunks(struct hashfile *f,
+		 struct chunk_info *chunks,
+		 int nr,
+		 void *data);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 5494fda1d3..10dcef9d6b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1809,17 +1809,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
+	if (write_chunks(f, chunks, num_chunks, ctx))
+		return -1;
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
diff --git a/midx.c b/midx.c
index 47f5f60fcd..67ac232a81 100644
--- a/midx.c
+++ b/midx.c
@@ -957,8 +957,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	write_table_of_contents(f, header_size, chunks, num_chunks);
 
-	for (i = 0; i < num_chunks; i++)
-		chunks[i].write_fn(f, &ctx);
+	result = write_chunks(f, chunks, num_chunks, &ctx);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 13/15] chunk-format: create chunk reading API
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 12/15] chunk-format: create write_chunks() Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-03 22:17   ` Junio C Hamano
  2020-12-03 22:43   ` Junio C Hamano
  2020-12-03 16:16 ` [PATCH 14/15] commit-graph: restore duplicate chunk checks Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Now that the chunk-format API has a consistent mechanism for writing
file formats based on chunks, let's extend it to also parse chunk-based
files during read.

Similar to the write scenario, the caller supplies some context
information, such as a memory location, the offset of the table of
contents, and some information of what to do with each chunk. The table
of contents parsing will skip any unspecified chunks and will leave the
specifics of each chunk to a function pointer.

This implementation handles some of the important error cases, such as
chunk offsets that escape the size of the file. However, we drop the
case of duplicate chunks and leave that to the given functions. It may
be helpful to allow multiple instances of the same chunk ID for some
formats. The new location of these error checks change the error strings
and also the tests that verify for corruption in the table of contents.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c              |  56 ++++++++++
 chunk-format.h              |  20 ++++
 commit-graph.c              | 200 ++++++++++++++++++++----------------
 midx.c                      | 114 ++++++++++++--------
 t/t5318-commit-graph.sh     |   2 +-
 t/t5319-multi-pack-index.sh |   6 +-
 6 files changed, 261 insertions(+), 137 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index a6643a4fc8..d888ef6ec7 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -1,6 +1,7 @@
 #include "git-compat-util.h"
 #include "chunk-format.h"
 #include "csum-file.h"
+#include "cache.h"
 #define CHUNK_LOOKUP_WIDTH 12
 
 void write_table_of_contents(struct hashfile *f,
@@ -47,3 +48,58 @@ int write_chunks(struct hashfile *f,
 
 	return 0;
 }
+
+int read_table_of_contents(const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length,
+			   struct read_chunk_info *chunks,
+			   int nr,
+			   void *data)
+{
+	uint32_t chunk_id;
+	const unsigned char *table_of_contents = mfile + toc_offset;
+
+	while (toc_length--) {
+		int i;
+		uint64_t chunk_offset, next_chunk_offset;
+
+		chunk_id = get_be32(table_of_contents);
+		chunk_offset = get_be64(table_of_contents + 4);
+
+		if (!chunk_id) {
+			error(_("terminating chunk id appears earlier than expected"));
+			return 1;
+		}
+
+		table_of_contents += CHUNK_LOOKUP_WIDTH;
+		next_chunk_offset = get_be64(table_of_contents + 4);
+
+		if (next_chunk_offset < chunk_offset ||
+		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {
+			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
+			      chunk_offset, next_chunk_offset);
+			return 1;
+		}
+		for (i = 0; i < nr; i++) {
+			if (chunks[i].id == chunk_id) {
+				int result = chunks[i].read_fn(
+						mfile + chunk_offset,
+						next_chunk_offset - chunk_offset,
+						data);
+
+				if (result)
+					return result;
+				break;
+			}
+		}
+	}
+
+	chunk_id = get_be32(table_of_contents);
+	if (chunk_id) {
+		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
index a2c7ddb23b..7049800f73 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -46,4 +46,24 @@ int write_chunks(struct hashfile *f,
 		 int nr,
 		 void *data);
 
+/*
+ * When reading a table of contents, we find the chunk with matching 'id'
+ * then call its read_fn to populate the necessary 'data' based on the
+ * chunk start and size.
+ */
+typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data);
+struct read_chunk_info {
+	uint32_t id;
+	chunk_read_fn read_fn;
+};
+
+int read_table_of_contents(const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length,
+			   struct read_chunk_info *chunks,
+			   int nr,
+			   void *data);
+
 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 10dcef9d6b..0a3ba147df 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -289,15 +289,114 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int graph_read_oid_fanout(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_oid_fanout = (uint32_t*)chunk_start;
+	return 0;
+}
+
+static int graph_read_oid_lookup(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_oid_lookup = chunk_start;
+	g->num_commits = chunk_size / g->hash_len;
+	return 0;
+}
+
+static int graph_read_data(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_commit_data = chunk_start;
+	return 0;
+}
+
+static int graph_read_extra_edges(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_extra_edges = chunk_start;
+	return 0;
+}
+
+static int graph_read_base_graphs(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_base_graphs = chunk_start;
+	return 0;
+}
+
+static int graph_read_bloom_indices(const unsigned char *chunk_start,
+				    size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_bloom_indexes = chunk_start;
+	return 0;
+}
+
+static int graph_read_bloom_data(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	uint32_t hash_version;
+	g->chunk_bloom_data = chunk_start;
+	hash_version = get_be32(chunk_start);
+
+	if (hash_version != 1)
+		return 0;
+
+	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+	g->bloom_filter_settings->hash_version = hash_version;
+	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
+	g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
+	g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
+
+	return 0;
+}
+
+static struct read_chunk_info read_chunks[] = {
+	[0] = {
+		GRAPH_CHUNKID_OIDFANOUT,
+		graph_read_oid_fanout
+	},
+	[1] = {
+		GRAPH_CHUNKID_OIDLOOKUP,
+		graph_read_oid_lookup
+	},
+	[2] = {
+		GRAPH_CHUNKID_DATA,
+		graph_read_data
+	},
+	[3] = {
+		GRAPH_CHUNKID_EXTRAEDGES,
+		graph_read_extra_edges
+	},
+	[4] = {
+		GRAPH_CHUNKID_BASE,
+		graph_read_base_graphs
+	},
+	[5] = {
+		GRAPH_CHUNKID_BLOOMINDEXES,
+		graph_read_bloom_indices
+	},
+	[6] = {
+		GRAPH_CHUNKID_BLOOMDATA,
+		graph_read_bloom_data
+	}
+};
+
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size)
 {
-	const unsigned char *data, *chunk_lookup;
-	uint32_t i;
+	const unsigned char *data;
 	struct commit_graph *graph;
-	uint64_t next_chunk_offset;
 	uint32_t graph_signature;
 	unsigned char graph_version, hash_version;
+	int chunks_nr = MAX_NUM_CHUNKS;
 
 	if (!graph_map)
 		return NULL;
@@ -346,95 +445,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
-	chunk_lookup = data + 8;
-	next_chunk_offset = get_be64(chunk_lookup + 4);
-	for (i = 0; i < graph->num_chunks; i++) {
-		uint32_t chunk_id;
-		uint64_t chunk_offset = next_chunk_offset;
-		int chunk_repeated = 0;
-
-		chunk_id = get_be32(chunk_lookup + 0);
-
-		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
-		next_chunk_offset = get_be64(chunk_lookup + 4);
-
-		if (chunk_offset > graph_size - the_hash_algo->rawsz) {
-			error(_("commit-graph improper chunk offset %08x%08x"), (uint32_t)(chunk_offset >> 32),
-			      (uint32_t)chunk_offset);
-			goto free_and_return;
-		}
-
-		switch (chunk_id) {
-		case GRAPH_CHUNKID_OIDFANOUT:
-			if (graph->chunk_oid_fanout)
-				chunk_repeated = 1;
-			else
-				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
-			break;
-
-		case GRAPH_CHUNKID_OIDLOOKUP:
-			if (graph->chunk_oid_lookup)
-				chunk_repeated = 1;
-			else {
-				graph->chunk_oid_lookup = data + chunk_offset;
-				graph->num_commits = (next_chunk_offset - chunk_offset)
-						     / graph->hash_len;
-			}
-			break;
-
-		case GRAPH_CHUNKID_DATA:
-			if (graph->chunk_commit_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_commit_data = data + chunk_offset;
-			break;
+	/* limit the chunk-format list if we are ignoring Bloom filters */
+	if (!r->settings.commit_graph_read_changed_paths)
+		chunks_nr = 5;
 
-		case GRAPH_CHUNKID_EXTRAEDGES:
-			if (graph->chunk_extra_edges)
-				chunk_repeated = 1;
-			else
-				graph->chunk_extra_edges = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BASE:
-			if (graph->chunk_base_graphs)
-				chunk_repeated = 1;
-			else
-				graph->chunk_base_graphs = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMINDEXES:
-			if (graph->chunk_bloom_indexes)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths)
-				graph->chunk_bloom_indexes = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMDATA:
-			if (graph->chunk_bloom_data)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths) {
-				uint32_t hash_version;
-				graph->chunk_bloom_data = data + chunk_offset;
-				hash_version = get_be32(data + chunk_offset);
-
-				if (hash_version != 1)
-					break;
-
-				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
-				graph->bloom_filter_settings->hash_version = hash_version;
-				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
-				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
-				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
-			}
-			break;
-		}
-
-		if (chunk_repeated) {
-			error(_("commit-graph chunk id %08x appears multiple times"), chunk_id);
-			goto free_and_return;
-		}
-	}
+	if (read_table_of_contents(
+		graph->data, graph_size, GRAPH_HEADER_SIZE, graph->num_chunks,
+		read_chunks, chunks_nr, graph))
+		goto free_and_return;
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
 		init_bloom_filters();
diff --git a/midx.c b/midx.c
index 67ac232a81..786b3b51c3 100644
--- a/midx.c
+++ b/midx.c
@@ -54,6 +54,74 @@ static char *get_midx_filename(const char *object_dir)
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+static int midx_read_pack_names(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_pack_names = chunk_start;
+	return 0;
+}
+
+static int midx_read_oid_fanout(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_oid_fanout = (uint32_t *)chunk_start;
+
+	if (chunk_size != 4 * 256) {
+		error(_("multi-pack-index OID fanout is of the wrong size"));
+		return 1;
+	}
+	return 0;
+}
+
+static int midx_read_oid_lookup(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_oid_lookup = chunk_start;
+	return 0;
+}
+
+static int midx_read_offsets(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_object_offsets = chunk_start;
+	return 0;
+}
+
+static int midx_read_large_offsets(const unsigned char *chunk_start,
+				   size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_large_offsets = chunk_start;
+	return 0;
+}
+
+static struct read_chunk_info read_chunks[] = {
+	[0] = {
+		MIDX_CHUNKID_PACKNAMES,
+		midx_read_pack_names
+	},
+	[1] = {
+		MIDX_CHUNKID_OIDFANOUT,
+		midx_read_oid_fanout
+	},
+	[2] = {
+		MIDX_CHUNKID_OIDLOOKUP,
+		midx_read_oid_lookup
+	},
+	[3] = {
+		MIDX_CHUNKID_OBJECTOFFSETS,
+		midx_read_offsets
+	},
+	[4] = {
+		MIDX_CHUNKID_LARGEOFFSETS,
+		midx_read_large_offsets
+	}
+};
+
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -114,48 +182,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
-	for (i = 0; i < m->num_chunks; i++) {
-		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
-					     MIDX_CHUNKLOOKUP_WIDTH * i);
-		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
-						 MIDX_CHUNKLOOKUP_WIDTH * i);
-
-		if (chunk_offset >= m->data_len)
-			die(_("invalid chunk offset (too large)"));
-
-		switch (chunk_id) {
-			case MIDX_CHUNKID_PACKNAMES:
-				m->chunk_pack_names = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OIDFANOUT:
-				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				m->chunk_oid_lookup = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				m->chunk_object_offsets = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				m->chunk_large_offsets = m->data + chunk_offset;
-				break;
-
-			case 0:
-				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
-				break;
-
-			default:
-				/*
-				 * Do nothing on unrecognized chunks, allowing future
-				 * extensions to add optional chunks.
-				 */
-				break;
-		}
-	}
+	if (read_table_of_contents(m->data, midx_size, MIDX_HEADER_SIZE,
+				   m->num_chunks, read_chunks,
+				   MIDX_MAX_CHUNKS, m))
+		goto cleanup_fail;
 
 	if (!m->chunk_pack_names)
 		die(_("multi-pack-index missing required pack-name chunk"));
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544d..65879af6c0 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -563,7 +563,7 @@ test_expect_success 'detect bad hash version' '
 
 test_expect_success 'detect low chunk count' '
 	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\01" \
-		"missing the .* chunk"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'detect missing OID fanout chunk' '
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ace469c95c..a02d612f4d 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -314,12 +314,12 @@ test_expect_success 'verify bad OID version' '
 
 test_expect_success 'verify truncated chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
-		"missing required"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'verify extended chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
-		"terminating multi-pack-index chunk id appears earlier than expected"
+		"terminating chunk id appears earlier than expected"
 '
 
 test_expect_success 'verify missing required chunk' '
@@ -329,7 +329,7 @@ test_expect_success 'verify missing required chunk' '
 
 test_expect_success 'verify invalid chunk offset' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
-		"invalid chunk offset (too large)"
+		"improper chunk offset(s)"
 '
 
 test_expect_success 'verify packnames out of order' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 14/15] commit-graph: restore duplicate chunk checks
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-07 13:43   ` Derrick Stolee
  2020-12-03 16:16 ` [PATCH 15/15] chunk-format: add technical docs Derrick Stolee via GitGitGadget
  2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
  15 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change introduced read_table_of_contents() in the
chunk-format API, but dropped the duplicate chunk check from the
commit-graph parsing logic. This was done to keep flexibility in the
chunk-format API.

One way to restore this check is to have each chunk_read_fn method check
if it has run before. This is somewhat repetitive. If we determine that
the chunk-format API would be better off with a hard requirement that
chunks are never repeated, then this could be replaced with a check in
chunk-format.c.

For now, only restore the duplicate checks that previously existed in
the commit-graph parsing logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 0a3ba147df..c0102fceba 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -289,10 +289,20 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int report_duplicate(void)
+{
+	warning(_("duplicate chunk detected"));
+	return 1;
+}
+
 static int graph_read_oid_fanout(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
 	struct commit_graph *g = (struct commit_graph *)data;
+
+	if (g->chunk_oid_fanout)
+		return report_duplicate();
+
 	g->chunk_oid_fanout = (uint32_t*)chunk_start;
 	return 0;
 }
@@ -301,6 +311,10 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
 	struct commit_graph *g = (struct commit_graph *)data;
+
+	if (g->chunk_oid_lookup)
+		return report_duplicate();
+
 	g->chunk_oid_lookup = chunk_start;
 	g->num_commits = chunk_size / g->hash_len;
 	return 0;
@@ -310,6 +324,10 @@ static int graph_read_data(const unsigned char *chunk_start,
 				 size_t chunk_size, void *data)
 {
 	struct commit_graph *g = (struct commit_graph *)data;
+
+	if (g->chunk_commit_data)
+		return report_duplicate();
+
 	g->chunk_commit_data = chunk_start;
 	return 0;
 }
@@ -318,6 +336,10 @@ static int graph_read_extra_edges(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
 	struct commit_graph *g = (struct commit_graph *)data;
+
+	if (g->chunk_extra_edges)
+		return report_duplicate();
+
 	g->chunk_extra_edges = chunk_start;
 	return 0;
 }
@@ -326,6 +348,10 @@ static int graph_read_base_graphs(const unsigned char *chunk_start,
 				  size_t chunk_size, void *data)
 {
 	struct commit_graph *g = (struct commit_graph *)data;
+
+	if (g->chunk_base_graphs)
+		return report_duplicate();
+
 	g->chunk_base_graphs = chunk_start;
 	return 0;
 }
@@ -334,6 +360,10 @@ static int graph_read_bloom_indices(const unsigned char *chunk_start,
 				    size_t chunk_size, void *data)
 {
 	struct commit_graph *g = (struct commit_graph *)data;
+
+	if (g->chunk_bloom_indexes)
+		return report_duplicate();
+
 	g->chunk_bloom_indexes = chunk_start;
 	return 0;
 }
@@ -343,6 +373,10 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
 {
 	struct commit_graph *g = (struct commit_graph *)data;
 	uint32_t hash_version;
+
+	if (g->chunk_bloom_data)
+		return report_duplicate();
+
 	g->chunk_bloom_data = chunk_start;
 	hash_version = get_be32(chunk_start);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 15/15] chunk-format: add technical docs
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 14/15] commit-graph: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2020-12-03 16:16 ` Derrick Stolee via GitGitGadget
  2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
  15 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-12-03 16:16 UTC (permalink / raw)
  To: git; +Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 54 +++++++++++++++++++
 .../technical/commit-graph-format.txt         |  3 ++
 Documentation/technical/pack-format.txt       |  3 ++
 3 files changed, 60 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 0000000000..3db3792dea
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,54 @@
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats, including:
+
+ 1. Writing and reading the table of contents.
+
+ 2. Verifying that the data written in a chunk matches the expected size
+    that was recorded in the table of contents.
+
+ 3. Checking that a table of contents describes offsets properly within
+    the file boundaries.
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b9..b92442780e 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -65,6 +65,9 @@ CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f..2fb1e60d29 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 05/15] midx: add entries to write_midx_context
  2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
@ 2020-12-03 21:42   ` Junio C Hamano
  2020-12-04 13:39     ` Derrick Stolee
  2020-12-08 18:00   ` Taylor Blau
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2020-12-03 21:42 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
> -				    struct pack_midx_entry *objects,
> -				    uint32_t nr_objects)
> +static size_t write_midx_oid_lookup(struct hashfile *f,
> +				    void *data)
>  {
> -	struct pack_midx_entry *list = objects;
> +	struct write_midx_context *ctx = (struct write_midx_context *)data;
> +	unsigned char hash_len = the_hash_algo->rawsz;
> +	struct pack_midx_entry *list = ctx->entries;

I know this is meant to be a faithful rewrite, but can we lose this
"length of the hash function output must be smaller than 256"
imposed by "unsigned char" at some point, perhaps after this series
settles?  .rawsz field is size_t IIRC.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/15] midx: convert chunk write methods to return int
  2020-12-03 16:16 ` [PATCH 08/15] midx: convert chunk write methods to return int Derrick Stolee via GitGitGadget
@ 2020-12-03 21:50   ` Junio C Hamano
  2020-12-04 13:40     ` Derrick Stolee
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2020-12-03 21:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>

> Subject: Re: [PATCH 08/15] midx: convert chunk write methods to return int

Please rephrase the title to avoid misleading the readers; it is not
like we used to return the number of bytes written in size_t but we
realized that we never write mroe than INTMAX and narrowed the
return type to "int".

In other words, returning "int" is not the most important thing in
this change.  What is more important is that the function will
eventually signal success/failure with its return value.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes
  2020-12-03 16:16 ` [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2020-12-03 22:00   ` Junio C Hamano
  2020-12-08 18:43     ` Taylor Blau
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2020-12-03 22:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When calculating the sizes of certain chunks, we should use 64-bit
> multiplication always. This allows us to properly predict the chunk
> sizes without risk of overflow.

That's quite an obvious fix, applicable even before this entire
series.  I think we saw quite similar bugfixes in different parts of
the codebase recently.

So far, everything looks good.


> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 0548266bea..47f5f60fcd 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -946,12 +946,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	chunks[2].write_fn = write_midx_oid_lookup;
>  
>  	chunks[3].id = MIDX_CHUNKID_OBJECTOFFSETS;
> -	chunks[3].size = ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
> +	chunks[3].size = (uint64_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
>  	chunks[3].write_fn = write_midx_object_offsets;
>  
>  	if (ctx.large_offsets_needed) {
>  		chunks[4].id = MIDX_CHUNKID_LARGEOFFSETS;
> -		chunks[4].size = ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
> +		chunks[4].size = (uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
>  		chunks[4].write_fn = write_midx_large_offsets;
>  	}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/15] chunk-format: create chunk reading API
  2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
@ 2020-12-03 22:17   ` Junio C Hamano
  2020-12-04 13:47     ` Derrick Stolee
  2020-12-03 22:43   ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2020-12-03 22:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, Abhishek Kumar
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Now that the chunk-format API has a consistent mechanism for writing
> file formats based on chunks, let's extend it to also parse chunk-based
> files during read.

Heads-up.  The generation-data work by Abhishek would conflict with
the way how this refactors the way the beginning offset of each
chunk is found.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/15] chunk-format: create chunk reading API
  2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
  2020-12-03 22:17   ` Junio C Hamano
@ 2020-12-03 22:43   ` Junio C Hamano
  2020-12-04 13:45     ` Derrick Stolee
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2020-12-03 22:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
> ...
> -		case GRAPH_CHUNKID_DATA:
> -			if (graph->chunk_commit_data)
> -				chunk_repeated = 1;
> -			else
> -				graph->chunk_commit_data = data + chunk_offset;
> -			break;
> +	/* limit the chunk-format list if we are ignoring Bloom filters */
> +	if (!r->settings.commit_graph_read_changed_paths)
> +		chunks_nr = 5;

I presume that this magic number 5 is directly connected to ...

> +static struct read_chunk_info read_chunks[] = {
> +	[0] = {
> +		GRAPH_CHUNKID_OIDFANOUT,
> +		graph_read_oid_fanout
> +	},
> +	[1] = {
> +		GRAPH_CHUNKID_OIDLOOKUP,
> +		graph_read_oid_lookup
> +	},
> +	[2] = {
> +		GRAPH_CHUNKID_DATA,
> +		graph_read_data
> +	},
> +	[3] = {
> +		GRAPH_CHUNKID_EXTRAEDGES,
> +		graph_read_extra_edges
> +	},
> +	[4] = {
> +		GRAPH_CHUNKID_BASE,
> +		graph_read_base_graphs
> +	},
> +	[5] = {
> +		GRAPH_CHUNKID_BLOOMINDEXES,
> +		graph_read_bloom_indices
> +	},
> +	[6] = {
> +		GRAPH_CHUNKID_BLOOMDATA,
> +		graph_read_bloom_data
> +	}
> +};

... the index of these entries in the table.

I find this arrangement brittle.  For one thing, it means that new
chunks cannot be added at an arbitrary place, and BLOOMDATA must be
at the end for the "limit the list when ignoring" logic to work
correctly (even when the developer who is adding a new chunk type
realizes that new one must be inserted before [5], and the magic
number 5 above must be updated), and it won't work at all if a
similar "we can optionally ignore reading the data" treatment needs
to be made for new chunk types, since two or more things cannot be
at the end of the list at the same time X-<.

Would it be a cleaner way to implement this "when member X in the
settings structure is not set, ignore BLOOMINDEXES and BLOOMDATA"
feature to let these graph_read_*() functions have access to the
settings structure, so that bloom_indices and bloom_data callback
functions can make decisions for themselves?

Once we do that, as far as I can tell, there is no longer any reason
to initialize read_chunks[] array with designated initializer.  The
implementation of read_table_of_contents() does not tolerate if this
array is sparsely populated anyway, and except for the "do we ignore
bloom?" hack, nothing really depends on the order of entries in the
array, right?

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2020-12-03 16:16 ` [PATCH 15/15] chunk-format: add technical docs Derrick Stolee via GitGitGadget
@ 2020-12-04 12:48 ` René Scharfe
  2020-12-04 13:57   ` Derrick Stolee
                     ` (2 more replies)
  15 siblings, 3 replies; 40+ messages in thread
From: René Scharfe @ 2020-12-04 12:48 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git; +Cc: szeder.dev, me, Derrick Stolee

Am 03.12.20 um 17:16 schrieb Derrick Stolee via GitGitGadget:
> I was thinking about file formats recently and realized that the "chunks"
> that are common to the commit-graph and multi-pack-index could inform future
> file formats. To make that process easier, let's combine the process of
> writing and reading chunks into a common API that both of these existing
> formats use.
>
> There is some extra benefit immediately: the writing and reading code for
> each gets a bit cleaner. Also, there were different checks in each that made
> the process more robust. Now, these share a common set of checks.

>  Documentation/technical/chunk-format.txt      |  54 ++
>  .../technical/commit-graph-format.txt         |   3 +
>  Documentation/technical/pack-format.txt       |   3 +
>  Makefile                                      |   1 +
>  chunk-format.c                                | 105 ++++
>  chunk-format.h                                |  69 +++
>  commit-graph.c                                | 298 ++++++-----
>  midx.c                                        | 466 ++++++++----------
>  t/t5318-commit-graph.sh                       |   2 +-
>  t/t5319-multi-pack-index.sh                   |   6 +-
>  10 files changed, 623 insertions(+), 384 deletions(-)

623-384-54-3-3-1-69-2-6 = 101

So if we ignore changes to documentation, headers, tests and build
script this spends ca. 100 more lines of code than the current version.
That's roughly the size of the new file chunk-format.c -- from this
bird's-eye-view the new API seems to be pure overhead.

In the new code I see several magic numbers, use of void pointers and
casting as well as repetition -- is this really going in the right
direction?  I get the feeling that YAGNI.

René

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 05/15] midx: add entries to write_midx_context
  2020-12-03 21:42   ` Junio C Hamano
@ 2020-12-04 13:39     ` Derrick Stolee
  0 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-04 13:39 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

On 12/3/2020 4:42 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> -static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
>> -				    struct pack_midx_entry *objects,
>> -				    uint32_t nr_objects)
>> +static size_t write_midx_oid_lookup(struct hashfile *f,
>> +				    void *data)
>>  {
>> -	struct pack_midx_entry *list = objects;
>> +	struct write_midx_context *ctx = (struct write_midx_context *)data;
>> +	unsigned char hash_len = the_hash_algo->rawsz;
>> +	struct pack_midx_entry *list = ctx->entries;
> 
> I know this is meant to be a faithful rewrite, but can we lose this
> "length of the hash function output must be smaller than 256"
> imposed by "unsigned char" at some point, perhaps after this series
> settles?  .rawsz field is size_t IIRC.

There's no reason why this should be "unsigned char" except that
I was probably thinking about the "hash version" byte in the
header when I added this line. Clearly there isn't a failure (yet),
but it's better to be safe.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 08/15] midx: convert chunk write methods to return int
  2020-12-03 21:50   ` Junio C Hamano
@ 2020-12-04 13:40     ` Derrick Stolee
  0 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-04 13:40 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

On 12/3/2020 4:50 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
> 
>> Subject: Re: [PATCH 08/15] midx: convert chunk write methods to return int
> 
> Please rephrase the title to avoid misleading the readers; it is not
> like we used to return the number of bytes written in size_t but we
> realized that we never write mroe than INTMAX and narrowed the
> return type to "int".
> 
> In other words, returning "int" is not the most important thing in
> this change.  What is more important is that the function will
> eventually signal success/failure with its return value.
 
I suppose "return int" is what I meant, but this is probably better:

 midx: return success/failure in chunk write methods

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/15] chunk-format: create chunk reading API
  2020-12-03 22:43   ` Junio C Hamano
@ 2020-12-04 13:45     ` Derrick Stolee
  0 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-04 13:45 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee

On 12/3/2020 5:43 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>> ...
>> -		case GRAPH_CHUNKID_DATA:
>> -			if (graph->chunk_commit_data)
>> -				chunk_repeated = 1;
>> -			else
>> -				graph->chunk_commit_data = data + chunk_offset;
>> -			break;
>> +	/* limit the chunk-format list if we are ignoring Bloom filters */
>> +	if (!r->settings.commit_graph_read_changed_paths)
>> +		chunks_nr = 5;
> 
> I presume that this magic number 5 is directly connected to ...
> 
>> +static struct read_chunk_info read_chunks[] = {
>> +	[0] = {
>> +		GRAPH_CHUNKID_OIDFANOUT,
>> +		graph_read_oid_fanout
>> +	},
>> +	[1] = {
>> +		GRAPH_CHUNKID_OIDLOOKUP,
>> +		graph_read_oid_lookup
>> +	},
>> +	[2] = {
>> +		GRAPH_CHUNKID_DATA,
>> +		graph_read_data
>> +	},
>> +	[3] = {
>> +		GRAPH_CHUNKID_EXTRAEDGES,
>> +		graph_read_extra_edges
>> +	},
>> +	[4] = {
>> +		GRAPH_CHUNKID_BASE,
>> +		graph_read_base_graphs
>> +	},
>> +	[5] = {
>> +		GRAPH_CHUNKID_BLOOMINDEXES,
>> +		graph_read_bloom_indices
>> +	},
>> +	[6] = {
>> +		GRAPH_CHUNKID_BLOOMDATA,
>> +		graph_read_bloom_data
>> +	}
>> +};
> 
> ... the index of these entries in the table.
> 
> I find this arrangement brittle.  For one thing, it means that new
> chunks cannot be added at an arbitrary place, and BLOOMDATA must be
> at the end for the "limit the list when ignoring" logic to work
> correctly (even when the developer who is adding a new chunk type
> realizes that new one must be inserted before [5], and the magic
> number 5 above must be updated), and it won't work at all if a
> similar "we can optionally ignore reading the data" treatment needs
> to be made for new chunk types, since two or more things cannot be
> at the end of the list at the same time X-<.
> 
> Would it be a cleaner way to implement this "when member X in the
> settings structure is not set, ignore BLOOMINDEXES and BLOOMDATA"
> feature to let these graph_read_*() functions have access to the
> settings structure, so that bloom_indices and bloom_data callback
> functions can make decisions for themselves?
> 
> Once we do that, as far as I can tell, there is no longer any reason
> to initialize read_chunks[] array with designated initializer.  The
> implementation of read_table_of_contents() does not tolerate if this
> array is sparsely populated anyway, and except for the "do we ignore
> bloom?" hack, nothing really depends on the order of entries in the
> array, right?

This is a good point, and it's my fault for not noticing this behavior.
In an earlier version, I was initializing read_chunks dynamically
inside parse_commit_graph(), where the change made more sense (the
Bloom chunks were not initialized at all based on this condition). I
think the better way to handle this is to check the config within
the reading functions graph_read_bloom_(indices|data).

The added benefit of checking in the read_chunk_fn is that the
chunk-format API can see that these chunks are "known" chunk types,
in case we were to add logging for "I don't understand this chunk
type".

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/15] chunk-format: create chunk reading API
  2020-12-03 22:17   ` Junio C Hamano
@ 2020-12-04 13:47     ` Derrick Stolee
  2020-12-04 20:17       ` Junio C Hamano
  0 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee @ 2020-12-04 13:47 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget, Abhishek Kumar
  Cc: git, szeder.dev, me, Derrick Stolee, Derrick Stolee,
	René Scharfe

On 12/3/2020 5:17 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Now that the chunk-format API has a consistent mechanism for writing
>> file formats based on chunks, let's extend it to also parse chunk-based
>> files during read.
> 
> Heads-up.  The generation-data work by Abhishek would conflict with
> the way how this refactors the way the beginning offset of each
> chunk is found.

Right! I should have checked against 'seen'. Abhishek's work should
take precedence and I can adapt on top of his topic. For now, feel
free to keep this topic out of 'seen' while we address the worth
of the topic (re: René's concerns).

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
@ 2020-12-04 13:57   ` Derrick Stolee
  2020-12-04 19:42   ` Junio C Hamano
  2020-12-08 18:49   ` Taylor Blau
  2 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-04 13:57 UTC (permalink / raw)
  To: René Scharfe, Derrick Stolee via GitGitGadget, git
  Cc: szeder.dev, me, Derrick Stolee, Abhishek Kumar

On 12/4/2020 7:48 AM, René Scharfe wrote:
> Am 03.12.20 um 17:16 schrieb Derrick Stolee via GitGitGadget:
...
>>  Documentation/technical/chunk-format.txt      |  54 ++
>>  .../technical/commit-graph-format.txt         |   3 +
>>  Documentation/technical/pack-format.txt       |   3 +
>>  Makefile                                      |   1 +
>>  chunk-format.c                                | 105 ++++
>>  chunk-format.h                                |  69 +++
>>  commit-graph.c                                | 298 ++++++-----
>>  midx.c                                        | 466 ++++++++----------
>>  t/t5318-commit-graph.sh                       |   2 +-
>>  t/t5319-multi-pack-index.sh                   |   6 +-
>>  10 files changed, 623 insertions(+), 384 deletions(-)
> 
> 623-384-54-3-3-1-69-2-6 = 101
> 
> So if we ignore changes to documentation, headers, tests and build
> script this spends ca. 100 more lines of code than the current version.
> That's roughly the size of the new file chunk-format.c -- from this
> bird's-eye-view the new API seems to be pure overhead.

Overhead in terms of lines of code, but many of those are function
prototypes and single lines containing only "{" and "}". So yes,
the code files are a bit longer, but the amount of executed code is
not meaningfully different.

Extra lines of code is an expected cost of refactoring. The remaining
question is, "is it worth the cost?" I believe it is.

> In the new code I see several magic numbers, use of void pointers and
> casting as well as repetition -- is this really going in the right
> direction?  I get the feeling that YAGNI.

void pointers are a cost of abstraction in C that we use all over the
codebase.

You (and Junio) are right to point out my magic numbers. Those should
be replaced with something better when possible.

As far as YAGNI, I doubt that very much. First, we have already seen
extensions to the commit-graph that added several new chunks, and
plugging into this (documented) API should be easier than the previous
ad-hoc mechanism.

I've CC'd Abhishek to get his opinion, since he's recently added chunks
to the commit-graph file. Outside of the fact that this series conflicts
with his series (which I will fix), it would be good to see if he
appreciates this model.

>> I was thinking about file formats recently and realized that the "chunks"
>> that are common to the commit-graph and multi-pack-index could inform future
>> file formats. To make that process easier, let's combine the process of
>> writing and reading chunks into a common API that both of these existing
>> formats use.

And another point on YAGNI: I'm literally prototyping a new file format and
want to use this API to build it instead of repeating myself. Specifically,
I noticed that the commit-graph and multi-pack-index were inconsistent in
how they protected the file format in different ways during writes and reads.
This leads to...

>> There is some extra benefit immediately: the writing and reading code for
>> each gets a bit cleaner. Also, there were different checks in each that made
>> the process more robust. Now, these share a common set of checks.

...my point that combining these checks make both codepaths slightly more
robust. I didn't even include the potential extension of storing the size
of each chunk in "struct commit_graph" and "struct multi_pack_index" for
run-time bound checks during lookups. That seemed like too much new
behavior for a series that intends to only refactor.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
  2020-12-04 13:57   ` Derrick Stolee
@ 2020-12-04 19:42   ` Junio C Hamano
  2020-12-08 18:49   ` Taylor Blau
  2 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2020-12-04 19:42 UTC (permalink / raw)
  To: René Scharfe
  Cc: Derrick Stolee via GitGitGadget, git, szeder.dev, me,
	Derrick Stolee

René Scharfe <l.s.r@web.de> writes:

>>  Documentation/technical/chunk-format.txt      |  54 ++
>>  .../technical/commit-graph-format.txt         |   3 +
>>  Documentation/technical/pack-format.txt       |   3 +
>>  Makefile                                      |   1 +
>>  chunk-format.c                                | 105 ++++
>>  chunk-format.h                                |  69 +++
>>  commit-graph.c                                | 298 ++++++-----
>>  midx.c                                        | 466 ++++++++----------
>>  t/t5318-commit-graph.sh                       |   2 +-
>>  t/t5319-multi-pack-index.sh                   |   6 +-
>>  10 files changed, 623 insertions(+), 384 deletions(-)
>
> 623-384-54-3-3-1-69-2-6 = 101
>
> So if we ignore changes to documentation, headers, tests and build
> script this spends ca. 100 more lines of code than the current version.
> That's roughly the size of the new file chunk-format.c -- from this
> bird's-eye-view the new API seems to be pure overhead.
>
> In the new code I see several magic numbers, use of void pointers and
> casting as well as repetition -- is this really going in the right
> direction?  I get the feeling that YAGNI.

Hmph, two existing users consolidated into one and still not losing
lines is not a very convincing sign.  Perhaps a third existing user
would purely lose lines when converted to use this (do we have a
third or fourth one?)  I dunno.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 13/15] chunk-format: create chunk reading API
  2020-12-04 13:47     ` Derrick Stolee
@ 2020-12-04 20:17       ` Junio C Hamano
  0 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2020-12-04 20:17 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Abhishek Kumar, git, szeder.dev,
	me, Derrick Stolee, Derrick Stolee, René Scharfe

Derrick Stolee <stolee@gmail.com> writes:

> On 12/3/2020 5:17 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> Now that the chunk-format API has a consistent mechanism for writing
>>> file formats based on chunks, let's extend it to also parse chunk-based
>>> files during read.
>> 
>> Heads-up.  The generation-data work by Abhishek would conflict with
>> the way how this refactors the way the beginning offset of each
>> chunk is found.
>
> Right! I should have checked against 'seen'. Abhishek's work should
> take precedence and I can adapt on top of his topic. For now, feel
> free to keep this topic out of 'seen' while we address the worth
> of the topic (re: René's concerns).

Sure.  We may want to do [PATCH 11/15] even if the "worth of the
topic" discussion ends up negative, though.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 14/15] commit-graph: restore duplicate chunk checks
  2020-12-03 16:16 ` [PATCH 14/15] commit-graph: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2020-12-07 13:43   ` Derrick Stolee
  0 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-07 13:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: szeder.dev, me, Derrick Stolee, Derrick Stolee

On 12/3/2020 11:16 AM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The previous change introduced read_table_of_contents() in the
> chunk-format API, but dropped the duplicate chunk check from the
> commit-graph parsing logic. This was done to keep flexibility in the
> chunk-format API.

"keep flexibility" is bogus. This is the biggest YAGNI of this
series. Instead, consider the patch below instead which restores
duplicate checks for the commit-graph file AND adds them to the
multi-pack-index file due to the shared API.

This is also roughly half of the added lines from the previous
patch.

Thanks,
-Stolee

-- >8 --

From 0df4959d59d7f9df3e9f6326bb0acb7b84f84980 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Mon, 7 Dec 2020 08:36:42 -0500
Subject: [PATCH] chunk-format: restore duplicate chunk checks

Before refactoring into the chunk-format API, the commit-graph parsing
logic included checks for duplicate chunks. It is unlikely that we would
desire a chunk-based file format that allows duplicate chunk IDs in the
table of contents, so add duplicate checks into
read_table_of_contents().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 15 ++++++++++++++-
 chunk-format.h |  3 +++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/chunk-format.c b/chunk-format.c
index d888ef6ec73..a4891bbd28a 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -57,9 +57,13 @@ int read_table_of_contents(const unsigned char *mfile,
 			   int nr,
 			   void *data)
 {
+	int i;
 	uint32_t chunk_id;
 	const unsigned char *table_of_contents = mfile + toc_offset;
 
+	for (i = 0; i < nr; i++)
+		chunks[i].found = 0;
+
 	while (toc_length--) {
 		int i;
 		uint64_t chunk_offset, next_chunk_offset;
@@ -83,7 +87,16 @@ int read_table_of_contents(const unsigned char *mfile,
 		}
 		for (i = 0; i < nr; i++) {
 			if (chunks[i].id == chunk_id) {
-				int result = chunks[i].read_fn(
+				int result;
+
+				if (chunks[i].found) {
+					error(_("duplicate chunk ID %"PRIx32" found"),
+					      chunk_id);
+					return 1;
+				}
+
+				chunks[i].found = 1;
+				result = chunks[i].read_fn(
 						mfile + chunk_offset,
 						next_chunk_offset - chunk_offset,
 						data);
diff --git a/chunk-format.h b/chunk-format.h
index 7049800f734..de45797223a 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -56,6 +56,9 @@ typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
 struct read_chunk_info {
 	uint32_t id;
 	chunk_read_fn read_fn;
+
+	/* used internally */
+	unsigned found:1;
 };
 
 int read_table_of_contents(const unsigned char *mfile,
-- 
2.29.0.vfs.0.0




^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/15] chunk-format: add API for writing table of contents
  2020-12-03 16:16 ` [PATCH 02/15] chunk-format: add API for writing table of contents Derrick Stolee via GitGitGadget
@ 2020-12-08 17:56   ` Taylor Blau
  0 siblings, 0 replies; 40+ messages in thread
From: Taylor Blau @ 2020-12-08 17:56 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, Derrick Stolee, Derrick Stolee

On Thu, Dec 03, 2020 at 04:16:41PM +0000, Derrick Stolee via GitGitGadget wrote:
> @@ -0,0 +1,26 @@
> +#include "git-compat-util.h"
> +#include "chunk-format.h"
> +#include "csum-file.h"
> +#define CHUNK_LOOKUP_WIDTH 12
> +
> +void write_table_of_contents(struct hashfile *f,
> +			     uint64_t cur_offset,

Hmmph. Could we not use hashfile_total() here instead? I wonder how much
the callers would be able to be cleaned up without having to keep track
of the offset themselves.

Is this to indicate that we might call hashfd() on a file that already
has data in it at an offset that doesn't cause us to overwrite that
data? I doubt it (since both callers currently write all of their data
into a hashfile), but if so that should be documented, too.

(In either case, off_t is a more appropriate type than uint64_t here.)

> @@ -0,0 +1,36 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      void *data);
> +
> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_

I'm not clear on what "_expected_" means here. Does that mean that the
actual data written may be over- or under-sized? If not, does violating
the expectation cause an error?

I realize that I'm being pedantic, but documenting this would be useful
to callers.

> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> +	uint32_t id;
> +	uint64_t size;
> +	chunk_write_fn write_fn;
> +};
> +
> +/*
> + * Write the chunk data into the supplied hashfile.
> + *
> + * * 'cur_offset' indicates the number of bytes written to the hashfile
> + *   before the table of contents starts.
> + *
> + * * 'nr' is the number of chunks with non-zero IDs, so 'nr + 1'
> + *   chunks are written in total.
> + */
> +void write_table_of_contents(struct hashfile *f,
> +			     uint64_t cur_offset,

Same goes here about cur_offset.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 05/15] midx: add entries to write_midx_context
  2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
  2020-12-03 21:42   ` Junio C Hamano
@ 2020-12-08 18:00   ` Taylor Blau
  1 sibling, 0 replies; 40+ messages in thread
From: Taylor Blau @ 2020-12-08 18:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, Derrick Stolee, Derrick Stolee

On Thu, Dec 03, 2020 at 04:16:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> diff --git a/midx.c b/midx.c
> index 6ab655ddda..2af4452165 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -458,6 +458,9 @@ struct write_midx_context {
>  	struct multi_pack_index *m;
>  	struct progress *progress;
>  	unsigned pack_paths_checked;
> +
> +	struct pack_midx_entry *entries;
> +	uint32_t entries_nr;
>  };

I like that you have added a newline between the members used for each
of the two (so far) callback functions.

It may be helpful to add a comment to the effect of, "these are used to
write the XXX chunk via write_midx_xxx()", or something.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/15] midx: use chunk-format API in write_midx_internal()
  2020-12-03 16:16 ` [PATCH 10/15] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
@ 2020-12-08 18:42   ` Taylor Blau
  2020-12-10 14:36     ` Derrick Stolee
  0 siblings, 1 reply; 40+ messages in thread
From: Taylor Blau @ 2020-12-08 18:42 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, Derrick Stolee, Derrick Stolee

On Thu, Dec 03, 2020 at 04:16:49PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The chunk-format API allows automatically writing the table of contents
> for a chunk-based file format when using an array of "struct
> chunk_info"s. Update write_midx_internal() to use this strategy, which
> also simplifies the chunk writing loop. This loop will be replaced with
> a chunk-format API call in an upcoming change.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 96 +++++++++++++---------------------------------------------
>  1 file changed, 21 insertions(+), 75 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index ce6d4339bd..0548266bea 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -11,6 +11,7 @@
>  #include "trace2.h"
>  #include "run-command.h"
>  #include "repository.h"
> +#include "chunk-format.h"
>
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>  #define MIDX_VERSION 1
> @@ -799,15 +800,14 @@ static int write_midx_large_offsets(struct hashfile *f,
>  static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
>  			       struct string_list *packs_to_drop, unsigned flags)
>  {
> -	unsigned char cur_chunk, num_chunks = 0;
> +	unsigned char num_chunks = 0;
>  	char *midx_name;
>  	uint32_t i;
>  	struct hashfile *f = NULL;
>  	struct lock_file lk;
>  	struct write_midx_context ctx = { 0 };
>  	uint64_t header_size = 0;
> -	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
> -	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> +	struct chunk_info chunks[MIDX_MAX_CHUNKS];
>  	int pack_name_concat_len = 0;
>  	int dropped_packs = 0;
>  	int result = 0;
> @@ -923,7 +923,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	if (ctx.m)
>  		close_midx(ctx.m);
>
> -	cur_chunk = 0;
>  	num_chunks = ctx.large_offsets_needed ? 5 : 4;
>
>  	if (ctx.nr - dropped_packs == 0) {
> @@ -934,85 +933,32 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>
>  	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
>
> -	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> -	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
> +	chunks[0].id = MIDX_CHUNKID_PACKNAMES;
> +	chunks[0].size = pack_name_concat_len;
> +	chunks[0].write_fn = write_midx_pack_names;

[...]

Hmm. The caller has to do quite a lot of work in order to feed chunks
into the new chunk-format code.

I wonder what it would look like if we introduced a new 'struct
chunkfile' type. It would know about the underlying hashfile that it's
writing to, as well as the chunks it's supposed to write. It would
likely support three operations: add a new chunk, write the TOC, and
write the contents (the latter two could probably be combined into one).

Below is a patch to do just that. It seems to work OK on t5319, but
fails some tests in t5318. I haven't looked into the commit-graph test
failures, but I'd be happy to do so if you think this is a direction
worth going in.

Before I show the patch, though, let's take a look at the stat:

 chunk-format.c | 61 ++++++++++++++++++++++++++++++++++++++----------
 chunk-format.h | 25 ++++++++++++--------
 commit-graph.c | 90 ++++++++++++++++++++++-------------------------------------------------
 midx.c         | 38 +++++++++++++-----------------
 4 files changed, 107 insertions(+), 107 deletions(-)

That's a little bit more code in the chunk-format code, which I can live
with, but the diff in commit-graph.c and midx.c is a net-negative, as I
would expect it to be. So, I'm happy that this seems to make things
easier on the callers.

OK, enough rambling. Here's the patch:

diff --git a/chunk-format.c b/chunk-format.c
index 771b6d98d0..14f2fe5c9a 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -1,26 +1,63 @@
 #include "git-compat-util.h"
 #include "chunk-format.h"
 #include "csum-file.h"
+#include "cache.h"
 #define CHUNK_LOOKUP_WIDTH 12

-void write_table_of_contents(struct hashfile *f,
-			     uint64_t cur_offset,
-			     struct chunk_info *chunks,
-			     int nr)
+void chunkfile_init(struct chunkfile *out, struct hashfile *f)
 {
-	int i;
+	if (!out)
+		BUG("cannot initialize null chunkfile");
+
+	memset(out, 0, sizeof(*out));
+	out->f = f;
+}
+
+void chunkfile_push_chunk(struct chunkfile *c,
+			  uint32_t id, uint64_t size, chunk_write_fn write_fn)
+{
+	ALLOC_GROW(c->chunks, c->chunks_nr + 1, c->chunks_alloc);
+	c->chunks[c->chunks_nr].id = id;
+	c->chunks[c->chunks_nr].size = size;
+	c->chunks[c->chunks_nr].write_fn = write_fn;
+	c->chunks_nr++;
+}
+
+void chunkfile_write_toc(struct chunkfile *c)
+{
+	size_t i;
+	off_t cur_offset = hashfile_total(c->f);

 	/* Add the table of contents to the current offset */
-	cur_offset += (nr + 1) * CHUNK_LOOKUP_WIDTH;
+	cur_offset += (c->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;

-	for (i = 0; i < nr; i++) {
-		hashwrite_be32(f, chunks[i].id);
-		hashwrite_be64(f, cur_offset);
+	for (i = 0; i < c->chunks_nr; i++) {
+		hashwrite_be32(c->f, c->chunks[i].id);
+		hashwrite_be64(c->f, cur_offset);

-		cur_offset += chunks[i].size;
+		cur_offset += c->chunks[i].size;
 	}

 	/* Trailing entry marks the end of the chunks */
-	hashwrite_be32(f, 0);
-	hashwrite_be64(f, cur_offset);
+	hashwrite_be32(c->f, 0);
+	hashwrite_be64(c->f, cur_offset);
+}
+
+void chunkfile_write_chunks(struct chunkfile *c, void *ctx)
+{
+	size_t i;
+
+	for (i = 0; i < c->chunks_nr; i++) {
+		off_t before, after;
+		struct chunk_info *chunk = &c->chunks[i];
+		if (!chunk->write_fn)
+			continue;
+
+		before = hashfile_total(c->f);
+		chunk->write_fn(c->f, ctx);
+		after = hashfile_total(c->f);
+
+		if (after - before != chunk->size)
+			BUG("unexpected chunk size");
+	}
 }
diff --git a/chunk-format.h b/chunk-format.h
index 4b9cbeb372..b0fb515425 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -19,18 +19,23 @@ struct chunk_info {
 	chunk_write_fn write_fn;
 };

+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+void chunkfile_init(struct chunkfile *out, struct hashfile *f);
+void chunkfile_push_chunk(struct chunkfile *c,
+			  uint32_t id, uint64_t size, chunk_write_fn write_fn);
+
 /*
  * Write the chunk data into the supplied hashfile.
- *
- * * 'cur_offset' indicates the number of bytes written to the hashfile
- *   before the table of contents starts.
- *
- * * 'nr' is the number of chunks with non-zero IDs, so 'nr + 1'
- *   chunks are written in total.
  */
-void write_table_of_contents(struct hashfile *f,
-			     uint64_t cur_offset,
-			     struct chunk_info *chunks,
-			     int nr);
+void chunkfile_write_toc(struct chunkfile *c);
+
+void chunkfile_write_chunks(struct chunkfile *c, void *ctx);

 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 5494fda1d3..abc0c9e46f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1702,11 +1702,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
+	struct chunkfile cf;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
-	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
 	struct object_id file_hash;

 	if (ctx->split) {
@@ -1753,76 +1751,42 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}

-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[0].write_fn = write_graph_chunk_fanout;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[1].write_fn = write_graph_chunk_oids;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
-	chunks[2].write_fn = write_graph_chunk_data;
-	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
-		num_chunks++;
-	}
+	chunkfile_init(&cf, f);
+
+	chunkfile_push_chunk(&cf, GRAPH_CHUNKID_OIDFANOUT,
+			     GRAPH_FANOUT_SIZE, write_graph_chunk_fanout);
+	chunkfile_push_chunk(&cf, GRAPH_CHUNKID_OIDLOOKUP,
+			     hashsz * ctx->commits.nr, write_graph_chunk_oids);
+	chunkfile_push_chunk(&cf, GRAPH_CHUNKID_DATA,
+			     (hashsz + 16) * ctx->commits.nr, write_graph_chunk_data);
+	if (ctx->num_extra_edges)
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_EXTRAEDGES,
+				     4 * ctx->num_extra_edges,
+				     write_graph_chunk_extra_edges);
 	if (ctx->changed_paths) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
-		num_chunks++;
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * 3
-					  + ctx->total_bloom_filter_data_size;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		chunks[num_chunks].write_fn = write_graph_chunk_base;
-		num_chunks++;
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_BLOOMINDEXES,
+				     sizeof(uint32_t) * ctx->commits.nr,
+				     write_graph_chunk_bloom_indexes);
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_BLOOMDATA,
+				     sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size,
+				     write_graph_chunk_bloom_data);
 	}
+	if (ctx->num_commit_graphs_after > 1)
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_BASE,
+				     hashsz * (ctx->num_commit_graphs_after - 1),
+				     write_graph_chunk_base);

-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+	chunkfile_push_chunk(&cf, 0, 0, NULL);

 	hashwrite_be32(f, GRAPH_SIGNATURE);

 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, cf.chunks_nr - 1);
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);

-	write_table_of_contents(f, /* cur_offset */ 8, chunks, num_chunks);
-
-	if (ctx->report_progress) {
-		strbuf_addf(&progress_title,
-			    Q_("Writing out commit graph in %d pass",
-			       "Writing out commit graph in %d passes",
-			       num_chunks),
-			    num_chunks);
-		ctx->progress = start_delayed_progress(
-			progress_title.buf,
-			num_chunks * ctx->commits.nr);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
-
-	stop_progress(&ctx->progress);
-	strbuf_release(&progress_title);
+	chunkfile_write_toc(&cf);
+	chunkfile_write_chunks(&cf, ctx);

 	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
 		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
diff --git a/midx.c b/midx.c
index 0548266bea..6315ea7555 100644
--- a/midx.c
+++ b/midx.c
@@ -807,7 +807,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
 	uint64_t header_size = 0;
-	struct chunk_info chunks[MIDX_MAX_CHUNKS];
+	struct chunkfile cf;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -933,33 +933,27 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *

 	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);

-	chunks[0].id = MIDX_CHUNKID_PACKNAMES;
-	chunks[0].size = pack_name_concat_len;
-	chunks[0].write_fn = write_midx_pack_names;
+	chunkfile_init(&cf, f);

-	chunks[1].id = MIDX_CHUNKID_OIDFANOUT;
-	chunks[1].size = MIDX_CHUNK_FANOUT_SIZE;
-	chunks[1].write_fn = write_midx_oid_fanout;
-
-	chunks[2].id = MIDX_CHUNKID_OIDLOOKUP;
-	chunks[2].size = ctx.entries_nr * the_hash_algo->rawsz;
-	chunks[2].write_fn = write_midx_oid_lookup;
-
-	chunks[3].id = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunks[3].size = ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	chunks[3].write_fn = write_midx_object_offsets;
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_PACKNAMES,
+			     pack_name_concat_len, write_midx_pack_names);
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_OIDFANOUT,
+			     MIDX_CHUNK_FANOUT_SIZE, write_midx_oid_fanout);
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_OIDLOOKUP,
+			     ctx.entries_nr * the_hash_algo->rawsz, write_midx_oid_lookup);
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_OBJECTOFFSETS,
+			     ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH, write_midx_object_offsets);

 	if (ctx.large_offsets_needed) {
-		chunks[4].id = MIDX_CHUNKID_LARGEOFFSETS;
-		chunks[4].size = ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
-		chunks[4].write_fn = write_midx_large_offsets;
+		chunkfile_push_chunk(&cf, MIDX_CHUNKID_LARGEOFFSETS,
+				     ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
+				     write_midx_large_offsets);
 	}

-	write_table_of_contents(f, header_size, chunks, num_chunks);
-
-	for (i = 0; i < num_chunks; i++)
-		chunks[i].write_fn(f, &ctx);
+	chunkfile_write_toc(&cf);
+	chunkfile_write_chunks(&cf, &ctx);

+	/* maybe move this into chunkfile??? */
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes
  2020-12-03 22:00   ` Junio C Hamano
@ 2020-12-08 18:43     ` Taylor Blau
  0 siblings, 0 replies; 40+ messages in thread
From: Taylor Blau @ 2020-12-08 18:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, szeder.dev, Derrick Stolee,
	Derrick Stolee

On Thu, Dec 03, 2020 at 02:00:50PM -0800, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > When calculating the sizes of certain chunks, we should use 64-bit
> > multiplication always. This allows us to properly predict the chunk
> > sizes without risk of overflow.
>
> That's quite an obvious fix, applicable even before this entire
> series.  I think we saw quite similar bugfixes in different parts of
> the codebase recently.

:-). Indeed, Stolee and I were probably looking at output from the same
static analysis tool, which seems to be eager to catch these sorts of
multiplication-widening errors.

I'd be happy to see this patch get applied down regardless of what
happens with the rest of the series.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/15] chunk-format: create write_chunks()
  2020-12-03 16:16 ` [PATCH 12/15] chunk-format: create write_chunks() Derrick Stolee via GitGitGadget
@ 2020-12-08 18:45   ` Taylor Blau
  0 siblings, 0 replies; 40+ messages in thread
From: Taylor Blau @ 2020-12-08 18:45 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, Derrick Stolee, Derrick Stolee

On Thu, Dec 03, 2020 at 04:16:51PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The commit-graph and multi-pack-index files both use a chunk-based file
> format. They have already unified on using write_table_of_contents(),
> but we expand upon that by unifying their chunk writing loop.
>
> This takes the concepts already present in the commit-graph that were
> dropped in the multi-pack-index code during refactoring, including:
>
> * Check the hashfile for how much data was written by each write_fn.
>
> * Allow write_fn() to report an error that results in a failure
>   without using die() in the low-level commands.
>
> This simplifies the code in commit-graph.c and midx.c while laying the
> foundation for future formats using similar ideas.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  chunk-format.c | 23 +++++++++++++++++++++++
>  chunk-format.h | 13 +++++++++++++
>  commit-graph.c | 13 ++-----------
>  midx.c         |  3 +--
>  4 files changed, 39 insertions(+), 13 deletions(-)
>
> diff --git a/chunk-format.c b/chunk-format.c
> index 771b6d98d0..a6643a4fc8 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -24,3 +24,26 @@ void write_table_of_contents(struct hashfile *f,
>  	hashwrite_be32(f, 0);
>  	hashwrite_be64(f, cur_offset);
>  }
> +
> +int write_chunks(struct hashfile *f,
> +		 struct chunk_info *chunks,
> +		 int nr,
> +		 void *data)
> +{

Serves me right for thinking that a function like write_chunks() would
be a good addition to this series without... actually reading the whole
series first ;-).

I'm glad to see that you're adding such a function, but I think that I
prefer my version which moves these four parameters into the chunkfile
struct. That allows for other functions to be created which can, for
e.g., manage the chunks themselves (like my chunkfile_push_chunk()).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
  2020-12-04 13:57   ` Derrick Stolee
  2020-12-04 19:42   ` Junio C Hamano
@ 2020-12-08 18:49   ` Taylor Blau
  2020-12-09 17:13     ` René Scharfe
  2 siblings, 1 reply; 40+ messages in thread
From: Taylor Blau @ 2020-12-08 18:49 UTC (permalink / raw)
  To: René Scharfe
  Cc: Derrick Stolee via GitGitGadget, git, szeder.dev, Derrick Stolee

On Fri, Dec 04, 2020 at 01:48:31PM +0100, René Scharfe wrote:
> Am 03.12.20 um 17:16 schrieb Derrick Stolee via GitGitGadget:
> > I was thinking about file formats recently and realized that the "chunks"
> > that are common to the commit-graph and multi-pack-index could inform future
> > file formats. To make that process easier, let's combine the process of
> > writing and reading chunks into a common API that both of these existing
> > formats use.
> >
> > There is some extra benefit immediately: the writing and reading code for
> > each gets a bit cleaner. Also, there were different checks in each that made
> > the process more robust. Now, these share a common set of checks.
>
> >  Documentation/technical/chunk-format.txt      |  54 ++
> >  .../technical/commit-graph-format.txt         |   3 +
> >  Documentation/technical/pack-format.txt       |   3 +
> >  Makefile                                      |   1 +
> >  chunk-format.c                                | 105 ++++
> >  chunk-format.h                                |  69 +++
> >  commit-graph.c                                | 298 ++++++-----
> >  midx.c                                        | 466 ++++++++----------
> >  t/t5318-commit-graph.sh                       |   2 +-
> >  t/t5319-multi-pack-index.sh                   |   6 +-
> >  10 files changed, 623 insertions(+), 384 deletions(-)
>
> 623-384-54-3-3-1-69-2-6 = 101
>
> So if we ignore changes to documentation, headers, tests and build
> script this spends ca. 100 more lines of code than the current version.
> That's roughly the size of the new file chunk-format.c -- from this
> bird's-eye-view the new API seems to be pure overhead.
>
> In the new code I see several magic numbers, use of void pointers and
> casting as well as repetition -- is this really going in the right
> direction?  I get the feeling that YAGNI.

I think that Stolee is going in the right direction. I suggested earlier
in the thread making a new "chunkfile" type which can handle allocating
new chunks, writing their tables of contents, and so on.

So, I think that we should pursues that direction a little further
before deciding whether or not this is worth continuing. My early
experiments showed that it does add a little more code to the
chunk-format.{c,h} files, but you get negative diffs in midx.c and
commit-graph.c, which is more in line with what I would expect from this
series.

I do think that the "overhead" here is more tolerable than we might
think; I'd rather have a well-documented "chunkfile" implementation
written once and called twice, than two near-identical implementations
of _writing_ the chunks / table of contents at each of the call sites.
So, even if this does end up being a net-lines-added kind of diff, I'd
still say that it's worth it.

With regards to the "YAGNI" comment... I do have thoughts about
extending the reachability bitmap format to use chunks (of course, this
would break compatibility with JGit, and it isn't something that I plan
to do in the short-term, or even necessarily in the future).

In any event, I'm sure that this won't be these two won't be the last
chunk-based formats that we have in Git.

> René

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-08 18:49   ` Taylor Blau
@ 2020-12-09 17:13     ` René Scharfe
  2020-12-10  0:50       ` Taylor Blau
  0 siblings, 1 reply; 40+ messages in thread
From: René Scharfe @ 2020-12-09 17:13 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee via GitGitGadget, git, szeder.dev, Derrick Stolee

Am 08.12.20 um 19:49 schrieb Taylor Blau:
> So, I think that we should pursues that direction a little further
> before deciding whether or not this is worth continuing. My early
> experiments showed that it does add a little more code to the
> chunk-format.{c,h} files, but you get negative diffs in midx.c and
> commit-graph.c, which is more in line with what I would expect from this
> series.

OK.

> I do think that the "overhead" here is more tolerable than we might
> think; I'd rather have a well-documented "chunkfile" implementation
> written once and called twice, than two near-identical implementations
> of _writing_ the chunks / table of contents at each of the call sites.
> So, even if this does end up being a net-lines-added kind of diff, I'd
> still say that it's worth it.

Well, interfaces are hard, and having two similar-but-not-quite-equal
pieces of code instead of a central API implementation trying to
serve two callers can actually be better.

I'm not too familiar with the chunk producers and consumers, so I can
only offer some high-level observations.  And I don't have to use the
API, so go wild! ;)  I was just triggered by the appearance of two
working pieces of code being replaced by two slightly different pieces
of code plus a third one on top.

> With regards to the "YAGNI" comment... I do have thoughts about
> extending the reachability bitmap format to use chunks (of course, this
> would break compatibility with JGit, and it isn't something that I plan
> to do in the short-term, or even necessarily in the future).
>
> In any event, I'm sure that this won't be these two won't be the last
> chunk-based formats that we have in Git.

OK, so perhaps we can do better before this scheme is copied.  The write
side is complicated by the fact that the table of contents (TOC) is
written first, followed by the actual chunks.  That requires two passes
over the data.

The ZIP format solved a similar issue by placing the TOC at the end,
which allows for one-pass streaming.  Another way to achieve that would
be to put the TOC in a separate file, like we do for .pack and .idx
files.  This way you could have a single write function for chunks, and
writers would just be a single sequence of calls for the different
types.

But seeing that the read side just loads all of the chunks anyway
(skipping unknown IDs) I wonder why we need a TOC at all.  That would
only be useful if callers were trying to read just some small subset
of the whole file.  A collection of chunks for easy dumping and loading
could be serialized by writing just a small header for each chunk
containing its type and size followed by its payload.

René

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-09 17:13     ` René Scharfe
@ 2020-12-10  0:50       ` Taylor Blau
  2020-12-10 14:30         ` Derrick Stolee
  0 siblings, 1 reply; 40+ messages in thread
From: Taylor Blau @ 2020-12-10  0:50 UTC (permalink / raw)
  To: René Scharfe
  Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git, szeder.dev,
	Derrick Stolee

On Wed, Dec 09, 2020 at 06:13:18PM +0100, René Scharfe wrote:
> I'm not too familiar with the chunk producers and consumers, so I can
> only offer some high-level observations.  And I don't have to use the
> API, so go wild! ;)  I was just triggered by the appearance of two
> working pieces of code being replaced by two slightly different pieces
> of code plus a third one on top.

:-).

> > With regards to the "YAGNI" comment... I do have thoughts about
> > extending the reachability bitmap format to use chunks (of course, this
> > would break compatibility with JGit, and it isn't something that I plan
> > to do in the short-term, or even necessarily in the future).
> >
> > In any event, I'm sure that this won't be these two won't be the last
> > chunk-based formats that we have in Git.
>
> OK, so perhaps we can do better before this scheme is copied.  The write
> side is complicated by the fact that the table of contents (TOC) is
> written first, followed by the actual chunks.  That requires two passes
> over the data.

"Two passes" meaning that we have to both compute the size of and then
write the data? This is relatively cheap to do, at least so I think.

For e.g., the OIDLOOKUP commit-graph chunk is just the_hash_algo->hashsz
* commits->nr bytes wide, so that can be done in constant time. A more
heavyweight case might be for e.g., the Bloom data section, where Bloom
filters have to first be computed, their lengths accounted for, and
_then_ written when we eventually get to writing that chunk.

This happens in compute_bloom_filters(); and write_chunk_bloom_indexes()
+ write_chunk_bloom_data(), respectively. Those Bloom filters are all
stored in a commit slab until they are written, so these "two passes"
are just paid for in memory.

> The ZIP format solved a similar issue by placing the TOC at the end,
> which allows for one-pass streaming.  Another way to achieve that would
> be to put the TOC in a separate file, like we do for .pack and .idx
> files.  This way you could have a single write function for chunks, and
> writers would just be a single sequence of calls for the different
> types.

Interesting. I'm not opposed to changing any of these formats (and maybe
there is some compelling argument for doing so, I am not sure) but I
think that unifying the implementation for reading / writing the chunk
format _before_ changing it is a postive step.

> But seeing that the read side just loads all of the chunks anyway
> (skipping unknown IDs) I wonder why we need a TOC at all.  That would
> only be useful if callers were trying to read just some small subset
> of the whole file.  A collection of chunks for easy dumping and loading
> could be serialized by writing just a small header for each chunk
> containing its type and size followed by its payload.

AFAIK, we do use the table of contents to locate where the chunks are so
that we can for e.g., set up the commit_graph structure's pointers to
point at each chunk appropriately.

> René

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 00/15] Refactor chunk-format into an API
  2020-12-10  0:50       ` Taylor Blau
@ 2020-12-10 14:30         ` Derrick Stolee
  0 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-10 14:30 UTC (permalink / raw)
  To: Taylor Blau, René Scharfe
  Cc: Derrick Stolee via GitGitGadget, git, szeder.dev, Derrick Stolee

On 12/9/2020 7:50 PM, Taylor Blau wrote:
> On Wed, Dec 09, 2020 at 06:13:18PM +0100, René Scharfe wrote:
>> I'm not too familiar with the chunk producers and consumers, so I can
>> only offer some high-level observations.  And I don't have to use the
>> API, so go wild! ;)  I was just triggered by the appearance of two
>> working pieces of code being replaced by two slightly different pieces
>> of code plus a third one on top.
> 
> :-).
> 
>>> With regards to the "YAGNI" comment... I do have thoughts about
>>> extending the reachability bitmap format to use chunks (of course, this
>>> would break compatibility with JGit, and it isn't something that I plan
>>> to do in the short-term, or even necessarily in the future).
>>>
>>> In any event, I'm sure that this won't be these two won't be the last
>>> chunk-based formats that we have in Git.
>>
>> OK, so perhaps we can do better before this scheme is copied.  The write
>> side is complicated by the fact that the table of contents (TOC) is
>> written first, followed by the actual chunks.  That requires two passes
>> over the data.
> 
> "Two passes" meaning that we have to both compute the size of and then
> write the data? This is relatively cheap to do, at least so I think.
> 
> For e.g., the OIDLOOKUP commit-graph chunk is just the_hash_algo->hashsz
> * commits->nr bytes wide, so that can be done in constant time. A more
> heavyweight case might be for e.g., the Bloom data section, where Bloom
> filters have to first be computed, their lengths accounted for, and
> _then_ written when we eventually get to writing that chunk.
> 
> This happens in compute_bloom_filters(); and write_chunk_bloom_indexes()
> + write_chunk_bloom_data(), respectively. Those Bloom filters are all
> stored in a commit slab until they are written, so these "two passes"
> are just paid for in memory.

The current design of the format (TOC first) does require that we can
predict the chunk sizes before we start writing the file. But also this
has _some_ desirable properties for the writer. Specifically, we keep
the file write handle for a smaller amount of time. How valuable is that?
:shrug:

>> The ZIP format solved a similar issue by placing the TOC at the end,
>> which allows for one-pass streaming.  Another way to achieve that would
>> be to put the TOC in a separate file, like we do for .pack and .idx
>> files.  This way you could have a single write function for chunks, and
>> writers would just be a single sequence of calls for the different
>> types.
> 
> Interesting. I'm not opposed to changing any of these formats (and maybe
> there is some compelling argument for doing so, I am not sure) but I
> think that unifying the implementation for reading / writing the chunk
> format _before_ changing it is a postive step.

Changing the TOC location would require a version increment in the file
formats, which is a bit painful.

I understand why the ZIP format does this, it is trying to stream data
through a compression algorithm and cannot store the result in memory
before writing. Perhaps we would want to consider that as a way to
reduce the memory load for things like changed-path Bloom filters, but
let's wait for that to be an actually noticeable problem before making
the change.

The TOC location is definitely optimized for readers, who are already
reading the initial header to discover some info about the file.

>> But seeing that the read side just loads all of the chunks anyway
>> (skipping unknown IDs) I wonder why we need a TOC at all.  That would
>> only be useful if callers were trying to read just some small subset
>> of the whole file.  A collection of chunks for easy dumping and loading
>> could be serialized by writing just a small header for each chunk
>> containing its type and size followed by its payload.
> 
> AFAIK, we do use the table of contents to locate where the chunks are so
> that we can for e.g., set up the commit_graph structure's pointers to
> point at each chunk appropriately.

The chunk-based format is really optimized for _structured data_ where
these sizes are mostly predictable. The chunk sizes that are not
predictable are things like the "extra edges" or changed-path Bloom
filter data, but that data is indexed by a structured chunk.

A natural fit for the chunk-based format is the reachability bitmap
format, since the current format requires a scan to discover which
commits have bitmaps. If we used chunks, then we could quickly
search a lookup table for the commits that have bitmaps, then navigate
to the binary data chunk for the bitmap itself.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 10/15] midx: use chunk-format API in write_midx_internal()
  2020-12-08 18:42   ` Taylor Blau
@ 2020-12-10 14:36     ` Derrick Stolee
  0 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2020-12-10 14:36 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, szeder.dev, Derrick Stolee, Derrick Stolee

On 12/8/2020 1:42 PM, Taylor Blau wrote:
> I wonder what it would look like if we introduced a new 'struct
> chunkfile' type. It would know about the underlying hashfile that it's
> writing to, as well as the chunks it's supposed to write. It would
> likely support three operations: add a new chunk, write the TOC, and
> write the contents (the latter two could probably be combined into one).
> 
> Below is a patch to do just that. It seems to work OK on t5319, but
> fails some tests in t5318. I haven't looked into the commit-graph test
> failures, but I'd be happy to do so if you think this is a direction
> worth going in.

I like this direction. It makes it easier to adapt future formats or
add new chunks. I wonder how this write interface can adapt to the
reading portion, since that is where some of the magic numbers came
about. Likely there is an equivalent that we could do to have the
chunk-format API construct the "struct chunkfile" and then the
commit-graph and multi-pack-index parsers can interpret the chunks
_after_ the TOC parse is complete.

As discussed, I'm putting this series on hold for a while so
generation number v2 is not interrupted. When I come back to it,
I'll incorporate these design ideas and build on top of that
topic.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-12-10 18:58 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 02/15] chunk-format: add API for writing table of contents Derrick Stolee via GitGitGadget
2020-12-08 17:56   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 03/15] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 04/15] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2020-12-03 21:42   ` Junio C Hamano
2020-12-04 13:39     ` Derrick Stolee
2020-12-08 18:00   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 06/15] midx: add pack_perm " Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 07/15] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 08/15] midx: convert chunk write methods to return int Derrick Stolee via GitGitGadget
2020-12-03 21:50   ` Junio C Hamano
2020-12-04 13:40     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 09/15] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 10/15] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2020-12-08 18:42   ` Taylor Blau
2020-12-10 14:36     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2020-12-03 22:00   ` Junio C Hamano
2020-12-08 18:43     ` Taylor Blau
2020-12-03 16:16 ` [PATCH 12/15] chunk-format: create write_chunks() Derrick Stolee via GitGitGadget
2020-12-08 18:45   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
2020-12-03 22:17   ` Junio C Hamano
2020-12-04 13:47     ` Derrick Stolee
2020-12-04 20:17       ` Junio C Hamano
2020-12-03 22:43   ` Junio C Hamano
2020-12-04 13:45     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 14/15] commit-graph: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2020-12-07 13:43   ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 15/15] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
2020-12-04 13:57   ` Derrick Stolee
2020-12-04 19:42   ` Junio C Hamano
2020-12-08 18:49   ` Taylor Blau
2020-12-09 17:13     ` René Scharfe
2020-12-10  0:50       ` Taylor Blau
2020-12-10 14:30         ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).