[PATCH v3 00/17] Refactor chunk-format into an API - Derrick Stolee via GitGitGadget

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de,
	szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Derrick Stolee <derrickstolee@github.com>
Subject: [PATCH v3 00/17] Refactor chunk-format into an API
Date: Fri, 05 Feb 2021 14:30:35 +0000	[thread overview]
Message-ID: <pull.848.v3.git.1612535452.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.848.v2.git.1611759716.gitgitgadget@gmail.com>

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 570 insertions(+), 456 deletions(-).
 * commit-graph.c: 107 insertions(+), 192 deletions(-)
 * midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get
smaller. Boilerplate things like abstracting method to match chunk_write_fn
and chunk_read_fn make up a lot of these insertions. The "interesting" code
gets a lot smaller and cleaner.


Updates in V3
=============

 * API methods use better types and changed their order to match internal
   data more closely.

 * Use hashfile_total() instead of internal data values.

 * The implementation of pair_chunk() uses read_chunk().

 * init_chunkfile() has an in-code doc comment warning against using the
   same struct chunkfile for reads and writes.

 * More multiplications are correctly cast in midx.c.

 * The chunk-format technical docs are expanded.


Updates in V2
=============

 * The method pair_chunk() now automatically sets a pointer while
   read_chunk() uses the callback. This greatly reduces the code size.

 * Pointer casts are now implicit instead of explicit.

 * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      | 116 +++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 180 ++++++++
 chunk-format.h                                |  65 +++
 commit-graph.c                                | 299 +++++-------
 midx.c                                        | 431 +++++++-----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 648 insertions(+), 458 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/848

Range-diff vs v2:

  1:  243dcec94368 =  1:  243dcec94368 commit-graph: anonymize data in chunk_write_fn
  2:  814512f21671 !  2:  16c37d2370cf chunk-format: create chunk format write API
     @@ Commit message
           5. free the chunkfile struct using free_chunkfile().
      
          Helped-by: Taylor Blau <me@ttaylorr.com>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Makefile ##
     @@ chunk-format.c (new)
      +}
      +
      +void add_chunk(struct chunkfile *cf,
     -+	       uint64_t id,
     -+	       chunk_write_fn fn,
     -+	       size_t size)
     ++	       uint32_t id,
     ++	       size_t size,
     ++	       chunk_write_fn fn)
      +{
      +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
      +
     @@ chunk-format.c (new)
      +int write_chunkfile(struct chunkfile *cf, void *data)
      +{
      +	int i;
     -+	size_t cur_offset = cf->f->offset + cf->f->total;
     ++	uint64_t cur_offset = hashfile_total(cf->f);
      +
      +	/* Add the table of contents to the current offset */
      +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
     @@ chunk-format.c (new)
      +	hashwrite_be64(cf->f, cur_offset);
      +
      +	for (i = 0; i < cf->chunks_nr; i++) {
     -+		uint64_t start_offset = cf->f->total + cf->f->offset;
     ++		off_t start_offset = hashfile_total(cf->f);
      +		int result = cf->chunks[i].write_fn(cf->f, data);
      +
      +		if (result)
      +			return result;
      +
     -+		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
     ++		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
      +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
      +			    cf->chunks[i].size, cf->chunks[i].id,
     -+			    cf->f->total + cf->f->offset - start_offset);
     ++			    hashfile_total(cf->f) - start_offset);
      +	}
      +
      +	return 0;
     @@ chunk-format.h (new)
      +struct chunkfile *init_chunkfile(struct hashfile *f);
      +void free_chunkfile(struct chunkfile *cf);
      +int get_num_chunks(struct chunkfile *cf);
     -+typedef int (*chunk_write_fn)(struct hashfile *f,
     -+			      void *data);
     ++typedef int (*chunk_write_fn)(struct hashfile *f, void *data);
      +void add_chunk(struct chunkfile *cf,
     -+	       uint64_t id,
     -+	       chunk_write_fn fn,
     -+	       size_t size);
     ++	       uint32_t id,
     ++	       size_t size,
     ++	       chunk_write_fn fn);
      +int write_chunkfile(struct chunkfile *cf, void *data);
      +
      +#endif
  3:  70af6e3083f4 !  3:  e549e24d79af commit-graph: use chunk-format write API
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	chunks[2].write_fn = write_graph_chunk_data;
      +	cf = init_chunkfile(f);
      +
     -+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
     -+		  write_graph_chunk_fanout, GRAPH_FANOUT_SIZE);
     -+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP,
     -+		  write_graph_chunk_oids, hashsz * ctx->commits.nr);
     -+	add_chunk(cf, GRAPH_CHUNKID_DATA,
     -+		  write_graph_chunk_data, (hashsz + 16) * ctx->commits.nr);
     ++	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, GRAPH_FANOUT_SIZE,
     ++		  write_graph_chunk_fanout);
     ++	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, hashsz * ctx->commits.nr,
     ++		  write_graph_chunk_oids);
     ++	add_chunk(cf, GRAPH_CHUNKID_DATA, (hashsz + 16) * ctx->commits.nr,
     ++		  write_graph_chunk_data);
       
       	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
       		ctx->write_generation_data = 0;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	}
      +	if (ctx->write_generation_data)
      +		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
     -+			  write_graph_chunk_generation_data,
     -+			  sizeof(uint32_t) * ctx->commits.nr);
     ++			  sizeof(uint32_t) * ctx->commits.nr,
     ++			  write_graph_chunk_generation_data);
      +	if (ctx->num_generation_data_overflows)
      +		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
     -+			  write_graph_chunk_generation_data_overflow,
     -+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows);
     ++			  sizeof(timestamp_t) * ctx->num_generation_data_overflows,
     ++			  write_graph_chunk_generation_data_overflow);
      +	if (ctx->num_extra_edges)
      +		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
     -+			  write_graph_chunk_extra_edges,
     -+			  4 * ctx->num_extra_edges);
     ++			  4 * ctx->num_extra_edges,
     ++			  write_graph_chunk_extra_edges);
       	if (ctx->changed_paths) {
      -		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
      -		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	chunks[num_chunks].id = 0;
      -	chunks[num_chunks].size = 0;
      +		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
     -+			  write_graph_chunk_bloom_indexes,
     -+			  sizeof(uint32_t) * ctx->commits.nr);
     ++			  sizeof(uint32_t) * ctx->commits.nr,
     ++			  write_graph_chunk_bloom_indexes);
      +		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -+			  write_graph_chunk_bloom_data,
      +			  sizeof(uint32_t) * 3
     -+				+ ctx->total_bloom_filter_data_size);
     ++				+ ctx->total_bloom_filter_data_size,
     ++			  write_graph_chunk_bloom_data);
      +	}
      +	if (ctx->num_commit_graphs_after > 1)
      +		add_chunk(cf, GRAPH_CHUNKID_BASE,
     -+			  write_graph_chunk_base,
     -+			  hashsz * (ctx->num_commit_graphs_after - 1));
     ++			  hashsz * (ctx->num_commit_graphs_after - 1),
     ++			  write_graph_chunk_base);
       
       	hashwrite_be32(f, GRAPH_SIGNATURE);
       
  4:  0cac7890bed7 =  4:  66ff49ed9309 midx: rename pack_info to write_midx_context
  5:  4a4e90b129ae =  5:  1d7484c0cffa midx: use context in write_midx_pack_names()
  6:  30ad423997b7 =  6:  ea0e7d40e537 midx: add entries to write_midx_context
  7:  2f1c496f3ab5 =  7:  b283a38fb775 midx: add pack_perm to write_midx_context
  8:  c4939548e51c =  8:  e7064512ab7f midx: add num_large_offsets to write_midx_context
  9:  b3cc73c22567 !  9:  7aa3242e15b7 midx: return success/failure in chunk write methods
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
       	stop_progress(&progress);
       
      -	if (written != chunk_offsets[num_chunks])
     -+	if (f->total + f->offset != chunk_offsets[num_chunks])
     ++	if (hashfile_total(f) != chunk_offsets[num_chunks])
       		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
      -		    written,
     -+		    f->total + f->offset,
     ++		    hashfile_total(f),
       		    chunk_offsets[num_chunks]);
       
       	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 10:  78744d3b7016 ! 10:  70f68c95e479 midx: drop chunk progress during write
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
       	}
      -	stop_progress(&progress);
       
     - 	if (f->total + f->offset != chunk_offsets[num_chunks])
     + 	if (hashfile_total(f) != chunk_offsets[num_chunks])
       		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
 11:  07dc0cf8c683 ! 11:  787cd7f18d2e midx: use chunk-format API in write_midx_internal()
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      -			case MIDX_CHUNKID_PACKNAMES:
      -				write_midx_pack_names(f, &ctx);
      -				break;
     -+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES,
     -+		  write_midx_pack_names, pack_name_concat_len);
     -+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
     -+		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
     ++	add_chunk(cf, MIDX_CHUNKID_PACKNAMES, pack_name_concat_len,
     ++		  write_midx_pack_names);
     ++	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
     ++		  write_midx_oid_fanout);
      +	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
     -+		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
     ++		  ctx.entries_nr * the_hash_algo->rawsz,
     ++		  write_midx_oid_lookup);
      +	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
     -+		  write_midx_object_offsets,
     -+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
     ++		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     ++		  write_midx_object_offsets);
       
      -			case MIDX_CHUNKID_OIDFANOUT:
      -				write_midx_oid_fanout(f, &ctx);
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      -	}
      +	if (ctx.large_offsets_needed)
      +		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
     -+			write_midx_large_offsets,
     -+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     ++			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     ++			write_midx_large_offsets);
       
     --	if (f->total + f->offset != chunk_offsets[num_chunks])
     +-	if (hashfile_total(f) != chunk_offsets[num_chunks])
      -		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
     --		    f->total + f->offset,
     +-		    hashfile_total(f),
      -		    chunk_offsets[num_chunks]);
      +	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
      +	write_chunkfile(cf, &ctx);
 12:  d8d8e9e2aa3f ! 12:  366eb2afee83 chunk-format: create read chunk API
     @@ Commit message
          read. If the same struct instance was used for both reads and writes,
          then there would be failures.
      
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## chunk-format.c ##
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +	return 0;
      +}
      +
     ++static int pair_chunk_fn(const unsigned char *chunk_start,
     ++			 size_t chunk_size,
     ++			 void *data)
     ++{
     ++	const unsigned char **p = data;
     ++	*p = chunk_start;
     ++	return 0;
     ++}
     ++
      +int pair_chunk(struct chunkfile *cf,
      +	       uint32_t chunk_id,
      +	       const unsigned char **p)
      +{
     -+	int i;
     -+
     -+	for (i = 0; i < cf->chunks_nr; i++) {
     -+		if (cf->chunks[i].id == chunk_id) {
     -+			*p = cf->chunks[i].start;
     -+			return 0;
     -+		}
     -+	}
     -+
     -+	return CHUNK_NOT_FOUND;
     ++	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
      +}
      +
      +int read_chunk(struct chunkfile *cf,
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +}
      
       ## chunk-format.h ##
     +@@
     + struct hashfile;
     + struct chunkfile;
     + 
     ++/*
     ++ * Initialize a 'struct chunkfile' for writing _or_ reading a file
     ++ * with the chunk format.
     ++ *
     ++ * If writing a file, supply a non-NULL 'struct hashfile *' that will
     ++ * be used to write.
     ++ *
     ++ * If reading a file, then supply the memory-mapped data to the
     ++ * pair_chunk() or read_chunk() methods, as appropriate.
     ++ *
     ++ * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
     ++ * for reading and writing.
     ++ */
     + struct chunkfile *init_chunkfile(struct hashfile *f);
     + void free_chunkfile(struct chunkfile *cf);
     + int get_num_chunks(struct chunkfile *cf);
      @@ chunk-format.h: void add_chunk(struct chunkfile *cf,
     - 	       size_t size);
     + 	       chunk_write_fn fn);
       int write_chunkfile(struct chunkfile *cf, void *data);
       
      +int read_table_of_contents(struct chunkfile *cf,
 13:  8744d2785965 = 13:  7838ad32e2e0 commit-graph: use chunk-format read API
 14:  750c03253c95 ! 14:  6bddd9e63b9b midx: use chunk-format read API
     @@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, i
       	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
       
       	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
     +@@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
     + cleanup_fail:
     + 	free(m);
     + 	free(midx_name);
     ++	free(cf);
     + 	if (midx_map)
     + 		munmap(midx_map, midx_size);
     + 	if (0 <= fd)
      
       ## t/t5319-multi-pack-index.sh ##
      @@ t/t5319-multi-pack-index.sh: test_expect_success 'verify bad OID version' '
 15:  83d292532a0f ! 15:  3cd97f389f1f midx: use 64-bit multiplication for chunk sizes
     @@ Commit message
          multiplication always. This allows us to properly predict the chunk
          sizes without risk of overflow.
      
     +    Other possible overflows were discovered by evaluating each
     +    multiplication in midx.c and ensuring that at least one side of the
     +    operator was of type size_t or off_t.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## midx.c ##
     +@@ midx.c: static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
     + 	const unsigned char *offset_data;
     + 	uint32_t offset32;
     + 
     +-	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
     ++	offset_data = m->chunk_object_offsets + (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH;
     + 	offset32 = get_be32(offset_data + sizeof(uint32_t));
     + 
     + 	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
     +@@ midx.c: static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
     + 
     + static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
     + {
     +-	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
     ++	return get_be32(m->chunk_object_offsets +
     ++			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
     + }
     + 
     + static int nth_midxed_pack_entry(struct repository *r,
      @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
     - 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
     - 		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
     + 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
     + 		  write_midx_oid_fanout);
       	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
     --		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
     -+		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
     +-		  ctx.entries_nr * the_hash_algo->rawsz,
     ++		  (size_t)ctx.entries_nr * the_hash_algo->rawsz,
     + 		  write_midx_oid_lookup);
       	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
     - 		  write_midx_object_offsets,
     - 		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
     -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
     +-		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     ++		  (size_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     + 		  write_midx_object_offsets);
     + 
       	if (ctx.large_offsets_needed)
       		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
     - 			write_midx_large_offsets,
     --			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     -+			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     +-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     ++			(size_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     + 			write_midx_large_offsets);
       
       	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
     - 	write_chunkfile(cf, &ctx);
 16:  669eeec707ab ! 16:  b9a1bddf615f chunk-format: restore duplicate chunk checks
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## chunk-format.c ##
     -@@ chunk-format.c: struct chunk_info {
     - 	chunk_write_fn write_fn;
     - 
     - 	const void *start;
     -+	unsigned found:1;
     - };
     - 
     - struct chunkfile {
      @@ chunk-format.c: int read_table_of_contents(struct chunkfile *cf,
       			   uint64_t toc_offset,
       			   int toc_length)
 17:  8f3985ab5df3 ! 17:  4c7d751f1e39 chunk-format: add technical docs
     @@ Documentation/technical/chunk-format.txt (new)
      +
      +Functions for working with chunk-based file formats are declared in
      +`chunk-format.h`. Using these methods provide extra checks that assist
     -+developers when creating new file formats, including:
     ++developers when creating new file formats.
      +
     -+ 1. Writing and reading the table of contents.
     ++Writing chunk-based file formats
     ++--------------------------------
      +
     -+ 2. Verifying that the data written in a chunk matches the expected size
     -+    that was recorded in the table of contents.
     ++To write a chunk-based file format, create a `struct chunkfile` by
     ++calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
     ++caller is responsible for opening the `hashfile` and writing header
     ++information so the file format is identifiable before the chunk-based
     ++format begins.
      +
     -+ 3. Checking that a table of contents describes offsets properly within
     -+    the file boundaries.
     ++Then, call `add_chunk()` for each chunk that is intended for write. This
     ++populates the `chunkfile` with information about the order and size of
     ++each chunk to write. Provide a `chunk_write_fn` function pointer to
     ++perform the write of the chunk data upon request.
     ++
     ++Call `write_chunkfile()` to write the table of contents to the `hashfile`
     ++followed by each of the chunks. This will verify that each chunk wrote
     ++the expected amount of data so the table of contents is correct.
     ++
     ++Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
     ++caller is responsible for finalizing the `hashfile` by writing the trailing
     ++hash and closing the file.
     ++
     ++Reading chunk-based file formats
     ++--------------------------------
     ++
     ++To read a chunk-based file format, the file must be opened as a
     ++memory-mapped region. The chunk-format API expects that the entire file
     ++is mapped as a contiguous memory region.
     ++
     ++Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
     ++
     ++After reading the header information from the beginning of the file,
     ++including the chunk count, call `read_table_of_contents()` to populate
     ++the `struct chunkfile` with the list of chunks, their offsets, and their
     ++sizes.
     ++
     ++Extract the data information for each chunk using `pair_chunk()` or
     ++`read_chunk()`:
     ++
     ++* `pair_chunk()` assigns a given pointer with the location inside the
     ++  memory-mapped file corresponding to that chunk's offset. If the chunk
     ++  does not exist, then the pointer is not modified.
     ++
     ++* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
     ++  with the appropriate initial pointer and size information. The function
     ++  is not called if the chunk does not exist. Use this method to read chunks
     ++  if you need to perform immediate parsing or if you need to execute logic
     ++  based on the size of the chunk.
     ++
     ++After calling these methods, call `free_chunkfile()` to clear the
     ++`struct chunkfile` data. This will not close the memory-mapped region.
     ++Callers are expected to own that data for the timeframe the pointers into
     ++the region are needed.
     ++
     ++Examples
     ++--------
     ++
     ++These file formats use the chunk-format API, and can be used as examples
     ++for future formats:
     ++
     ++* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
     ++  in `commit-graph.c` for how the chunk-format API is used to write and
     ++  parse the commit-graph file format documented in
     ++  link:technical/commit-graph-format.html[the commit-graph file format].
     ++
     ++* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
     ++  in `midx.c` for how the chunk-format API is used to write and
     ++  parse the multi-pack-index file format documented in
     ++  link:technical/pack-format.html[the multi-pack-index file format].
      
       ## Documentation/technical/commit-graph-format.txt ##
      @@ Documentation/technical/commit-graph-format.txt: CHUNK LOOKUP:

-- 
gitgitgadget

next prev parent reply	other threads:[~2021-02-05 22:17 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27  1:53   ` Chris Torek
2021-01-27  2:36     ` Taylor Blau
2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-01-27  2:42   ` Taylor Blau
2021-01-27 13:49     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27  2:47   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27  2:49   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-01-27  3:02   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27  3:06   ` Taylor Blau
2021-01-27 13:50     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
2021-01-27  2:29 ` Taylor Blau
2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-04 21:24     ` Junio C Hamano
2021-02-04 22:40       ` Junio C Hamano
2021-02-05 11:37       ` Derrick Stolee
2021-02-05 19:25         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-04 22:59     ` Junio C Hamano
2021-02-05 11:42       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-04 23:40     ` Junio C Hamano
2021-02-05 12:19       ` Derrick Stolee
2021-02-05 19:37         ` Junio C Hamano
2021-02-08 22:26           ` Junio C Hamano
2021-02-09  1:33             ` Derrick Stolee
2021-02-09 20:47               ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05  0:00     ` Junio C Hamano
2021-02-05 10:59       ` Chris Torek
2021-02-05 20:41         ` Junio C Hamano
2021-02-06 20:35           ` Chris Torek
2021-02-05 12:30       ` Derrick Stolee
2021-02-05 19:42         ` Junio C Hamano
2021-02-07 19:50       ` SZEDER Gábor
2021-02-08  5:41         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05  0:05     ` Junio C Hamano
2021-02-05 12:31       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-05  0:15     ` Junio C Hamano
2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
2021-02-05  2:08   ` Junio C Hamano
2021-02-05  2:27     ` Derrick Stolee
2021-02-05 14:30   ` Derrick Stolee via GitGitGadget [this message]
2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-07 21:13       ` SZEDER Gábor
2021-02-08 13:44         ` Derrick Stolee
2021-02-11 19:43           ` SZEDER Gábor
2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-07 20:20       ` SZEDER Gábor
2021-02-08 13:35         ` Derrick Stolee
2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-24 16:52         ` SZEDER Gábor
2021-02-24 17:12           ` Taylor Blau
2021-02-24 17:52             ` Derrick Stolee
2021-02-24 19:44               ` Junio C Hamano
2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 21:47         ` Junio C Hamano
2021-02-19 12:42           ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.848.v3.git.1612535452.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).