git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, me@ttaylorr.com, l.s.r@web.de,
	szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v2 12/17] chunk-format: create read chunk API
Date: Fri, 5 Feb 2021 07:19:52 -0500	[thread overview]
Message-ID: <1278de82-417c-a6ee-a5fe-055fa0ef1903@gmail.com> (raw)
In-Reply-To: <xmqqczxf4d2k.fsf@gitster.c.googlers.com>

On 2/4/2021 6:40 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> We are re-using the anonymous 'struct chunkfile' data, as it is internal
>> to the chunk-format API. This gives it essentially two modes: write and
>> read. If the same struct instance was used for both reads and writes,
>> then there would be failures.
> 
> Writing it here won't help future developers very much.  I think
> that belongs to API/function doc for init_chunkfile().

I will add a clear comment there. Thanks.

>> +int read_table_of_contents(struct chunkfile *cf,
>> +			   const unsigned char *mfile,
>> +			   size_t mfile_size,
>> +			   uint64_t toc_offset,
>> +			   int toc_length)
> 
> It's a bit of mystery, having seen how the table-of-contents is laid
> out by reading the writing side of the code, how toc_offset and
> toc_length are discovered by the caller.  IIRC, the size to cover
> everything from the beginning of the file to the end of the
> table-of-contents was recorded as the length of a non-existent chunk
> with id 0, but we need to be able to somehow find it to use that as
> a way to get to the (end of) table-of-contents from the beginning of
> the file.   I guess we'll learn enough when we get to read the code
> that calls this function.

The existing formats have a byte in their header specifying how
many chunks are in the table of contents. That's how this information
is known in advance.

If we want to instead rely on the terminating chunk id with value 0,
then this method could be modified in the future. It does complicate
the allocation of cf->chunks slightly.

>> +	chunk_id = get_be32(table_of_contents);
>> +	if (chunk_id) {
>> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
>> +		return -1;
>> +	}
> 
> Shouldn't we be validating the size component associated with this
> "id=0" fake chunk that appears at the end as well?  IIRC, the size
> field should be pointing at the byte immediately after the TOC entry
> for the last true chunk, immediately before this zero chunk id?

During the loop, we scanned ahead to find the offset of the
terminating chunk and compute the size of the last "real" chunk.

Any size validation here would be to check that the offset points
_exactly_ to the terminating hash, but that might be too restrictive
on the format. (Who knows if there is a legit reason to have non-
chunked data between the chunked data and the trailing hash?)

As I re-roll, I plan to skip this final check. It would be easy to
add it in a forward fix, thought.

>> +int pair_chunk(struct chunkfile *cf,
>> +	       uint32_t chunk_id,
>> +	       const unsigned char **p)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		if (cf->chunks[i].id == chunk_id) {
>> +			*p = cf->chunks[i].start;
>> +			return 0;
>> +		}
> 
> OK, the assumption here is that there will be at most one chunk that
> has the chunk_id we seek to find (or putting it differently, second
> and subsequent chunks with the same ID are ignored).  We may want to
> write it down somewhere.

This is enforced by a later patch, as you discovered.

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		if (cf->chunks[i].id == chunk_id)
>> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
> 
> It is curious why pair_chunk() exists in the first place.  With
> something like this:
> 
>         int pair_chunk_fn(const unsigned char *chunk_start,
>                           size_t chunk_size,
>                           void *data)
>         {
>                 const unsigned char **p = data;
>                 *p = chunk_start;
>                 return 0;
>         }
> 
> instead of
> 
> 	const unsigned char *location;
> 
> 	pair_chunk(cf, chunk_id, &location);
> 
> we can write
> 
> 	const unsigned char *location;
> 
> 	read_chunk(cf, chunk_id, pair_chunk_fn, &location);
> 
> no?  That would allow us to reorganize the in-core TOC more easily
> if it turns out to be necessary in the future.

I like this, but why not just use pair_chunk_fn inside of
the implementation of pair_chunk() so callers have an easy
interface. This results in the implementation:

static int pair_chunk_fn(const unsigned char *chunk_start,
			 size_t chunk_size,
			 void *data)
{
	const unsigned char **p = data;
	*p = chunk_start;
	return 0;
}

int pair_chunk(struct chunkfile *cf,
	       uint32_t chunk_id,
	       const unsigned char **p)
{
	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
}

>> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
>> +			     size_t chunk_size, void *data);
> 
> Is the assumption throughout the chunked file API that reading is
> actually not done as a "seek+read+process", but as a normal memory
> access over mmapped region?  The reason I ask is because the answer
> affects the choice of the right type for the offset.  The function
> signature of read_table_of_contents() uses size_t to represent the
> length of the entire mmapped region that holds the data from the
> file, and that is better than off_t, especially if size_t were
> smaller than off_t (i.e. we may not be able to mmap a huge size that
> filesystem can handle and let us access with seek+read).
> 
> But the assumption that the whole mfile can be mmapped in as a whole
> is only in read_table_of_contents(), and users of read_chunk() API
> can be oblivious, I think---IOW, we could "page in" the chunk's data
> in read_chunk() while the callback function works on it in-core, and
> then discard it, if we wanted to change the implementation [*].
> 
> 	Side note: for that to work, the API must say that the
> 	callback function MUST NOT assume that the memory region
> 	starting at chunk_start it is given will stay in memory
> 	after it returns.  Otherwise, we cannot "page in" and "page
> 	out".
> 
> I am not advocating that we should not assume the entire file can be
> mapped in.  I would however advocate to be explicit in documenting
> what the users of API can and cannot do (e.g. if we want the "read"
> callbacks to take advantage of the fact that mfile will stay mapped
> until the chunkfile is discarded, we should say so, so that they will
> not make unnecessary copies out of the mmapped region).

I will add this expectation to the documentation.

>> +/*
>> + * Find 'chunk_id' in the given chunkfile and call the
>> + * given chunk_read_fn method with the information for
>> + * that chunk.
>> + *
>> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
>> + */
>> +int read_chunk(struct chunkfile *cf,
>> +	       uint32_t chunk_id,
>> +	       chunk_read_fn fn,
>> +	       void *data);
> 
> Did I miss an update to free_chunkfile() to release resources used
> to read this file?  For some reason, unlike the writing side, the
> reading side of this API feels a bit incomplete to me.

free_chunkfile(cf) already frees cf->chunks (and cf itself). No
other resources are allocated during the read. The caller is
responsible for the mmap'd file resource.

The reading side could use more documentation. I will think about
this.

Thanks,
-Stolee

  reply	other threads:[~2021-02-05 12:25 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27  1:53   ` Chris Torek
2021-01-27  2:36     ` Taylor Blau
2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-01-27  2:42   ` Taylor Blau
2021-01-27 13:49     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27  2:47   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27  2:49   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-01-27  3:02   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27  3:06   ` Taylor Blau
2021-01-27 13:50     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
2021-01-27  2:29 ` Taylor Blau
2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-04 21:24     ` Junio C Hamano
2021-02-04 22:40       ` Junio C Hamano
2021-02-05 11:37       ` Derrick Stolee
2021-02-05 19:25         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-04 22:59     ` Junio C Hamano
2021-02-05 11:42       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-04 23:40     ` Junio C Hamano
2021-02-05 12:19       ` Derrick Stolee [this message]
2021-02-05 19:37         ` Junio C Hamano
2021-02-08 22:26           ` Junio C Hamano
2021-02-09  1:33             ` Derrick Stolee
2021-02-09 20:47               ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05  0:00     ` Junio C Hamano
2021-02-05 10:59       ` Chris Torek
2021-02-05 20:41         ` Junio C Hamano
2021-02-06 20:35           ` Chris Torek
2021-02-05 12:30       ` Derrick Stolee
2021-02-05 19:42         ` Junio C Hamano
2021-02-07 19:50       ` SZEDER Gábor
2021-02-08  5:41         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05  0:05     ` Junio C Hamano
2021-02-05 12:31       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-05  0:15     ` Junio C Hamano
2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
2021-02-05  2:08   ` Junio C Hamano
2021-02-05  2:27     ` Derrick Stolee
2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-07 21:13       ` SZEDER Gábor
2021-02-08 13:44         ` Derrick Stolee
2021-02-11 19:43           ` SZEDER Gábor
2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-07 20:20       ` SZEDER Gábor
2021-02-08 13:35         ` Derrick Stolee
2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-24 16:52         ` SZEDER Gábor
2021-02-24 17:12           ` Taylor Blau
2021-02-24 17:52             ` Derrick Stolee
2021-02-24 19:44               ` Junio C Hamano
2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 21:47         ` Junio C Hamano
2021-02-19 12:42           ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1278de82-417c-a6ee-a5fe-055fa0ef1903@gmail.com \
    --to=stolee@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).