git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
@ 2022-11-07 18:35 Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
                   ` (32 more replies)
  0 siblings, 33 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee


Introduction
============

I became interested in our packed-ref format based on the asymmetry between
ref updates and ref deletions: if we delete a packed ref, then the
packed-refs file needs to be rewritten. Compared to writing a loose ref,
this is an O(N) cost instead of O(1).

In this way, I set out with some goals:

 * (Primary) Make packed ref deletions be nearly as fast as loose ref
   updates.
 * (Secondary) Allow using a packed ref format for all refs, dropping loose
   refs and creating a clear way to snapshot all refs at a given point in
   time.

I also had one major non-goal to keep things focused:

 * (Non-goal) Update the reflog format.

After carefully considering several options, it seemed that there are two
solutions that can solve this effectively:

 1. Wait for reftable to be integrated into Git.
 2. Update the packed-refs backend to have a stacked version.

The reftable work seems currently dormant. The format is pretty complicated
and I have a difficult time seeing a way forward for it to be fully
integrated into Git. Personally, I'd prefer a more incremental approach with
formats that are built for a basic filesystem. During the process, we can
create APIs within Git that can benefit other file formats within Git.

Further, there is a simpler model that satisfies my primary goal without the
complication required for the secondary goal. Suppose we create a stacked
packed-refs file but only have two layers: the first (base) layer is created
when git pack-refs collapses the full stack and adds the loose ref updates
to the packed-refs file; the second (top) layer contains only ref deletions
(allowing null OIDs to indicate a deleted ref). Then, ref deletions would
only need to rewrite that top layer, making ref deletions take O(deletions)
time instead of O(all refs) time. With a reasonable schedule to squash the
packed-refs stack, this would be a dramatic improvement. (A prototype
implementation showed that updating a layer of 1,000 deletions takes only
twice the time as writing a single loose ref.)

If we want to satisfy the secondary goal of passing all ref updates through
the packed storage, then more complicated layering would be necessary. The
point of bringing this up is that we have incremental goals along the way to
that final state that give us good stopping points to test the benefits of
each step.

Stacking the packed-refs format introduces several interesting strategy
points that are complicated to resolve. Before we can do that, we first need
to establish a way to modify the ref format of a Git repository. Hence, we
need a new extension for the ref formats.

To simplify the first update to the ref formats, it seemed better to add a
new file format version to the existing packed-refs file format. This format
has the exact lock/write/rename mechanics of the current packed-refs format,
but uses a file format that structures the information in a more compact
way. It uses the chunk-format API, with some tweaks. This format update is
useful to the final goal of a stacked packed-refs API, since each layer will
have faster reads and writes. The main reason to do this first is that it is
much simpler to understand the value-add (smaller files means faster
performance).


RFC Organization
================

This RFC is quite long, but the length seemed necessary to actually provide
and end-to-end implementation that demonstrates the packed-refs v2 format
along with test coverage (via the new GIT_TEST_PACKED_REFS_VERSION
variable).

For convenience, I've broken each section of the full RFC into parts, which
resembles how I intend to submit the pieces for full review. These parts are
available as pull requests in my fork, but here is a breakdown:


Part I: Optionally hash the index
=================================

[1] https://github.com/derrickstolee/git/pull/23 Packed-refs v2 Part I:
Optionally hash the index (Patches 1-2)

The chunk-format API uses the hashfile API as a buffered write, but also all
existing formats that use the chunk-format API also have a trailing hash as
part of the format. Since the packed-refs file has a critical path involving
its write speed (deleting a packed ref), it seemed important to allow
apples-to-apples comparison between the v1 and v2 format by skipping the
hashing. This is later toggled by a config option.

In this part, the focus is on allowing the hashfile API to ignore updating
the hash during the buffered writes. We've been using this in microsoft/git
to optionally speed up index writes, which patch 2 introduces here. The file
format instead writes a null OID which would look like a corrupt file to an
older 'git fsck'. Before submitting a full version, I would update 'git
fsck' to ignore a null OID in all of our file formats that include a
trailing hash. Since the index is more short-lived than other formats (such
as pack-files) this trailing hash is less useful. The write time is also
critical as the performance tests demonstrate.


Part II: Create extensions.refFormat
====================================

[2] https://github.com/derrickstolee/git/pull/24 Packed-refs v2 Part II:
create extensions.refFormat (Patches 3-7)

This part is a critical concept that has yet to be defined in the Git
codebase. We have no way to incrementally modify the ref format. Since refs
are so critical, we cannot add an optionally-understood layer on top (like
we did with the multi-pack-index and commit-graph files). The reftable draft
[6] proposes the same extension name (extensions.refFormat) but focuses
instead on only a single value. This means that the reftable must be defined
at git init or git clone time and cannot be upgraded from the files backend.

In this RFC, I propose a different model that allows for more customization
and incremental updates. The extensions.refFormat config key is multi-valued
and defaults to the list of files and packed. In the context of this RFC,
the intention is to be able to add packed-v2 so the list of all three values
would allow Git to write and read either file format version (v1 or v2). In
the larger scheme, the extension could allow restricting to only loose refs
(just files) or only packed-refs (just packed) or even later when reftable
is complete, files and reftable could mean that loose refs are the primary
ref storage, but the reftable format serves as a drop-in replacement for the
packed-refs file. Not all combinations need to be understood by Git, but
having them available as an option could be useful for flexibility,
especially when trying to upgrade existing repositories to new formats.

In the future, beyond the scope of this RFC, it would be good to add a
stacked value that allows a stack of files in packed-refs format (whose
version is specified by the packed or packed-v2 values) so we can further
speed up writes to the packed layer. Depending on how well that works, we
could focus on speeding up ref deletions or sending all ref writes straight
to the packed-refs layer. With the option to keep the loose refs storage, we
have flexibility to explore that space incrementally when we have time to
get to it.


Part III: Allow a trailing table-of-contents in the chunk-format API
====================================================================

[3] https://github.com/derrickstolee/git/pull/25 Packed-refs v2 Part III:
trailing table of contents in chunk-format (Patches 8-17)

In order to optimize the write speed of the packed-refs v2 file format, we
want to write immediately to the file as we stream existing refs from the
current refs. The current chunk-format API requires computing the chunk
lengths in advance, which can slow down the write and take more memory than
necessary. Using a trailing table of contents solves this problem, and was
recommended earlier [7]. We just didn't have enough evidence to justify the
work to update the existing chunk formats. Here, we update the API in
advance of using in the packed-refs v2 format.

We could consider updating the commit-graph and multi-pack-index formats to
use trailing table of contents, but it requires a version bump. That might
be worth it in the case of the commit-graph where computing the size of the
changed-path Bloom filters chunk requires a lot of memory at the moment.
After this chunk-format API update is reviewed and merged, we can pursue
those directions more closely. We would want to investigate the formats more
carefully to see if we want to update the chunks themselves as well as some
header information.


Part IV: Abstract some parts of the v1 file format
==================================================

[4] https://github.com/derrickstolee/git/pull/26 Packed-refs v2 Part IV:
abstract some parts of the v1 file format (Patches 18-21)

These patches move the part of the refs/packed-backend.c file that deal with
the specifics of the packed-refs v1 file format into a new file:
refs/packed-format-v1.c. This also creates an abstraction layer that will
allow inserting the v2 format more easily.

One thing that doesn't exist currently is a documentation file describing
the packed-refs file format. I would add that file in this part before
submitting it for full review. (I also haven't written the file format doc
for the packed-refs v2 format, either.)


Part V: Implement the v2 file format
====================================

[5] https://github.com/derrickstolee/git/pull/27 Packed-refs v2 Part V: the
v2 file format (Patches 22-35)

This is the real meat of the work. Perhaps there are ways to split it
further, but for now this is what I have ready. The very last patch does a
complete performance comparison for a repo with many refs.

The format is not yet documented, but is broken up into these pieces:

 1. The refs data chunk stores the same data as the packed-refs file, but
    each ref is broken down as follows: the ref name (with trailing zero),
    the OID for the ref in its raw bytes, and (if necessary) the peeled OID
    for the ref in its raw bytes. The refs are sorted lexicographically.

 2. The ref offsets chunk is a single column of 64-bit offsets into the refs
    chunk indicating where each ref starts. The most-significant bit of that
    value indicates whether or not there is a peeled OID.

 3. The prefix data chunk lists a set of ref prefixes (currently writes only
    allow depth-2 prefixes, such as refs/heads/ and refs/tags/). When
    present, these prefixes are written in this chunk and not in the refs
    data chunk. The prefixes are sorted lexicographically.

 4. The prefix offset chunk has two 32-bit integer columns. The first column
    stores the offset within the prefix data chunk to the start of the
    prefix string. The second column points to the row position for the
    first ref that has name greater than this prefix (the 0th prefix is
    assumed to start at row 0, so we can interpret the prefix range from
    row[i-1] and row[i]).

Between using raw OIDs and storing the depth-2 prefixes only once, this
format compresses the file to ~60% of its v1 size. (The format allows not
writing the prefix chunks, and the prefix chunks are implemented after the
basics of the ref chunks are complete.)

The write times are reduced in a similar fraction to the size difference.
Reads are sped up somewhat, and we have the potential to do a ref count by
prefix much faster by doing a binary search for the start and end of the
prefix and then subtracting the row positions instead of scanning the file
between to count refs.


Relationship to Reftable
========================

I mentioned earlier that I had considered using reftable as a way to achieve
the stated goals. With the current state of that work, I'm not confident
that it is the right approach here.

My main worry is that the reftable is more complicated than we need for a
typical Git repository that is based on a typical filesystem. This makes
testing the format very critical, and we seem to not be near reaching that
approach. The v2 format here is very similar to existing Git file formats
since it uses the chunk-format API. This means that the amount of code
custom to just the v2 format is quite small.

As mentioned, the current extension plan [6] only allows reftable or files
and does not allow for a mix of both. This RFC introduces the possibility
that both could co-exist. Using that multi-valued approach means that I'm
able to test the v2 packed-refs file format almost as well as the v1 file
format within this RFC. (More tests need to be added that are specific to
this format, but I'm waiting for confirmation that this is an acceptable
direction.) At the very least, this multi-valued approach could be used as a
way to allow using the reftable format as a drop-in replacement for the
packed-refs file, as well as upgrading an existing repo to use reftable.
That might even help the integration process to allow the reftable format to
be tested at least by some subset of tests instead of waiting for a full
test suite update.

I'm interested to hear from people more involved in the reftable work to see
the status of that project and how it matches or differs from my
perspective.

The one thing I can say is that if the reftable work had not already begun,
then this is RFC is how I would have approached a new ref format.

I look forward to your feedback!

Thanks,

 * Stolee

[6]
https://github.com/git/git/pull/1215/files#diff-a30f88b458b1f01e7a67e72576584b5b77ddb0362e40da6f7bf4a9ddf79db7b8R41-R48
The draft version of extensions.refFormat for reftable.

[7] https://lore.kernel.org/git/4696bd93-9406-0abd-25ec-a739665a24d5@web.de/
Re: [PATCH 00/15] Refactor chunk-format into an API (where René recommends a
trailing table of contents)

Derrick Stolee (30):
  hashfile: allow skipping the hash function
  read-cache: add index.computeHash config option
  extensions: add refFormat extension
  config: fix multi-level bulleted list
  repository: wire ref extensions to ref backends
  refs: allow loose files without packed-refs
  chunk-format: number of chunks is optional
  chunk-format: document trailing table of contents
  chunk-format: store chunk offset during write
  chunk-format: allow trailing table of contents
  chunk-format: parse trailing table of contents
  refs: extract packfile format to new file
  packed-backend: extract add_write_error()
  packed-backend: extract iterator/updates merge
  packed-backend: create abstraction for writing refs
  config: add config values for packed-refs v2
  packed-backend: create shell of v2 writes
  packed-refs: write file format version 2
  packed-refs: read file format v2
  packed-refs: read optional prefix chunks
  packed-refs: write prefix chunks
  packed-backend: create GIT_TEST_PACKED_REFS_VERSION
  t1409: test with packed-refs v2
  t5312: allow packed-refs v2 format
  t5502: add PACKED_REFS_V1 prerequisite
  t3210: require packed-refs v1 for some tests
  t*: skip packed-refs v2 over http tests
  ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds
  p1401: create performance test for ref operations
  refs: skip hashing when writing packed-refs v2

 Documentation/config.txt            |   2 +
 Documentation/config/extensions.txt |  76 ++-
 Documentation/config/index.txt      |   8 +
 Documentation/config/refs.txt       |  13 +
 Documentation/gitformat-chunk.txt   |  26 +-
 Makefile                            |   2 +
 cache.h                             |   2 +
 chunk-format.c                      | 109 +++-
 chunk-format.h                      |  18 +-
 ci/run-build-and-tests.sh           |   1 +
 commit-graph.c                      |   2 +-
 csum-file.c                         |  14 +-
 csum-file.h                         |   7 +
 midx.c                              |   2 +-
 read-cache.c                        |  22 +-
 refs.c                              |  24 +-
 refs/files-backend.c                |   8 +-
 refs/packed-backend.c               | 880 +++++++---------------------
 refs/packed-backend.h               | 281 +++++++++
 refs/packed-format-v1.c             | 456 ++++++++++++++
 refs/packed-format-v2.c             | 624 ++++++++++++++++++++
 refs/refs-internal.h                |   9 +
 repository.c                        |   2 +
 repository.h                        |   7 +
 setup.c                             |  26 +
 t/perf/p1401-ref-operations.sh      |  52 ++
 t/t1409-avoid-packing-refs.sh       |  22 +-
 t/t1600-index.sh                    |   8 +
 t/t3210-pack-refs.sh                |   8 +-
 t/t3212-ref-formats.sh              | 100 ++++
 t/t5502-quickfetch.sh               |   2 +-
 t/t5539-fetch-http-shallow.sh       |   7 +
 t/t5541-http-push-smart.sh          |   7 +
 t/t5542-push-http-shallow.sh        |   7 +
 t/t5551-http-fetch-smart.sh         |   7 +
 t/t5558-clone-bundle-uri.sh         |   7 +
 t/test-lib.sh                       |   4 +
 37 files changed, 2157 insertions(+), 695 deletions(-)
 create mode 100644 Documentation/config/refs.txt
 create mode 100644 refs/packed-format-v1.c
 create mode 100644 refs/packed-format-v2.c
 create mode 100755 t/perf/p1401-ref-operations.sh
 create mode 100755 t/t3212-ref-formats.sh


base-commit: c03801e19cb8ab36e9c0d17ff3d5e0c3b0f24193
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1408%2Fderrickstolee%2Frefs%2Frfc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1408/derrickstolee/refs/rfc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1408
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/30] hashfile: allow skipping the hash function
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.

Since the commit-graph and multi-pack-index files both use this trailing
hash, the chunk-format API uses a 'struct hashfile' to handle the I/O to
the file. This was very convenient to allow using the hashfile methods
during these operations.

However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. If we wish to use the chunk-format API to upgrade
other file types, then this hashing is a performance penalty that might
not be worth the benefit of a trailing hash.

For example, if we create a chunk-format version of the packed-refs
file, then the file format could shrink by using raw object IDs instead
of hexadecimal representations in ASCII. That reduction in size is not
enough to counteract the performance penalty of hashing the file
contents. In cases such as deleting a reference that appears in the
packed-refs file, that write-time performance is critical. This is in
contrast to the commit-graph and multi-pack-index files which are mainly
updated in non-critical paths such as background maintenance.

One way to allow future chunked formats to not suffer this penalty would
be to create an abstraction layer around the 'struct hashfile' using a
vtable of function pointers. This would allow placing a different
representation in place of the hashfile. This option would be cumbersome
for a few reasons. First, the hashfile's buffered writes are already
highly optimized and would need to be duplicated in another code path.
The second is that the chunk-format API calls the chunk_write_fn
pointers using a hashfile. If we change that to an abstraction layer,
then those that _do_ use the hashfile API would need to change all of
their instances of hashwrite(), hashwrite_be32(), and others to use the
new abstraction layer.

Instead, this change opts for a simpler change. Introduce a new
'skip_hash' option to 'struct hashfile'. When set, the update_fn and
final_fn members of the_hash_algo are skipped. When finalizing the
hashfile, the trailing hash is replaced with the null hash.

This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire. For the commit-graph and multi-pack-index file,
it may be possible to allow the null hash without incrementing the file
format version, since it technically fits the structure of the file
format. The only issue is that older versions would trigger a failure
during 'git fsck'. For these file formats, we may want to delay such a
change until it is justified.

However, the index file is written in critical paths. It is also
frequently updated, so corruption at rest is less likely to be an issue
than in those other file formats. This could be a good candidate to
create an option that skips the hashing operation.

A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 csum-file.c | 14 +++++++++++---
 csum-file.h |  7 +++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 59ef3398ca2..3243473c3d7 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -45,7 +45,8 @@ void hashflush(struct hashfile *f)
 	unsigned offset = f->offset;
 
 	if (offset) {
-		the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
+		if (!f->skip_hash)
+			the_hash_algo->update_fn(&f->ctx, f->buffer, offset);
 		flush(f, f->buffer, offset);
 		f->offset = 0;
 	}
@@ -64,7 +65,12 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result,
 	int fd;
 
 	hashflush(f);
-	the_hash_algo->final_fn(f->buffer, &f->ctx);
+
+	if (f->skip_hash)
+		memset(f->buffer, 0, the_hash_algo->rawsz);
+	else
+		the_hash_algo->final_fn(f->buffer, &f->ctx);
+
 	if (result)
 		hashcpy(result, f->buffer);
 	if (flags & CSUM_HASH_IN_STREAM)
@@ -108,7 +114,8 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 			 * the hashfile's buffer. In this block,
 			 * f->offset is necessarily zero.
 			 */
-			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			if (!f->skip_hash)
+				the_hash_algo->update_fn(&f->ctx, buf, nr);
 			flush(f, buf, nr);
 		} else {
 			/*
@@ -153,6 +160,7 @@ static struct hashfile *hashfd_internal(int fd, const char *name,
 	f->tp = tp;
 	f->name = name;
 	f->do_crc = 0;
+	f->skip_hash = 0;
 	the_hash_algo->init_fn(&f->ctx);
 
 	f->buffer_len = buffer_len;
diff --git a/csum-file.h b/csum-file.h
index 0d29f528fbc..29468067f81 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -20,6 +20,13 @@ struct hashfile {
 	size_t buffer_len;
 	unsigned char *buffer;
 	unsigned char *check_buffer;
+
+	/**
+	 * If set to 1, skip_hash indicates that we should
+	 * not actually compute the hash for this hashfile and
+	 * instead only use it as a buffered write.
+	 */
+	unsigned int skip_hash;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/30] read-cache: add index.computeHash config option
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-11 23:31   ` Elijah Newren
  2022-11-17 16:13   ` Ævar Arnfjörð Bjarmason
  2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
                   ` (30 subsequent siblings)
  32 siblings, 2 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change allowed skipping the hashing portion of the
hashwrite API, using it instead as a buffered write API. Disabling the
hashwrite can be particularly helpful when the write operation is in a
critical path.

One such critical path is the writing of the index. This operation is so
critical that the sparse index was created specifically to reduce the
size of the index to make these writes (and reads) faster.

Following a similar approach to one used in the microsoft/git fork [1],
add a new config option that allows disabling this hashing during the
index write. The cost is that we can no longer validate the contents for
corruption-at-rest using the trailing hash.

[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201

While older Git versions will not recognize the null hash as a special
case, the file format itself is still being met in terms of its
structure. Using this null hash will still allow Git operations to
function across older versions.

The one exception is 'git fsck' which checks the hash of the index file.
Here, we disable this check if the trailing hash is all zeroes. We add a
warning to the config option that this may cause undesirable behavior
with older Git versions.

As a quick comparison, I tested 'git update-index --force-write' with
and without index.computHash=false on a copy of the Linux kernel
repository.

Benchmark 1: with hash
  Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
  Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
  Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
  Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
  'without hash' ran
    1.78 ± 0.76 times faster than 'with hash'

These performance benefits are substantial enough to allow users the
ability to opt-in to this feature, even with the potential confusion
with older 'git fsck' versions.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/index.txt |  8 ++++++++
 read-cache.c                   | 22 +++++++++++++++++++++-
 t/t1600-index.sh               |  8 ++++++++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
index 75f3a2d1054..709ba72f622 100644
--- a/Documentation/config/index.txt
+++ b/Documentation/config/index.txt
@@ -30,3 +30,11 @@ index.version::
 	Specify the version with which new index files should be
 	initialized.  This does not affect existing repositories.
 	If `feature.manyFiles` is enabled, then the default is 4.
+
+index.computeHash::
+	When enabled, compute the hash of the index file as it is written
+	and store the hash at the end of the content. This is enabled by
+	default.
++
+If you disable `index.computHash`, then older Git clients may report that
+your index is corrupt during `git fsck`.
diff --git a/read-cache.c b/read-cache.c
index 32024029274..f24d96de4d3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	git_hash_ctx c;
 	unsigned char hash[GIT_MAX_RAWSZ];
 	int hdr_version;
+	int all_zeroes = 1;
+	unsigned char *start, *end;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
@@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
 	if (!verify_index_checksum)
 		return 0;
 
+	end = (unsigned char *)hdr + size;
+	start = end - the_hash_algo->rawsz;
+	while (start < end) {
+		if (*start != 0) {
+			all_zeroes = 0;
+			break;
+		}
+		start++;
+	}
+
+	if (all_zeroes)
+		return 0;
+
 	the_hash_algo->init_fn(&c);
 	the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
 	the_hash_algo->final_fn(hash, &c);
-	if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
+	if (!hasheq(hash, end - the_hash_algo->rawsz))
 		return error(_("bad index file sha1 signature"));
 	return 0;
 }
@@ -2917,9 +2932,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	int ieot_entries = 1;
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
+	int compute_hash;
 
 	f = hashfd(tempfile->fd, tempfile->filename.buf);
 
+	if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
+	    !compute_hash)
+		f->skip_hash = 1;
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 010989f90e6..24ab90ca047 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -103,4 +103,12 @@ test_expect_success 'index version config precedence' '
 	test_index_version 0 true 2 2
 '
 
+test_expect_success 'index.computeHash config option' '
+	(
+		rm -f .git/index &&
+		git -c index.computeHash=false add a &&
+		git fsck
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/30] extensions: add refFormat extension
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-11 23:39   ` Elijah Newren
  2022-11-07 18:35 ` [PATCH 04/30] config: fix multi-level bulleted list Derrick Stolee via GitGitGadget
                   ` (29 subsequent siblings)
  32 siblings, 1 reply; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Git's reference storage is critical to its function. Creating new
storage formats for references requires adding an extension. This
prevents third-party tools that do not understand that format from
operating incorrectly on the repository. This makes updating ref formats
more difficult than other optional indexes, such as the commit-graph or
multi-pack-index.

However, there are a number of potential ref storage enhancements that
are underway or could be created. Git needs an established mechanism for
coordinating between these different options.

The first obvious format update is the reftable format as documented in
Documentation/technical/reftable.txt. This format has much of its
implementation already in Git, but its connection as a ref backend is
not complete. This change is similar to some changes within one of the
patches intended for the reftable effort [1].

[1] https://lore.kernel.org/git/pull.1215.git.git.1644351400761.gitgitgadget@gmail.com/

However, this change makes a distinct strategy change from the one
recommended by reftable. Here, the extensions.refFormat extension is
provided as a multi-valued list. In the reftable RFC, the extension has
a single value, "files" or "reftable" and explicitly states that this
should not change after 'git init' or 'git clone'.

The single-valued approach has some major drawbacks, including the idea
that the "files" backend cannot coexist with the "reftable" backend at
the same time. In this way, it would not be possible to create a
repository that can write loose references and combine them into a
reftable in the background. With the multi-valued approach, we could
integrate reftable as a drop-in replacement for the packed-refs file and
allow that to be a faster way to do the integration since the test suite
would only need updates when the test is explicitly testing packed-refs.

When upgrading a repository from the "files" backend to the "reftable"
backend, it can help to have a transition period where both are present,
then finally removing the "files" backend after all loose refs are
collected into the reftable.

But the reftable is not the only approach available.

One obvious improvement could be a new file format version for the
packed-refs file. Its current plaintext-based format is inefficient due
to storing object IDs as hexadecimal representations instead of in
their raw format. This extra cost will get worse with SHA-256. In
addition, binary searches need to guess a position and scan to find
newlines for a refname entry. A structured binary format could allow for
more compact representation and faster access. Adding such a format
could be seen as "files-v2", but it is really "packed-v2".

The reftable approach has a concept of a "stack" of reftable files. This
idea would also work for a stack of packed-refs files (in v1 or v2
format). It would be helpful to describe that the refs could be stored
in a stack of packed-ref files independently of whether that is in file
format v1 or v2.

Even in these two options, it might be helpful to indicate whether or
not loose ref files are present. That is one reason to not make them
appear as "files-v2" or "files-v3" options in a single-valued extension.
Even as "packed-v2" or "packed-v3" options, this approach would require
third-party tools to understand the "v2" version if they want to support
the "v3" options. Instead, by splitting the format from the layout, we
can allow third-party tools to integrate only with the most-desired
format options.

For these reasons, this change is defining the extensions.refFormat
extension as well as how the two existing values interact. By default,
Git will assume "files" and "packed" in the list. If any other value
is provided, then the extension is marked as unrecognized.

Add tests that check the behavior of extensions.refFormat, both in that
it requires core.repositoryFormatVersion=1, and Git will refuse to work
with an unknown value of the extension.

There is a gap in the current implementation, though. What happens if
exactly one of "files" or "packed" is provided? The presence of only one
would imply that the other is not available. A later change can
communicate the list contents to the repository struct and then the
reference backend could ignore one of these two layers.

Specifically, having only "files" would mean that Git should not read or
write the packed-refs file and instead only read and write loose ref
files. By contrast, having only "packed" would mean that Git should not
read or write loose ref files and instead always update the packed-refs
file on every ref update.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/extensions.txt | 41 +++++++++++++++++++++++++++++
 setup.c                             |  5 ++++
 t/t3212-ref-formats.sh              | 27 +++++++++++++++++++
 3 files changed, 73 insertions(+)
 create mode 100755 t/t3212-ref-formats.sh

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index bccaec7a963..ce8185adf53 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -7,6 +7,47 @@ Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
 
+extensions.refFormat::
+	Specify the reference storage mechanisms used by the repoitory as a
+	multi-valued list. The acceptable values are `files` and `packed`.
+	If not specified, the list of `files` and `packed` is assumed. It
+	is an error to specify this key unless `core.repositoryFormatVersion`
+	is 1.
++
+As new ref formats are added, Git commands may modify this list before and
+after upgrading the on-disk reference storage files. The specific values
+indicate the existence of different layers:
++
+--
+`files`;;
+	When present, references may be stored as "loose" reference files
+	in the `$GIT_DIR/refs/` directory. The name of the reference
+	corresponds to the filename after `$GIT_DIR` and the file contains
+	an object ID as a hexadecimal string. If a loose reference file
+	exists, then its value takes precedence over all other formats.
+
+`packed`;;
+	When present, references may be stored as a group in a
+	`packed-refs` file in its version 1 format. When grouped with
+	`"files"` or provided on its own, this file is located at
+	`$GIT_DIR/packed-refs`. This file contains a list of distinct
+	reference names, paired with their object IDs. When combined with
+	`files`, the `packed` format will only be used to group multiple
+	loose object files upon request via the `git pack-refs` command or
+	via the `pack-refs` maintenance task.
+--
++
+The following combinations are supported by this version of Git:
++
+--
+`files` and `packed`;;
+	This set of values indicates that references are stored both as
+	loose reference files and in the `packed-refs` file in its v1
+	format. Loose references are preferred, and the `packed-refs` file
+	is updated only when deleting a reference that is stored in the
+	`packed-refs` file or during a `git pack-refs` command.
+--
+
 extensions.worktreeConfig::
 	If enabled, then worktrees will load config settings from the
 	`$GIT_DIR/config.worktree` file in addition to the
diff --git a/setup.c b/setup.c
index cefd5f63c46..f5eb50c969a 100644
--- a/setup.c
+++ b/setup.c
@@ -577,6 +577,11 @@ static enum extension_result handle_extension(const char *var,
 				     "extensions.objectformat", value);
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "refformat")) {
+		if (strcmp(value, "files") && strcmp(value, "packed"))
+			return error(_("invalid value for '%s': '%s'"),
+				     "extensions.refFormat", value);
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
new file mode 100755
index 00000000000..bc554e7c701
--- /dev/null
+++ b/t/t3212-ref-formats.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+test_description='test across ref formats'
+
+. ./test-lib.sh
+
+test_expect_success 'extensions.refFormat requires core.repositoryFormatVersion=1' '
+	test_when_finished rm -rf broken &&
+
+	# Force sha1 to ensure GIT_TEST_DEFAULT_HASH does
+	# not imply a value of core.repositoryFormatVersion.
+	git init --object-format=sha1 broken &&
+	git -C broken config extensions.refFormat files &&
+	test_must_fail git -C broken status 2>err &&
+	grep "repo version is 0, but v1-only extension found" err
+'
+
+test_expect_success 'invalid extensions.refFormat' '
+	test_when_finished rm -rf broken &&
+	git init broken &&
+	git -C broken config core.repositoryFormatVersion 1 &&
+	git -C broken config extensions.refFormat bogus &&
+	test_must_fail git -C broken status 2>err &&
+	grep "invalid value for '\''extensions.refFormat'\'': '\''bogus'\''" err
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/30] config: fix multi-level bulleted list
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 05/30] repository: wire ref extensions to ref backends Derrick Stolee via GitGitGadget
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The documentation for 'extensions.worktreeConfig' includes a bulletted
list describing certain config values that need to be moved into the
worktree config instead of the repository config file. However, since we
are already in a bulletted list, the documentation tools do not know
when that inner list is complete. Paragraphs intended to not be part of
that inner list are rendered as part of the last bullet.

Modify the format to match a similar doubly-nested list from the
'column.ui' config documentation. Reword the descriptions slightly to
make the config keys appear as their own heading in the inner list.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/extensions.txt | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index ce8185adf53..18ed1c58126 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -62,10 +62,15 @@ When enabling `extensions.worktreeConfig`, you must be careful to move
 certain values from the common config file to the main working tree's
 `config.worktree` file, if present:
 +
-* `core.worktree` must be moved from `$GIT_COMMON_DIR/config` to
-  `$GIT_COMMON_DIR/config.worktree`.
-* If `core.bare` is true, then it must be moved from `$GIT_COMMON_DIR/config`
-  to `$GIT_COMMON_DIR/config.worktree`.
+--
+`core.worktree`;;
+	This config value must be moved from `$GIT_COMMON_DIR/config` to
+	`$GIT_COMMON_DIR/config.worktree`.
+
+`core.bare`;;
+	If true, then this value must be moved from
+	`$GIT_COMMON_DIR/config` to `$GIT_COMMON_DIR/config.worktree`.
+--
 +
 It may also be beneficial to adjust the locations of `core.sparseCheckout`
 and `core.sparseCheckoutCone` depending on your desire for customizable
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/30] repository: wire ref extensions to ref backends
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 04/30] config: fix multi-level bulleted list Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 06/30] refs: allow loose files without packed-refs Derrick Stolee via GitGitGadget
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the extensions.refFormat config option.
It is a multi-valued config option that currently understands "files"
and "packed", with both values assumed by default. If any value is
provided explicitly, this default is ignored and the provided settings
are used instead.

The multi-valued nature of this extension presents a way to allow a user
to specify that they never want a packed-refs file (only use "files") or
that they never want loose reference files (only use "packed"). However,
that functionality is not currently connected.

Before actually modifying the files backend to understand these
extension settings, do the basic wiring that connects the
extensions.refFormat parsing to the creation of the ref backend. A
future change will actually change the ref backend initialization based
on these settings, but this communication of the extension is
sufficiently complicated to be worth an isolated change.

For now, also forbid the setting of only "packed". This is done by
redirecting the choice of backend to the packed backend when that
selection is made. A later change will make the "files"-only extension
value ignore the packed backend.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 cache.h                |  2 ++
 refs.c                 | 22 ++++++++++++++++++++--
 refs/files-backend.c   |  2 +-
 refs/refs-internal.h   |  3 +++
 repository.c           |  2 ++
 repository.h           |  6 ++++++
 setup.c                | 18 +++++++++++++++++-
 t/t3212-ref-formats.sh | 12 ++++++++++++
 8 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/cache.h b/cache.h
index 26ed03bd6de..13e9c251ac3 100644
--- a/cache.h
+++ b/cache.h
@@ -1155,6 +1155,8 @@ struct repository_format {
 	int hash_algo;
 	int sparse_index;
 	char *work_tree;
+	int ref_format_count;
+	enum ref_format_flags ref_format;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
 };
diff --git a/refs.c b/refs.c
index 1491ae937eb..21441ddb162 100644
--- a/refs.c
+++ b/refs.c
@@ -1982,6 +1982,15 @@ static struct ref_store *lookup_ref_store_map(struct hashmap *map,
 	return entry ? entry->refs : NULL;
 }
 
+static int add_ref_format_flags(enum ref_format_flags flags, int caps) {
+	if (flags & REF_FORMAT_FILES)
+		caps |= REF_STORE_FORMAT_FILES;
+	if (flags & REF_FORMAT_PACKED)
+		caps |= REF_STORE_FORMAT_PACKED;
+
+	return caps;
+}
+
 /*
  * Create, record, and return a ref_store instance for the specified
  * gitdir.
@@ -1991,9 +2000,17 @@ static struct ref_store *ref_store_init(struct repository *repo,
 					unsigned int flags)
 {
 	const char *be_name = "files";
-	struct ref_storage_be *be = find_ref_storage_backend(be_name);
+	struct ref_storage_be *be;
 	struct ref_store *refs;
 
+	flags = add_ref_format_flags(repo->ref_format, flags);
+
+	if (!(flags & REF_STORE_FORMAT_FILES) &&
+	    (flags & REF_STORE_FORMAT_PACKED))
+		be_name = "packed";
+
+	be = find_ref_storage_backend(be_name);
+
 	if (!be)
 		BUG("reference backend %s is unknown", be_name);
 
@@ -2009,7 +2026,8 @@ struct ref_store *get_main_ref_store(struct repository *r)
 	if (!r->gitdir)
 		BUG("attempting to get main_ref_store outside of repository");
 
-	r->refs_private = ref_store_init(r, r->gitdir, REF_STORE_ALL_CAPS);
+	r->refs_private = ref_store_init(r, r->gitdir,
+					 REF_STORE_ALL_CAPS);
 	r->refs_private = maybe_debug_wrap_ref_store(r->gitdir, r->refs_private);
 	return r->refs_private;
 }
diff --git a/refs/files-backend.c b/refs/files-backend.c
index b89954355de..db6c8e434c6 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -3274,7 +3274,7 @@ static int files_init_db(struct ref_store *ref_store, struct strbuf *err UNUSED)
 }
 
 struct ref_storage_be refs_be_files = {
-	.next = NULL,
+	.next = &refs_be_packed,
 	.name = "files",
 	.init = files_ref_store_create,
 	.init_db = files_init_db,
diff --git a/refs/refs-internal.h b/refs/refs-internal.h
index 69f93b0e2ac..41520c945e4 100644
--- a/refs/refs-internal.h
+++ b/refs/refs-internal.h
@@ -521,6 +521,9 @@ struct ref_store;
 				 REF_STORE_ODB | \
 				 REF_STORE_MAIN)
 
+#define REF_STORE_FORMAT_FILES		(1 << 8) /* can use loose ref files */
+#define REF_STORE_FORMAT_PACKED		(1 << 9) /* can use packed-refs file */
+
 /*
  * Initialize the ref_store for the specified gitdir. These functions
  * should call base_ref_store_init() to initialize the shared part of
diff --git a/repository.c b/repository.c
index 5d166b692c8..96533fc76be 100644
--- a/repository.c
+++ b/repository.c
@@ -182,6 +182,8 @@ int repo_init(struct repository *repo,
 	repo->repository_format_partial_clone = format.partial_clone;
 	format.partial_clone = NULL;
 
+	repo->ref_format = format.ref_format;
+
 	if (worktree)
 		repo_set_worktree(repo, worktree);
 
diff --git a/repository.h b/repository.h
index 24316ac944e..5cfde4282c5 100644
--- a/repository.h
+++ b/repository.h
@@ -61,6 +61,11 @@ struct repo_path_cache {
 	char *shallow;
 };
 
+enum ref_format_flags {
+	REF_FORMAT_FILES = (1 << 0),
+	REF_FORMAT_PACKED = (1 << 1),
+};
+
 struct repository {
 	/* Environment */
 	/*
@@ -95,6 +100,7 @@ struct repository {
 	 * the ref object.
 	 */
 	struct ref_store *refs_private;
+	enum ref_format_flags ref_format;
 
 	/*
 	 * Contains path to often used file names.
diff --git a/setup.c b/setup.c
index f5eb50c969a..a5e63479558 100644
--- a/setup.c
+++ b/setup.c
@@ -578,9 +578,14 @@ static enum extension_result handle_extension(const char *var,
 		data->hash_algo = format;
 		return EXTENSION_OK;
 	} else if (!strcmp(ext, "refformat")) {
-		if (strcmp(value, "files") && strcmp(value, "packed"))
+		if (!strcmp(value, "files"))
+			data->ref_format |= REF_FORMAT_FILES;
+		else if (!strcmp(value, "packed"))
+			data->ref_format |= REF_FORMAT_PACKED;
+		else
 			return error(_("invalid value for '%s': '%s'"),
 				     "extensions.refFormat", value);
+		data->ref_format_count++;
 		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
@@ -723,6 +728,11 @@ int read_repository_format(struct repository_format *format, const char *path)
 	git_config_from_file(check_repo_format, path, format);
 	if (format->version == -1)
 		clear_repository_format(format);
+
+	/* Set default ref_format if no extensions.refFormat exists. */
+	if (!format->ref_format_count)
+		format->ref_format = REF_FORMAT_FILES | REF_FORMAT_PACKED;
+
 	return format->version;
 }
 
@@ -1425,6 +1435,9 @@ int discover_git_directory(struct strbuf *commondir,
 		candidate.partial_clone;
 	candidate.partial_clone = NULL;
 
+	/* take ownership of candidate.ref_format */
+	the_repository->ref_format = candidate.ref_format;
+
 	clear_repository_format(&candidate);
 	return 0;
 }
@@ -1561,6 +1574,8 @@ const char *setup_git_directory_gently(int *nongit_ok)
 			the_repository->repository_format_partial_clone =
 				repo_fmt.partial_clone;
 			repo_fmt.partial_clone = NULL;
+
+			the_repository->ref_format = repo_fmt.ref_format;
 		}
 	}
 	/*
@@ -1650,6 +1665,7 @@ void check_repository_format(struct repository_format *fmt)
 	repo_set_hash_algo(the_repository, fmt->hash_algo);
 	the_repository->repository_format_partial_clone =
 		xstrdup_or_null(fmt->partial_clone);
+	the_repository->ref_format = fmt->ref_format;
 	clear_repository_format(&repo_fmt);
 }
 
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
index bc554e7c701..8c4e70196a0 100755
--- a/t/t3212-ref-formats.sh
+++ b/t/t3212-ref-formats.sh
@@ -24,4 +24,16 @@ test_expect_success 'invalid extensions.refFormat' '
 	grep "invalid value for '\''extensions.refFormat'\'': '\''bogus'\''" err
 '
 
+test_expect_success 'extensions.refFormat=packed only' '
+	git init only-packed &&
+	(
+		cd only-packed &&
+		git config core.repositoryFormatVersion 1 &&
+		git config extensions.refFormat packed &&
+		test_commit A &&
+		test_path_exists .git/packed-refs &&
+		test_path_is_missing .git/refs/tags/A
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/30] refs: allow loose files without packed-refs
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 05/30] repository: wire ref extensions to ref backends Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 07/30] chunk-format: number of chunks is optional Derrick Stolee via GitGitGadget
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The extensions.refFormat extension is a multi-valued config that
specifies which ref formats are available to the current repository. By
default, Git assumes the list of "files" and "packed", unless there is
at least one of these extensions specified.

With the current values, it is possible for a user to specify only
"files" or only "packed". The only-"packed" option was already ruled as
invalid since Git's current code has too many places that require a
loose reference. This could change in the future.

However, we can now allow the user to specify extensions.refFormat=files
alone, making it impossible to create a packed-refs file (or to read one
that might exist).

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/extensions.txt |  5 +++++
 refs/files-backend.c                |  6 ++++++
 refs/packed-backend.c               |  3 +++
 refs/refs-internal.h                |  5 +++++
 t/t3212-ref-formats.sh              | 20 ++++++++++++++++++++
 5 files changed, 39 insertions(+)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 18ed1c58126..18071c336d0 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -46,6 +46,11 @@ The following combinations are supported by this version of Git:
 	format. Loose references are preferred, and the `packed-refs` file
 	is updated only when deleting a reference that is stored in the
 	`packed-refs` file or during a `git pack-refs` command.
+
+`files`;;
+	When only this value is present, Git will ignore the `packed-refs`
+	file and refuse to write one during `git pack-refs`. All references
+	will be read from and written to loose reference files.
 --
 
 extensions.worktreeConfig::
diff --git a/refs/files-backend.c b/refs/files-backend.c
index db6c8e434c6..4a18aed6204 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -1198,6 +1198,12 @@ static int files_pack_refs(struct ref_store *ref_store, unsigned int flags)
 	struct strbuf err = STRBUF_INIT;
 	struct ref_transaction *transaction;
 
+	if (!packed_refs_enabled(refs->store_flags)) {
+		warning(_("refusing to create '%s' file because '%s' is not set"),
+			"packed-refs", "extensions.refFormat=packed");
+		return -1;
+	}
+
 	transaction = ref_store_transaction_begin(refs->packed_ref_store, &err);
 	if (!transaction)
 		return -1;
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index c1c71d183ea..a4371b711b9 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -478,6 +478,9 @@ static int load_contents(struct snapshot *snapshot)
 	size_t size;
 	ssize_t bytes_read;
 
+	if (!packed_refs_enabled(snapshot->refs->store_flags))
+		return 0;
+
 	fd = open(snapshot->refs->path, O_RDONLY);
 	if (fd < 0) {
 		if (errno == ENOENT) {
diff --git a/refs/refs-internal.h b/refs/refs-internal.h
index 41520c945e4..a1900848a87 100644
--- a/refs/refs-internal.h
+++ b/refs/refs-internal.h
@@ -524,6 +524,11 @@ struct ref_store;
 #define REF_STORE_FORMAT_FILES		(1 << 8) /* can use loose ref files */
 #define REF_STORE_FORMAT_PACKED		(1 << 9) /* can use packed-refs file */
 
+static inline int packed_refs_enabled(int flags)
+{
+	return flags & REF_STORE_FORMAT_PACKED;
+}
+
 /*
  * Initialize the ref_store for the specified gitdir. These functions
  * should call base_ref_store_init() to initialize the shared part of
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
index 8c4e70196a0..67aa65c116f 100755
--- a/t/t3212-ref-formats.sh
+++ b/t/t3212-ref-formats.sh
@@ -36,4 +36,24 @@ test_expect_success 'extensions.refFormat=packed only' '
 	)
 '
 
+test_expect_success 'extensions.refFormat=files only' '
+	test_commit T &&
+	git pack-refs --all &&
+	git init only-loose &&
+	(
+		cd only-loose &&
+		git config core.repositoryFormatVersion 1 &&
+		git config extensions.refFormat files &&
+		test_commit A &&
+		test_commit B &&
+		test_must_fail git pack-refs 2>err &&
+		grep "refusing to create" err &&
+		test_path_is_missing .git/packed-refs &&
+
+		# Refuse to parse a packed-refs file.
+		cp ../.git/packed-refs .git/packed-refs &&
+		test_must_fail git rev-parse refs/tags/T
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/30] chunk-format: number of chunks is optional
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 06/30] refs: allow loose files without packed-refs Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 08/30] chunk-format: document trailing table of contents Derrick Stolee via GitGitGadget
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Even though the commit-graph and multi-pack-index file formats specify a
number of chunks in their header information, this is optional. The
table of contents terminates with a null chunk ID, which can be used
instead. The extra value is helpful for some checks, but is ultimately
not necessary for the format.

This will be important in some future formats.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/gitformat-chunk.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/gitformat-chunk.txt b/Documentation/gitformat-chunk.txt
index 57202ede273..c01f5567c4f 100644
--- a/Documentation/gitformat-chunk.txt
+++ b/Documentation/gitformat-chunk.txt
@@ -24,8 +24,9 @@ how they use the chunks to describe structured data.
 
 A chunk-based file format begins with some header information custom to
 that format. That header should include enough information to identify
-the file type, format version, and number of chunks in the file. From this
-information, that file can determine the start of the chunk-based region.
+the file type, format version, and (optionally) the number of chunks in
+the file. From this information, that file can determine the start of the
+chunk-based region.
 
 The chunk-based region starts with a table of contents describing where
 each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/30] chunk-format: document trailing table of contents
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 07/30] chunk-format: number of chunks is optional Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 09/30] chunk-format: store chunk offset during write Derrick Stolee via GitGitGadget
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

It will be helpful to allow a trailing table of contents when writing
some file types with the chunk-format API. The main reason is that it
allows dynamically computing the chunk sizes while writing the file.
This can use fewer resources than precomputing all chunk sizes in
advance.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/gitformat-chunk.txt | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/gitformat-chunk.txt b/Documentation/gitformat-chunk.txt
index c01f5567c4f..ee3718c4306 100644
--- a/Documentation/gitformat-chunk.txt
+++ b/Documentation/gitformat-chunk.txt
@@ -52,8 +52,27 @@ The final entry in the table of contents must be four zero bytes. This
 confirms that the table of contents is ending and provides the offset for
 the end of the chunk-based data.
 
+The default chunk format assumes the table of contents appears at the
+beginning of the file (after the header information) and the chunks are
+ordered by increasing offset. Alternatively, the chunk format allows a
+table of contents that is placed at the end of the file (before the
+trailing hash) and the offsets are in descending order. In this trailing
+table of contents case, the data in order looks instead like the following
+table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | 0x0000             | OFFSET[C+1]            |
+  | ID[C]              | OFFSET[C]              |
+  | ...                | ...                    |
+  | ID[0]              | OFFSET[0]              |
+
+The concrete file format that uses the chunk format will mention that it
+uses a trailing table of contents if it uses it. By default, the table of
+contents is in ascending order before all chunk data.
+
 Note: The chunk-based format expects that the file contains _at least_ a
-trailing hash after `OFFSET[C+1]`.
+trailing hash after either `OFFSET[C+1]` or the trailing table of contents.
 
 Functions for working with chunk-based file formats are declared in
 `chunk-format.h`. Using these methods provide extra checks that assist
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/30] chunk-format: store chunk offset during write
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 08/30] chunk-format: document trailing table of contents Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 10/30] chunk-format: allow trailing table of contents Derrick Stolee via GitGitGadget
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

As a preparatory step to allowing trailing table of contents, store the
offsets of each chunk as we write them. This replaces an existing use of
a local variable, but the stored value will be used in the next change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 chunk-format.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index 0275b74a895..f1b2c8a8b36 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -13,6 +13,7 @@ struct chunk_info {
 	chunk_write_fn write_fn;
 
 	const void *start;
+	off_t offset;
 };
 
 struct chunkfile {
@@ -78,16 +79,16 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 	hashwrite_be64(cf->f, cur_offset);
 
 	for (i = 0; i < cf->chunks_nr; i++) {
-		off_t start_offset = hashfile_total(cf->f);
+		cf->chunks[i].offset = hashfile_total(cf->f);
 		result = cf->chunks[i].write_fn(cf->f, data);
 
 		if (result)
 			goto cleanup;
 
-		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
+		if (hashfile_total(cf->f) - cf->chunks[i].offset != cf->chunks[i].size)
 			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
 			    cf->chunks[i].size, cf->chunks[i].id,
-			    hashfile_total(cf->f) - start_offset);
+			    hashfile_total(cf->f) - cf->chunks[i].offset);
 	}
 
 cleanup:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/30] chunk-format: allow trailing table of contents
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 09/30] chunk-format: store chunk offset during write Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 11/30] chunk-format: parse " Derrick Stolee via GitGitGadget
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The existing chunk formats use the table of contents at the beginning of
the file. This is intended as a way to speed up the initial loading of
the file, but comes at a cost during writes. Each example needs to fully
compute how big each chunk will be in advance, which usually requires
storing the full file contents in memory.

Future file formats may want to use the chunk format API in cases where
the writing stage is critical to performance, so we may want to stream
updates from an existing file and then only write the table of contents
at the end.

Add a new 'flags' parameter to write_chunkfile() that allows this
behavior. When this is specified, the defensive programming that checks
that the chunks are written with the precomputed sizes is disabled.
Then, the table of contents is written in reverse order at the end of
the hashfile, so a parser can read the chunk list starting from the end
of the file (minus the hash).

The parsing of these table of contents will come in a later change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 chunk-format.c | 53 +++++++++++++++++++++++++++++++++++---------------
 chunk-format.h |  9 ++++++++-
 commit-graph.c |  2 +-
 midx.c         |  2 +-
 4 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index f1b2c8a8b36..3f5cc9b5ddf 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -57,26 +57,31 @@ void add_chunk(struct chunkfile *cf,
 	cf->chunks_nr++;
 }
 
-int write_chunkfile(struct chunkfile *cf, void *data)
+int write_chunkfile(struct chunkfile *cf,
+		    enum chunkfile_flags flags,
+		    void *data)
 {
 	int i, result = 0;
-	uint64_t cur_offset = hashfile_total(cf->f);
 
 	trace2_region_enter("chunkfile", "write", the_repository);
 
-	/* Add the table of contents to the current offset */
-	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
+	if (!(flags & CHUNKFILE_TRAILING_TOC)) {
+		uint64_t cur_offset = hashfile_total(cf->f);
 
-	for (i = 0; i < cf->chunks_nr; i++) {
-		hashwrite_be32(cf->f, cf->chunks[i].id);
-		hashwrite_be64(cf->f, cur_offset);
+		/* Add the table of contents to the current offset */
+		cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
 
-		cur_offset += cf->chunks[i].size;
-	}
+		for (i = 0; i < cf->chunks_nr; i++) {
+			hashwrite_be32(cf->f, cf->chunks[i].id);
+			hashwrite_be64(cf->f, cur_offset);
 
-	/* Trailing entry marks the end of the chunks */
-	hashwrite_be32(cf->f, 0);
-	hashwrite_be64(cf->f, cur_offset);
+			cur_offset += cf->chunks[i].size;
+		}
+
+		/* Trailing entry marks the end of the chunks */
+		hashwrite_be32(cf->f, 0);
+		hashwrite_be64(cf->f, cur_offset);
+	}
 
 	for (i = 0; i < cf->chunks_nr; i++) {
 		cf->chunks[i].offset = hashfile_total(cf->f);
@@ -85,10 +90,26 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 		if (result)
 			goto cleanup;
 
-		if (hashfile_total(cf->f) - cf->chunks[i].offset != cf->chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    cf->chunks[i].size, cf->chunks[i].id,
-			    hashfile_total(cf->f) - cf->chunks[i].offset);
+		if (!(flags & CHUNKFILE_TRAILING_TOC)) {
+			if (hashfile_total(cf->f) - cf->chunks[i].offset != cf->chunks[i].size)
+				BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+				    cf->chunks[i].size, cf->chunks[i].id,
+				    hashfile_total(cf->f) - cf->chunks[i].offset);
+		}
+
+		cf->chunks[i].size = hashfile_total(cf->f) - cf->chunks[i].offset;
+	}
+
+	if (flags & CHUNKFILE_TRAILING_TOC) {
+		size_t last_chunk_tail = hashfile_total(cf->f);
+		/* First entry marks the end of the chunks */
+		hashwrite_be32(cf->f, 0);
+		hashwrite_be64(cf->f, last_chunk_tail);
+
+		for (i = cf->chunks_nr - 1; i >= 0; i--) {
+			hashwrite_be32(cf->f, cf->chunks[i].id);
+			hashwrite_be64(cf->f, cf->chunks[i].offset);
+		}
 	}
 
 cleanup:
diff --git a/chunk-format.h b/chunk-format.h
index 7885aa08487..39e8967e950 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -31,7 +31,14 @@ void add_chunk(struct chunkfile *cf,
 	       uint32_t id,
 	       size_t size,
 	       chunk_write_fn fn);
-int write_chunkfile(struct chunkfile *cf, void *data);
+
+enum chunkfile_flags {
+	CHUNKFILE_TRAILING_TOC = (1 << 0),
+};
+
+int write_chunkfile(struct chunkfile *cf,
+		    enum chunkfile_flags flags,
+		    void *data);
 
 int read_table_of_contents(struct chunkfile *cf,
 			   const unsigned char *mfile,
diff --git a/commit-graph.c b/commit-graph.c
index a7d87559328..c927b81250d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1932,7 +1932,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			get_num_chunks(cf) * ctx->commits.nr);
 	}
 
-	write_chunkfile(cf, ctx);
+	write_chunkfile(cf, 0, ctx);
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
diff --git a/midx.c b/midx.c
index 7cfad04a240..03d947a5d33 100644
--- a/midx.c
+++ b/midx.c
@@ -1510,7 +1510,7 @@ static int write_midx_internal(const char *object_dir,
 	}
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
-	write_chunkfile(cf, &ctx);
+	write_chunkfile(cf, 0, &ctx);
 
 	finalize_hashfile(f, midx_hash, FSYNC_COMPONENT_PACK_METADATA,
 			  CSUM_FSYNC | CSUM_HASH_IN_STREAM);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 11/30] chunk-format: parse trailing table of contents
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 10/30] chunk-format: allow trailing table of contents Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 12/30] refs: extract packfile format to new file Derrick Stolee via GitGitGadget
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The new read_trailing_table_of_contents() mimics
read_table_of_contents() except that it reads the table of contents in
reverse from the end of the given hashfile. The file is given as a
memory-mapped section of memory and a size. Automatically calculate the
start of the trailing hash and read the table of contents in revers from
that position.

The errors come along from those in read_table_of_contents(). The one
exception is that the chunk_offset cannot be checked as going into the
table of contents since we do not have that length automatically. That
may have some surprising results for some narrow forms of corruption.
However, we do still limit the size to the size of the file plus the
part of the table of contents read so far. At minimum, the given sizes
can be used to limit parsing within the file itself.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 chunk-format.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h |  9 +++++++++
 2 files changed, 62 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 3f5cc9b5ddf..e836a121c5c 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -173,6 +173,59 @@ int read_table_of_contents(struct chunkfile *cf,
 	return 0;
 }
 
+int read_trailing_table_of_contents(struct chunkfile *cf,
+				    const unsigned char *mfile,
+				    size_t mfile_size)
+{
+	int i;
+	uint32_t chunk_id;
+	const unsigned char *table_of_contents = mfile + mfile_size - the_hash_algo->rawsz;
+
+	while (1) {
+		uint64_t chunk_offset;
+
+		table_of_contents -= CHUNK_TOC_ENTRY_SIZE;
+
+		chunk_id = get_be32(table_of_contents);
+		chunk_offset = get_be64(table_of_contents + 4);
+
+		/* Calculate the previous chunk size, if it exists. */
+		if (cf->chunks_nr) {
+			off_t previous_offset = cf->chunks[cf->chunks_nr - 1].offset;
+
+			if (chunk_offset < previous_offset ||
+			    chunk_offset > table_of_contents - mfile) {
+				error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
+				previous_offset, chunk_offset);
+				return -1;
+			}
+
+			cf->chunks[cf->chunks_nr - 1].size = chunk_offset - previous_offset;
+		}
+
+		/* Stop at the null chunk. We only need it for the last size. */
+		if (!chunk_id)
+			break;
+
+		for (i = 0; i < cf->chunks_nr; i++) {
+			if (cf->chunks[i].id == chunk_id) {
+				error(_("duplicate chunk ID %"PRIx32" found"),
+					chunk_id);
+				return -1;
+			}
+		}
+
+		ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
+
+		cf->chunks[cf->chunks_nr].id = chunk_id;
+		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
+		cf->chunks[cf->chunks_nr].offset = chunk_offset;
+		cf->chunks_nr++;
+	}
+
+	return 0;
+}
+
 static int pair_chunk_fn(const unsigned char *chunk_start,
 			 size_t chunk_size,
 			 void *data)
diff --git a/chunk-format.h b/chunk-format.h
index 39e8967e950..acb8dfbce80 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -46,6 +46,15 @@ int read_table_of_contents(struct chunkfile *cf,
 			   uint64_t toc_offset,
 			   int toc_length);
 
+/**
+ * Read the given chunkfile, but read the table of contents from the
+ * end of the given mfile. The file is expected to be a hashfile with
+ * the_hash_file->rawsz bytes at the end storing the hash.
+ */
+int read_trailing_table_of_contents(struct chunkfile *cf,
+				    const unsigned char *mfile,
+				    size_t mfile_size);
+
 #define CHUNK_NOT_FOUND (-2)
 
 /*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 12/30] refs: extract packfile format to new file
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 11/30] chunk-format: parse " Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 13/30] packed-backend: extract add_write_error() Derrick Stolee via GitGitGadget
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In preparation for adding a new packed-refs file format, extract all
code from refs/packed-backend.c that involves knowledge of the plaintext
file format. This includes any parsing logic that cares about the
header, plaintext lines of the form "<oid> <ref>" or "^<peeled>", and
the error messages when there is an issue in the file. This also
includes the writing logic that writes the header or the individual
references.

Future changes will perform more refactoring to abstract away more of
the writing process to be more generic, but this is enough of a chunk of
code movement.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Makefile                |   1 +
 refs/packed-backend.c   | 595 ++--------------------------------------
 refs/packed-backend.h   | 195 +++++++++++++
 refs/packed-format-v1.c | 453 ++++++++++++++++++++++++++++++
 4 files changed, 667 insertions(+), 577 deletions(-)
 create mode 100644 refs/packed-format-v1.c

diff --git a/Makefile b/Makefile
index 4927379184c..3dc887941d4 100644
--- a/Makefile
+++ b/Makefile
@@ -1057,6 +1057,7 @@ LIB_OBJS += refs/debug.o
 LIB_OBJS += refs/files-backend.o
 LIB_OBJS += refs/iterator.o
 LIB_OBJS += refs/packed-backend.o
+LIB_OBJS += refs/packed-format-v1.o
 LIB_OBJS += refs/ref-cache.o
 LIB_OBJS += refspec.o
 LIB_OBJS += remote.o
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index a4371b711b9..afaf6f53233 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -36,121 +36,6 @@ static enum mmap_strategy mmap_strategy = MMAP_TEMPORARY;
 static enum mmap_strategy mmap_strategy = MMAP_OK;
 #endif
 
-struct packed_ref_store;
-
-/*
- * A `snapshot` represents one snapshot of a `packed-refs` file.
- *
- * Normally, this will be a mmapped view of the contents of the
- * `packed-refs` file at the time the snapshot was created. However,
- * if the `packed-refs` file was not sorted, this might point at heap
- * memory holding the contents of the `packed-refs` file with its
- * records sorted by refname.
- *
- * `snapshot` instances are reference counted (via
- * `acquire_snapshot()` and `release_snapshot()`). This is to prevent
- * an instance from disappearing while an iterator is still iterating
- * over it. Instances are garbage collected when their `referrers`
- * count goes to zero.
- *
- * The most recent `snapshot`, if available, is referenced by the
- * `packed_ref_store`. Its freshness is checked whenever
- * `get_snapshot()` is called; if the existing snapshot is obsolete, a
- * new snapshot is taken.
- */
-struct snapshot {
-	/*
-	 * A back-pointer to the packed_ref_store with which this
-	 * snapshot is associated:
-	 */
-	struct packed_ref_store *refs;
-
-	/* Is the `packed-refs` file currently mmapped? */
-	int mmapped;
-
-	/*
-	 * The contents of the `packed-refs` file:
-	 *
-	 * - buf -- a pointer to the start of the memory
-	 * - start -- a pointer to the first byte of actual references
-	 *   (i.e., after the header line, if one is present)
-	 * - eof -- a pointer just past the end of the reference
-	 *   contents
-	 *
-	 * If the `packed-refs` file was already sorted, `buf` points
-	 * at the mmapped contents of the file. If not, it points at
-	 * heap-allocated memory containing the contents, sorted. If
-	 * there were no contents (e.g., because the file didn't
-	 * exist), `buf`, `start`, and `eof` are all NULL.
-	 */
-	char *buf, *start, *eof;
-
-	/*
-	 * What is the peeled state of the `packed-refs` file that
-	 * this snapshot represents? (This is usually determined from
-	 * the file's header.)
-	 */
-	enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled;
-
-	/*
-	 * Count of references to this instance, including the pointer
-	 * from `packed_ref_store::snapshot`, if any. The instance
-	 * will not be freed as long as the reference count is
-	 * nonzero.
-	 */
-	unsigned int referrers;
-
-	/*
-	 * The metadata of the `packed-refs` file from which this
-	 * snapshot was created, used to tell if the file has been
-	 * replaced since we read it.
-	 */
-	struct stat_validity validity;
-};
-
-/*
- * A `ref_store` representing references stored in a `packed-refs`
- * file. It implements the `ref_store` interface, though it has some
- * limitations:
- *
- * - It cannot store symbolic references.
- *
- * - It cannot store reflogs.
- *
- * - It does not support reference renaming (though it could).
- *
- * On the other hand, it can be locked outside of a reference
- * transaction. In that case, it remains locked even after the
- * transaction is done and the new `packed-refs` file is activated.
- */
-struct packed_ref_store {
-	struct ref_store base;
-
-	unsigned int store_flags;
-
-	/* The path of the "packed-refs" file: */
-	char *path;
-
-	/*
-	 * A snapshot of the values read from the `packed-refs` file,
-	 * if it might still be current; otherwise, NULL.
-	 */
-	struct snapshot *snapshot;
-
-	/*
-	 * Lock used for the "packed-refs" file. Note that this (and
-	 * thus the enclosing `packed_ref_store`) must not be freed.
-	 */
-	struct lock_file lock;
-
-	/*
-	 * Temporary file used when rewriting new contents to the
-	 * "packed-refs" file. Note that this (and thus the enclosing
-	 * `packed_ref_store`) must not be freed.
-	 */
-	struct tempfile *tempfile;
-};
-
 /*
  * Increment the reference count of `*snapshot`.
  */
@@ -164,7 +49,7 @@ static void acquire_snapshot(struct snapshot *snapshot)
  * memory and close the file, or free the memory. Then set the buffer
  * pointers to NULL.
  */
-static void clear_snapshot_buffer(struct snapshot *snapshot)
+void clear_snapshot_buffer(struct snapshot *snapshot)
 {
 	if (snapshot->mmapped) {
 		if (munmap(snapshot->buf, snapshot->eof - snapshot->buf))
@@ -245,224 +130,6 @@ static void clear_snapshot(struct packed_ref_store *refs)
 	}
 }
 
-static NORETURN void die_unterminated_line(const char *path,
-					   const char *p, size_t len)
-{
-	if (len < 80)
-		die("unterminated line in %s: %.*s", path, (int)len, p);
-	else
-		die("unterminated line in %s: %.75s...", path, p);
-}
-
-static NORETURN void die_invalid_line(const char *path,
-				      const char *p, size_t len)
-{
-	const char *eol = memchr(p, '\n', len);
-
-	if (!eol)
-		die_unterminated_line(path, p, len);
-	else if (eol - p < 80)
-		die("unexpected line in %s: %.*s", path, (int)(eol - p), p);
-	else
-		die("unexpected line in %s: %.75s...", path, p);
-
-}
-
-struct snapshot_record {
-	const char *start;
-	size_t len;
-};
-
-static int cmp_packed_ref_records(const void *v1, const void *v2)
-{
-	const struct snapshot_record *e1 = v1, *e2 = v2;
-	const char *r1 = e1->start + the_hash_algo->hexsz + 1;
-	const char *r2 = e2->start + the_hash_algo->hexsz + 1;
-
-	while (1) {
-		if (*r1 == '\n')
-			return *r2 == '\n' ? 0 : -1;
-		if (*r1 != *r2) {
-			if (*r2 == '\n')
-				return 1;
-			else
-				return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : +1;
-		}
-		r1++;
-		r2++;
-	}
-}
-
-/*
- * Compare a snapshot record at `rec` to the specified NUL-terminated
- * refname.
- */
-static int cmp_record_to_refname(const char *rec, const char *refname)
-{
-	const char *r1 = rec + the_hash_algo->hexsz + 1;
-	const char *r2 = refname;
-
-	while (1) {
-		if (*r1 == '\n')
-			return *r2 ? -1 : 0;
-		if (!*r2)
-			return 1;
-		if (*r1 != *r2)
-			return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : +1;
-		r1++;
-		r2++;
-	}
-}
-
-/*
- * `snapshot->buf` is not known to be sorted. Check whether it is, and
- * if not, sort it into new memory and munmap/free the old storage.
- */
-static void sort_snapshot(struct snapshot *snapshot)
-{
-	struct snapshot_record *records = NULL;
-	size_t alloc = 0, nr = 0;
-	int sorted = 1;
-	const char *pos, *eof, *eol;
-	size_t len, i;
-	char *new_buffer, *dst;
-
-	pos = snapshot->start;
-	eof = snapshot->eof;
-
-	if (pos == eof)
-		return;
-
-	len = eof - pos;
-
-	/*
-	 * Initialize records based on a crude estimate of the number
-	 * of references in the file (we'll grow it below if needed):
-	 */
-	ALLOC_GROW(records, len / 80 + 20, alloc);
-
-	while (pos < eof) {
-		eol = memchr(pos, '\n', eof - pos);
-		if (!eol)
-			/* The safety check should prevent this. */
-			BUG("unterminated line found in packed-refs");
-		if (eol - pos < the_hash_algo->hexsz + 2)
-			die_invalid_line(snapshot->refs->path,
-					 pos, eof - pos);
-		eol++;
-		if (eol < eof && *eol == '^') {
-			/*
-			 * Keep any peeled line together with its
-			 * reference:
-			 */
-			const char *peeled_start = eol;
-
-			eol = memchr(peeled_start, '\n', eof - peeled_start);
-			if (!eol)
-				/* The safety check should prevent this. */
-				BUG("unterminated peeled line found in packed-refs");
-			eol++;
-		}
-
-		ALLOC_GROW(records, nr + 1, alloc);
-		records[nr].start = pos;
-		records[nr].len = eol - pos;
-		nr++;
-
-		if (sorted &&
-		    nr > 1 &&
-		    cmp_packed_ref_records(&records[nr - 2],
-					   &records[nr - 1]) >= 0)
-			sorted = 0;
-
-		pos = eol;
-	}
-
-	if (sorted)
-		goto cleanup;
-
-	/* We need to sort the memory. First we sort the records array: */
-	QSORT(records, nr, cmp_packed_ref_records);
-
-	/*
-	 * Allocate a new chunk of memory, and copy the old memory to
-	 * the new in the order indicated by `records` (not bothering
-	 * with the header line):
-	 */
-	new_buffer = xmalloc(len);
-	for (dst = new_buffer, i = 0; i < nr; i++) {
-		memcpy(dst, records[i].start, records[i].len);
-		dst += records[i].len;
-	}
-
-	/*
-	 * Now munmap the old buffer and use the sorted buffer in its
-	 * place:
-	 */
-	clear_snapshot_buffer(snapshot);
-	snapshot->buf = snapshot->start = new_buffer;
-	snapshot->eof = new_buffer + len;
-
-cleanup:
-	free(records);
-}
-
-/*
- * Return a pointer to the start of the record that contains the
- * character `*p` (which must be within the buffer). If no other
- * record start is found, return `buf`.
- */
-static const char *find_start_of_record(const char *buf, const char *p)
-{
-	while (p > buf && (p[-1] != '\n' || p[0] == '^'))
-		p--;
-	return p;
-}
-
-/*
- * Return a pointer to the start of the record following the record
- * that contains `*p`. If none is found before `end`, return `end`.
- */
-static const char *find_end_of_record(const char *p, const char *end)
-{
-	while (++p < end && (p[-1] != '\n' || p[0] == '^'))
-		;
-	return p;
-}
-
-/*
- * We want to be able to compare mmapped reference records quickly,
- * without totally parsing them. We can do so because the records are
- * LF-terminated, and the refname should start exactly (GIT_SHA1_HEXSZ
- * + 1) bytes past the beginning of the record.
- *
- * But what if the `packed-refs` file contains garbage? We're willing
- * to tolerate not detecting the problem, as long as we don't produce
- * totally garbled output (we can't afford to check the integrity of
- * the whole file during every Git invocation). But we do want to be
- * sure that we never read past the end of the buffer in memory and
- * perform an illegal memory access.
- *
- * Guarantee that minimum level of safety by verifying that the last
- * record in the file is LF-terminated, and that it has at least
- * (GIT_SHA1_HEXSZ + 1) characters before the LF. Die if either of
- * these checks fails.
- */
-static void verify_buffer_safe(struct snapshot *snapshot)
-{
-	const char *start = snapshot->start;
-	const char *eof = snapshot->eof;
-	const char *last_line;
-
-	if (start == eof)
-		return;
-
-	last_line = find_start_of_record(start, eof - 1);
-	if (*(eof - 1) != '\n' || eof - last_line < the_hash_algo->hexsz + 2)
-		die_invalid_line(snapshot->refs->path,
-				 last_line, eof - last_line);
-}
-
 #define SMALL_FILE_SIZE (32*1024)
 
 /*
@@ -524,67 +191,6 @@ static int load_contents(struct snapshot *snapshot)
 	return 1;
 }
 
-/*
- * Find the place in `snapshot->buf` where the start of the record for
- * `refname` starts. If `mustexist` is true and the reference doesn't
- * exist, then return NULL. If `mustexist` is false and the reference
- * doesn't exist, then return the point where that reference would be
- * inserted, or `snapshot->eof` (which might be NULL) if it would be
- * inserted at the end of the file. In the latter mode, `refname`
- * doesn't have to be a proper reference name; for example, one could
- * search for "refs/replace/" to find the start of any replace
- * references.
- *
- * The record is sought using a binary search, so `snapshot->buf` must
- * be sorted.
- */
-static const char *find_reference_location(struct snapshot *snapshot,
-					   const char *refname, int mustexist)
-{
-	/*
-	 * This is not *quite* a garden-variety binary search, because
-	 * the data we're searching is made up of records, and we
-	 * always need to find the beginning of a record to do a
-	 * comparison. A "record" here is one line for the reference
-	 * itself and zero or one peel lines that start with '^'. Our
-	 * loop invariant is described in the next two comments.
-	 */
-
-	/*
-	 * A pointer to the character at the start of a record whose
-	 * preceding records all have reference names that come
-	 * *before* `refname`.
-	 */
-	const char *lo = snapshot->start;
-
-	/*
-	 * A pointer to a the first character of a record whose
-	 * reference name comes *after* `refname`.
-	 */
-	const char *hi = snapshot->eof;
-
-	while (lo != hi) {
-		const char *mid, *rec;
-		int cmp;
-
-		mid = lo + (hi - lo) / 2;
-		rec = find_start_of_record(lo, mid);
-		cmp = cmp_record_to_refname(rec, refname);
-		if (cmp < 0) {
-			lo = find_end_of_record(mid, hi);
-		} else if (cmp > 0) {
-			hi = rec;
-		} else {
-			return rec;
-		}
-	}
-
-	if (mustexist)
-		return NULL;
-	else
-		return lo;
-}
-
 /*
  * Create a newly-allocated `snapshot` of the `packed-refs` file in
  * its current state and return it. The return value will already have
@@ -630,54 +236,22 @@ static struct snapshot *create_snapshot(struct packed_ref_store *refs)
 	if (!load_contents(snapshot))
 		return snapshot;
 
-	/* If the file has a header line, process it: */
-	if (snapshot->buf < snapshot->eof && *snapshot->buf == '#') {
-		char *tmp, *p, *eol;
-		struct string_list traits = STRING_LIST_INIT_NODUP;
-
-		eol = memchr(snapshot->buf, '\n',
-			     snapshot->eof - snapshot->buf);
-		if (!eol)
-			die_unterminated_line(refs->path,
-					      snapshot->buf,
-					      snapshot->eof - snapshot->buf);
-
-		tmp = xmemdupz(snapshot->buf, eol - snapshot->buf);
-
-		if (!skip_prefix(tmp, "# pack-refs with:", (const char **)&p))
-			die_invalid_line(refs->path,
-					 snapshot->buf,
-					 snapshot->eof - snapshot->buf);
-
-		string_list_split_in_place(&traits, p, ' ', -1);
-
-		if (unsorted_string_list_has_string(&traits, "fully-peeled"))
-			snapshot->peeled = PEELED_FULLY;
-		else if (unsorted_string_list_has_string(&traits, "peeled"))
-			snapshot->peeled = PEELED_TAGS;
-
-		sorted = unsorted_string_list_has_string(&traits, "sorted");
-
-		/* perhaps other traits later as well */
-
-		/* The "+ 1" is for the LF character. */
-		snapshot->start = eol + 1;
-
-		string_list_clear(&traits, 0);
-		free(tmp);
+	if (parse_packed_format_v1_header(refs, snapshot, &sorted)) {
+		clear_snapshot(refs);
+		return NULL;
 	}
 
-	verify_buffer_safe(snapshot);
+	verify_buffer_safe_v1(snapshot);
 
 	if (!sorted) {
-		sort_snapshot(snapshot);
+		sort_snapshot_v1(snapshot);
 
 		/*
 		 * Reordering the records might have moved a short one
 		 * to the end of the buffer, so verify the buffer's
 		 * safety again:
 		 */
-		verify_buffer_safe(snapshot);
+		verify_buffer_safe_v1(snapshot);
 	}
 
 	if (mmap_strategy != MMAP_OK && snapshot->mmapped) {
@@ -735,55 +309,11 @@ static int packed_read_raw_ref(struct ref_store *ref_store, const char *refname,
 	struct packed_ref_store *refs =
 		packed_downcast(ref_store, REF_STORE_READ, "read_raw_ref");
 	struct snapshot *snapshot = get_snapshot(refs);
-	const char *rec;
-
-	*type = 0;
 
-	rec = find_reference_location(snapshot, refname, 1);
-
-	if (!rec) {
-		/* refname is not a packed reference. */
-		*failure_errno = ENOENT;
-		return -1;
-	}
-
-	if (get_oid_hex(rec, oid))
-		die_invalid_line(refs->path, rec, snapshot->eof - rec);
-
-	*type = REF_ISPACKED;
-	return 0;
+	return packed_read_raw_ref_v1(refs, snapshot, refname,
+				      oid, type, failure_errno);
 }
 
-/*
- * This value is set in `base.flags` if the peeled value of the
- * current reference is known. In that case, `peeled` contains the
- * correct peeled value for the reference, which might be `null_oid`
- * if the reference is not a tag or if it is broken.
- */
-#define REF_KNOWS_PEELED 0x40
-
-/*
- * An iterator over a snapshot of a `packed-refs` file.
- */
-struct packed_ref_iterator {
-	struct ref_iterator base;
-
-	struct snapshot *snapshot;
-
-	/* The current position in the snapshot's buffer: */
-	const char *pos;
-
-	/* The end of the part of the buffer that will be iterated over: */
-	const char *eof;
-
-	/* Scratch space for current values: */
-	struct object_id oid, peeled;
-	struct strbuf refname_buf;
-
-	struct repository *repo;
-	unsigned int flags;
-};
-
 /*
  * Move the iterator to the next record in the snapshot, without
  * respect for whether the record is actually required by the current
@@ -793,68 +323,7 @@ struct packed_ref_iterator {
  */
 static int next_record(struct packed_ref_iterator *iter)
 {
-	const char *p = iter->pos, *eol;
-
-	strbuf_reset(&iter->refname_buf);
-
-	if (iter->pos == iter->eof)
-		return ITER_DONE;
-
-	iter->base.flags = REF_ISPACKED;
-
-	if (iter->eof - p < the_hash_algo->hexsz + 2 ||
-	    parse_oid_hex(p, &iter->oid, &p) ||
-	    !isspace(*p++))
-		die_invalid_line(iter->snapshot->refs->path,
-				 iter->pos, iter->eof - iter->pos);
-
-	eol = memchr(p, '\n', iter->eof - p);
-	if (!eol)
-		die_unterminated_line(iter->snapshot->refs->path,
-				      iter->pos, iter->eof - iter->pos);
-
-	strbuf_add(&iter->refname_buf, p, eol - p);
-	iter->base.refname = iter->refname_buf.buf;
-
-	if (check_refname_format(iter->base.refname, REFNAME_ALLOW_ONELEVEL)) {
-		if (!refname_is_safe(iter->base.refname))
-			die("packed refname is dangerous: %s",
-			    iter->base.refname);
-		oidclr(&iter->oid);
-		iter->base.flags |= REF_BAD_NAME | REF_ISBROKEN;
-	}
-	if (iter->snapshot->peeled == PEELED_FULLY ||
-	    (iter->snapshot->peeled == PEELED_TAGS &&
-	     starts_with(iter->base.refname, "refs/tags/")))
-		iter->base.flags |= REF_KNOWS_PEELED;
-
-	iter->pos = eol + 1;
-
-	if (iter->pos < iter->eof && *iter->pos == '^') {
-		p = iter->pos + 1;
-		if (iter->eof - p < the_hash_algo->hexsz + 1 ||
-		    parse_oid_hex(p, &iter->peeled, &p) ||
-		    *p++ != '\n')
-			die_invalid_line(iter->snapshot->refs->path,
-					 iter->pos, iter->eof - iter->pos);
-		iter->pos = p;
-
-		/*
-		 * Regardless of what the file header said, we
-		 * definitely know the value of *this* reference. But
-		 * we suppress it if the reference is broken:
-		 */
-		if ((iter->base.flags & REF_ISBROKEN)) {
-			oidclr(&iter->peeled);
-			iter->base.flags &= ~REF_KNOWS_PEELED;
-		} else {
-			iter->base.flags |= REF_KNOWS_PEELED;
-		}
-	} else {
-		oidclr(&iter->peeled);
-	}
-
-	return ITER_OK;
+	return next_record_v1(iter);
 }
 
 static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator)
@@ -942,7 +411,7 @@ static struct ref_iterator *packed_ref_iterator_begin(
 	snapshot = get_snapshot(refs);
 
 	if (prefix && *prefix)
-		start = find_reference_location(snapshot, prefix, 0);
+		start = find_reference_location_v1(snapshot, prefix, 0);
 	else
 		start = snapshot->start;
 
@@ -972,23 +441,6 @@ static struct ref_iterator *packed_ref_iterator_begin(
 	return ref_iterator;
 }
 
-/*
- * Write an entry to the packed-refs file for the specified refname.
- * If peeled is non-NULL, write it as the entry's peeled value. On
- * error, return a nonzero value and leave errno set at the value left
- * by the failing call to `fprintf()`.
- */
-static int write_packed_entry(FILE *fh, const char *refname,
-			      const struct object_id *oid,
-			      const struct object_id *peeled)
-{
-	if (fprintf(fh, "%s %s\n", oid_to_hex(oid), refname) < 0 ||
-	    (peeled && fprintf(fh, "^%s\n", oid_to_hex(peeled)) < 0))
-		return -1;
-
-	return 0;
-}
-
 int packed_refs_lock(struct ref_store *ref_store, int flags, struct strbuf *err)
 {
 	struct packed_ref_store *refs =
@@ -1070,17 +522,6 @@ int packed_refs_is_locked(struct ref_store *ref_store)
 	return is_lock_file_locked(&refs->lock);
 }
 
-/*
- * The packed-refs header line that we write out. Perhaps other traits
- * will be added later.
- *
- * Note that earlier versions of Git used to parse these traits by
- * looking for " trait " in the line. For this reason, the space after
- * the colon and the trailing space are required.
- */
-static const char PACKED_REFS_HEADER[] =
-	"# pack-refs with: peeled fully-peeled sorted \n";
-
 static int packed_init_db(struct ref_store *ref_store UNUSED,
 			  struct strbuf *err UNUSED)
 {
@@ -1136,7 +577,7 @@ static int write_with_updates(struct packed_ref_store *refs,
 		goto error;
 	}
 
-	if (fprintf(out, "%s", PACKED_REFS_HEADER) < 0)
+	if (write_packed_file_header_v1(out) < 0)
 		goto write_error;
 
 	/*
@@ -1230,9 +671,9 @@ static int write_with_updates(struct packed_ref_store *refs,
 			struct object_id peeled;
 			int peel_error = ref_iterator_peel(iter, &peeled);
 
-			if (write_packed_entry(out, iter->refname,
-					       iter->oid,
-					       peel_error ? NULL : &peeled))
+			if (write_packed_entry_v1(out, iter->refname,
+						  iter->oid,
+						  peel_error ? NULL : &peeled))
 				goto write_error;
 
 			if ((ok = ref_iterator_advance(iter)) != ITER_OK)
@@ -1251,9 +692,9 @@ static int write_with_updates(struct packed_ref_store *refs,
 			int peel_error = peel_object(&update->new_oid,
 						     &peeled);
 
-			if (write_packed_entry(out, update->refname,
-					       &update->new_oid,
-					       peel_error ? NULL : &peeled))
+			if (write_packed_entry_v1(out, update->refname,
+						  &update->new_oid,
+						  peel_error ? NULL : &peeled))
 				goto write_error;
 
 			i++;
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index 9dd8a344c34..143ed6d4f6c 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -1,6 +1,10 @@
 #ifndef REFS_PACKED_BACKEND_H
 #define REFS_PACKED_BACKEND_H
 
+#include "../cache.h"
+#include "refs-internal.h"
+#include "../lockfile.h"
+
 struct repository;
 struct ref_transaction;
 
@@ -36,4 +40,195 @@ int packed_refs_is_locked(struct ref_store *ref_store);
 int is_packed_transaction_needed(struct ref_store *ref_store,
 				 struct ref_transaction *transaction);
 
+struct packed_ref_store;
+
+/*
+ * A `snapshot` represents one snapshot of a `packed-refs` file.
+ *
+ * Normally, this will be a mmapped view of the contents of the
+ * `packed-refs` file at the time the snapshot was created. However,
+ * if the `packed-refs` file was not sorted, this might point at heap
+ * memory holding the contents of the `packed-refs` file with its
+ * records sorted by refname.
+ *
+ * `snapshot` instances are reference counted (via
+ * `acquire_snapshot()` and `release_snapshot()`). This is to prevent
+ * an instance from disappearing while an iterator is still iterating
+ * over it. Instances are garbage collected when their `referrers`
+ * count goes to zero.
+ *
+ * The most recent `snapshot`, if available, is referenced by the
+ * `packed_ref_store`. Its freshness is checked whenever
+ * `get_snapshot()` is called; if the existing snapshot is obsolete, a
+ * new snapshot is taken.
+ */
+struct snapshot {
+	/*
+	 * A back-pointer to the packed_ref_store with which this
+	 * snapshot is associated:
+	 */
+	struct packed_ref_store *refs;
+
+	/* Is the `packed-refs` file currently mmapped? */
+	int mmapped;
+
+	/*
+	 * The contents of the `packed-refs` file:
+	 *
+	 * - buf -- a pointer to the start of the memory
+	 * - start -- a pointer to the first byte of actual references
+	 *   (i.e., after the header line, if one is present)
+	 * - eof -- a pointer just past the end of the reference
+	 *   contents
+	 *
+	 * If the `packed-refs` file was already sorted, `buf` points
+	 * at the mmapped contents of the file. If not, it points at
+	 * heap-allocated memory containing the contents, sorted. If
+	 * there were no contents (e.g., because the file didn't
+	 * exist), `buf`, `start`, and `eof` are all NULL.
+	 */
+	char *buf, *start, *eof;
+
+	/*
+	 * What is the peeled state of the `packed-refs` file that
+	 * this snapshot represents? (This is usually determined from
+	 * the file's header.)
+	 */
+	enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled;
+
+	/*
+	 * Count of references to this instance, including the pointer
+	 * from `packed_ref_store::snapshot`, if any. The instance
+	 * will not be freed as long as the reference count is
+	 * nonzero.
+	 */
+	unsigned int referrers;
+
+	/*
+	 * The metadata of the `packed-refs` file from which this
+	 * snapshot was created, used to tell if the file has been
+	 * replaced since we read it.
+	 */
+	struct stat_validity validity;
+};
+
+/*
+ * If the buffer in `snapshot` is active, then either munmap the
+ * memory and close the file, or free the memory. Then set the buffer
+ * pointers to NULL.
+ */
+void clear_snapshot_buffer(struct snapshot *snapshot);
+
+/*
+ * A `ref_store` representing references stored in a `packed-refs`
+ * file. It implements the `ref_store` interface, though it has some
+ * limitations:
+ *
+ * - It cannot store symbolic references.
+ *
+ * - It cannot store reflogs.
+ *
+ * - It does not support reference renaming (though it could).
+ *
+ * On the other hand, it can be locked outside of a reference
+ * transaction. In that case, it remains locked even after the
+ * transaction is done and the new `packed-refs` file is activated.
+ */
+struct packed_ref_store {
+	struct ref_store base;
+
+	unsigned int store_flags;
+
+	/* The path of the "packed-refs" file: */
+	char *path;
+
+	/*
+	 * A snapshot of the values read from the `packed-refs` file,
+	 * if it might still be current; otherwise, NULL.
+	 */
+	struct snapshot *snapshot;
+
+	/*
+	 * Lock used for the "packed-refs" file. Note that this (and
+	 * thus the enclosing `packed_ref_store`) must not be freed.
+	 */
+	struct lock_file lock;
+
+	/*
+	 * Temporary file used when rewriting new contents to the
+	 * "packed-refs" file. Note that this (and thus the enclosing
+	 * `packed_ref_store`) must not be freed.
+	 */
+	struct tempfile *tempfile;
+};
+
+/*
+ * This value is set in `base.flags` if the peeled value of the
+ * current reference is known. In that case, `peeled` contains the
+ * correct peeled value for the reference, which might be `null_oid`
+ * if the reference is not a tag or if it is broken.
+ */
+#define REF_KNOWS_PEELED 0x40
+
+/*
+ * An iterator over a snapshot of a `packed-refs` file.
+ */
+struct packed_ref_iterator {
+	struct ref_iterator base;
+
+	struct snapshot *snapshot;
+
+	/* The current position in the snapshot's buffer: */
+	const char *pos;
+
+	/* The end of the part of the buffer that will be iterated over: */
+	const char *eof;
+
+	/* Scratch space for current values: */
+	struct object_id oid, peeled;
+	struct strbuf refname_buf;
+
+	struct repository *repo;
+	unsigned int flags;
+};
+
+/**
+ * Parse the buffer at the given snapshot to verify that it is a
+ * packed-refs file in version 1 format. Update the snapshot->peeled
+ * value according to the header information. Update the given
+ * 'sorted' value with whether or not the packed-refs file is sorted.
+ */
+int parse_packed_format_v1_header(struct packed_ref_store *refs,
+				  struct snapshot *snapshot,
+				  int *sorted);
+
+/*
+ * Find the place in `snapshot->buf` where the start of the record for
+ * `refname` starts. If `mustexist` is true and the reference doesn't
+ * exist, then return NULL. If `mustexist` is false and the reference
+ * doesn't exist, then return the point where that reference would be
+ * inserted, or `snapshot->eof` (which might be NULL) if it would be
+ * inserted at the end of the file. In the latter mode, `refname`
+ * doesn't have to be a proper reference name; for example, one could
+ * search for "refs/replace/" to find the start of any replace
+ * references.
+ *
+ * The record is sought using a binary search, so `snapshot->buf` must
+ * be sorted.
+ */
+const char *find_reference_location_v1(struct snapshot *snapshot,
+				       const char *refname, int mustexist);
+
+int packed_read_raw_ref_v1(struct packed_ref_store *refs, struct snapshot *snapshot,
+			   const char *refname, struct object_id *oid,
+			   unsigned int *type, int *failure_errno);
+
+void verify_buffer_safe_v1(struct snapshot *snapshot);
+void sort_snapshot_v1(struct snapshot *snapshot);
+int write_packed_file_header_v1(FILE *out);
+int next_record_v1(struct packed_ref_iterator *iter);
+int write_packed_entry_v1(FILE *fh, const char *refname,
+			  const struct object_id *oid,
+			  const struct object_id *peeled);
+
 #endif /* REFS_PACKED_BACKEND_H */
diff --git a/refs/packed-format-v1.c b/refs/packed-format-v1.c
new file mode 100644
index 00000000000..ef9e6618c89
--- /dev/null
+++ b/refs/packed-format-v1.c
@@ -0,0 +1,453 @@
+#include "../cache.h"
+#include "../config.h"
+#include "../refs.h"
+#include "refs-internal.h"
+#include "packed-backend.h"
+#include "../iterator.h"
+#include "../lockfile.h"
+#include "../chdir-notify.h"
+
+static NORETURN void die_unterminated_line(const char *path,
+					   const char *p, size_t len)
+{
+	if (len < 80)
+		die("unterminated line in %s: %.*s", path, (int)len, p);
+	else
+		die("unterminated line in %s: %.75s...", path, p);
+}
+
+static NORETURN void die_invalid_line(const char *path,
+				      const char *p, size_t len)
+{
+	const char *eol = memchr(p, '\n', len);
+
+	if (!eol)
+		die_unterminated_line(path, p, len);
+	else if (eol - p < 80)
+		die("unexpected line in %s: %.*s", path, (int)(eol - p), p);
+	else
+		die("unexpected line in %s: %.75s...", path, p);
+}
+
+struct snapshot_record {
+	const char *start;
+	size_t len;
+};
+
+static int cmp_packed_ref_records(const void *v1, const void *v2)
+{
+	const struct snapshot_record *e1 = v1, *e2 = v2;
+	const char *r1 = e1->start + the_hash_algo->hexsz + 1;
+	const char *r2 = e2->start + the_hash_algo->hexsz + 1;
+
+	while (1) {
+		if (*r1 == '\n')
+			return *r2 == '\n' ? 0 : -1;
+		if (*r1 != *r2) {
+			if (*r2 == '\n')
+				return 1;
+			else
+				return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : +1;
+		}
+		r1++;
+		r2++;
+	}
+}
+
+/*
+ * Compare a snapshot record at `rec` to the specified NUL-terminated
+ * refname.
+ */
+static int cmp_record_to_refname(const char *rec, const char *refname)
+{
+	const char *r1 = rec + the_hash_algo->hexsz + 1;
+	const char *r2 = refname;
+
+	while (1) {
+		if (*r1 == '\n')
+			return *r2 ? -1 : 0;
+		if (!*r2)
+			return 1;
+		if (*r1 != *r2)
+			return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : +1;
+		r1++;
+		r2++;
+	}
+}
+
+/*
+ * `snapshot->buf` is not known to be sorted. Check whether it is, and
+ * if not, sort it into new memory and munmap/free the old storage.
+ */
+void sort_snapshot_v1(struct snapshot *snapshot)
+{
+	struct snapshot_record *records = NULL;
+	size_t alloc = 0, nr = 0;
+	int sorted = 1;
+	const char *pos, *eof, *eol;
+	size_t len, i;
+	char *new_buffer, *dst;
+
+	pos = snapshot->start;
+	eof = snapshot->eof;
+
+	if (pos == eof)
+		return;
+
+	len = eof - pos;
+
+	/*
+	 * Initialize records based on a crude estimate of the number
+	 * of references in the file (we'll grow it below if needed):
+	 */
+	ALLOC_GROW(records, len / 80 + 20, alloc);
+
+	while (pos < eof) {
+		eol = memchr(pos, '\n', eof - pos);
+		if (!eol)
+			/* The safety check should prevent this. */
+			BUG("unterminated line found in packed-refs");
+		if (eol - pos < the_hash_algo->hexsz + 2)
+			die_invalid_line(snapshot->refs->path,
+					 pos, eof - pos);
+		eol++;
+		if (eol < eof && *eol == '^') {
+			/*
+			 * Keep any peeled line together with its
+			 * reference:
+			 */
+			const char *peeled_start = eol;
+
+			eol = memchr(peeled_start, '\n', eof - peeled_start);
+			if (!eol)
+				/* The safety check should prevent this. */
+				BUG("unterminated peeled line found in packed-refs");
+			eol++;
+		}
+
+		ALLOC_GROW(records, nr + 1, alloc);
+		records[nr].start = pos;
+		records[nr].len = eol - pos;
+		nr++;
+
+		if (sorted &&
+		    nr > 1 &&
+		    cmp_packed_ref_records(&records[nr - 2],
+					   &records[nr - 1]) >= 0)
+			sorted = 0;
+
+		pos = eol;
+	}
+
+	if (sorted)
+		goto cleanup;
+
+	/* We need to sort the memory. First we sort the records array: */
+	QSORT(records, nr, cmp_packed_ref_records);
+
+	/*
+	 * Allocate a new chunk of memory, and copy the old memory to
+	 * the new in the order indicated by `records` (not bothering
+	 * with the header line):
+	 */
+	new_buffer = xmalloc(len);
+	for (dst = new_buffer, i = 0; i < nr; i++) {
+		memcpy(dst, records[i].start, records[i].len);
+		dst += records[i].len;
+	}
+
+	/*
+	 * Now munmap the old buffer and use the sorted buffer in its
+	 * place:
+	 */
+	clear_snapshot_buffer(snapshot);
+	snapshot->buf = snapshot->start = new_buffer;
+	snapshot->eof = new_buffer + len;
+
+cleanup:
+	free(records);
+}
+
+/*
+ * Return a pointer to the start of the record that contains the
+ * character `*p` (which must be within the buffer). If no other
+ * record start is found, return `buf`.
+ */
+static const char *find_start_of_record(const char *buf, const char *p)
+{
+	while (p > buf && (p[-1] != '\n' || p[0] == '^'))
+		p--;
+	return p;
+}
+
+/*
+ * Return a pointer to the start of the record following the record
+ * that contains `*p`. If none is found before `end`, return `end`.
+ */
+static const char *find_end_of_record(const char *p, const char *end)
+{
+	while (++p < end && (p[-1] != '\n' || p[0] == '^'))
+		;
+	return p;
+}
+
+/*
+ * We want to be able to compare mmapped reference records quickly,
+ * without totally parsing them. We can do so because the records are
+ * LF-terminated, and the refname should start exactly (GIT_SHA1_HEXSZ
+ * + 1) bytes past the beginning of the record.
+ *
+ * But what if the `packed-refs` file contains garbage? We're willing
+ * to tolerate not detecting the problem, as long as we don't produce
+ * totally garbled output (we can't afford to check the integrity of
+ * the whole file during every Git invocation). But we do want to be
+ * sure that we never read past the end of the buffer in memory and
+ * perform an illegal memory access.
+ *
+ * Guarantee that minimum level of safety by verifying that the last
+ * record in the file is LF-terminated, and that it has at least
+ * (GIT_SHA1_HEXSZ + 1) characters before the LF. Die if either of
+ * these checks fails.
+ */
+void verify_buffer_safe_v1(struct snapshot *snapshot)
+{
+	const char *start = snapshot->start;
+	const char *eof = snapshot->eof;
+	const char *last_line;
+
+	if (start == eof)
+		return;
+
+	last_line = find_start_of_record(start, eof - 1);
+	if (*(eof - 1) != '\n' || eof - last_line < the_hash_algo->hexsz + 2)
+		die_invalid_line(snapshot->refs->path,
+				 last_line, eof - last_line);
+}
+
+/*
+ * Find the place in `snapshot->buf` where the start of the record for
+ * `refname` starts. If `mustexist` is true and the reference doesn't
+ * exist, then return NULL. If `mustexist` is false and the reference
+ * doesn't exist, then return the point where that reference would be
+ * inserted, or `snapshot->eof` (which might be NULL) if it would be
+ * inserted at the end of the file. In the latter mode, `refname`
+ * doesn't have to be a proper reference name; for example, one could
+ * search for "refs/replace/" to find the start of any replace
+ * references.
+ *
+ * The record is sought using a binary search, so `snapshot->buf` must
+ * be sorted.
+ */
+const char *find_reference_location_v1(struct snapshot *snapshot,
+				       const char *refname, int mustexist)
+{
+	/*
+	 * This is not *quite* a garden-variety binary search, because
+	 * the data we're searching is made up of records, and we
+	 * always need to find the beginning of a record to do a
+	 * comparison. A "record" here is one line for the reference
+	 * itself and zero or one peel lines that start with '^'. Our
+	 * loop invariant is described in the next two comments.
+	 */
+
+	/*
+	 * A pointer to the character at the start of a record whose
+	 * preceding records all have reference names that come
+	 * *before* `refname`.
+	 */
+	const char *lo = snapshot->start;
+
+	/*
+	 * A pointer to a the first character of a record whose
+	 * reference name comes *after* `refname`.
+	 */
+	const char *hi = snapshot->eof;
+
+	while (lo != hi) {
+		const char *mid, *rec;
+		int cmp;
+
+		mid = lo + (hi - lo) / 2;
+		rec = find_start_of_record(lo, mid);
+		cmp = cmp_record_to_refname(rec, refname);
+		if (cmp < 0) {
+			lo = find_end_of_record(mid, hi);
+		} else if (cmp > 0) {
+			hi = rec;
+		} else {
+			return rec;
+		}
+	}
+
+	if (mustexist)
+		return NULL;
+	else
+		return lo;
+}
+
+int parse_packed_format_v1_header(struct packed_ref_store *refs,
+				  struct snapshot *snapshot,
+				  int *sorted)
+{
+	*sorted = 0;
+	/* If the file has a header line, process it: */
+	if (snapshot->buf < snapshot->eof && *snapshot->buf == '#') {
+		char *tmp, *p, *eol;
+		struct string_list traits = STRING_LIST_INIT_NODUP;
+
+		eol = memchr(snapshot->buf, '\n',
+			     snapshot->eof - snapshot->buf);
+		if (!eol)
+			die_unterminated_line(refs->path,
+					      snapshot->buf,
+					      snapshot->eof - snapshot->buf);
+
+		tmp = xmemdupz(snapshot->buf, eol - snapshot->buf);
+
+		if (!skip_prefix(tmp, "# pack-refs with:", (const char **)&p))
+			die_invalid_line(refs->path,
+					 snapshot->buf,
+					 snapshot->eof - snapshot->buf);
+
+		string_list_split_in_place(&traits, p, ' ', -1);
+
+		if (unsorted_string_list_has_string(&traits, "fully-peeled"))
+			snapshot->peeled = PEELED_FULLY;
+		else if (unsorted_string_list_has_string(&traits, "peeled"))
+			snapshot->peeled = PEELED_TAGS;
+
+		*sorted = unsorted_string_list_has_string(&traits, "sorted");
+
+		/* perhaps other traits later as well */
+
+		/* The "+ 1" is for the LF character. */
+		snapshot->start = eol + 1;
+
+		string_list_clear(&traits, 0);
+		free(tmp);
+	}
+
+	return 0;
+}
+
+int packed_read_raw_ref_v1(struct packed_ref_store *refs, struct snapshot *snapshot,
+			   const char *refname, struct object_id *oid,
+			   unsigned int *type, int *failure_errno)
+{
+	const char *rec;
+
+	*type = 0;
+
+	rec = find_reference_location_v1(snapshot, refname, 1);
+
+	if (!rec) {
+		/* refname is not a packed reference. */
+		*failure_errno = ENOENT;
+		return -1;
+	}
+
+	if (get_oid_hex(rec, oid))
+		die_invalid_line(refs->path, rec, snapshot->eof - rec);
+
+	*type = REF_ISPACKED;
+	return 0;
+}
+
+int next_record_v1(struct packed_ref_iterator *iter)
+{
+	const char *p = iter->pos, *eol;
+
+	strbuf_reset(&iter->refname_buf);
+
+	if (iter->pos == iter->eof)
+		return ITER_DONE;
+
+	iter->base.flags = REF_ISPACKED;
+
+	if (iter->eof - p < the_hash_algo->hexsz + 2 ||
+	    parse_oid_hex(p, &iter->oid, &p) ||
+	    !isspace(*p++))
+		die_invalid_line(iter->snapshot->refs->path,
+				 iter->pos, iter->eof - iter->pos);
+
+	eol = memchr(p, '\n', iter->eof - p);
+	if (!eol)
+		die_unterminated_line(iter->snapshot->refs->path,
+				      iter->pos, iter->eof - iter->pos);
+
+	strbuf_add(&iter->refname_buf, p, eol - p);
+	iter->base.refname = iter->refname_buf.buf;
+
+	if (check_refname_format(iter->base.refname, REFNAME_ALLOW_ONELEVEL)) {
+		if (!refname_is_safe(iter->base.refname))
+			die("packed refname is dangerous: %s",
+			    iter->base.refname);
+		oidclr(&iter->oid);
+		iter->base.flags |= REF_BAD_NAME | REF_ISBROKEN;
+	}
+	if (iter->snapshot->peeled == PEELED_FULLY ||
+	    (iter->snapshot->peeled == PEELED_TAGS &&
+	     starts_with(iter->base.refname, "refs/tags/")))
+		iter->base.flags |= REF_KNOWS_PEELED;
+
+	iter->pos = eol + 1;
+
+	if (iter->pos < iter->eof && *iter->pos == '^') {
+		p = iter->pos + 1;
+		if (iter->eof - p < the_hash_algo->hexsz + 1 ||
+		    parse_oid_hex(p, &iter->peeled, &p) ||
+		    *p++ != '\n')
+			die_invalid_line(iter->snapshot->refs->path,
+					 iter->pos, iter->eof - iter->pos);
+		iter->pos = p;
+
+		/*
+		 * Regardless of what the file header said, we
+		 * definitely know the value of *this* reference. But
+		 * we suppress it if the reference is broken:
+		 */
+		if ((iter->base.flags & REF_ISBROKEN)) {
+			oidclr(&iter->peeled);
+			iter->base.flags &= ~REF_KNOWS_PEELED;
+		} else {
+			iter->base.flags |= REF_KNOWS_PEELED;
+		}
+	} else {
+		oidclr(&iter->peeled);
+	}
+
+	return ITER_OK;
+}
+
+/*
+ * The packed-refs header line that we write out. Perhaps other traits
+ * will be added later.
+ *
+ * Note that earlier versions of Git used to parse these traits by
+ * looking for " trait " in the line. For this reason, the space after
+ * the colon and the trailing space are required.
+ */
+static const char PACKED_REFS_HEADER[] =
+	"# pack-refs with: peeled fully-peeled sorted \n";
+
+int write_packed_file_header_v1(FILE *out)
+{
+	return fprintf(out, "%s", PACKED_REFS_HEADER);
+}
+
+/*
+ * Write an entry to the packed-refs file for the specified refname.
+ * If peeled is non-NULL, write it as the entry's peeled value. On
+ * error, return a nonzero value and leave errno set at the value left
+ * by the failing call to `fprintf()`.
+ */
+int write_packed_entry_v1(FILE *fh, const char *refname,
+			  const struct object_id *oid,
+			  const struct object_id *peeled)
+{
+	if (fprintf(fh, "%s %s\n", oid_to_hex(oid), refname) < 0 ||
+	    (peeled && fprintf(fh, "^%s\n", oid_to_hex(peeled)) < 0))
+		return -1;
+
+	return 0;
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 13/30] packed-backend: extract add_write_error()
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 12/30] refs: extract packfile format to new file Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 14/30] packed-backend: extract iterator/updates merge Derrick Stolee via GitGitGadget
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The write_with_updates() method uses a write_error label to jump to code
that adds an error message before exiting with an error. This appears
both when the packed-refs file header is written, but also when a ref
line is written to the packed-refs file.

A future change will abstract the loop that writes the refs out of
write_with_updates(), making the goto an inconvenient pattern. For now,
remove the distinction between "goto write_error" and "goto error" by
adding the message in-line using the new static method
add_write_error(). This is functionally equivalent, but will make the
next step easier.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index afaf6f53233..ef8060f2e08 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -529,6 +529,12 @@ static int packed_init_db(struct ref_store *ref_store UNUSED,
 	return 0;
 }
 
+static void add_write_error(struct packed_ref_store *refs, struct strbuf *err)
+{
+	strbuf_addf(err, "error writing to %s: %s",
+		    get_tempfile_path(refs->tempfile), strerror(errno));
+}
+
 /*
  * Write the packed refs from the current snapshot to the packed-refs
  * tempfile, incorporating any changes from `updates`. `updates` must
@@ -577,8 +583,10 @@ static int write_with_updates(struct packed_ref_store *refs,
 		goto error;
 	}
 
-	if (write_packed_file_header_v1(out) < 0)
-		goto write_error;
+	if (write_packed_file_header_v1(out) < 0) {
+		add_write_error(refs, err);
+		goto error;
+	}
 
 	/*
 	 * We iterate in parallel through the current list of refs and
@@ -673,8 +681,10 @@ static int write_with_updates(struct packed_ref_store *refs,
 
 			if (write_packed_entry_v1(out, iter->refname,
 						  iter->oid,
-						  peel_error ? NULL : &peeled))
-				goto write_error;
+						  peel_error ? NULL : &peeled)) {
+				add_write_error(refs, err);
+				goto error;
+			}
 
 			if ((ok = ref_iterator_advance(iter)) != ITER_OK)
 				iter = NULL;
@@ -694,8 +704,10 @@ static int write_with_updates(struct packed_ref_store *refs,
 
 			if (write_packed_entry_v1(out, update->refname,
 						  &update->new_oid,
-						  peel_error ? NULL : &peeled))
-				goto write_error;
+						  peel_error ? NULL : &peeled)) {
+				add_write_error(refs, err);
+				goto error;
+			}
 
 			i++;
 		}
@@ -719,10 +731,6 @@ static int write_with_updates(struct packed_ref_store *refs,
 
 	return 0;
 
-write_error:
-	strbuf_addf(err, "error writing to %s: %s",
-		    get_tempfile_path(refs->tempfile), strerror(errno));
-
 error:
 	if (iter)
 		ref_iterator_abort(iter);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 14/30] packed-backend: extract iterator/updates merge
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 13/30] packed-backend: extract add_write_error() Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 15/30] packed-backend: create abstraction for writing refs Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

TBD

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c | 117 +++++++++++++++++++++++-------------------
 1 file changed, 64 insertions(+), 53 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index ef8060f2e08..0dff78f02c8 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -535,58 +535,13 @@ static void add_write_error(struct packed_ref_store *refs, struct strbuf *err)
 		    get_tempfile_path(refs->tempfile), strerror(errno));
 }
 
-/*
- * Write the packed refs from the current snapshot to the packed-refs
- * tempfile, incorporating any changes from `updates`. `updates` must
- * be a sorted string list whose keys are the refnames and whose util
- * values are `struct ref_update *`. On error, rollback the tempfile,
- * write an error message to `err`, and return a nonzero value.
- *
- * The packfile must be locked before calling this function and will
- * remain locked when it is done.
- */
-static int write_with_updates(struct packed_ref_store *refs,
-			      struct string_list *updates,
-			      struct strbuf *err)
+static int merge_iterator_and_updates(struct packed_ref_store *refs,
+				      struct string_list *updates,
+				      struct strbuf *err,
+				      FILE *out)
 {
 	struct ref_iterator *iter = NULL;
-	size_t i;
-	int ok;
-	FILE *out;
-	struct strbuf sb = STRBUF_INIT;
-	char *packed_refs_path;
-
-	if (!is_lock_file_locked(&refs->lock))
-		BUG("write_with_updates() called while unlocked");
-
-	/*
-	 * If packed-refs is a symlink, we want to overwrite the
-	 * symlinked-to file, not the symlink itself. Also, put the
-	 * staging file next to it:
-	 */
-	packed_refs_path = get_locked_file_path(&refs->lock);
-	strbuf_addf(&sb, "%s.new", packed_refs_path);
-	free(packed_refs_path);
-	refs->tempfile = create_tempfile(sb.buf);
-	if (!refs->tempfile) {
-		strbuf_addf(err, "unable to create file %s: %s",
-			    sb.buf, strerror(errno));
-		strbuf_release(&sb);
-		return -1;
-	}
-	strbuf_release(&sb);
-
-	out = fdopen_tempfile(refs->tempfile, "w");
-	if (!out) {
-		strbuf_addf(err, "unable to fdopen packed-refs tempfile: %s",
-			    strerror(errno));
-		goto error;
-	}
-
-	if (write_packed_file_header_v1(out) < 0) {
-		add_write_error(refs, err);
-		goto error;
-	}
+	int ok, i;
 
 	/*
 	 * We iterate in parallel through the current list of refs and
@@ -713,6 +668,65 @@ static int write_with_updates(struct packed_ref_store *refs,
 		}
 	}
 
+error:
+	if (iter)
+		ref_iterator_abort(iter);
+	return ok;
+}
+
+/*
+ * Write the packed refs from the current snapshot to the packed-refs
+ * tempfile, incorporating any changes from `updates`. `updates` must
+ * be a sorted string list whose keys are the refnames and whose util
+ * values are `struct ref_update *`. On error, rollback the tempfile,
+ * write an error message to `err`, and return a nonzero value.
+ *
+ * The packfile must be locked before calling this function and will
+ * remain locked when it is done.
+ */
+static int write_with_updates(struct packed_ref_store *refs,
+			      struct string_list *updates,
+			      struct strbuf *err)
+{
+	int ok;
+	FILE *out;
+	struct strbuf sb = STRBUF_INIT;
+	char *packed_refs_path;
+
+	if (!is_lock_file_locked(&refs->lock))
+		BUG("write_with_updates() called while unlocked");
+
+	/*
+	 * If packed-refs is a symlink, we want to overwrite the
+	 * symlinked-to file, not the symlink itself. Also, put the
+	 * staging file next to it:
+	 */
+	packed_refs_path = get_locked_file_path(&refs->lock);
+	strbuf_addf(&sb, "%s.new", packed_refs_path);
+	free(packed_refs_path);
+	refs->tempfile = create_tempfile(sb.buf);
+	if (!refs->tempfile) {
+		strbuf_addf(err, "unable to create file %s: %s",
+			    sb.buf, strerror(errno));
+		strbuf_release(&sb);
+		return -1;
+	}
+	strbuf_release(&sb);
+
+	out = fdopen_tempfile(refs->tempfile, "w");
+	if (!out) {
+		strbuf_addf(err, "unable to fdopen packed-refs tempfile: %s",
+			    strerror(errno));
+		goto error;
+	}
+
+	if (write_packed_file_header_v1(out) < 0) {
+		add_write_error(refs, err);
+		goto error;
+	}
+
+	ok = merge_iterator_and_updates(refs, updates, err, out);
+
 	if (ok != ITER_DONE) {
 		strbuf_addstr(err, "unable to write packed-refs file: "
 			      "error iterating over old contents");
@@ -732,9 +746,6 @@ static int write_with_updates(struct packed_ref_store *refs,
 	return 0;
 
 error:
-	if (iter)
-		ref_iterator_abort(iter);
-
 	delete_tempfile(&refs->tempfile);
 	return -1;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 15/30] packed-backend: create abstraction for writing refs
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 14/30] packed-backend: extract iterator/updates merge Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 16/30] config: add config values for packed-refs v2 Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The packed-refs file is a plaintext file format that starts with a
header line, then each ref is given as one or two lines (two if there is
a peeled value). These lines are written as part of a sequence of
updates which are merged with the existing ref iterator in
merge_iterator_and_updates(). That method is currently tied directly to
write_packed_entry_v1().

When creating a new version of the packed-file format, it would be
valuable to use this merging logic in an identical way. Create a new
function pointer type, write_ref_fn, and use that type in
merge_iterator_and_updates().

Notably, the function pointer type no longer depends on a FILE pointer,
but instead takes an arbitrary "void *write_data" parameter. This
flexibility will be critical in the future, since the planned v2 format
will use the chunk-format API and need a more complicated structure than
the output FILE.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c   | 26 +++++++++++++++-----------
 refs/packed-backend.h   | 16 ++++++++++++++--
 refs/packed-format-v1.c |  7 +++++--
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 0dff78f02c8..7ed9475812c 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -535,10 +535,11 @@ static void add_write_error(struct packed_ref_store *refs, struct strbuf *err)
 		    get_tempfile_path(refs->tempfile), strerror(errno));
 }
 
-static int merge_iterator_and_updates(struct packed_ref_store *refs,
-				      struct string_list *updates,
-				      struct strbuf *err,
-				      FILE *out)
+int merge_iterator_and_updates(struct packed_ref_store *refs,
+			       struct string_list *updates,
+			       struct strbuf *err,
+			       write_ref_fn write_fn,
+			       void *write_data)
 {
 	struct ref_iterator *iter = NULL;
 	int ok, i;
@@ -634,9 +635,10 @@ static int merge_iterator_and_updates(struct packed_ref_store *refs,
 			struct object_id peeled;
 			int peel_error = ref_iterator_peel(iter, &peeled);
 
-			if (write_packed_entry_v1(out, iter->refname,
-						  iter->oid,
-						  peel_error ? NULL : &peeled)) {
+			if (write_fn(iter->refname,
+				     iter->oid,
+				     peel_error ? NULL : &peeled,
+				     write_data)) {
 				add_write_error(refs, err);
 				goto error;
 			}
@@ -657,9 +659,10 @@ static int merge_iterator_and_updates(struct packed_ref_store *refs,
 			int peel_error = peel_object(&update->new_oid,
 						     &peeled);
 
-			if (write_packed_entry_v1(out, update->refname,
-						  &update->new_oid,
-						  peel_error ? NULL : &peeled)) {
+			if (write_fn(update->refname,
+				     &update->new_oid,
+				     peel_error ? NULL : &peeled,
+				     write_data)) {
 				add_write_error(refs, err);
 				goto error;
 			}
@@ -725,7 +728,8 @@ static int write_with_updates(struct packed_ref_store *refs,
 		goto error;
 	}
 
-	ok = merge_iterator_and_updates(refs, updates, err, out);
+	ok = merge_iterator_and_updates(refs, updates, err,
+					write_packed_entry_v1, out);
 
 	if (ok != ITER_DONE) {
 		strbuf_addstr(err, "unable to write packed-refs file: "
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index 143ed6d4f6c..b6908bb002c 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -192,6 +192,17 @@ struct packed_ref_iterator {
 	unsigned int flags;
 };
 
+typedef int (*write_ref_fn)(const char *refname,
+			    const struct object_id *oid,
+			    const struct object_id *peeled,
+			    void *write_data);
+
+int merge_iterator_and_updates(struct packed_ref_store *refs,
+			       struct string_list *updates,
+			       struct strbuf *err,
+			       write_ref_fn write_fn,
+			       void *write_data);
+
 /**
  * Parse the buffer at the given snapshot to verify that it is a
  * packed-refs file in version 1 format. Update the snapshot->peeled
@@ -227,8 +238,9 @@ void verify_buffer_safe_v1(struct snapshot *snapshot);
 void sort_snapshot_v1(struct snapshot *snapshot);
 int write_packed_file_header_v1(FILE *out);
 int next_record_v1(struct packed_ref_iterator *iter);
-int write_packed_entry_v1(FILE *fh, const char *refname,
+int write_packed_entry_v1(const char *refname,
 			  const struct object_id *oid,
-			  const struct object_id *peeled);
+			  const struct object_id *peeled,
+			  void *write_data);
 
 #endif /* REFS_PACKED_BACKEND_H */
diff --git a/refs/packed-format-v1.c b/refs/packed-format-v1.c
index ef9e6618c89..2d071567c02 100644
--- a/refs/packed-format-v1.c
+++ b/refs/packed-format-v1.c
@@ -441,10 +441,13 @@ int write_packed_file_header_v1(FILE *out)
  * error, return a nonzero value and leave errno set at the value left
  * by the failing call to `fprintf()`.
  */
-int write_packed_entry_v1(FILE *fh, const char *refname,
+int write_packed_entry_v1(const char *refname,
 			  const struct object_id *oid,
-			  const struct object_id *peeled)
+			  const struct object_id *peeled,
+			  void *write_data)
 {
+	FILE *fh = write_data;
+
 	if (fprintf(fh, "%s %s\n", oid_to_hex(oid), refname) < 0 ||
 	    (peeled && fprintf(fh, "^%s\n", oid_to_hex(peeled)) < 0))
 		return -1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 16/30] config: add config values for packed-refs v2
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 15/30] packed-backend: create abstraction for writing refs Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 17/30] packed-backend: create shell of v2 writes Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When updating the file format version for something as critical as ref
storage, the file format version must come with an extension change. The
extensions.refFormat config value is a multi-valued config value that
defaults to the pair "files" and "packed".

Add "packed-v2" as a possible value to extensions.refFormat. This
value specifies that the packed-refs file may exist in the version 2
format. (If the "packed" value does not exist, then the packed-refs file
must exist in version 2, not version 1.)

In order to select version 2 for writing, the user will have two
options. First, the user could remove "packed" and add "packed-v2" to
the extensions.refFormat list. This would imply that version 2 is the
only format available. However, this also means that version 1 files
would be ignored at read time, so this does not allow users to upgrade
repositories with existing packed-refs files.

Add a new refs.packedRefsVersion config option which allows specifying
which version to use during writes. Thus, when both "packed" and
"packed-v2" are in the extensions.refFormat list, the user can upgrade
from version 1 to version 2, or downgrade from 2 to 1.

Currently, the implementation does not use refs.packedRefsVersion, as
that is delayed until we have the code to write that file format
version. However, we can add the necessary enum values and flag
constants to communicate the presence of "packed-v2" in the
extensions.refFormat list.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config.txt            |  2 ++
 Documentation/config/extensions.txt | 27 ++++++++++++++++++++++-----
 Documentation/config/refs.txt       | 13 +++++++++++++
 refs.c                              |  4 +++-
 refs/packed-backend.c               | 17 ++++++++++++++++-
 refs/refs-internal.h                |  5 +++--
 repository.h                        |  1 +
 setup.c                             |  2 ++
 t/t3212-ref-formats.sh              | 19 +++++++++++++++++++
 9 files changed, 81 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/config/refs.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e93aef8626..e480f99c3e1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -493,6 +493,8 @@ include::config/rebase.txt[]
 
 include::config/receive.txt[]
 
+include::config/refs.txt[]
+
 include::config/remote.txt[]
 
 include::config/remotes.txt[]
diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 18071c336d0..05abb821e07 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -35,17 +35,34 @@ indicate the existence of different layers:
 	`files`, the `packed` format will only be used to group multiple
 	loose object files upon request via the `git pack-refs` command or
 	via the `pack-refs` maintenance task.
+
+`packed-v2`;;
+	When present, references may be stored as a group in a
+	`packed-refs` file in its version 2 format. This file is in the
+	same position and interacts with loose refs the same as when the
+	`packed` value exists. Both `packed` and `packed-v2` must exist to
+	upgrade an existing `packed-refs` file from version 1 to version 2
+	or to downgrade from version 2 to version 1. When both are
+	present, the `refs.packedRefsVersion` config value indicates which
+	file format version is used during writes, but both versions are
+	understood when reading the file.
 --
 +
 The following combinations are supported by this version of Git:
 +
 --
-`files` and `packed`;;
+`files` and (`packed` and/or `packed-v2`);;
 	This set of values indicates that references are stored both as
-	loose reference files and in the `packed-refs` file in its v1
-	format. Loose references are preferred, and the `packed-refs` file
-	is updated only when deleting a reference that is stored in the
-	`packed-refs` file or during a `git pack-refs` command.
+	loose reference files and in the `packed-refs` file. Loose
+	references are preferred, and the `packed-refs` file is updated
+	only when deleting a reference that is stored in the `packed-refs`
+	file or during a `git pack-refs` command.
++
+The presence of `packed` and `packed-v2` specifies whether the `packed-refs`
+file is allowed to be in its v1 or v2 formats, respectively. When only one
+is present, Git will refuse to read the `packed-refs` file that do not
+match the expected format. When both are present, the `refs.packedRefsVersion`
+config option indicates which file format is used during writes.
 
 `files`;;
 	When only this value is present, Git will ignore the `packed-refs`
diff --git a/Documentation/config/refs.txt b/Documentation/config/refs.txt
new file mode 100644
index 00000000000..b2fdb2923f7
--- /dev/null
+++ b/Documentation/config/refs.txt
@@ -0,0 +1,13 @@
+refs.packedRefsVersion::
+	Specifies the file format version to use when writing a `packed-refs`
+	file. Defaults to `1`.
++
+The only other value currently allowed is `2`, which uses a structured file
+format to result in a smaller `packed-refs` file. In order to write this
+file format version, the repository must also have the `packed-v2` extension
+enabled. The most typical setup will include the
+`core.repositoryFormatVersion=1` config value and the `extensions.refFormat`
+key will have three values: `files`, `packed`, and `packed-v2`.
++
+If `extensions.refFormat` has the value `packed-v2` and not `packed`, then
+`refs.packedRefsVersion` defaults to `2`.
diff --git a/refs.c b/refs.c
index 21441ddb162..bf53d1445f2 100644
--- a/refs.c
+++ b/refs.c
@@ -1987,6 +1987,8 @@ static int add_ref_format_flags(enum ref_format_flags flags, int caps) {
 		caps |= REF_STORE_FORMAT_FILES;
 	if (flags & REF_FORMAT_PACKED)
 		caps |= REF_STORE_FORMAT_PACKED;
+	if (flags & REF_FORMAT_PACKED_V2)
+		caps |= REF_STORE_FORMAT_PACKED_V2;
 
 	return caps;
 }
@@ -2006,7 +2008,7 @@ static struct ref_store *ref_store_init(struct repository *repo,
 	flags = add_ref_format_flags(repo->ref_format, flags);
 
 	if (!(flags & REF_STORE_FORMAT_FILES) &&
-	    (flags & REF_STORE_FORMAT_PACKED))
+	    packed_refs_enabled(flags))
 		be_name = "packed";
 
 	be = find_ref_storage_backend(be_name);
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 7ed9475812c..655aab939be 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -236,7 +236,13 @@ static struct snapshot *create_snapshot(struct packed_ref_store *refs)
 	if (!load_contents(snapshot))
 		return snapshot;
 
-	if (parse_packed_format_v1_header(refs, snapshot, &sorted)) {
+	/*
+	 * If this is a v1 file format, but we don't have v1 enabled,
+	 * then ignore it the same way we would as if we didn't
+	 * understand it.
+	 */
+	if (parse_packed_format_v1_header(refs, snapshot, &sorted) ||
+	    !(refs->store_flags & REF_STORE_FORMAT_PACKED)) {
 		clear_snapshot(refs);
 		return NULL;
 	}
@@ -310,6 +316,12 @@ static int packed_read_raw_ref(struct ref_store *ref_store, const char *refname,
 		packed_downcast(ref_store, REF_STORE_READ, "read_raw_ref");
 	struct snapshot *snapshot = get_snapshot(refs);
 
+	if (!snapshot) {
+		/* refname is not a packed reference. */
+		*failure_errno = ENOENT;
+		return -1;
+	}
+
 	return packed_read_raw_ref_v1(refs, snapshot, refname,
 				      oid, type, failure_errno);
 }
@@ -410,6 +422,9 @@ static struct ref_iterator *packed_ref_iterator_begin(
 	 */
 	snapshot = get_snapshot(refs);
 
+	if (!snapshot)
+		return empty_ref_iterator_begin();
+
 	if (prefix && *prefix)
 		start = find_reference_location_v1(snapshot, prefix, 0);
 	else
diff --git a/refs/refs-internal.h b/refs/refs-internal.h
index a1900848a87..39b93fce97c 100644
--- a/refs/refs-internal.h
+++ b/refs/refs-internal.h
@@ -522,11 +522,12 @@ struct ref_store;
 				 REF_STORE_MAIN)
 
 #define REF_STORE_FORMAT_FILES		(1 << 8) /* can use loose ref files */
-#define REF_STORE_FORMAT_PACKED		(1 << 9) /* can use packed-refs file */
+#define REF_STORE_FORMAT_PACKED		(1 << 9) /* can use v1 packed-refs file */
+#define REF_STORE_FORMAT_PACKED_V2	(1 << 10) /* can use v2 packed-refs file */
 
 static inline int packed_refs_enabled(int flags)
 {
-	return flags & REF_STORE_FORMAT_PACKED;
+	return flags & (REF_STORE_FORMAT_PACKED | REF_STORE_FORMAT_PACKED_V2);
 }
 
 /*
diff --git a/repository.h b/repository.h
index 5cfde4282c5..ee3a90efc72 100644
--- a/repository.h
+++ b/repository.h
@@ -64,6 +64,7 @@ struct repo_path_cache {
 enum ref_format_flags {
 	REF_FORMAT_FILES = (1 << 0),
 	REF_FORMAT_PACKED = (1 << 1),
+	REF_FORMAT_PACKED_V2 = (1 << 2),
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index a5e63479558..72bfa289ade 100644
--- a/setup.c
+++ b/setup.c
@@ -582,6 +582,8 @@ static enum extension_result handle_extension(const char *var,
 			data->ref_format |= REF_FORMAT_FILES;
 		else if (!strcmp(value, "packed"))
 			data->ref_format |= REF_FORMAT_PACKED;
+		else if (!strcmp(value, "packed-v2"))
+			data->ref_format |= REF_FORMAT_PACKED_V2;
 		else
 			return error(_("invalid value for '%s': '%s'"),
 				     "extensions.refFormat", value);
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
index 67aa65c116f..cd1b399bbb8 100755
--- a/t/t3212-ref-formats.sh
+++ b/t/t3212-ref-formats.sh
@@ -56,4 +56,23 @@ test_expect_success 'extensions.refFormat=files only' '
 	)
 '
 
+test_expect_success 'extensions.refFormat=files,packed-v2' '
+	test_commit Q &&
+	git pack-refs --all &&
+	git init no-packed-v1 &&
+	(
+		cd no-packed-v1 &&
+		git config core.repositoryFormatVersion 1 &&
+		git config extensions.refFormat files &&
+		git config --add extensions.refFormat packed-v2 &&
+		test_commit A &&
+		test_commit B &&
+
+		# Refuse to parse a v1 packed-refs file.
+		cp ../.git/packed-refs .git/packed-refs &&
+		test_must_fail git rev-parse refs/tags/Q &&
+		rm -f .git/packed-refs
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 17/30] packed-backend: create shell of v2 writes
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 16/30] config: add config values for packed-refs v2 Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 18/30] packed-refs: write file format version 2 Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Makefile                |  1 +
 refs/packed-backend.c   | 75 +++++++++++++++++++++++++++++++++++------
 refs/packed-backend.h   |  7 ++++
 refs/packed-format-v2.c | 38 +++++++++++++++++++++
 4 files changed, 110 insertions(+), 11 deletions(-)
 create mode 100644 refs/packed-format-v2.c

diff --git a/Makefile b/Makefile
index 3dc887941d4..16cd245e0ad 100644
--- a/Makefile
+++ b/Makefile
@@ -1058,6 +1058,7 @@ LIB_OBJS += refs/files-backend.o
 LIB_OBJS += refs/iterator.o
 LIB_OBJS += refs/packed-backend.o
 LIB_OBJS += refs/packed-format-v1.o
+LIB_OBJS += refs/packed-format-v2.o
 LIB_OBJS += refs/ref-cache.o
 LIB_OBJS += refspec.o
 LIB_OBJS += remote.o
diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 655aab939be..09f7b74584f 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -692,6 +692,45 @@ error:
 	return ok;
 }
 
+static int write_with_updates_v1(struct packed_ref_store *refs,
+				 struct string_list *updates,
+				 struct strbuf *err)
+{
+	FILE *out;
+
+	out = fdopen_tempfile(refs->tempfile, "w");
+	if (!out) {
+		strbuf_addf(err, "unable to fdopen packed-refs tempfile: %s",
+			    strerror(errno));
+		goto error;
+	}
+
+	if (write_packed_file_header_v1(out) < 0) {
+		add_write_error(refs, err);
+		goto error;
+	}
+
+	return merge_iterator_and_updates(refs, updates, err,
+					  write_packed_entry_v1, out);
+
+error:
+	return -1;
+}
+
+static int write_with_updates_v2(struct packed_ref_store *refs,
+				 struct string_list *updates,
+				 struct strbuf *err)
+{
+	struct write_packed_refs_v2_context *ctx = create_v2_context(refs, updates, err);
+	int ok = -1;
+
+	if ((ok = write_packed_refs_v2(ctx)) < 0)
+		add_write_error(refs, err);
+
+	free_v2_context(ctx);
+	return ok;
+}
+
 /*
  * Write the packed refs from the current snapshot to the packed-refs
  * tempfile, incorporating any changes from `updates`. `updates` must
@@ -707,9 +746,9 @@ static int write_with_updates(struct packed_ref_store *refs,
 			      struct strbuf *err)
 {
 	int ok;
-	FILE *out;
 	struct strbuf sb = STRBUF_INIT;
 	char *packed_refs_path;
+	int version;
 
 	if (!is_lock_file_locked(&refs->lock))
 		BUG("write_with_updates() called while unlocked");
@@ -731,21 +770,35 @@ static int write_with_updates(struct packed_ref_store *refs,
 	}
 	strbuf_release(&sb);
 
-	out = fdopen_tempfile(refs->tempfile, "w");
-	if (!out) {
-		strbuf_addf(err, "unable to fdopen packed-refs tempfile: %s",
-			    strerror(errno));
-		goto error;
+	if (git_config_get_int("refs.packedrefsversion", &version)) {
+		/*
+		 * Set the default depending on the current extension
+		 * list. Default to version 1 if available, but allow a
+		 * default of 2 if only "packed-v2" exists.
+		 */
+		if (refs->store_flags & REF_STORE_FORMAT_PACKED)
+			version = 1;
+		else if (refs->store_flags & REF_STORE_FORMAT_PACKED_V2)
+			version = 2;
+		else
+			BUG("writing a packed-refs file without an extension");
 	}
 
-	if (write_packed_file_header_v1(out) < 0) {
-		add_write_error(refs, err);
+	switch (version) {
+	case 1:
+		ok = write_with_updates_v1(refs, updates, err);
+		break;
+
+	case 2:
+		ok = write_with_updates_v2(refs, updates, err);
+		break;
+
+	default:
+		strbuf_addf(err, "unknown packed-refs version: %d",
+			    version);
 		goto error;
 	}
 
-	ok = merge_iterator_and_updates(refs, updates, err,
-					write_packed_entry_v1, out);
-
 	if (ok != ITER_DONE) {
 		strbuf_addstr(err, "unable to write packed-refs file: "
 			      "error iterating over old contents");
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index b6908bb002c..e76f26bfc46 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -243,4 +243,11 @@ int write_packed_entry_v1(const char *refname,
 			  const struct object_id *peeled,
 			  void *write_data);
 
+struct write_packed_refs_v2_context;
+struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *refs,
+						       struct string_list *updates,
+						       struct strbuf *err);
+int write_packed_refs_v2(struct write_packed_refs_v2_context *ctx);
+void free_v2_context(struct write_packed_refs_v2_context *ctx);
+
 #endif /* REFS_PACKED_BACKEND_H */
diff --git a/refs/packed-format-v2.c b/refs/packed-format-v2.c
new file mode 100644
index 00000000000..ecf3cc93694
--- /dev/null
+++ b/refs/packed-format-v2.c
@@ -0,0 +1,38 @@
+#include "../cache.h"
+#include "../config.h"
+#include "../refs.h"
+#include "refs-internal.h"
+#include "packed-backend.h"
+#include "../iterator.h"
+#include "../lockfile.h"
+#include "../chdir-notify.h"
+
+struct write_packed_refs_v2_context {
+	struct packed_ref_store *refs;
+	struct string_list *updates;
+	struct strbuf *err;
+};
+
+struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *refs,
+						       struct string_list *updates,
+						       struct strbuf *err)
+{
+	struct write_packed_refs_v2_context *ctx;
+	CALLOC_ARRAY(ctx, 1);
+
+	ctx->refs = refs;
+	ctx->updates = updates;
+	ctx->err = err;
+
+	return ctx;
+}
+
+int write_packed_refs_v2(struct write_packed_refs_v2_context *ctx)
+{
+	return 0;
+}
+
+void free_v2_context(struct write_packed_refs_v2_context *ctx)
+{
+	free(ctx);
+}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 18/30] packed-refs: write file format version 2
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 17/30] packed-backend: create shell of v2 writes Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 19/30] packed-refs: read file format v2 Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

TODO: add writing tests.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c   |   3 +-
 refs/packed-format-v2.c | 108 ++++++++++++++++++++++++++++++++++++++++
 t/t3212-ref-formats.sh  |   6 ++-
 3 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 09f7b74584f..3429e63620a 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -790,7 +790,8 @@ static int write_with_updates(struct packed_ref_store *refs,
 		break;
 
 	case 2:
-		ok = write_with_updates_v2(refs, updates, err);
+		/* Convert the normal error codes to ITER_DONE. */
+		ok = write_with_updates_v2(refs, updates, err) ? -2 : ITER_DONE;
 		break;
 
 	default:
diff --git a/refs/packed-format-v2.c b/refs/packed-format-v2.c
index ecf3cc93694..044cc9f629a 100644
--- a/refs/packed-format-v2.c
+++ b/refs/packed-format-v2.c
@@ -6,11 +6,30 @@
 #include "../iterator.h"
 #include "../lockfile.h"
 #include "../chdir-notify.h"
+#include "../chunk-format.h"
+#include "../csum-file.h"
+
+#define OFFSET_IS_PEELED (((uint64_t)1) << 63)
+
+#define PACKED_REFS_SIGNATURE          0x50524546 /* "PREF" */
+#define CHREFS_CHUNKID_OFFSETS         0x524F4646 /* "ROFF" */
+#define CHREFS_CHUNKID_REFS            0x52454653 /* "REFS" */
 
 struct write_packed_refs_v2_context {
 	struct packed_ref_store *refs;
 	struct string_list *updates;
 	struct strbuf *err;
+
+	struct hashfile *f;
+	struct chunkfile *cf;
+
+	/*
+	 * As we stream the ref names to the refs chunk, store these
+	 * values in-memory. These arrays are populated one for every ref.
+	 */
+	uint64_t *offsets;
+	size_t nr;
+	size_t offsets_alloc;
 };
 
 struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *refs,
@@ -24,15 +43,104 @@ struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *
 	ctx->updates = updates;
 	ctx->err = err;
 
+	if (!fdopen_tempfile(refs->tempfile, "w")) {
+		strbuf_addf(err, "unable to fdopen packed-refs tempfile: %s",
+			    strerror(errno));
+		return ctx;
+	}
+
+	ctx->f = hashfd(refs->tempfile->fd, refs->tempfile->filename.buf);
+	ctx->cf = init_chunkfile(ctx->f);
+
 	return ctx;
 }
 
+static int write_packed_entry_v2(const char *refname,
+				 const struct object_id *oid,
+				 const struct object_id *peeled,
+				 void *write_data)
+{
+	struct write_packed_refs_v2_context *ctx = write_data;
+	size_t reflen = strlen(refname) + 1;
+	size_t i = ctx->nr;
+
+	ALLOC_GROW(ctx->offsets, i + 1, ctx->offsets_alloc);
+
+	/* Write entire ref, including null terminator. */
+	hashwrite(ctx->f, refname, reflen);
+	hashwrite(ctx->f, oid->hash, the_hash_algo->rawsz);
+	if (peeled)
+		hashwrite(ctx->f, peeled->hash, the_hash_algo->rawsz);
+
+	if (i)
+		ctx->offsets[i] = (ctx->offsets[i - 1] & (~OFFSET_IS_PEELED));
+	else
+		ctx->offsets[i] = 0;
+	ctx->offsets[i] += reflen + the_hash_algo->rawsz;
+
+	if (peeled) {
+		ctx->offsets[i] += the_hash_algo->rawsz;
+		ctx->offsets[i] |= OFFSET_IS_PEELED;
+	}
+
+	ctx->nr++;
+	return 0;
+}
+
+static int write_refs_chunk_refs(struct hashfile *f,
+				 void *data)
+{
+	struct write_packed_refs_v2_context *ctx = data;
+	int ok;
+
+	trace2_region_enter("refs", "refs-chunk", the_repository);
+	ok = merge_iterator_and_updates(ctx->refs, ctx->updates, ctx->err,
+					write_packed_entry_v2, ctx);
+	trace2_region_leave("refs", "refs-chunk", the_repository);
+
+	return ok != ITER_DONE;
+}
+
+static int write_refs_chunk_offsets(struct hashfile *f,
+				    void *data)
+{
+	struct write_packed_refs_v2_context *ctx = data;
+	size_t i;
+
+	trace2_region_enter("refs", "offsets", the_repository);
+	for (i = 0; i < ctx->nr; i++)
+		hashwrite_be64(f, ctx->offsets[i]);
+
+	trace2_region_leave("refs", "offsets", the_repository);
+	return 0;
+}
+
 int write_packed_refs_v2(struct write_packed_refs_v2_context *ctx)
 {
+	unsigned char file_hash[GIT_MAX_RAWSZ];
+
+	add_chunk(ctx->cf, CHREFS_CHUNKID_REFS, 0, write_refs_chunk_refs);
+	add_chunk(ctx->cf, CHREFS_CHUNKID_OFFSETS, 0, write_refs_chunk_offsets);
+
+	hashwrite_be32(ctx->f, PACKED_REFS_SIGNATURE);
+	hashwrite_be32(ctx->f, 2);
+	hashwrite_be32(ctx->f, the_hash_algo->format_id);
+
+	if (write_chunkfile(ctx->cf, CHUNKFILE_TRAILING_TOC, ctx))
+		goto failure;
+
+	finalize_hashfile(ctx->f, file_hash, FSYNC_COMPONENT_REFERENCE,
+			  CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+
 	return 0;
+
+failure:
+	return -1;
 }
 
 void free_v2_context(struct write_packed_refs_v2_context *ctx)
 {
+	if (ctx->cf)
+		free_chunkfile(ctx->cf);
 	free(ctx);
 }
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
index cd1b399bbb8..03c713ac4f6 100755
--- a/t/t3212-ref-formats.sh
+++ b/t/t3212-ref-formats.sh
@@ -71,7 +71,11 @@ test_expect_success 'extensions.refFormat=files,packed-v2' '
 		# Refuse to parse a v1 packed-refs file.
 		cp ../.git/packed-refs .git/packed-refs &&
 		test_must_fail git rev-parse refs/tags/Q &&
-		rm -f .git/packed-refs
+		rm -f .git/packed-refs &&
+
+		# Create a v2 packed-refs file
+		git pack-refs --all &&
+		test_path_exists .git/packed-refs
 	)
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 19/30] packed-refs: read file format v2
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 18/30] packed-refs: write file format version 2 Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 20/30] packed-refs: read optional prefix chunks Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c   | 129 ++++++++++++++++---------
 refs/packed-backend.h   |  72 ++++++++++++--
 refs/packed-format-v2.c | 209 ++++++++++++++++++++++++++++++++++++++++
 t/t3212-ref-formats.sh  |  17 +++-
 4 files changed, 372 insertions(+), 55 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 3429e63620a..549cce1f84a 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -66,7 +66,7 @@ void clear_snapshot_buffer(struct snapshot *snapshot)
  * Decrease the reference count of `*snapshot`. If it goes to zero,
  * free `*snapshot` and return true; otherwise return false.
  */
-static int release_snapshot(struct snapshot *snapshot)
+int release_snapshot(struct snapshot *snapshot)
 {
 	if (!--snapshot->referrers) {
 		stat_validity_clear(&snapshot->validity);
@@ -142,7 +142,6 @@ static int load_contents(struct snapshot *snapshot)
 {
 	int fd;
 	struct stat st;
-	size_t size;
 	ssize_t bytes_read;
 
 	if (!packed_refs_enabled(snapshot->refs->store_flags))
@@ -168,25 +167,25 @@ static int load_contents(struct snapshot *snapshot)
 
 	if (fstat(fd, &st) < 0)
 		die_errno("couldn't stat %s", snapshot->refs->path);
-	size = xsize_t(st.st_size);
+	snapshot->buflen = xsize_t(st.st_size);
 
-	if (!size) {
+	if (!snapshot->buflen) {
 		close(fd);
 		return 0;
-	} else if (mmap_strategy == MMAP_NONE || size <= SMALL_FILE_SIZE) {
-		snapshot->buf = xmalloc(size);
-		bytes_read = read_in_full(fd, snapshot->buf, size);
-		if (bytes_read < 0 || bytes_read != size)
+	} else if (mmap_strategy == MMAP_NONE || snapshot->buflen <= SMALL_FILE_SIZE) {
+		snapshot->buf = xmalloc(snapshot->buflen);
+		bytes_read = read_in_full(fd, snapshot->buf, snapshot->buflen);
+		if (bytes_read < 0 || bytes_read != snapshot->buflen)
 			die_errno("couldn't read %s", snapshot->refs->path);
 		snapshot->mmapped = 0;
 	} else {
-		snapshot->buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+		snapshot->buf = xmmap(NULL, snapshot->buflen, PROT_READ, MAP_PRIVATE, fd, 0);
 		snapshot->mmapped = 1;
 	}
 	close(fd);
 
 	snapshot->start = snapshot->buf;
-	snapshot->eof = snapshot->buf + size;
+	snapshot->eof = snapshot->buf + snapshot->buflen;
 
 	return 1;
 }
@@ -232,46 +231,52 @@ static struct snapshot *create_snapshot(struct packed_ref_store *refs)
 	snapshot->refs = refs;
 	acquire_snapshot(snapshot);
 	snapshot->peeled = PEELED_NONE;
+	snapshot->version = 1;
 
 	if (!load_contents(snapshot))
 		return snapshot;
 
-	/*
-	 * If this is a v1 file format, but we don't have v1 enabled,
-	 * then ignore it the same way we would as if we didn't
-	 * understand it.
-	 */
-	if (parse_packed_format_v1_header(refs, snapshot, &sorted) ||
-	    !(refs->store_flags & REF_STORE_FORMAT_PACKED)) {
-		clear_snapshot(refs);
-		return NULL;
-	}
+	if ((refs->store_flags & REF_STORE_FORMAT_PACKED) &&
+	    !detect_packed_format_v2_header(refs, snapshot)) {
+		parse_packed_format_v1_header(refs, snapshot, &sorted);
+		snapshot->version = 1;
+		verify_buffer_safe_v1(snapshot);
 
-	verify_buffer_safe_v1(snapshot);
+		if (!sorted) {
+			sort_snapshot_v1(snapshot);
 
-	if (!sorted) {
-		sort_snapshot_v1(snapshot);
+			/*
+			* Reordering the records might have moved a short one
+			* to the end of the buffer, so verify the buffer's
+			* safety again:
+			*/
+			verify_buffer_safe_v1(snapshot);
+		}
 
-		/*
-		 * Reordering the records might have moved a short one
-		 * to the end of the buffer, so verify the buffer's
-		 * safety again:
-		 */
-		verify_buffer_safe_v1(snapshot);
+		if (mmap_strategy != MMAP_OK && snapshot->mmapped) {
+			/*
+			* We don't want to leave the file mmapped, so we are
+			* forced to make a copy now:
+			*/
+			char *buf_copy = xmalloc(snapshot->buflen);
+
+			memcpy(buf_copy, snapshot->start, snapshot->buflen);
+			clear_snapshot_buffer(snapshot);
+			snapshot->buf = snapshot->start = buf_copy;
+			snapshot->eof = buf_copy + snapshot->buflen;
+		}
+
+		return snapshot;
 	}
 
-	if (mmap_strategy != MMAP_OK && snapshot->mmapped) {
+	if (refs->store_flags & REF_STORE_FORMAT_PACKED_V2) {
 		/*
-		 * We don't want to leave the file mmapped, so we are
-		 * forced to make a copy now:
+		 * Assume we are in v2 format mode, now.
+		 *
+		 * fill_snapshot_v2() will die() if parsing fails.
 		 */
-		size_t size = snapshot->eof - snapshot->start;
-		char *buf_copy = xmalloc(size);
-
-		memcpy(buf_copy, snapshot->start, size);
-		clear_snapshot_buffer(snapshot);
-		snapshot->buf = snapshot->start = buf_copy;
-		snapshot->eof = buf_copy + size;
+		fill_snapshot_v2(snapshot);
+		snapshot->version = 2;
 	}
 
 	return snapshot;
@@ -322,8 +327,18 @@ static int packed_read_raw_ref(struct ref_store *ref_store, const char *refname,
 		return -1;
 	}
 
-	return packed_read_raw_ref_v1(refs, snapshot, refname,
-				      oid, type, failure_errno);
+	switch (snapshot->version) {
+	case 1:
+		return packed_read_raw_ref_v1(refs, snapshot, refname,
+					      oid, type, failure_errno);
+
+	case 2:
+		return packed_read_raw_ref_v2(refs, snapshot, refname,
+					      oid, type, failure_errno);
+
+	default:
+		return -1;
+	}
 }
 
 /*
@@ -335,7 +350,16 @@ static int packed_read_raw_ref(struct ref_store *ref_store, const char *refname,
  */
 static int next_record(struct packed_ref_iterator *iter)
 {
-	return next_record_v1(iter);
+	switch (iter->version) {
+	case 1:
+		return next_record_v1(iter);
+
+	case 2:
+		return next_record_v2(iter);
+
+	default:
+		return -1;
+	}
 }
 
 static int packed_ref_iterator_advance(struct ref_iterator *ref_iterator)
@@ -410,6 +434,7 @@ static struct ref_iterator *packed_ref_iterator_begin(
 	struct packed_ref_iterator *iter;
 	struct ref_iterator *ref_iterator;
 	unsigned int required_flags = REF_STORE_READ;
+	size_t v2_row = 0;
 
 	if (!(flags & DO_FOR_EACH_INCLUDE_BROKEN))
 		required_flags |= REF_STORE_ODB;
@@ -422,13 +447,21 @@ static struct ref_iterator *packed_ref_iterator_begin(
 	 */
 	snapshot = get_snapshot(refs);
 
-	if (!snapshot)
+	if (!snapshot || snapshot->version < 0 || snapshot->version > 2)
 		return empty_ref_iterator_begin();
 
-	if (prefix && *prefix)
-		start = find_reference_location_v1(snapshot, prefix, 0);
-	else
-		start = snapshot->start;
+	if (prefix && *prefix) {
+		if (snapshot->version == 1)
+			start = find_reference_location_v1(snapshot, prefix, 0);
+		else
+			start = find_reference_location_v2(snapshot, prefix, 0,
+							   &v2_row);
+	} else {
+		if (snapshot->version == 1)
+			start = snapshot->start;
+		else
+			start = snapshot->refs_chunk;
+	}
 
 	if (start == snapshot->eof)
 		return empty_ref_iterator_begin();
@@ -439,6 +472,8 @@ static struct ref_iterator *packed_ref_iterator_begin(
 
 	iter->snapshot = snapshot;
 	acquire_snapshot(snapshot);
+	iter->version = snapshot->version;
+	iter->row = v2_row;
 
 	iter->pos = start;
 	iter->eof = snapshot->eof;
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index e76f26bfc46..3a8649857f1 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -72,6 +72,9 @@ struct snapshot {
 	/* Is the `packed-refs` file currently mmapped? */
 	int mmapped;
 
+	/* which file format version is this file? */
+	int version;
+
 	/*
 	 * The contents of the `packed-refs` file:
 	 *
@@ -96,6 +99,14 @@ struct snapshot {
 	 */
 	enum { PEELED_NONE, PEELED_TAGS, PEELED_FULLY } peeled;
 
+	/*************************
+	 * packed-refs v2 values *
+	 *************************/
+	size_t nr;
+	size_t buflen;
+	const unsigned char *offset_chunk;
+	const char *refs_chunk;
+
 	/*
 	 * Count of references to this instance, including the pointer
 	 * from `packed_ref_store::snapshot`, if any. The instance
@@ -112,6 +123,8 @@ struct snapshot {
 	struct stat_validity validity;
 };
 
+int release_snapshot(struct snapshot *snapshot);
+
 /*
  * If the buffer in `snapshot` is active, then either munmap the
  * memory and close the file, or free the memory. Then set the buffer
@@ -175,21 +188,30 @@ struct packed_ref_store {
  */
 struct packed_ref_iterator {
 	struct ref_iterator base;
-
 	struct snapshot *snapshot;
+	struct repository *repo;
+	unsigned int flags;
+	int version;
+
+	/* Scratch space for current values: */
+	struct object_id oid, peeled;
+	struct strbuf refname_buf;
 
 	/* The current position in the snapshot's buffer: */
 	const char *pos;
 
+	/***********************************
+	 * packed-refs v1 iterator values. *
+	 ***********************************/
+
 	/* The end of the part of the buffer that will be iterated over: */
 	const char *eof;
 
-	/* Scratch space for current values: */
-	struct object_id oid, peeled;
-	struct strbuf refname_buf;
-
-	struct repository *repo;
-	unsigned int flags;
+	/***********************************
+	 * packed-refs v2 iterator values. *
+	 ***********************************/
+	size_t nr;
+	size_t row;
 };
 
 typedef int (*write_ref_fn)(const char *refname,
@@ -243,6 +265,42 @@ int write_packed_entry_v1(const char *refname,
 			  const struct object_id *peeled,
 			  void *write_data);
 
+/**
+ * Parse the buffer at the given snapshot to verify that it is a
+ * packed-refs file in version 1 format. Update the snapshot->peeled
+ * value according to the header information. Update the given
+ * 'sorted' value with whether or not the packed-refs file is sorted.
+ */
+int parse_packed_format_v1_header(struct packed_ref_store *refs,
+				  struct snapshot *snapshot,
+				  int *sorted);
+
+int detect_packed_format_v2_header(struct packed_ref_store *refs,
+				   struct snapshot *snapshot);
+/*
+ * Find the place in `snapshot->buf` where the start of the record for
+ * `refname` starts. If `mustexist` is true and the reference doesn't
+ * exist, then return NULL. If `mustexist` is false and the reference
+ * doesn't exist, then return the point where that reference would be
+ * inserted, or `snapshot->eof` (which might be NULL) if it would be
+ * inserted at the end of the file. In the latter mode, `refname`
+ * doesn't have to be a proper reference name; for example, one could
+ * search for "refs/replace/" to find the start of any replace
+ * references.
+ *
+ * The record is sought using a binary search, so `snapshot->buf` must
+ * be sorted.
+ */
+const char *find_reference_location_v2(struct snapshot *snapshot,
+				       const char *refname, int mustexist,
+				       size_t *pos);
+
+int packed_read_raw_ref_v2(struct packed_ref_store *refs, struct snapshot *snapshot,
+			   const char *refname, struct object_id *oid,
+			   unsigned int *type, int *failure_errno);
+int next_record_v2(struct packed_ref_iterator *iter);
+void fill_snapshot_v2(struct snapshot *snapshot);
+
 struct write_packed_refs_v2_context;
 struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *refs,
 						       struct string_list *updates,
diff --git a/refs/packed-format-v2.c b/refs/packed-format-v2.c
index 044cc9f629a..d75df9545ec 100644
--- a/refs/packed-format-v2.c
+++ b/refs/packed-format-v2.c
@@ -15,6 +15,215 @@
 #define CHREFS_CHUNKID_OFFSETS         0x524F4646 /* "ROFF" */
 #define CHREFS_CHUNKID_REFS            0x52454653 /* "REFS" */
 
+int detect_packed_format_v2_header(struct packed_ref_store *refs,
+				   struct snapshot *snapshot)
+{
+	/*
+	 * packed-refs v1 might not have a header, so check instead
+	 * that the v2 signature is not present.
+	 */
+	return get_be32(snapshot->buf) == PACKED_REFS_SIGNATURE;
+}
+
+static const char *get_nth_ref(struct snapshot *snapshot,
+			       size_t n)
+{
+	uint64_t offset;
+
+	if (n >= snapshot->nr)
+		BUG("asking for position %"PRIu64" outside of bounds (%"PRIu64")",
+		    (uint64_t)n, (uint64_t)snapshot->nr);
+
+	if (n)
+		offset = get_be64(snapshot->offset_chunk + (n-1) * sizeof(uint64_t))
+				  & ~OFFSET_IS_PEELED;
+	else
+		offset = 0;
+
+	return snapshot->refs_chunk + offset;
+}
+
+/*
+ * Find the place in `snapshot->buf` where the start of the record for
+ * `refname` starts. If `mustexist` is true and the reference doesn't
+ * exist, then return NULL. If `mustexist` is false and the reference
+ * doesn't exist, then return the point where that reference would be
+ * inserted, or `snapshot->eof` (which might be NULL) if it would be
+ * inserted at the end of the file. In the latter mode, `refname`
+ * doesn't have to be a proper reference name; for example, one could
+ * search for "refs/replace/" to find the start of any replace
+ * references.
+ *
+ * The record is sought using a binary search, so `snapshot->buf` must
+ * be sorted.
+ */
+const char *find_reference_location_v2(struct snapshot *snapshot,
+				       const char *refname, int mustexist,
+				       size_t *pos)
+{
+	size_t lo = 0, hi = snapshot->nr;
+
+	while (lo != hi) {
+		const char *rec;
+		int cmp;
+		size_t mid = lo + (hi - lo) / 2;
+
+		rec = get_nth_ref(snapshot, mid);
+		cmp = strcmp(rec, refname);
+		if (cmp < 0) {
+			lo = mid + 1;
+		} else if (cmp > 0) {
+			hi = mid;
+		} else {
+			if (pos)
+				*pos = mid;
+			return rec;
+		}
+	}
+
+	if (mustexist) {
+		return NULL;
+	} else {
+		const char *ret;
+		/*
+		 * We are likely doing a prefix match, so use the current
+		 * 'lo' position as the indicator.
+		 */
+		if (pos)
+			*pos = lo;
+		if (lo >= snapshot->nr)
+			return NULL;
+
+		ret = get_nth_ref(snapshot, lo);
+		return ret;
+	}
+}
+
+int packed_read_raw_ref_v2(struct packed_ref_store *refs, struct snapshot *snapshot,
+			   const char *refname, struct object_id *oid,
+			   unsigned int *type, int *failure_errno)
+{
+	const char *rec;
+
+	*type = 0;
+
+	rec = find_reference_location_v2(snapshot, refname, 1, NULL);
+
+	if (!rec) {
+		/* refname is not a packed reference. */
+		*failure_errno = ENOENT;
+		return -1;
+	}
+
+	hashcpy(oid->hash, (const unsigned char *)rec + strlen(rec) + 1);
+	oid->algo = hash_algo_by_ptr(the_hash_algo);
+
+	*type = REF_ISPACKED;
+	return 0;
+}
+
+static int packed_refs_read_offsets(const unsigned char *chunk_start,
+				     size_t chunk_size, void *data)
+{
+	struct snapshot *snapshot = data;
+
+	snapshot->offset_chunk = chunk_start;
+	snapshot->nr = chunk_size / sizeof(uint64_t);
+	return 0;
+}
+
+void fill_snapshot_v2(struct snapshot *snapshot)
+{
+	uint32_t file_signature, file_version, hash_version;
+	struct chunkfile *cf;
+
+	file_signature = get_be32(snapshot->buf);
+	if (file_signature != PACKED_REFS_SIGNATURE)
+		die(_("%s file signature %X does not match signature %X"),
+		    "packed-ref", file_signature, PACKED_REFS_SIGNATURE);
+
+	file_version = get_be32(snapshot->buf + sizeof(uint32_t));
+	if (file_version != 2)
+		die(_("format version %u does not match expected file version %u"),
+		    file_version, 2);
+
+	hash_version = get_be32(snapshot->buf + 2 * sizeof(uint32_t));
+	if (hash_version != the_hash_algo->format_id)
+		die(_("hash version %X does not match expected hash version %X"),
+		    hash_version, the_hash_algo->format_id);
+
+	cf = init_chunkfile(NULL);
+
+	if (read_trailing_table_of_contents(cf, (const unsigned char *)snapshot->buf, snapshot->buflen)) {
+		release_snapshot(snapshot);
+		snapshot = NULL;
+		goto cleanup;
+	}
+
+	read_chunk(cf, CHREFS_CHUNKID_OFFSETS, packed_refs_read_offsets, snapshot);
+	pair_chunk(cf, CHREFS_CHUNKID_REFS, (const unsigned char**)&snapshot->refs_chunk);
+
+	/* TODO: add error checks for invalid chunk combinations. */
+
+cleanup:
+	free_chunkfile(cf);
+}
+
+/*
+ * Move the iterator to the next record in the snapshot, without
+ * respect for whether the record is actually required by the current
+ * iteration. Adjust the fields in `iter` and return `ITER_OK` or
+ * `ITER_DONE`. This function does not free the iterator in the case
+ * of `ITER_DONE`.
+ */
+int next_record_v2(struct packed_ref_iterator *iter)
+{
+	uint64_t offset;
+	const char *pos = iter->pos;
+	strbuf_reset(&iter->refname_buf);
+
+	if (iter->row == iter->snapshot->nr)
+		return ITER_DONE;
+
+	iter->base.flags = REF_ISPACKED;
+
+	strbuf_addstr(&iter->refname_buf, pos);
+	iter->base.refname = iter->refname_buf.buf;
+	pos += strlen(pos) + 1;
+
+	hashcpy(iter->oid.hash, (const unsigned char *)pos);
+	iter->oid.algo = hash_algo_by_ptr(the_hash_algo);
+	pos += the_hash_algo->rawsz;
+
+	if (check_refname_format(iter->base.refname, REFNAME_ALLOW_ONELEVEL)) {
+		if (!refname_is_safe(iter->base.refname))
+			die("packed refname is dangerous: %s",
+			    iter->base.refname);
+		oidclr(&iter->oid);
+		iter->base.flags |= REF_BAD_NAME | REF_ISBROKEN;
+	}
+
+	/* We always know the peeled value! */
+	iter->base.flags |= REF_KNOWS_PEELED;
+
+	offset = get_be64(iter->snapshot->offset_chunk + sizeof(uint64_t) * iter->row);
+	if (offset & OFFSET_IS_PEELED) {
+		hashcpy(iter->peeled.hash, (const unsigned char *)pos);
+		iter->peeled.algo = hash_algo_by_ptr(the_hash_algo);
+	} else {
+		oidclr(&iter->peeled);
+	}
+
+	/* TODO: somehow all tags are getting OFFSET_IS_PEELED even though
+	 * some are not annotated tags.
+	 */
+	iter->pos = iter->snapshot->refs_chunk + (offset & (~OFFSET_IS_PEELED));
+
+	iter->row++;
+
+	return ITER_OK;
+}
+
 struct write_packed_refs_v2_context {
 	struct packed_ref_store *refs;
 	struct string_list *updates;
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
index 03c713ac4f6..571ba518ef1 100755
--- a/t/t3212-ref-formats.sh
+++ b/t/t3212-ref-formats.sh
@@ -73,9 +73,24 @@ test_expect_success 'extensions.refFormat=files,packed-v2' '
 		test_must_fail git rev-parse refs/tags/Q &&
 		rm -f .git/packed-refs &&
 
+		git for-each-ref --format="%(refname) %(objectname)" >expect-all &&
+		git for-each-ref --format="%(refname) %(objectname)" \
+			refs/tags/* >expect-tags &&
+
 		# Create a v2 packed-refs file
 		git pack-refs --all &&
-		test_path_exists .git/packed-refs
+		test_path_exists .git/packed-refs &&
+		for t in A B
+		do
+			test_path_is_missing .git/refs/tags/$t &&
+			git rev-parse refs/tags/$t || return 1
+		done &&
+
+		git for-each-ref --format="%(refname) %(objectname)" >actual-all &&
+		test_cmp expect-all actual-all &&
+		git for-each-ref --format="%(refname) %(objectname)" \
+			refs/tags/* >actual-tags &&
+		test_cmp expect-tags actual-tags
 	)
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 20/30] packed-refs: read optional prefix chunks
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 19/30] packed-refs: read file format v2 Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 21/30] packed-refs: write " Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c   |   2 +
 refs/packed-backend.h   |   9 +++
 refs/packed-format-v2.c | 159 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 170 insertions(+)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 549cce1f84a..ae904de9014 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -475,6 +475,8 @@ static struct ref_iterator *packed_ref_iterator_begin(
 	iter->version = snapshot->version;
 	iter->row = v2_row;
 
+	init_iterator_prefix_info(prefix, iter);
+
 	iter->pos = start;
 	iter->eof = snapshot->eof;
 	strbuf_init(&iter->refname_buf, 0);
diff --git a/refs/packed-backend.h b/refs/packed-backend.h
index 3a8649857f1..1936bb5c76c 100644
--- a/refs/packed-backend.h
+++ b/refs/packed-backend.h
@@ -103,9 +103,12 @@ struct snapshot {
 	 * packed-refs v2 values *
 	 *************************/
 	size_t nr;
+	size_t prefixes_nr;
 	size_t buflen;
 	const unsigned char *offset_chunk;
 	const char *refs_chunk;
+	const unsigned char *prefix_offsets_chunk;
+	const char *prefix_chunk;
 
 	/*
 	 * Count of references to this instance, including the pointer
@@ -212,6 +215,9 @@ struct packed_ref_iterator {
 	 ***********************************/
 	size_t nr;
 	size_t row;
+	size_t prefix_row_end;
+	size_t prefix_i;
+	const char *cur_prefix;
 };
 
 typedef int (*write_ref_fn)(const char *refname,
@@ -308,4 +314,7 @@ struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *
 int write_packed_refs_v2(struct write_packed_refs_v2_context *ctx);
 void free_v2_context(struct write_packed_refs_v2_context *ctx);
 
+void init_iterator_prefix_info(const char *prefix,
+			       struct packed_ref_iterator *iter);
+
 #endif /* REFS_PACKED_BACKEND_H */
diff --git a/refs/packed-format-v2.c b/refs/packed-format-v2.c
index d75df9545ec..0ab277f7ad4 100644
--- a/refs/packed-format-v2.c
+++ b/refs/packed-format-v2.c
@@ -14,6 +14,79 @@
 #define PACKED_REFS_SIGNATURE          0x50524546 /* "PREF" */
 #define CHREFS_CHUNKID_OFFSETS         0x524F4646 /* "ROFF" */
 #define CHREFS_CHUNKID_REFS            0x52454653 /* "REFS" */
+#define CHREFS_CHUNKID_PREFIX_DATA     0x50465844 /* "PFXD" */
+#define CHREFS_CHUNKID_PREFIX_OFFSETS  0x5046584F /* "PFXO" */
+
+static const char *get_nth_prefix(struct snapshot *snapshot,
+				  size_t n, size_t *len)
+{
+	uint64_t offset, next_offset;
+
+	if (n >= snapshot->prefixes_nr)
+		BUG("asking for prefix %"PRIu64" outside of bounds (%"PRIu64")",
+		    (uint64_t)n, (uint64_t)snapshot->prefixes_nr);
+
+	if (n)
+		offset = get_be32(snapshot->prefix_offsets_chunk +
+				  2 * sizeof(uint32_t) * (n - 1));
+	else
+		offset = 0;
+
+	if (len) {
+		next_offset = get_be32(snapshot->prefix_offsets_chunk +
+				       2 * sizeof(uint32_t) * n);
+
+		/* Prefix includes null terminator. */
+		*len = next_offset - offset - 1;
+	}
+
+	return snapshot->prefix_chunk + offset;
+}
+
+/*
+ * Find the place in `snapshot->buf` where the start of the record for
+ * `refname` starts. If `mustexist` is true and the reference doesn't
+ * exist, then return NULL. If `mustexist` is false and the reference
+ * doesn't exist, then return the point where that reference would be
+ * inserted, or `snapshot->eof` (which might be NULL) if it would be
+ * inserted at the end of the file. In the latter mode, `refname`
+ * doesn't have to be a proper reference name; for example, one could
+ * search for "refs/replace/" to find the start of any replace
+ * references.
+ *
+ * The record is sought using a binary search, so `snapshot->buf` must
+ * be sorted.
+ */
+static const char *find_prefix_location(struct snapshot *snapshot,
+					const char *refname, size_t *pos)
+{
+	size_t lo = 0, hi = snapshot->prefixes_nr;
+
+	while (lo != hi) {
+		const char *rec;
+		int cmp;
+		size_t len;
+		size_t mid = lo + (hi - lo) / 2;
+
+		rec = get_nth_prefix(snapshot, mid, &len);
+		cmp = strncmp(rec, refname, len);
+		if (cmp < 0) {
+			lo = mid + 1;
+		} else if (cmp > 0) {
+			hi = mid;
+		} else {
+			/* we have a prefix match! */
+			*pos = mid;
+			return rec;
+		}
+	}
+
+	*pos = lo;
+	if (lo < snapshot->prefixes_nr)
+		return get_nth_prefix(snapshot, lo, NULL);
+	else
+		return NULL;
+}
 
 int detect_packed_format_v2_header(struct packed_ref_store *refs,
 				   struct snapshot *snapshot)
@@ -63,6 +136,46 @@ const char *find_reference_location_v2(struct snapshot *snapshot,
 {
 	size_t lo = 0, hi = snapshot->nr;
 
+	if (snapshot->prefix_chunk) {
+		size_t prefix_row;
+		const char *prefix;
+		int found = 1;
+
+		prefix = find_prefix_location(snapshot, refname, &prefix_row);
+
+		if (!prefix || !starts_with(refname, prefix)) {
+			if (mustexist)
+				return NULL;
+			found = 0;
+		}
+
+		/* The second 4-byte column of the prefix offsets */
+		if (prefix_row) {
+			/* if prefix_row == 0, then lo = 0, which is already true. */
+			lo = get_be32(snapshot->prefix_offsets_chunk +
+				2 * sizeof(uint32_t) * (prefix_row - 1) + sizeof(uint32_t));
+		}
+
+		if (!found) {
+			const char *ret;
+			/* Terminate early with this lo position as the insertion point. */
+			if (pos)
+				*pos = lo;
+
+			if (lo >= snapshot->nr)
+				return NULL;
+
+			ret = get_nth_ref(snapshot, lo);
+			return ret;
+		}
+
+		hi = get_be32(snapshot->prefix_offsets_chunk +
+			      2 * sizeof(uint32_t) * prefix_row + sizeof(uint32_t));
+
+		if (prefix)
+			refname += strlen(prefix);
+	}
+
 	while (lo != hi) {
 		const char *rec;
 		int cmp;
@@ -132,6 +245,16 @@ static int packed_refs_read_offsets(const unsigned char *chunk_start,
 	return 0;
 }
 
+static int packed_refs_read_prefix_offsets(const unsigned char *chunk_start,
+					    size_t chunk_size, void *data)
+{
+	struct snapshot *snapshot = data;
+
+	snapshot->prefix_offsets_chunk = chunk_start;
+	snapshot->prefixes_nr = chunk_size / sizeof(uint64_t);
+	return 0;
+}
+
 void fill_snapshot_v2(struct snapshot *snapshot)
 {
 	uint32_t file_signature, file_version, hash_version;
@@ -163,6 +286,9 @@ void fill_snapshot_v2(struct snapshot *snapshot)
 	read_chunk(cf, CHREFS_CHUNKID_OFFSETS, packed_refs_read_offsets, snapshot);
 	pair_chunk(cf, CHREFS_CHUNKID_REFS, (const unsigned char**)&snapshot->refs_chunk);
 
+	read_chunk(cf, CHREFS_CHUNKID_PREFIX_OFFSETS, packed_refs_read_prefix_offsets, snapshot);
+	pair_chunk(cf, CHREFS_CHUNKID_PREFIX_DATA, (const unsigned char**)&snapshot->prefix_chunk);
+
 	/* TODO: add error checks for invalid chunk combinations. */
 
 cleanup:
@@ -187,6 +313,8 @@ int next_record_v2(struct packed_ref_iterator *iter)
 
 	iter->base.flags = REF_ISPACKED;
 
+	if (iter->cur_prefix)
+		strbuf_addstr(&iter->refname_buf, iter->cur_prefix);
 	strbuf_addstr(&iter->refname_buf, pos);
 	iter->base.refname = iter->refname_buf.buf;
 	pos += strlen(pos) + 1;
@@ -221,9 +349,40 @@ int next_record_v2(struct packed_ref_iterator *iter)
 
 	iter->row++;
 
+	if (iter->row == iter->prefix_row_end && iter->snapshot->prefix_chunk) {
+		size_t prefix_pos = get_be32(iter->snapshot->prefix_offsets_chunk +
+					     2 * sizeof(uint32_t) * iter->prefix_i);
+		iter->cur_prefix = iter->snapshot->prefix_chunk + prefix_pos;
+		iter->prefix_i++;
+		iter->prefix_row_end = get_be32(iter->snapshot->prefix_offsets_chunk +
+						2 * sizeof(uint32_t) * iter->prefix_i + sizeof(uint32_t));
+	}
+
 	return ITER_OK;
 }
 
+void init_iterator_prefix_info(const char *prefix,
+			       struct packed_ref_iterator *iter)
+{
+	struct snapshot *snapshot = iter->snapshot;
+
+	if (snapshot->version != 2 || !snapshot->prefix_chunk) {
+		iter->prefix_row_end = snapshot->nr;
+		return;
+	}
+
+	if (prefix)
+		iter->cur_prefix = find_prefix_location(snapshot, prefix, &iter->prefix_i);
+	else {
+		iter->cur_prefix = snapshot->prefix_chunk;
+		iter->prefix_i = 0;
+	}
+
+	iter->prefix_row_end = get_be32(snapshot->prefix_offsets_chunk +
+					2 * sizeof(uint32_t) * iter->prefix_i +
+					sizeof(uint32_t));
+}
+
 struct write_packed_refs_v2_context {
 	struct packed_ref_store *refs;
 	struct string_list *updates;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 21/30] packed-refs: write prefix chunks
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (19 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 20/30] packed-refs: read optional prefix chunks Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Tests already cover that we will start reading these prefixes.

TODO: discuss time and space savings over typical approach.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-format-v2.c | 103 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/refs/packed-format-v2.c b/refs/packed-format-v2.c
index 0ab277f7ad4..2cd45a5987a 100644
--- a/refs/packed-format-v2.c
+++ b/refs/packed-format-v2.c
@@ -398,6 +398,18 @@ struct write_packed_refs_v2_context {
 	uint64_t *offsets;
 	size_t nr;
 	size_t offsets_alloc;
+
+	int write_prefixes;
+	const char *cur_prefix;
+	size_t cur_prefix_len;
+
+	char **prefixes;
+	uint32_t *prefix_offsets;
+	uint32_t *prefix_rows;
+	size_t prefix_nr;
+	size_t prefixes_alloc;
+	size_t prefix_offsets_alloc;
+	size_t prefix_rows_alloc;
 };
 
 struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *refs,
@@ -434,6 +446,56 @@ static int write_packed_entry_v2(const char *refname,
 
 	ALLOC_GROW(ctx->offsets, i + 1, ctx->offsets_alloc);
 
+	if (ctx->write_prefixes) {
+		if (ctx->cur_prefix && starts_with(refname, ctx->cur_prefix)) {
+			/* skip ahead! */
+			refname += ctx->cur_prefix_len;
+			reflen -= ctx->cur_prefix_len;
+		} else {
+			size_t len;
+			const char *slash, *slashslash = NULL;
+			if (ctx->prefix_nr) {
+				/* close out the old prefix. */
+				ctx->prefix_rows[ctx->prefix_nr - 1] = ctx->nr;
+			}
+
+			/* Find the new prefix. */
+			slash = strchr(refname, '/');
+			if (slash)
+				slashslash = strchr(slash + 1, '/');
+			/* If there are two slashes, use that. */
+			slash = slashslash ? slashslash : slash;
+			/*
+			 * If there is at least one slash, use that,
+			 * and include the slash in the string.
+			 * Otherwise, use the end of the ref.
+			 */
+			slash = slash ? slash + 1 : refname + strlen(refname);
+
+			len = slash - refname;
+			ALLOC_GROW(ctx->prefixes, ctx->prefix_nr + 1, ctx->prefixes_alloc);
+			ALLOC_GROW(ctx->prefix_offsets, ctx->prefix_nr + 1, ctx->prefix_offsets_alloc);
+			ALLOC_GROW(ctx->prefix_rows, ctx->prefix_nr + 1, ctx->prefix_rows_alloc);
+
+			if (ctx->prefix_nr)
+				ctx->prefix_offsets[ctx->prefix_nr] = ctx->prefix_offsets[ctx->prefix_nr - 1] + len + 1;
+			else
+				ctx->prefix_offsets[ctx->prefix_nr] = len + 1;
+
+			ctx->prefixes[ctx->prefix_nr] = xstrndup(refname, len);
+			ctx->cur_prefix = ctx->prefixes[ctx->prefix_nr];
+			ctx->prefix_nr++;
+
+			refname += len;
+			reflen -= len;
+			ctx->cur_prefix_len = len;
+		}
+
+		/* Update the last row continually. */
+		ctx->prefix_rows[ctx->prefix_nr - 1] = i + 1;
+	}
+
+
 	/* Write entire ref, including null terminator. */
 	hashwrite(ctx->f, refname, reflen);
 	hashwrite(ctx->f, oid->hash, the_hash_algo->rawsz);
@@ -483,13 +545,54 @@ static int write_refs_chunk_offsets(struct hashfile *f,
 	return 0;
 }
 
+static int write_refs_chunk_prefix_data(struct hashfile *f,
+					void *data)
+{
+	struct write_packed_refs_v2_context *ctx = data;
+	size_t i;
+
+	trace2_region_enter("refs", "prefix-data", the_repository);
+	for (i = 0; i < ctx->prefix_nr; i++) {
+		size_t len = strlen(ctx->prefixes[i]) + 1;
+		hashwrite(f, ctx->prefixes[i], len);
+
+		/* TODO: assert the prefix lengths match the stored offsets? */
+	}
+
+	trace2_region_leave("refs", "prefix-data", the_repository);
+	return 0;
+}
+
+static int write_refs_chunk_prefix_offsets(struct hashfile *f,
+				    void *data)
+{
+	struct write_packed_refs_v2_context *ctx = data;
+	size_t i;
+
+	trace2_region_enter("refs", "prefix-offsets", the_repository);
+	for (i = 0; i < ctx->prefix_nr; i++) {
+		hashwrite_be32(f, ctx->prefix_offsets[i]);
+		hashwrite_be32(f, ctx->prefix_rows[i]);
+	}
+
+	trace2_region_leave("refs", "prefix-offsets", the_repository);
+	return 0;
+}
+
 int write_packed_refs_v2(struct write_packed_refs_v2_context *ctx)
 {
 	unsigned char file_hash[GIT_MAX_RAWSZ];
 
+	ctx->write_prefixes = git_env_bool("GIT_TEST_WRITE_PACKED_REFS_PREFIXES", 1);
+
 	add_chunk(ctx->cf, CHREFS_CHUNKID_REFS, 0, write_refs_chunk_refs);
 	add_chunk(ctx->cf, CHREFS_CHUNKID_OFFSETS, 0, write_refs_chunk_offsets);
 
+	if (ctx->write_prefixes) {
+		add_chunk(ctx->cf, CHREFS_CHUNKID_PREFIX_DATA, 0, write_refs_chunk_prefix_data);
+		add_chunk(ctx->cf, CHREFS_CHUNKID_PREFIX_OFFSETS, 0, write_refs_chunk_prefix_offsets);
+	}
+
 	hashwrite_be32(ctx->f, PACKED_REFS_SIGNATURE);
 	hashwrite_be32(ctx->f, 2);
 	hashwrite_be32(ctx->f, the_hash_algo->format_id);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (20 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 21/30] packed-refs: write " Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 23/30] t1409: test with packed-refs v2 Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When set, this will create a default value for the packed-refs file
version on writes. When set to "2", it will automatically add the
"packed-v2" value to extensions.refFormat.

Not all tests pass with GIT_TEST_PACKED_REFS_VERSION=2 because they care
specifically about the content of the packed-refs file. These tests will
be updated in following changes.

To start, though, disable the GIT_TEST_PACKED_REFS_VERSION environment
variable in t3212-ref-formats.sh, since that script already tests both
versions, including upgrade scenarios.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-backend.c  | 3 ++-
 setup.c                | 5 ++++-
 t/t3212-ref-formats.sh | 3 +++
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index ae904de9014..e84f669c42e 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -807,7 +807,8 @@ static int write_with_updates(struct packed_ref_store *refs,
 	}
 	strbuf_release(&sb);
 
-	if (git_config_get_int("refs.packedrefsversion", &version)) {
+	if (!(version = git_env_ulong("GIT_TEST_PACKED_REFS_VERSION", 0)) &&
+	    git_config_get_int("refs.packedrefsversion", &version)) {
 		/*
 		 * Set the default depending on the current extension
 		 * list. Default to version 1 if available, but allow a
diff --git a/setup.c b/setup.c
index 72bfa289ade..a4525732fe9 100644
--- a/setup.c
+++ b/setup.c
@@ -732,8 +732,11 @@ int read_repository_format(struct repository_format *format, const char *path)
 		clear_repository_format(format);
 
 	/* Set default ref_format if no extensions.refFormat exists. */
-	if (!format->ref_format_count)
+	if (!format->ref_format_count) {
 		format->ref_format = REF_FORMAT_FILES | REF_FORMAT_PACKED;
+		if (git_env_ulong("GIT_TEST_PACKED_REFS_VERSION", 0) == 2)
+			format->ref_format |= REF_FORMAT_PACKED_V2;
+	}
 
 	return format->version;
 }
diff --git a/t/t3212-ref-formats.sh b/t/t3212-ref-formats.sh
index 571ba518ef1..5583f16db41 100755
--- a/t/t3212-ref-formats.sh
+++ b/t/t3212-ref-formats.sh
@@ -2,6 +2,9 @@
 
 test_description='test across ref formats'
 
+GIT_TEST_PACKED_REFS_VERSION=0
+export GIT_TEST_PACKED_REFS_VERSION
+
 . ./test-lib.sh
 
 test_expect_success 'extensions.refFormat requires core.repositoryFormatVersion=1' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 23/30] t1409: test with packed-refs v2
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (21 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 24/30] t5312: allow packed-refs v2 format Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

t1409-avoid-packing-refs.sh seeks to test that the packed-refs file is
not modified unnecessarily. One way it does this is by creating a
packed-refs file, then munging its contents and verifying that the
munged data remains after other commands.

For packed-refs v1, it suffices to add a line that is similar to a
comment. For packed-refs v2, we cannot even add to the file without
messing up the trailing table of contents of its chunked format.
However, we can manipulate the last bytes that are within the trailing
hash and use 'tail -c 4' to read them.

This makes t1409 pass with GIT_TEST_PACKED_REFS_VERSION=2.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t1409-avoid-packing-refs.sh | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/t/t1409-avoid-packing-refs.sh b/t/t1409-avoid-packing-refs.sh
index be12fb63506..dc8d58432c8 100755
--- a/t/t1409-avoid-packing-refs.sh
+++ b/t/t1409-avoid-packing-refs.sh
@@ -8,13 +8,29 @@ test_description='avoid rewriting packed-refs unnecessarily'
 # shouldn't upset readers, and it should be omitted if the file is
 # ever rewritten.
 mark_packed_refs () {
-	sed -e "s/^\(#.*\)/\1 t1409 /" .git/packed-refs >.git/packed-refs.new &&
-	mv .git/packed-refs.new .git/packed-refs
+	if test "$GIT_TEST_PACKED_REFS_VERSION" = "2"
+	then
+		size=$(wc -c < .git/packed-refs) &&
+		pos=$(expr $size - 4) &&
+		printf "FAKE" | dd of=".git/packed-refs" bs=1 seek="$pos" conv=notrunc
+	else
+		sed -e "s/^\(#.*\)/\1 t1409 /" .git/packed-refs >.git/packed-refs.new &&
+		mv .git/packed-refs.new .git/packed-refs
+	fi
 }
 
 # Verify that the packed-refs file is still marked.
 check_packed_refs_marked () {
-	grep -q '^#.* t1409 ' .git/packed-refs
+	if test "$GIT_TEST_PACKED_REFS_VERSION" = "2"
+	then
+		size=$(wc -c < .git/packed-refs) &&
+		pos=$(expr $size - 4) &&
+		tail -c 4 .git/packed-refs >actual &&
+		printf "FAKE" >expect &&
+		test_cmp expect actual
+	else
+		grep -q '^#.* t1409 ' .git/packed-refs
+	fi
 }
 
 test_expect_success 'setup' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 24/30] t5312: allow packed-refs v2 format
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (22 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 23/30] t1409: test with packed-refs v2 Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:35 ` [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

One test in t5312 uses 'grep' to detect that a ref is written in the
packed-refs file instead of a loose object. This does not work when the
packed-refs file is in v2 format, such as when
GIT_TEST_PACKED_REFS_VERSION=2.

Since the test already checks that the loose ref is missing, it suffices
to check that 'git rev-parse' succeeds.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t3210-pack-refs.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t3210-pack-refs.sh b/t/t3210-pack-refs.sh
index 577f32dc71f..fe6c97d9087 100755
--- a/t/t3210-pack-refs.sh
+++ b/t/t3210-pack-refs.sh
@@ -159,7 +159,7 @@ test_expect_success 'delete ref while another dangling packed ref' '
 test_expect_success 'pack ref directly below refs/' '
 	git update-ref refs/top HEAD &&
 	git pack-refs --all --prune &&
-	grep refs/top .git/packed-refs &&
+	git rev-parse refs/top &&
 	test_path_is_missing .git/refs/top
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (23 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 24/30] t5312: allow packed-refs v2 format Derrick Stolee via GitGitGadget
@ 2022-11-07 18:35 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:36 ` [PATCH 26/30] t3210: require packed-refs v1 for some tests Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:35 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The last test in t5502-quickfetch.sh exploits the packed-refs v1 file
format by appending 1000 lines to the packed-refs file. If the
packed-refs file is in the v2 format, this corrupts the file as
unreadable.

Instead of making the test slower, let's ignore it when
GIT_TEST_PACKED_REFS_VERSION=2. The test is really about 'git fetch',
not the packed-refs format. Create a prerequisite in case we want to use
this technique again in the future.

An alternative would be to write those 1000 refs using a different
mechanism, but let's opt for the simpler case for now.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t5502-quickfetch.sh | 2 +-
 t/test-lib.sh         | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/t/t5502-quickfetch.sh b/t/t5502-quickfetch.sh
index b160f8b7fb7..0c4aadebae6 100755
--- a/t/t5502-quickfetch.sh
+++ b/t/t5502-quickfetch.sh
@@ -122,7 +122,7 @@ test_expect_success 'quickfetch should not copy from alternate' '
 
 '
 
-test_expect_success 'quickfetch should handle ~1000 refs (on Windows)' '
+test_expect_success PACKED_REFS_V1 'quickfetch should handle ~1000 refs (on Windows)' '
 
 	git gc &&
 	head=$(git rev-parse HEAD) &&
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 6db377f68b8..a244cd75c06 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -1954,3 +1954,7 @@ test_lazy_prereq FSMONITOR_DAEMON '
 	git version --build-options >output &&
 	grep "feature: fsmonitor--daemon" output
 '
+
+test_lazy_prereq PACKED_REFS_V1 '
+	test "$GIT_TEST_PACKED_REFS_VERSION" -ne "2"
+'
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 26/30] t3210: require packed-refs v1 for some tests
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (24 preceding siblings ...)
  2022-11-07 18:35 ` [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite Derrick Stolee via GitGitGadget
@ 2022-11-07 18:36 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:36 ` [PATCH 27/30] t*: skip packed-refs v2 over http tests Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:36 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Three tests in t3210-pack-refs.sh corrupt a packed-refs file to test
that Git properly discovers and handles those failures. These tests
assume that the file is in the v1 format, so add the PACKED_REFS_V1
prereq to skip these tests when GIT_TEST_PACKED_REFS_VERSION=2.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t3210-pack-refs.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/t/t3210-pack-refs.sh b/t/t3210-pack-refs.sh
index fe6c97d9087..76251dfe05a 100755
--- a/t/t3210-pack-refs.sh
+++ b/t/t3210-pack-refs.sh
@@ -197,7 +197,7 @@ test_expect_success 'notice d/f conflict with existing ref' '
 	test_must_fail git branch foo/bar/baz/lots/of/extra/components
 '
 
-test_expect_success 'reject packed-refs with unterminated line' '
+test_expect_success PACKED_REFS_V1 'reject packed-refs with unterminated line' '
 	cp .git/packed-refs .git/packed-refs.bak &&
 	test_when_finished "mv .git/packed-refs.bak .git/packed-refs" &&
 	printf "%s" "$HEAD refs/zzzzz" >>.git/packed-refs &&
@@ -206,7 +206,7 @@ test_expect_success 'reject packed-refs with unterminated line' '
 	test_cmp expected_err err
 '
 
-test_expect_success 'reject packed-refs containing junk' '
+test_expect_success PACKED_REFS_V1 'reject packed-refs containing junk' '
 	cp .git/packed-refs .git/packed-refs.bak &&
 	test_when_finished "mv .git/packed-refs.bak .git/packed-refs" &&
 	printf "%s\n" "bogus content" >>.git/packed-refs &&
@@ -215,7 +215,7 @@ test_expect_success 'reject packed-refs containing junk' '
 	test_cmp expected_err err
 '
 
-test_expect_success 'reject packed-refs with a short SHA-1' '
+test_expect_success PACKED_REFS_V1 'reject packed-refs with a short SHA-1' '
 	cp .git/packed-refs .git/packed-refs.bak &&
 	test_when_finished "mv .git/packed-refs.bak .git/packed-refs" &&
 	printf "%.7s %s\n" $HEAD refs/zzzzz >>.git/packed-refs &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 27/30] t*: skip packed-refs v2 over http tests
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (25 preceding siblings ...)
  2022-11-07 18:36 ` [PATCH 26/30] t3210: require packed-refs v1 for some tests Derrick Stolee via GitGitGadget
@ 2022-11-07 18:36 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:36 ` [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:36 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The GIT_TEST_PACKED_REFS_VERSION=2 environment variable helps us test
the packed-refs file format in its v2 version. This variable makes the
Git process act as if the extensions.refFormat config key has
"packed-v2" in its list. This means that if the environment variable is
removed, the repository is in a bad state. This is sufficient for most
test cases.

However, tests that fetch over HTTP appear to lose this environment
variable when executed through the HTTP server. Since the repositories
are created via Git commands in the tests, the packed-refs files end up
in the v2 format, but the server processes do not understand this and
start serving empty payloads since they do not recognize any refs.

The preferred long-term solution would be to ensure that the GIT_TEST_*
environment variable persists into the HTTP server. However, these tests
are not exercising any particularly tricky parts of the packed-refs file
format. It may not be worth the effort to pass the environment variable
and instead we can unset the environment variable (with a comment
explaining why) in these tests.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t5539-fetch-http-shallow.sh | 7 +++++++
 t/t5541-http-push-smart.sh    | 7 +++++++
 t/t5542-push-http-shallow.sh  | 7 +++++++
 t/t5551-http-fetch-smart.sh   | 7 +++++++
 t/t5558-clone-bundle-uri.sh   | 7 +++++++
 5 files changed, 35 insertions(+)

diff --git a/t/t5539-fetch-http-shallow.sh b/t/t5539-fetch-http-shallow.sh
index 3ea75d34ca0..5e3b4304367 100755
--- a/t/t5539-fetch-http-shallow.sh
+++ b/t/t5539-fetch-http-shallow.sh
@@ -5,6 +5,13 @@ test_description='fetch/clone from a shallow clone over http'
 GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
 export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
+# If GIT_TEST_PACKED_REFS_VERSION=2, then the packed-refs file will
+# be written in v2 format without extensions.refFormat=packed-v2. This
+# causes issues for the HTTP server which does not carry over the
+# environment variable to the server process.
+GIT_TEST_PACKED_REFS_VERSION=0
+export GIT_TEST_PACKED_REFS_VERSION
+
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
diff --git a/t/t5541-http-push-smart.sh b/t/t5541-http-push-smart.sh
index fbad2d5ff5e..495437dd3c7 100755
--- a/t/t5541-http-push-smart.sh
+++ b/t/t5541-http-push-smart.sh
@@ -7,6 +7,13 @@ test_description='test smart pushing over http via http-backend'
 GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
 export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
+# If GIT_TEST_PACKED_REFS_VERSION=2, then the packed-refs file will
+# be written in v2 format without extensions.refFormat=packed-v2. This
+# causes issues for the HTTP server which does not carry over the
+# environment variable to the server process.
+GIT_TEST_PACKED_REFS_VERSION=0
+export GIT_TEST_PACKED_REFS_VERSION
+
 . ./test-lib.sh
 
 ROOT_PATH="$PWD"
diff --git a/t/t5542-push-http-shallow.sh b/t/t5542-push-http-shallow.sh
index c2cc83182f9..c47b18b9faa 100755
--- a/t/t5542-push-http-shallow.sh
+++ b/t/t5542-push-http-shallow.sh
@@ -5,6 +5,13 @@ test_description='push from/to a shallow clone over http'
 GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
 export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
+# If GIT_TEST_PACKED_REFS_VERSION=2, then the packed-refs file will
+# be written in v2 format without extensions.refFormat=packed-v2. This
+# causes issues for the HTTP server which does not carry over the
+# environment variable to the server process.
+GIT_TEST_PACKED_REFS_VERSION=0
+export GIT_TEST_PACKED_REFS_VERSION
+
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh
index 6a38294a476..61f2e90eabe 100755
--- a/t/t5551-http-fetch-smart.sh
+++ b/t/t5551-http-fetch-smart.sh
@@ -4,6 +4,13 @@ test_description='test smart fetching over http via http-backend'
 GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
 export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
+# If GIT_TEST_PACKED_REFS_VERSION=2, then the packed-refs file will
+# be written in v2 format without extensions.refFormat=packed-v2. This
+# causes issues for the HTTP server which does not carry over the
+# environment variable to the server process.
+GIT_TEST_PACKED_REFS_VERSION=0
+export GIT_TEST_PACKED_REFS_VERSION
+
 . ./test-lib.sh
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
diff --git a/t/t5558-clone-bundle-uri.sh b/t/t5558-clone-bundle-uri.sh
index 9155f31fa2c..3e35322155e 100755
--- a/t/t5558-clone-bundle-uri.sh
+++ b/t/t5558-clone-bundle-uri.sh
@@ -2,6 +2,13 @@
 
 test_description='test fetching bundles with --bundle-uri'
 
+# If GIT_TEST_PACKED_REFS_VERSION=2, then the packed-refs file will
+# be written in v2 format without extensions.refFormat=packed-v2. This
+# causes issues for the HTTP server which does not carry over the
+# environment variable to the server process.
+GIT_TEST_PACKED_REFS_VERSION=0
+export GIT_TEST_PACKED_REFS_VERSION
+
 . ./test-lib.sh
 
 test_expect_success 'fail to clone from non-existent file' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (26 preceding siblings ...)
  2022-11-07 18:36 ` [PATCH 27/30] t*: skip packed-refs v2 over http tests Derrick Stolee via GitGitGadget
@ 2022-11-07 18:36 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:36 ` [PATCH 29/30] p1401: create performance test for ref operations Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:36 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The linux-TEST-vars CI build helps us check that certain opt-in features
are still exercised in at least one environment. The new
GIT_TEST_PACKED_REFS_VERSION environment variable now passes the test
suite when set to "2", so add this to that list of variables.

This provides nearly the same coverage of the v2 format as we had in the
v1 format.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 ci/run-build-and-tests.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 8ebff425967..e93574ca262 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -30,6 +30,7 @@ linux-TEST-vars)
 	export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=master
 	export GIT_TEST_WRITE_REV_INDEX=1
 	export GIT_TEST_CHECKOUT_WORKERS=2
+	export GIT_TEST_PACKED_REFS_VERSION=2
 	;;
 linux-clang)
 	export GIT_TEST_DEFAULT_HASH=sha1
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 29/30] p1401: create performance test for ref operations
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (27 preceding siblings ...)
  2022-11-07 18:36 ` [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds Derrick Stolee via GitGitGadget
@ 2022-11-07 18:36 ` Derrick Stolee via GitGitGadget
  2022-11-07 18:36 ` [PATCH 30/30] refs: skip hashing when writing packed-refs v2 Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:36 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

TBD

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/perf/p1401-ref-operations.sh | 47 ++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100755 t/perf/p1401-ref-operations.sh

diff --git a/t/perf/p1401-ref-operations.sh b/t/perf/p1401-ref-operations.sh
new file mode 100755
index 00000000000..1c372ba0ee8
--- /dev/null
+++ b/t/perf/p1401-ref-operations.sh
@@ -0,0 +1,47 @@
+#!/bin/sh
+
+test_description="Tests performance of ref operations"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_perf 'git pack-refs (v1)' '
+	git commit --allow-empty -m "change one ref" &&
+	git pack-refs --all
+'
+
+test_perf 'git for-each-ref (v1)' '
+	git for-each-ref --format="%(refname)" >/dev/null
+'
+
+test_perf 'git for-each-ref prefix (v1)' '
+	git for-each-ref --format="%(refname)" refs/tags/ >/dev/null
+'
+
+test_expect_success 'configure packed-refs v2' '
+	git config core.repositoryFormatVersion 1 &&
+	git config --add extensions.refFormat files &&
+	git config --add extensions.refFormat packed &&
+	git config --add extensions.refFormat packed-v2 &&
+	git config refs.packedRefsVersion 2 &&
+	git commit --allow-empty -m "change one ref" &&
+	git pack-refs --all &&
+	test_copy_bytes 16 .git/packed-refs | xxd >actual &&
+	grep PREF actual
+'
+
+test_perf 'git pack-refs (v2)' '
+	git commit --allow-empty -m "change one ref" &&
+	git pack-refs --all
+'
+
+test_perf 'git for-each-ref (v2)' '
+	git for-each-ref --format="%(refname)" >/dev/null
+'
+
+test_perf 'git for-each-ref prefix (v2)' '
+	git for-each-ref --format="%(refname)" refs/tags/ >/dev/null
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 30/30] refs: skip hashing when writing packed-refs v2
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (28 preceding siblings ...)
  2022-11-07 18:36 ` [PATCH 29/30] p1401: create performance test for ref operations Derrick Stolee via GitGitGadget
@ 2022-11-07 18:36 ` Derrick Stolee via GitGitGadget
  2022-11-09 15:15 ` [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-11-07 18:36 UTC (permalink / raw)
  To: git; +Cc: jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'skip_hash' option in 'struct hashfile' indicates that we want to
use the hashfile API as a buffered writer, and not use the hash function
to create a trailing hash. We still write a trailing null hash to
indicate that we do not have a checksum at the end. This feature is
enabled for index writes using the 'index.computeHash' config key.

Create a similar (currently hidden) option for the packed-refs v2 file
format: refs.hashPackedRefs. This defaults to false because performance
is compared to the packed-refs v1 file format which does have a checksum
anywhere.

This change results in improvements to p1401 when using a repository
with a 42 MB packed-refs file (600,000+ refs).

Test                        HEAD~1            HEAD
--------------------------------------------------------------------
1401.1: git pack-refs (v1)  0.38(0.31+0.52)   0.37(0.28+0.52) -2.6%
1401.5: git pack-refs (v2)  0.39(0.33+0.52)   0.30(0.28+0.46) -23.1%

Note that these tests update a ref and then repack the packed-refs file.
The following benchmarks are from a hyperfine experiment that only ran
the 'git pack-refs --all' command for the two formats, but also compared
the effect when refs.hashPackedRefs=true.

Benchmark 1: v1
  Time (mean ± σ):     163.5 ms ±  18.1 ms    [User: 117.8 ms, System: 38.1 ms]
  Range (min … max):   131.3 ms … 190.4 ms    50 runs

Benchmark 2: v2-no-hash
  Time (mean ± σ):      95.8 ms ±  15.1 ms    [User: 72.5 ms, System: 23.0 ms]
  Range (min … max):    82.9 ms … 131.2 ms    50 runs

Benchmark 3: v2-hashing
  Time (mean ± σ):     100.8 ms ±  16.4 ms    [User: 77.2 ms, System: 23.1 ms]
  Range (min … max):    83.0 ms … 131.1 ms    50 runs

Summary
  'v2-no-hash' ran
    1.05 ± 0.24 times faster than 'v2-hashing'
    1.71 ± 0.33 times faster than 'v1'

In this case of repeatedly rewriting the same refs seems to demonstrate
a smaller improvement than the p1401 test. However, the overall
reduction from v1 matches the expected reduction in file size. In my
tests, the 42 MB packed-refs (v1) file was compacted to 28 MB in the v2
format.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 refs/packed-format-v2.c        | 7 +++++++
 t/perf/p1401-ref-operations.sh | 5 +++++
 2 files changed, 12 insertions(+)

diff --git a/refs/packed-format-v2.c b/refs/packed-format-v2.c
index 2cd45a5987a..ada34bf9bf0 100644
--- a/refs/packed-format-v2.c
+++ b/refs/packed-format-v2.c
@@ -417,6 +417,7 @@ struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *
 						       struct strbuf *err)
 {
 	struct write_packed_refs_v2_context *ctx;
+	int do_skip_hash;
 	CALLOC_ARRAY(ctx, 1);
 
 	ctx->refs = refs;
@@ -430,6 +431,12 @@ struct write_packed_refs_v2_context *create_v2_context(struct packed_ref_store *
 	}
 
 	ctx->f = hashfd(refs->tempfile->fd, refs->tempfile->filename.buf);
+
+	/* Default to true, so skip_hash if not set. */
+	if (git_config_get_maybe_bool("refs.hashpackedrefs", &do_skip_hash) ||
+	    do_skip_hash)
+		ctx->f->skip_hash = 1;
+
 	ctx->cf = init_chunkfile(ctx->f);
 
 	return ctx;
diff --git a/t/perf/p1401-ref-operations.sh b/t/perf/p1401-ref-operations.sh
index 1c372ba0ee8..0b88a2f531a 100755
--- a/t/perf/p1401-ref-operations.sh
+++ b/t/perf/p1401-ref-operations.sh
@@ -36,6 +36,11 @@ test_perf 'git pack-refs (v2)' '
 	git pack-refs --all
 '
 
+test_perf 'git pack-refs (v2;hashing)' '
+	git commit --allow-empty -m "change one ref" &&
+	git -c refs.hashPackedRefs=true pack-refs --all
+'
+
 test_perf 'git for-each-ref (v2)' '
 	git for-each-ref --format="%(refname)" >/dev/null
 '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (29 preceding siblings ...)
  2022-11-07 18:36 ` [PATCH 30/30] refs: skip hashing when writing packed-refs v2 Derrick Stolee via GitGitGadget
@ 2022-11-09 15:15 ` Derrick Stolee
  2022-11-11 23:28 ` Elijah Newren
  2022-11-28 18:56 ` Han-Wen Nienhuys
  32 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee @ 2022-11-09 15:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: jrnieder, gitster, me, l.s.r, mhagger, hanwen, hanwenn

On 11/7/2022 1:35 PM, Derrick Stolee via GitGitGadget wrote:
> This RFC is quite long, but the length seemed necessary to actually provide
> and end-to-end implementation that demonstrates the packed-refs v2 format
> along with test coverage (via the new GIT_TEST_PACKED_REFS_VERSION
> variable).

Apologies. I had intended to CC a long list of people, but I messed up
the "cc:" lines on my GGG PR. Appropriate people are CC'd. Please add
anyone who I might have missed.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (30 preceding siblings ...)
  2022-11-09 15:15 ` [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee
@ 2022-11-11 23:28 ` Elijah Newren
  2022-11-14  0:07   ` Derrick Stolee
  2022-11-28 18:56 ` Han-Wen Nienhuys
  32 siblings, 1 reply; 56+ messages in thread
From: Elijah Newren @ 2022-11-11 23:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, jrnieder, Derrick Stolee

On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Introduction
> ============
>
> I became interested in our packed-ref format based on the asymmetry between
> ref updates and ref deletions: if we delete a packed ref, then the
> packed-refs file needs to be rewritten. Compared to writing a loose ref,
> this is an O(N) cost instead of O(1).
>
> In this way, I set out with some goals:
>
>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>    updates.

Performance is always nice.  :-)

>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>    refs and creating a clear way to snapshot all refs at a given point in
>    time.

Is this secondary goal the actual goal you have, or just the
implementation by which you get the real underlying goal?

To me, it appears that such a capability would solve both (a) D/F
conflict problems (i.e. the ability to simultaneously have a
refs/heads/feature and refs/heads/feature/shiny ref), and (b) case
sensitivity issues in refnames (i.e. inability of some users to work
with both a refs/heads/feature and a refs/heads/FeAtUrE, due to
constraints of their filesystem and the loose storage mechanism).  Are
either of those the goal you are trying to achieve (I think both would
be really nice, more so than the performance goal you have), or is
there another?

> I also had one major non-goal to keep things focused:
>
>  * (Non-goal) Update the reflog format.
>
> After carefully considering several options, it seemed that there are two
> solutions that can solve this effectively:
>
>  1. Wait for reftable to be integrated into Git.
>  2. Update the packed-refs backend to have a stacked version.
>
> The reftable work seems currently dormant. The format is pretty complicated
> and I have a difficult time seeing a way forward for it to be fully
> integrated into Git. Personally, I'd prefer a more incremental approach with
> formats that are built for a basic filesystem. During the process, we can
> create APIs within Git that can benefit other file formats within Git.
>
> Further, there is a simpler model that satisfies my primary goal without the
> complication required for the secondary goal. Suppose we create a stacked
> packed-refs file but only have two layers: the first (base) layer is created
> when git pack-refs collapses the full stack and adds the loose ref updates
> to the packed-refs file; the second (top) layer contains only ref deletions
> (allowing null OIDs to indicate a deleted ref). Then, ref deletions would
> only need to rewrite that top layer, making ref deletions take O(deletions)
> time instead of O(all refs) time. With a reasonable schedule to squash the
> packed-refs stack, this would be a dramatic improvement. (A prototype
> implementation showed that updating a layer of 1,000 deletions takes only
> twice the time as writing a single loose ref.)

Makes sense.  If a ref is re-introduced after deletion, then do you
remove it from the deletion layer and then write the single loose ref?

> If we want to satisfy the secondary goal of passing all ref updates through
> the packed storage, then more complicated layering would be necessary. The
> point of bringing this up is that we have incremental goals along the way to
> that final state that give us good stopping points to test the benefits of
> each step.

I like the incremental plan.  Your primary goal perhaps benefits
hosting providers the most, while the second appears to me to be an
interesting usability improvement (some of my users might argue it's
even a bugfix) that would affect users with far fewer refs as well.
So, lots of benefits and we get some along the way to the final plan.

> Stacking the packed-refs format introduces several interesting strategy
> points that are complicated to resolve. Before we can do that, we first need
> to establish a way to modify the ref format of a Git repository. Hence, we
> need a new extension for the ref formats.
>
> To simplify the first update to the ref formats, it seemed better to add a
> new file format version to the existing packed-refs file format. This format
> has the exact lock/write/rename mechanics of the current packed-refs format,
> but uses a file format that structures the information in a more compact
> way. It uses the chunk-format API, with some tweaks. This format update is
> useful to the final goal of a stacked packed-refs API, since each layer will
> have faster reads and writes. The main reason to do this first is that it is
> much simpler to understand the value-add (smaller files means faster
> performance).
>
>
> RFC Organization
> ================
>
> This RFC is quite long, but the length seemed necessary to actually provide
> and end-to-end implementation that demonstrates the packed-refs v2 format
> along with test coverage (via the new GIT_TEST_PACKED_REFS_VERSION
> variable).
>
> For convenience, I've broken each section of the full RFC into parts, which
> resembles how I intend to submit the pieces for full review. These parts are
> available as pull requests in my fork, but here is a breakdown:


> Part I: Optionally hash the index
> =================================
>
> [1] https://github.com/derrickstolee/git/pull/23 Packed-refs v2 Part I:
> Optionally hash the index (Patches 1-2)
>
> The chunk-format API uses the hashfile API as a buffered write, but also all
> existing formats that use the chunk-format API also have a trailing hash as
> part of the format. Since the packed-refs file has a critical path involving
> its write speed (deleting a packed ref), it seemed important to allow
> apples-to-apples comparison between the v1 and v2 format by skipping the
> hashing. This is later toggled by a config option.
>
> In this part, the focus is on allowing the hashfile API to ignore updating
> the hash during the buffered writes. We've been using this in microsoft/git
> to optionally speed up index writes, which patch 2 introduces here. The file
> format instead writes a null OID which would look like a corrupt file to an
> older 'git fsck'. Before submitting a full version, I would update 'git
> fsck' to ignore a null OID in all of our file formats that include a
> trailing hash. Since the index is more short-lived than other formats (such
> as pack-files) this trailing hash is less useful. The write time is also
> critical as the performance tests demonstrate.

This feels like a diversion from your goals.  Should it really be up-front?

Reading through the patches, the first patch does appear to be
necessary but isn't well motivated from the cover letter.  The second
patch seems orthogonal to your series, though it is really nice to see
index writes dropping almost to half the time.

> Part II: Create extensions.refFormat
> ====================================
>
> [2] https://github.com/derrickstolee/git/pull/24 Packed-refs v2 Part II:
> create extensions.refFormat (Patches 3-7)
>
> This part is a critical concept that has yet to be defined in the Git
> codebase. We have no way to incrementally modify the ref format. Since refs
> are so critical, we cannot add an optionally-understood layer on top (like
> we did with the multi-pack-index and commit-graph files). The reftable draft
> [6] proposes the same extension name (extensions.refFormat) but focuses
> instead on only a single value. This means that the reftable must be defined
> at git init or git clone time and cannot be upgraded from the files backend.
>
> In this RFC, I propose a different model that allows for more customization
> and incremental updates. The extensions.refFormat config key is multi-valued
> and defaults to the list of files and packed.

This last sentence doesn't parse that well for me.  Perhaps "...and
defaults to a combination of 'files' and 'packed', meaning supporting
both loose refs and packed refs "?

> In the context of this RFC,
> the intention is to be able to add packed-v2 so the list of all three values
> would allow Git to write and read either file format version (v1 or v2). In
> the larger scheme, the extension could allow restricting to only loose refs
> (just files) or only packed-refs (just packed) or even later when reftable
> is complete, files and reftable could mean that loose refs are the primary
> ref storage, but the reftable format serves as a drop-in replacement for the
> packed-refs file. Not all combinations need to be understood by Git, but
> having them available as an option could be useful for flexibility,
> especially when trying to upgrade existing repositories to new formats.
>
> In the future, beyond the scope of this RFC, it would be good to add a
> stacked value that allows a stack of files in packed-refs format (whose
> version is specified by the packed or packed-v2 values) so we can further
> speed up writes to the packed layer. Depending on how well that works, we
> could focus on speeding up ref deletions or sending all ref writes straight
> to the packed-refs layer. With the option to keep the loose refs storage, we
> have flexibility to explore that space incrementally when we have time to
> get to it.
>
>
> Part III: Allow a trailing table-of-contents in the chunk-format API
> ====================================================================
>
> [3] https://github.com/derrickstolee/git/pull/25 Packed-refs v2 Part III:
> trailing table of contents in chunk-format (Patches 8-17)
>
> In order to optimize the write speed of the packed-refs v2 file format, we
> want to write immediately to the file as we stream existing refs from the
> current refs. The current chunk-format API requires computing the chunk
> lengths in advance, which can slow down the write and take more memory than
> necessary. Using a trailing table of contents solves this problem, and was
> recommended earlier [7]. We just didn't have enough evidence to justify the
> work to update the existing chunk formats. Here, we update the API in
> advance of using in the packed-refs v2 format.
>
> We could consider updating the commit-graph and multi-pack-index formats to
> use trailing table of contents, but it requires a version bump. That might
> be worth it in the case of the commit-graph where computing the size of the
> changed-path Bloom filters chunk requires a lot of memory at the moment.
> After this chunk-format API update is reviewed and merged, we can pursue
> those directions more closely. We would want to investigate the formats more
> carefully to see if we want to update the chunks themselves as well as some
> header information.

I like how you point out additional benefits the series could provide,
but leave them out.  Perhaps do the same with the optional index
hashing in patch 2?

> Part IV: Abstract some parts of the v1 file format
> ==================================================
>
> [4] https://github.com/derrickstolee/git/pull/26 Packed-refs v2 Part IV:
> abstract some parts of the v1 file format (Patches 18-21)
>
> These patches move the part of the refs/packed-backend.c file that deal with
> the specifics of the packed-refs v1 file format into a new file:
> refs/packed-format-v1.c. This also creates an abstraction layer that will
> allow inserting the v2 format more easily.
>
> One thing that doesn't exist currently is a documentation file describing
> the packed-refs file format. I would add that file in this part before
> submitting it for full review. (I also haven't written the file format doc
> for the packed-refs v2 format, either.)

Sounds like another win-win opportunity to get someone from the
community to contribute.  :-)   ..Or maybe that's not the best
strategy, since recent empirical evidence suggests that trick doesn't
work.  Oh well, it was worth a shot.

> Part V: Implement the v2 file format
> ====================================
>
> [5] https://github.com/derrickstolee/git/pull/27 Packed-refs v2 Part V: the
> v2 file format (Patches 22-35)
>
> This is the real meat of the work. Perhaps there are ways to split it
> further, but for now this is what I have ready. The very last patch does a
> complete performance comparison for a repo with many refs.
>
> The format is not yet documented, but is broken up into these pieces:
>
>  1. The refs data chunk stores the same data as the packed-refs file, but
>     each ref is broken down as follows: the ref name (with trailing zero),
>     the OID for the ref in its raw bytes, and (if necessary) the peeled OID
>     for the ref in its raw bytes. The refs are sorted lexicographically.
>
>  2. The ref offsets chunk is a single column of 64-bit offsets into the refs
>     chunk indicating where each ref starts. The most-significant bit of that
>     value indicates whether or not there is a peeled OID.
>
>  3. The prefix data chunk lists a set of ref prefixes (currently writes only
>     allow depth-2 prefixes, such as refs/heads/ and refs/tags/). When
>     present, these prefixes are written in this chunk and not in the refs
>     data chunk. The prefixes are sorted lexicographically.
>
>  4. The prefix offset chunk has two 32-bit integer columns. The first column
>     stores the offset within the prefix data chunk to the start of the
>     prefix string. The second column points to the row position for the
>     first ref that has name greater than this prefix (the 0th prefix is
>     assumed to start at row 0, so we can interpret the prefix range from
>     row[i-1] and row[i]).
>
> Between using raw OIDs and storing the depth-2 prefixes only once, this
> format compresses the file to ~60% of its v1 size. (The format allows not
> writing the prefix chunks, and the prefix chunks are implemented after the
> basics of the ref chunks are complete.)
>
> The write times are reduced in a similar fraction to the size difference.
> Reads are sped up somewhat, and we have the potential to do a ref count by
> prefix much faster by doing a binary search for the start and end of the
> prefix and then subtracting the row positions instead of scanning the file
> between to count refs.
>
>
> Relationship to Reftable
> ========================
>
> I mentioned earlier that I had considered using reftable as a way to achieve
> the stated goals. With the current state of that work, I'm not confident
> that it is the right approach here.
>
> My main worry is that the reftable is more complicated than we need for a
> typical Git repository that is based on a typical filesystem. This makes
> testing the format very critical, and we seem to not be near reaching that
> approach. The v2 format here is very similar to existing Git file formats
> since it uses the chunk-format API. This means that the amount of code
> custom to just the v2 format is quite small.
>
> As mentioned, the current extension plan [6] only allows reftable or files
> and does not allow for a mix of both. This RFC introduces the possibility
> that both could co-exist. Using that multi-valued approach means that I'm
> able to test the v2 packed-refs file format almost as well as the v1 file
> format within this RFC. (More tests need to be added that are specific to
> this format, but I'm waiting for confirmation that this is an acceptable
> direction.) At the very least, this multi-valued approach could be used as a
> way to allow using the reftable format as a drop-in replacement for the
> packed-refs file, as well as upgrading an existing repo to use reftable.
> That might even help the integration process to allow the reftable format to
> be tested at least by some subset of tests instead of waiting for a full
> test suite update.

Thanks for providing this background; I also like how it potentially
makes it easier to adopt reftable in the future.

> I'm interested to hear from people more involved in the reftable work to see
> the status of that project and how it matches or differs from my
> perspective.

That wouldn't be me, but I appreciate the comparisons to help me
orient where things are.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/30] read-cache: add index.computeHash config option
  2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
@ 2022-11-11 23:31   ` Elijah Newren
  2022-11-14 16:30     ` Derrick Stolee
  2022-11-17 16:13   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 56+ messages in thread
From: Elijah Newren @ 2022-11-11 23:31 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, jrnieder, Derrick Stolee

On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <derrickstolee@github.com>
>
> The previous change allowed skipping the hashing portion of the
> hashwrite API, using it instead as a buffered write API. Disabling the
> hashwrite can be particularly helpful when the write operation is in a
> critical path.
>
> One such critical path is the writing of the index. This operation is so
> critical that the sparse index was created specifically to reduce the
> size of the index to make these writes (and reads) faster.
>
> Following a similar approach to one used in the microsoft/git fork [1],
> add a new config option that allows disabling this hashing during the
> index write. The cost is that we can no longer validate the contents for
> corruption-at-rest using the trailing hash.
>
> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>
> While older Git versions will not recognize the null hash as a special
> case, the file format itself is still being met in terms of its
> structure. Using this null hash will still allow Git operations to
> function across older versions.
>
> The one exception is 'git fsck' which checks the hash of the index file.
> Here, we disable this check if the trailing hash is all zeroes. We add a
> warning to the config option that this may cause undesirable behavior
> with older Git versions.
>
> As a quick comparison, I tested 'git update-index --force-write' with
> and without index.computHash=false on a copy of the Linux kernel
> repository.
>
> Benchmark 1: with hash
>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>
> Benchmark 2: without hash
>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>
> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'
>
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

This is impressive and interesting...but an improvement unrelated to
this series other than the fact that it builds on some of it.  Perhaps
pull this patch out?

Also, would it make sense to integrate index.computeHash with feature.manyFiles?

>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/config/index.txt |  8 ++++++++
>  read-cache.c                   | 22 +++++++++++++++++++++-
>  t/t1600-index.sh               |  8 ++++++++
>  3 files changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config/index.txt b/Documentation/config/index.txt
> index 75f3a2d1054..709ba72f622 100644
> --- a/Documentation/config/index.txt
> +++ b/Documentation/config/index.txt
> @@ -30,3 +30,11 @@ index.version::
>         Specify the version with which new index files should be
>         initialized.  This does not affect existing repositories.
>         If `feature.manyFiles` is enabled, then the default is 4.
> +
> +index.computeHash::
> +       When enabled, compute the hash of the index file as it is written
> +       and store the hash at the end of the content. This is enabled by
> +       default.
> ++
> +If you disable `index.computHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.
> diff --git a/read-cache.c b/read-cache.c
> index 32024029274..f24d96de4d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>         git_hash_ctx c;
>         unsigned char hash[GIT_MAX_RAWSZ];
>         int hdr_version;
> +       int all_zeroes = 1;
> +       unsigned char *start, *end;
>
>         if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>                 return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>         if (!verify_index_checksum)
>                 return 0;
>
> +       end = (unsigned char *)hdr + size;
> +       start = end - the_hash_algo->rawsz;
> +       while (start < end) {
> +               if (*start != 0) {
> +                       all_zeroes = 0;
> +                       break;
> +               }
> +               start++;
> +       }
> +
> +       if (all_zeroes)
> +               return 0;
> +
>         the_hash_algo->init_fn(&c);
>         the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz);
>         the_hash_algo->final_fn(hash, &c);
> -       if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz))
> +       if (!hasheq(hash, end - the_hash_algo->rawsz))
>                 return error(_("bad index file sha1 signature"));
>         return 0;
>  }
> @@ -2917,9 +2932,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>         int ieot_entries = 1;
>         struct index_entry_offset_table *ieot = NULL;
>         int nr, nr_threads;
> +       int compute_hash;
>
>         f = hashfd(tempfile->fd, tempfile->filename.buf);
>
> +       if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
> +           !compute_hash)
> +               f->skip_hash = 1;
> +
>         for (i = removed = extended = 0; i < entries; i++) {
>                 if (cache[i]->ce_flags & CE_REMOVE)
>                         removed++;
> diff --git a/t/t1600-index.sh b/t/t1600-index.sh
> index 010989f90e6..24ab90ca047 100755
> --- a/t/t1600-index.sh
> +++ b/t/t1600-index.sh
> @@ -103,4 +103,12 @@ test_expect_success 'index version config precedence' '
>         test_index_version 0 true 2 2
>  '
>
> +test_expect_success 'index.computeHash config option' '
> +       (
> +               rm -f .git/index &&
> +               git -c index.computeHash=false add a &&
> +               git fsck
> +       )
> +'
> +
>  test_done
> --
> gitgitgadget

Pretty simple change, though.  Very nice.  :-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/30] extensions: add refFormat extension
  2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
@ 2022-11-11 23:39   ` Elijah Newren
  2022-11-16 14:37     ` Derrick Stolee
  0 siblings, 1 reply; 56+ messages in thread
From: Elijah Newren @ 2022-11-11 23:39 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, jrnieder, Derrick Stolee

On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
[...]
> One obvious improvement could be a new file format version for the
> packed-refs file. Its current plaintext-based format is inefficient due
> to storing object IDs as hexadecimal representations instead of in
> their raw format. This extra cost will get worse with SHA-256.

> In addition, binary searches need to guess a position and scan to find
> newlines for a refname entry. A structured binary format could allow for
> more compact representation and faster access.

This doesn't parse very well at all.  The scanning is due to refname
entries being of variable length, and changing hexadecimal
representation of object IDs to binary values isn't going to help
that.

I _think_, after re-scanning your RFC cover letter that you had other
ideas to allow a binary search in order to read a single ref's value,
and that the juxtaposing of these sentences together leads to an
unfortunate assumption that one change is related to the both goals,
but something extra here to clarify would help.

> diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
> index bccaec7a963..ce8185adf53 100644
> --- a/Documentation/config/extensions.txt
> +++ b/Documentation/config/extensions.txt
> @@ -7,6 +7,47 @@ Note that this setting should only be set by linkgit:git-init[1] or
>  linkgit:git-clone[1].  Trying to change it after initialization will not
>  work and will produce hard-to-diagnose issues.
>
> +extensions.refFormat::
> +       Specify the reference storage mechanisms used by the repoitory as a
> +       multi-valued list. The acceptable values are `files` and `packed`.

> +       If not specified, the list of `files` and `packed` is assumed.

This sentence doesn't parse for me.

> +       It
> +       is an error to specify this key unless `core.repositoryFormatVersion`
> +       is 1.

...is at least 1?  Or are we trying to be incompatible with potential
future core.repositoryFormatVersion values?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-11 23:28 ` Elijah Newren
@ 2022-11-14  0:07   ` Derrick Stolee
  2022-11-15  2:47     ` Elijah Newren
  2022-11-18 23:31     ` Junio C Hamano
  0 siblings, 2 replies; 56+ messages in thread
From: Derrick Stolee @ 2022-11-14  0:07 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, jrnieder

On 11/11/22 6:28 PM, Elijah Newren wrote:
> On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> Introduction
>> ============
>>
>> I became interested in our packed-ref format based on the asymmetry between
>> ref updates and ref deletions: if we delete a packed ref, then the
>> packed-refs file needs to be rewritten. Compared to writing a loose ref,
>> this is an O(N) cost instead of O(1).
>>
>> In this way, I set out with some goals:
>>
>>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>>    updates.
> 
> Performance is always nice.  :-)
> 
>>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>>    refs and creating a clear way to snapshot all refs at a given point in
>>    time.
> 
> Is this secondary goal the actual goal you have, or just the
> implementation by which you get the real underlying goal?

To me, the primary goal takes precedence. It turns out that the best
way to solve for that goal happens to also make it possible to store
all refs in a packed form, because we can update the packed form
much faster than our current setup. There are alternatives that I
considered (and prototyped) that were more specific to the deletions
case, but they were not actually as fast as the stacked method. Those
alternatives also would never help reach the secondary goal, but I
probably would have considered them anyway if they were faster, if
only for their simplicity.

> To me, it appears that such a capability would solve both (a) D/F
> conflict problems (i.e. the ability to simultaneously have a
> refs/heads/feature and refs/heads/feature/shiny ref), and (b) case
> sensitivity issues in refnames (i.e. inability of some users to work
> with both a refs/heads/feature and a refs/heads/FeAtUrE, due to
> constraints of their filesystem and the loose storage mechanism).  Are
> either of those the goal you are trying to achieve (I think both would
> be really nice, more so than the performance goal you have), or is
> there another?

For a Git host provider, these D/F conflict and case-sensitivity
situations probably would need to stay as restrictions on the
server side for quite some time because we don't want users on
older Git clients to be unable to fetch a repository just because
we updated our ref storage to allow for such possibilities.

The biggest benefit on the server side is actually for consistency
checks. Using a stacked packed-refs (especially with a tip file
that describes all of the layers) allows an atomic way to take a
snapshot of the refs and run a checksum operation on their values.
With loose refs, concurrent updates can modify the checksum during
its computation. This is a super niche reason for this, but it's
nice that the performance-only focus also ends up with a design
that satisfies this goal.

...

>> Further, there is a simpler model that satisfies my primary goal without the
>> complication required for the secondary goal. Suppose we create a stacked
>> packed-refs file but only have two layers: the first (base) layer is created
>> when git pack-refs collapses the full stack and adds the loose ref updates
>> to the packed-refs file; the second (top) layer contains only ref deletions
>> (allowing null OIDs to indicate a deleted ref). Then, ref deletions would
>> only need to rewrite that top layer, making ref deletions take O(deletions)
>> time instead of O(all refs) time. With a reasonable schedule to squash the
>> packed-refs stack, this would be a dramatic improvement. (A prototype
>> implementation showed that updating a layer of 1,000 deletions takes only
>> twice the time as writing a single loose ref.)
> 
> Makes sense.  If a ref is re-introduced after deletion, then do you
> remove it from the deletion layer and then write the single loose ref?

Loose refs always take precedence over the packed layer, so if the loose
ref exists we ignore its status in the packed layer. That allows us to not
update the packed layer unless it is a ref deletion or ref maintenance.

>> If we want to satisfy the secondary goal of passing all ref updates through
>> the packed storage, then more complicated layering would be necessary. The
>> point of bringing this up is that we have incremental goals along the way to
>> that final state that give us good stopping points to test the benefits of
>> each step.
> 
> I like the incremental plan.  Your primary goal perhaps benefits
> hosting providers the most, while the second appears to me to be an
> interesting usability improvement (some of my users might argue it's
> even a bugfix) that would affect users with far fewer refs as well.
> So, lots of benefits and we get some along the way to the final plan.

As I mentioned earlier in this reply, we have a ways to go before we
can realize the usability issue, but we can get started somewhere and
see how things progress.

>> In this part, the focus is on allowing the hashfile API to ignore updating
>> the hash during the buffered writes. We've been using this in microsoft/git
>> to optionally speed up index writes, which patch 2 introduces here. The file
>> format instead writes a null OID which would look like a corrupt file to an
>> older 'git fsck'. Before submitting a full version, I would update 'git
>> fsck' to ignore a null OID in all of our file formats that include a
>> trailing hash. Since the index is more short-lived than other formats (such
>> as pack-files) this trailing hash is less useful. The write time is also
>> critical as the performance tests demonstrate.
> 
> This feels like a diversion from your goals.  Should it really be up-front?
> 
> Reading through the patches, the first patch does appear to be
> necessary but isn't well motivated from the cover letter.  The second
> patch seems orthogonal to your series, though it is really nice to see
> index writes dropping almost to half the time.

This one I put up front only because it is a good candidate for
submitting soon, in parallel to any discussion about the rest of the
RFC.

The last part uses the chunk-format API for the packed-refs v2 format,
but the write speed is critical and hence we need this ability to skip
the hashing. Patch 1 mentions the application in the refs space as a
potential future, but the immediate benefit to index updates can help
lots of users right now.

Even if the community said "this packed-refs v2 is a bad idea" I would
still want to submit these two patches. That's the main reason they are
up front.

(Also: one thing I didn't mention in this cover letter or in the later
patches is that we could enable the hashing on the packed-refs v2 via
a config value, eventually. That would allow users who really care
about validating their file hashes an option to do so. While this
would make packed-refs v2 writes slower than v1 writes, the v1 format
does not have a checksum available, so that might be a valuable option
to those users.)

>> Part II: Create extensions.refFormat
>> ====================================
>>
>> [2] https://github.com/derrickstolee/git/pull/24 Packed-refs v2 Part II:
>> create extensions.refFormat (Patches 3-7)
>>
>> This part is a critical concept that has yet to be defined in the Git
>> codebase. We have no way to incrementally modify the ref format. Since refs
>> are so critical, we cannot add an optionally-understood layer on top (like
>> we did with the multi-pack-index and commit-graph files). The reftable draft
>> [6] proposes the same extension name (extensions.refFormat) but focuses
>> instead on only a single value. This means that the reftable must be defined
>> at git init or git clone time and cannot be upgraded from the files backend.
>>
>> In this RFC, I propose a different model that allows for more customization
>> and incremental updates. The extensions.refFormat config key is multi-valued
>> and defaults to the list of files and packed.
> 
> This last sentence doesn't parse that well for me.  Perhaps "...and
> defaults to a combination of 'files' and 'packed', meaning supporting
> both loose refs and packed refs "?

Sounds good to me. Thanks.

>> Part III: Allow a trailing table-of-contents in the chunk-format API
>> ====================================================================
>>
>> [3] https://github.com/derrickstolee/git/pull/25 Packed-refs v2 Part III:
>> trailing table of contents in chunk-format (Patches 8-17)
>>
>> In order to optimize the write speed of the packed-refs v2 file format, we
>> want to write immediately to the file as we stream existing refs from the
>> current refs. The current chunk-format API requires computing the chunk
>> lengths in advance, which can slow down the write and take more memory than
>> necessary. Using a trailing table of contents solves this problem, and was
>> recommended earlier [7]. We just didn't have enough evidence to justify the
>> work to update the existing chunk formats. Here, we update the API in
>> advance of using in the packed-refs v2 format.
>>
>> We could consider updating the commit-graph and multi-pack-index formats to
>> use trailing table of contents, but it requires a version bump. That might
>> be worth it in the case of the commit-graph where computing the size of the
>> changed-path Bloom filters chunk requires a lot of memory at the moment.
>> After this chunk-format API update is reviewed and merged, we can pursue
>> those directions more closely. We would want to investigate the formats more
>> carefully to see if we want to update the chunks themselves as well as some
>> header information.
> 
> I like how you point out additional benefits the series could provide,
> but leave them out.  Perhaps do the same with the optional index
> hashing in patch 2?

The index hashing is just _so easy_ that I couldn't bring myself to
leave it out. I didn't include the skip_hash option for these other
formats since the write times are not as critical for these files as
write time is for the index.

Updating these formats to a v2 that uses a trailing format (and
likely other small deviations based on what we've learned since they
were first created) would be an interesting direction to pursue with
care. Absolutely not something to do while blocking the refs work.

>> Part IV: Abstract some parts of the v1 file format
>> ==================================================
>>
>> [4] https://github.com/derrickstolee/git/pull/26 Packed-refs v2 Part IV:
>> abstract some parts of the v1 file format (Patches 18-21)
>>
>> These patches move the part of the refs/packed-backend.c file that deal with
>> the specifics of the packed-refs v1 file format into a new file:
>> refs/packed-format-v1.c. This also creates an abstraction layer that will
>> allow inserting the v2 format more easily.
>>
>> One thing that doesn't exist currently is a documentation file describing
>> the packed-refs file format. I would add that file in this part before
>> submitting it for full review. (I also haven't written the file format doc
>> for the packed-refs v2 format, either.)
> 
> Sounds like another win-win opportunity to get someone from the
> community to contribute.  :-)   ..Or maybe that's not the best
> strategy, since recent empirical evidence suggests that trick doesn't
> work.  Oh well, it was worth a shot.

Hey, if someone beats me to it, I won't complain. They should expect
me to CC them on reviews for the packed-refs v2 format ;)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/30] read-cache: add index.computeHash config option
  2022-11-11 23:31   ` Elijah Newren
@ 2022-11-14 16:30     ` Derrick Stolee
  0 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee @ 2022-11-14 16:30 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, jrnieder

On 11/11/2022 6:31 PM, Elijah Newren wrote:
> On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> The previous change allowed skipping the hashing portion of the
>> hashwrite API, using it instead as a buffered write API. Disabling the
>> hashwrite can be particularly helpful when the write operation is in a
>> critical path.
>>
>> One such critical path is the writing of the index. This operation is so
>> critical that the sparse index was created specifically to reduce the
>> size of the index to make these writes (and reads) faster.
>>
>> Following a similar approach to one used in the microsoft/git fork [1],
>> add a new config option that allows disabling this hashing during the
>> index write. The cost is that we can no longer validate the contents for
>> corruption-at-rest using the trailing hash.
>>
>> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
>>
>> While older Git versions will not recognize the null hash as a special
>> case, the file format itself is still being met in terms of its
>> structure. Using this null hash will still allow Git operations to
>> function across older versions.
>>
>> The one exception is 'git fsck' which checks the hash of the index file.
>> Here, we disable this check if the trailing hash is all zeroes. We add a
>> warning to the config option that this may cause undesirable behavior
>> with older Git versions.
>>
>> As a quick comparison, I tested 'git update-index --force-write' with
>> and without index.computHash=false on a copy of the Linux kernel
>> repository.
>>
>> Benchmark 1: with hash
>>   Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
>>   Range (min … max):    34.3 ms …  79.1 ms    82 runs
>>
>> Benchmark 2: without hash
>>   Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
>>   Range (min … max):    16.3 ms …  42.0 ms    69 runs
>>
>> Summary
>>   'without hash' ran
>>     1.78 ± 0.76 times faster than 'with hash'
>>
>> These performance benefits are substantial enough to allow users the
>> ability to opt-in to this feature, even with the potential confusion
>> with older 'git fsck' versions.
> 
> This is impressive and interesting...but an improvement unrelated to
> this series other than the fact that it builds on some of it.  Perhaps
> pull this patch out?

While patch 1 is required for the packed-refs work, this one is an easy
way to take advantage of it. I'll submit these two patches soon on their
own as the rest of the RFC is discussed.

> Also, would it make sense to integrate index.computeHash with feature.manyFiles?

It would make sense to include in feature.manyFiles and Scalar's recommended
config. I expect that it would be good to have the config available in a Git
release before updating those configs to include it. Perhaps that is too
conservative, though.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-14  0:07   ` Derrick Stolee
@ 2022-11-15  2:47     ` Elijah Newren
  2022-11-16 14:45       ` Derrick Stolee
  2022-11-18 23:31     ` Junio C Hamano
  1 sibling, 1 reply; 56+ messages in thread
From: Elijah Newren @ 2022-11-15  2:47 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee via GitGitGadget, git, jrnieder

On Sun, Nov 13, 2022 at 4:07 PM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 11/11/22 6:28 PM, Elijah Newren wrote:
> > On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> Introduction
> >> ============
> >>
> >> I became interested in our packed-ref format based on the asymmetry between
> >> ref updates and ref deletions: if we delete a packed ref, then the
> >> packed-refs file needs to be rewritten. Compared to writing a loose ref,
> >> this is an O(N) cost instead of O(1).
> >>
> >> In this way, I set out with some goals:
> >>
> >>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
> >>    updates.
> >
> > Performance is always nice.  :-)
> >
> >>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
> >>    refs and creating a clear way to snapshot all refs at a given point in
> >>    time.
> >
> > Is this secondary goal the actual goal you have, or just the
> > implementation by which you get the real underlying goal?
>
> To me, the primary goal takes precedence. It turns out that the best
> way to solve for that goal happens to also make it possible to store
> all refs in a packed form, because we can update the packed form
> much faster than our current setup. There are alternatives that I
> considered (and prototyped) that were more specific to the deletions
> case, but they were not actually as fast as the stacked method. Those
> alternatives also would never help reach the secondary goal, but I
> probably would have considered them anyway if they were faster, if
> only for their simplicity.

That's orthogonal to my question, though.  For your primary goal, you
stated it in a form where it was obvious what benefit it would provide
to end users.  Your secondary goal, as stated, didn't list any benefit
to end users that I could see (update: reading the rest of your
response it appears I just didn't understand it), so I was trying to
guess at why your secondary goal might be a goal, i.e. what the real
secondary goal was.

> > To me, it appears that such a capability would solve both (a) D/F
> > conflict problems (i.e. the ability to simultaneously have a
> > refs/heads/feature and refs/heads/feature/shiny ref), and (b) case
> > sensitivity issues in refnames (i.e. inability of some users to work
> > with both a refs/heads/feature and a refs/heads/FeAtUrE, due to
> > constraints of their filesystem and the loose storage mechanism).  Are
> > either of those the goal you are trying to achieve (I think both would
> > be really nice, more so than the performance goal you have), or is
> > there another?
>
> For a Git host provider, these D/F conflict and case-sensitivity
> situations probably would need to stay as restrictions on the
> server side for quite some time because we don't want users on
> older Git clients to be unable to fetch a repository just because
> we updated our ref storage to allow for such possibilities.

Okay, but even if not used on the server side, this capability could
still be used on the client side and provide a big benefit to end
users.

But I think there's a minor issue with what you stated; as far as I
can tell, there is no case-sensitivity restriction on the server side
for GitHub currently, and users do currently have problems cloning and
using repositories with branches that differ in case only.  See e.g.
https://github.com/newren/git-filter-repo/issues/48 and the multiple
duplicates which reference that issue.  We've also had issues at
$DAYJOB, though for GHE we added some hooks to deny creating branches
that differ only in case from another branch to avoid the problem.

Also, D/F restrictions on the server do not stop users from having D/F
problems when fetching.  If users forget to use `--prune`, then when a
refs/heads/foo has already been fetched is deleted and replaced by a
refs/heads/foo/bar, then the user gets errors.  This issue actually
caused a bit of a fire-drill for us just recently.

So both kinds of problems already exist, for users with any git client
version (although the former only for users with unfortunate file
systems).  And both problems cause pain.  Both issues are caused by
loose refs, so limiting git storage to packed refs would fix both
issues.

> The biggest benefit on the server side is actually for consistency
> checks. Using a stacked packed-refs (especially with a tip file
> that describes all of the layers) allows an atomic way to take a
> snapshot of the refs and run a checksum operation on their values.
> With loose refs, concurrent updates can modify the checksum during
> its computation. This is a super niche reason for this, but it's
> nice that the performance-only focus also ends up with a design
> that satisfies this goal.

Ah...so this is the reason for your secondary goal?  Re-reading it
looks like you did state this, I just missed it without the longer
explanation.

Anyway, it might be worth calling out in your cover letter that there
are (at least) three benefits to this secondary goal of yours -- the
one you list here, plus the two I list above.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/30] extensions: add refFormat extension
  2022-11-11 23:39   ` Elijah Newren
@ 2022-11-16 14:37     ` Derrick Stolee
  0 siblings, 0 replies; 56+ messages in thread
From: Derrick Stolee @ 2022-11-16 14:37 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, jrnieder

On 11/11/22 6:39 PM, Elijah Newren wrote:
> On Mon, Nov 7, 2022 at 10:48 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> [...]
>> One obvious improvement could be a new file format version for the
>> packed-refs file. Its current plaintext-based format is inefficient due
>> to storing object IDs as hexadecimal representations instead of in
>> their raw format. This extra cost will get worse with SHA-256.
> 
>> In addition, binary searches need to guess a position and scan to find
>> newlines for a refname entry. A structured binary format could allow for
>> more compact representation and faster access.
> 
> This doesn't parse very well at all.  The scanning is due to refname
> entries being of variable length, and changing hexadecimal
> representation of object IDs to binary values isn't going to help
> that.
> 
> I _think_, after re-scanning your RFC cover letter that you had other
> ideas to allow a binary search in order to read a single ref's value,
> and that the juxtaposing of these sentences together leads to an
> unfortunate assumption that one change is related to the both goals,
> but something extra here to clarify would help.

The v2 format has a structured list of offsets that can be used to
navigate directly to the ith ref in the file. Thus, we can use a
more precise form of binary search. Since we have these values, we
do not need to scan for newlines or spaces for the end of the ref
strings. This allows us to use the raw OIDs since we are not using
special characters as string boundaries.

I will work to clarify when I submit this for review.

>> diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
>> index bccaec7a963..ce8185adf53 100644
>> --- a/Documentation/config/extensions.txt
>> +++ b/Documentation/config/extensions.txt
>> @@ -7,6 +7,47 @@ Note that this setting should only be set by linkgit:git-init[1] or
>>  linkgit:git-clone[1].  Trying to change it after initialization will not
>>  work and will produce hard-to-diagnose issues.
>>
>> +extensions.refFormat::
>> +       Specify the reference storage mechanisms used by the repoitory as a
>> +       multi-valued list. The acceptable values are `files` and `packed`.
> 
>> +       If not specified, the list of `files` and `packed` is assumed.
> 
> This sentence doesn't parse for me.
> 
>> +       It
>> +       is an error to specify this key unless `core.repositoryFormatVersion`
>> +       is 1.
> 
> ...is at least 1?  Or are we trying to be incompatible with potential
> future core.repositoryFormatVersion values?

Specifying exactly 1 is consistent across our extensions documentation.
The intention of the extensions system is that we should never need a
value 2 here. If we do, then we should consider all extensions to be
redesigned from scratch. Perhaps we'd have different defaults, or older
options not possible anymore.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-15  2:47     ` Elijah Newren
@ 2022-11-16 14:45       ` Derrick Stolee
  2022-11-17  4:28         ` Elijah Newren
  0 siblings, 1 reply; 56+ messages in thread
From: Derrick Stolee @ 2022-11-16 14:45 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Derrick Stolee via GitGitGadget, git, jrnieder

On 11/14/22 9:47 PM, Elijah Newren wrote:
> On Sun, Nov 13, 2022 at 4:07 PM Derrick Stolee <derrickstolee@github.com> wrote:
>>
>> On 11/11/22 6:28 PM, Elijah Newren wrote:
>>> On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
>>> <gitgitgadget@gmail.com> wrote:
>>>>
>>>> Introduction
>>>> ============
>>>>
>>>> I became interested in our packed-ref format based on the asymmetry between
>>>> ref updates and ref deletions: if we delete a packed ref, then the
>>>> packed-refs file needs to be rewritten. Compared to writing a loose ref,
>>>> this is an O(N) cost instead of O(1).
>>>>
>>>> In this way, I set out with some goals:
>>>>
>>>>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>>>>    updates.
>>>
>>> Performance is always nice.  :-)
>>>
>>>>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>>>>    refs and creating a clear way to snapshot all refs at a given point in
>>>>    time.
>>>
>>> Is this secondary goal the actual goal you have, or just the
>>> implementation by which you get the real underlying goal?
>>
>> To me, the primary goal takes precedence. It turns out that the best
>> way to solve for that goal happens to also make it possible to store
>> all refs in a packed form, because we can update the packed form
>> much faster than our current setup. There are alternatives that I
>> considered (and prototyped) that were more specific to the deletions
>> case, but they were not actually as fast as the stacked method. Those
>> alternatives also would never help reach the secondary goal, but I
>> probably would have considered them anyway if they were faster, if
>> only for their simplicity.
> 
> That's orthogonal to my question, though.  For your primary goal, you
> stated it in a form where it was obvious what benefit it would provide
> to end users.  Your secondary goal, as stated, didn't list any benefit
> to end users that I could see (update: reading the rest of your
> response it appears I just didn't understand it), so I was trying to
> guess at why your secondary goal might be a goal, i.e. what the real
> secondary goal was.

The reason is in the goal "creating a clear way to snapshot all refs
at a given point in time". This is a server-side benefit with no
visible benefit to users, immediately.

The D/F conflicts and case-sensitive parts that could fall from that
are not included in my goals. Part of that is because we would need a
new reflog format to complete that part. Let's take things one step
at a time and handle reflogs after we have ref update performance
handled.

>>> To me, it appears that such a capability would solve both (a) D/F
>>> conflict problems (i.e. the ability to simultaneously have a
>>> refs/heads/feature and refs/heads/feature/shiny ref), and (b) case
>>> sensitivity issues in refnames (i.e. inability of some users to work
>>> with both a refs/heads/feature and a refs/heads/FeAtUrE, due to
>>> constraints of their filesystem and the loose storage mechanism).  Are
>>> either of those the goal you are trying to achieve (I think both would
>>> be really nice, more so than the performance goal you have), or is
>>> there another?
>>
>> For a Git host provider, these D/F conflict and case-sensitivity
>> situations probably would need to stay as restrictions on the
>> server side for quite some time because we don't want users on
>> older Git clients to be unable to fetch a repository just because
>> we updated our ref storage to allow for such possibilities.
> 
> Okay, but even if not used on the server side, this capability could
> still be used on the client side and provide a big benefit to end
> users.
> 
> But I think there's a minor issue with what you stated; as far as I
> can tell, there is no case-sensitivity restriction on the server side
> for GitHub currently, and users do currently have problems cloning and
> using repositories with branches that differ in case only.  See e.g.
> https://github.com/newren/git-filter-repo/issues/48 and the multiple
> duplicates which reference that issue.  We've also had issues at
> $DAYJOB, though for GHE we added some hooks to deny creating branches
> that differ only in case from another branch to avoid the problem.

Yes, you're right here. We could do better in rejecting case-sensitive
matches upon request.

> Also, D/F restrictions on the server do not stop users from having D/F
> problems when fetching.  If users forget to use `--prune`, then when a
> refs/heads/foo has already been fetched is deleted and replaced by a
> refs/heads/foo/bar, then the user gets errors.  This issue actually
> caused a bit of a fire-drill for us just recently.

And similar to the case-sensitive situation, I'm not sure if we have
checks to avoid D/F conflicts if they happen across the loose/packed
boundary. We might just be using the filesystem as a constraint. I'll
need to dig in more here.

(This is all the more reason why this space is already complicated and
will take some time to unwind.)

> So both kinds of problems already exist, for users with any git client
> version (although the former only for users with unfortunate file
> systems).  And both problems cause pain.  Both issues are caused by
> loose refs, so limiting git storage to packed refs would fix both
> issues.
> 
>> The biggest benefit on the server side is actually for consistency
>> checks. Using a stacked packed-refs (especially with a tip file
>> that describes all of the layers) allows an atomic way to take a
>> snapshot of the refs and run a checksum operation on their values.
>> With loose refs, concurrent updates can modify the checksum during
>> its computation. This is a super niche reason for this, but it's
>> nice that the performance-only focus also ends up with a design
>> that satisfies this goal.
> 
> Ah...so this is the reason for your secondary goal?  Re-reading it
> looks like you did state this, I just missed it without the longer
> explanation.
> 
> Anyway, it might be worth calling out in your cover letter that there
> are (at least) three benefits to this secondary goal of yours -- the
> one you list here, plus the two I list above.

I suppose I assumed that the D/F and case conflicts were a "known"
benefit and a huge motivation of the reftable work. Instead of trying
to solve all of the ref problems at once, I wanted to focus on the
subset that I knew could be solved with a simpler solution, leaving
the full solution to later steps. It would help to be explicit about
how this direction helps solve this problem while also being clear
about how it does not solve it completely.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-16 14:45       ` Derrick Stolee
@ 2022-11-17  4:28         ` Elijah Newren
  0 siblings, 0 replies; 56+ messages in thread
From: Elijah Newren @ 2022-11-17  4:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee via GitGitGadget, git, jrnieder

On Wed, Nov 16, 2022 at 6:45 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 11/14/22 9:47 PM, Elijah Newren wrote:
> > On Sun, Nov 13, 2022 at 4:07 PM Derrick Stolee <derrickstolee@github.com> wrote:
> >>
> >> On 11/11/22 6:28 PM, Elijah Newren wrote:
> >>> On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
> >>> <gitgitgadget@gmail.com> wrote:
[...]
> >>>>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
> >>>>    refs and creating a clear way to snapshot all refs at a given point in
> >>>>    time.
[...]
>
> The reason is in the goal "creating a clear way to snapshot all refs
> at a given point in time". This is a server-side benefit with no
> visible benefit to users, immediately.

Yes, sorry, I just missed it.  I didn't understand it and wrongly
assumed it was continuing to talk about the implementation details
rather than the benefit details.  My bad.

Thanks for patiently correcting me.

> The D/F conflicts and case-sensitive parts that could fall from that
> are not included in my goals. Part of that is because we would need a
> new reflog format to complete that part. Let's take things one step
> at a time and handle reflogs after we have ref update performance
> handled.

Ah, right, I can see how reflog would affect both of those problems
now that you highlight it, but it hadn't occurred to me before.

> >> The biggest benefit on the server side is actually for consistency
> >> checks. Using a stacked packed-refs (especially with a tip file
> >> that describes all of the layers) allows an atomic way to take a
> >> snapshot of the refs and run a checksum operation on their values.
> >> With loose refs, concurrent updates can modify the checksum during
> >> its computation. This is a super niche reason for this, but it's
> >> nice that the performance-only focus also ends up with a design
> >> that satisfies this goal.
> >
> > Ah...so this is the reason for your secondary goal?  Re-reading it
> > looks like you did state this, I just missed it without the longer
> > explanation.
> >
> > Anyway, it might be worth calling out in your cover letter that there
> > are (at least) three benefits to this secondary goal of yours -- the
> > one you list here, plus the two I list above.
>
> I suppose I assumed that the D/F and case conflicts were a "known"
> benefit and a huge motivation of the reftable work.

Yes, and I thought you had just found a simpler solution to those
problems that might not provide all the benefits of reftable (e.g.
performance with huge numbers of refs) but did solve those particular
problems.  I've only looked at reftable from the surface from a
distance, and I was unaware previously that reflog also affected these
two problems (though it seems obvious in hindsight).  And I do
remember you calling out that you weren't changing the reflog format
in your cover letter, but I didn't understand the ramifications of
that statement at the time.

> Instead of trying
> to solve all of the ref problems at once, I wanted to focus on the
> subset that I knew could be solved with a simpler solution, leaving
> the full solution to later steps. It would help to be explicit about
> how this direction helps solve this problem while also being clear
> about how it does not solve it completely.

It certainly would have helped me.  :-)

Thanks for explaining all these details.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/30] read-cache: add index.computeHash config option
  2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
  2022-11-11 23:31   ` Elijah Newren
@ 2022-11-17 16:13   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 56+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-11-17 16:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, jrnieder, Derrick Stolee


On Mon, Nov 07 2022, Derrick Stolee via GitGitGadget wrote:

> Summary
>   'without hash' ran
>     1.78 ± 0.76 times faster than 'with hash'
>
> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

The 0.76 part of that is probably just fs caches etc. screwing things
up. I tried it on a ramdisk with CFLAGS=-O3:
	
	$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' -w 1 -r 10
	Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      13.3 ms ±   0.3 ms    [User: 7.1 ms, System: 6.1 ms]
	  Range (min … max):    12.7 ms …  13.6 ms    10 runs
	 
	Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      34.8 ms ±   0.4 ms    [User: 28.9 ms, System: 5.8 ms]
	  Range (min … max):    34.2 ms …  35.1 ms    10 runs
	 
	Summary
	  './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
	    2.62 ± 0.07 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'

I also see that if I compile with OPENSSL_SHA1=Y, then:
	
	$ hyperfine -L v false,true './git -c index.computeHash={v} -C /dev/shm/linux update-index --force-write' 
	Benchmark 1: ./git -c index.computeHash=false -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      14.0 ms ±   1.3 ms    [User: 7.7 ms, System: 6.2 ms]
	  Range (min … max):    13.1 ms …  21.7 ms    206 runs
	 
	  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' 
	options.
	 
	Benchmark 2: ./git -c index.computeHash=true -C /dev/shm/linux update-index --force-write
	  Time (mean ± σ):      21.0 ms ±   1.0 ms    [User: 15.0 ms, System: 6.0 ms]
	  Range (min … max):    20.1 ms …  28.4 ms    138 runs
	 
	  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' 
	options.
	 
	Summary
	  './git -c index.computeHash=false -C /dev/shm/linux update-index --force-write' ran
	    1.50 ± 0.15 times faster than './git -c index.computeHash=true -C /dev/shm/linux update-index --force-write'

Which, FWIW is something worth considering. I.e. when we introduced
sha1dc we did so with the "big hammer" of the existing hashing API,
which is all or nothing, and we pick the hash when we compile git.

But that left a lot of things slower for no good reason, e.g. when we do
this hashing of the trailers. So if we could just compile with two
implementations, and give users the choice of "use the faster hash when
you're not communicating with other git repos" we could make things
faster in some cases, without the potential format interop issues.

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +index.computeHash::
> +	When enabled, compute the hash of the index file as it is written
> +	and store the hash at the end of the content. This is enabled by
> +	default.
> ++

If we have a boolean option it makes sense to make its name reflect the
opt-in nature. So "index.skipHash". Then just say "If enabled", and skip
the "this is enabled by default, and then later this code:

> +	int compute_hash;
> [...]
> +	if (!git_config_get_maybe_bool("index.computehash", &compute_hash) &&
> +	    !compute_hash)
> +		f->skip_hash = 1;

Can just become:

	git_config_get_maybe_bool("index.skipHash", &f->skip_hash);

I.e. git_config_get_maybe_bool() leaves the passed-in dest value alone
if it doesn't have it in the config, and you only use this
"compute_hash" as an inverted version of "skip_hash".

> +If you disable `index.computHash`, then older Git clients may report that
> +your index is corrupt during `git fsck`.
> diff --git a/read-cache.c b/read-cache.c
> index 32024029274..f24d96de4d3 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1817,6 +1817,8 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	git_hash_ctx c;
>  	unsigned char hash[GIT_MAX_RAWSZ];
>  	int hdr_version;
> +	int all_zeroes = 1;
> +	unsigned char *start, *end;
>  
>  	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
>  		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
> @@ -1827,10 +1829,23 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
>  	if (!verify_index_checksum)
>  		return 0;
>  
> +	end = (unsigned char *)hdr + size;
> +	start = end - the_hash_algo->rawsz;
> +	while (start < end) {
> +		if (*start != 0) {
> +			all_zeroes = 0;
> +			break;
> +		}
> +		start++;
> +	}

Didn't you just re-invent oidread()? :)

Just to narrate my way through this. Before we called verify_hdr() we
did:

        hdr = (const struct cache_header *)mmap;
        if (verify_hdr(hdr, mmap_size) < 0)

So, we mmap()'d the index on disk, and whe "hdr" is the struct version
of this data, we then cast that back to an "unsigned char *" here,
because we're interested in just the raw bytes.

Then we "jump to the end" here, and start iterating over the rawsz at
the end, because we're just reading if we have a null_oid().

Then, right after that verify_hdr() call, the veriy next thing we'll do is:

	oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);

So, maybe I'm missing some subtlety still, and some of this is existing
baggage in the pre-image (we used to have the sha1 in the struct, a
*long* time ago).

But isn't this equivalent?:
	
	diff --git a/read-cache.c b/read-cache.c
	index f24d96de4d3..39b5b8419f5 100644
	--- a/read-cache.c
	+++ b/read-cache.c
	@@ -1812,13 +1812,14 @@ int verify_index_checksum;
	 /* Allow fsck to force verification of the cache entry order. */
	 int verify_ce_order;
	 
	-static int verify_hdr(const struct cache_header *hdr, unsigned long size)
	+static int verify_hdr(const char *const mmap, const size_t size,
	+		      const struct cache_header **hdrp, struct object_id *oid)
	 {
	+	const struct cache_header *hdr = (const struct cache_header *)mmap;
	 	git_hash_ctx c;
	 	unsigned char hash[GIT_MAX_RAWSZ];
	 	int hdr_version;
	-	int all_zeroes = 1;
	-	unsigned char *start, *end;
	+	const unsigned char *end = (unsigned char *)mmap + size;
	 
	 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
	 		return error(_("bad signature 0x%08x"), hdr->hdr_signature);
	@@ -1826,20 +1827,12 @@ static int verify_hdr(const struct cache_header *hdr, unsigned long size)
	 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
	 		return error(_("bad index version %d"), hdr_version);
	 
	+	*hdrp = hdr;
	+	oidread(oid, end - the_hash_algo->rawsz);
	+
	 	if (!verify_index_checksum)
	 		return 0;
	-
	-	end = (unsigned char *)hdr + size;
	-	start = end - the_hash_algo->rawsz;
	-	while (start < end) {
	-		if (*start != 0) {
	-			all_zeroes = 0;
	-			break;
	-		}
	-		start++;
	-	}
	-
	-	if (all_zeroes)
	+	if (is_null_oid(oid))
	 		return 0;
	 
	 	the_hash_algo->init_fn(&c);
	@@ -2358,11 +2351,8 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
	 			mmap_os_err());
	 	close(fd);
	 
	-	hdr = (const struct cache_header *)mmap;
	-	if (verify_hdr(hdr, mmap_size) < 0)
	+	if (verify_hdr(mmap, mmap_size, &hdr, &istate->oid) < 0)
	 		goto unmap;
	-
	-	oidread(&istate->oid, (const unsigned char *)hdr + mmap_size - the_hash_algo->rawsz);
	 	istate->version = ntohl(hdr->hdr_version);
	 	istate->cache_nr = ntohl(hdr->hdr_entries);
	 	istate->cache_alloc = alloc_nr(istate->cache_nr);

I.e. we just make the verify function be in charge of populating our
"oid", which we can do that early, as we'd error out later in the
function if it doesn't match.

We could avoid the "hdrp" there, but if we're doing the cast it's
probably good for readability to just do it once.

> +test_expect_success 'index.computeHash config option' '
> +	(
> +		rm -f .git/index &&
> +		git -c index.computeHash=false add a &&
> +		git fsck
> +	)
> +'

You can skip the subshell here, but for a non-RFC let's leave the test
in a nice state for the next test someone adds, so maybe:

	test_when_finished "rm -rf repo" &&
	git clone . repo &&
	[...]

Lastly, on this again:

> These performance benefits are substantial enough to allow users the
> ability to opt-in to this feature, even with the potential confusion
> with older 'git fsck' versions.

Isn't an unstated major caveat here that it's not "an older verison",
but if you on *your version* set the config to "true" your index doesn't
have a hash, so it's persisted until you wipe the index?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-14  0:07   ` Derrick Stolee
  2022-11-15  2:47     ` Elijah Newren
@ 2022-11-18 23:31     ` Junio C Hamano
  2022-11-19  0:41       ` Elijah Newren
  2022-11-30 15:31       ` Derrick Stolee
  1 sibling, 2 replies; 56+ messages in thread
From: Junio C Hamano @ 2022-11-18 23:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Derrick Stolee via GitGitGadget, git, jrnieder

Derrick Stolee <derrickstolee@github.com> writes:

> On 11/11/22 6:28 PM, Elijah Newren wrote:
>> On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
>>>
>>> Introduction
>>> ============
>>>
>>> I became interested in our packed-ref format based on the asymmetry between
>>> ref updates and ref deletions: if we delete a packed ref, then the
>>> packed-refs file needs to be rewritten. Compared to writing a loose ref,
>>> this is an O(N) cost instead of O(1).
>>>
>>> In this way, I set out with some goals:
>>>
>>>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>>>    updates.
>> 
>> Performance is always nice.  :-)
>> 
>>>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>>>    refs and creating a clear way to snapshot all refs at a given point in
>>>    time.
>> 
>> Is this secondary goal the actual goal you have, or just the
>> implementation by which you get the real underlying goal?
>
> To me, the primary goal takes precedence. It turns out that the best
> way to solve for that goal happens to also make it possible to store
> all refs in a packed form, because we can update the packed form
> much faster than our current setup. There are alternatives that I
> considered (and prototyped) that were more specific to the deletions
> case, but they were not actually as fast as the stacked method. Those
> alternatives also would never help reach the secondary goal, but I
> probably would have considered them anyway if they were faster, if
> only for their simplicity.

I have been and am still offline and haven't examined this proposal
in detail, but would it be a better longer-term approach to improve
reftable backend, instead of piling more effort on loose+packed
filesystem based backend?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-18 23:31     ` Junio C Hamano
@ 2022-11-19  0:41       ` Elijah Newren
  2022-11-19  3:00         ` Taylor Blau
  2022-11-30 15:31       ` Derrick Stolee
  1 sibling, 1 reply; 56+ messages in thread
From: Elijah Newren @ 2022-11-19  0:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, jrnieder

On Fri, Nov 18, 2022 at 3:31 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Derrick Stolee <derrickstolee@github.com> writes:
>
> > On 11/11/22 6:28 PM, Elijah Newren wrote:
> >> On Mon, Nov 7, 2022 at 11:01 AM Derrick Stolee via GitGitGadget
> >> <gitgitgadget@gmail.com> wrote:

> I have been and am still offline and haven't examined this proposal
> in detail, but would it be a better longer-term approach to improve
> reftable backend, instead of piling more effort on loose+packed
> filesystem based backend?

Well, Stolee explicitly brought this up multiple times in his cover
letter with various arguments about why he thinks this approach is a
better way to move us on the path towards improved ref handling, and
doesn't see it as excluding the reftable option but just opening us up
to more incremental (and incrementally testable) improvements.  This
question came up early and often in the cover letter; he even ends
with a "Relationship to reftable" section.

But he is clearly open to feedback about whether others agree or
disagree with his thesis.

(I haven't looked much at reftable, so I can't opine on that question,
but Stolee's approach did seem eminently easier to review.  I did have
some questions about his proposal(s) because I didn't quite understand
them, in part due to being unfamiliar with the area.)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-19  0:41       ` Elijah Newren
@ 2022-11-19  3:00         ` Taylor Blau
  0 siblings, 0 replies; 56+ messages in thread
From: Taylor Blau @ 2022-11-19  3:00 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Derrick Stolee, Derrick Stolee via GitGitGadget,
	git, jrnieder

On Fri, Nov 18, 2022 at 04:41:25PM -0800, Elijah Newren wrote:
> (I haven't looked much at reftable, so I can't opine on that question,
> but Stolee's approach did seem eminently easier to review.  I did have
> some questions about his proposal(s) because I didn't quite understand
> them, in part due to being unfamiliar with the area.)

For what it's worth, I'm in the same boat as you are.

That being said, I do find it somewhat sad that we have reftable bits in
Junio's tree that don't appear to be progressing all that much. So as
much as I would like to see us have fewer reference backends, I'd rather
see whatever the "next-gen" backend be have good support and momentum,
even if that means carrying more code.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
                   ` (31 preceding siblings ...)
  2022-11-11 23:28 ` Elijah Newren
@ 2022-11-28 18:56 ` Han-Wen Nienhuys
  2022-11-30 15:16   ` Derrick Stolee
  32 siblings, 1 reply; 56+ messages in thread
From: Han-Wen Nienhuys @ 2022-11-28 18:56 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, jrnieder, Derrick Stolee, John Cai

On Mon, Nov 7, 2022 at 7:36 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
>
> Introduction
> ============
>
> I became interested in our packed-ref format based on the asymmetry between
> ref updates and ref deletions: if we delete a packed ref, then the
> packed-refs file needs to be rewritten. Compared to writing a loose ref,
> this is an O(N) cost instead of O(1).
>
> In this way, I set out with some goals:
>
>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>    updates.
>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>    refs and creating a clear way to snapshot all refs at a given point in
>    time.
>
> I also had one major non-goal to keep things focused:
>
>  * (Non-goal) Update the reflog format.
>
> After carefully considering several options, it seemed that there are two
> solutions that can solve this effectively:
>
>  1. Wait for reftable to be integrated into Git.
>  2. Update the packed-refs backend to have a stacked version.
>
> The reftable work seems currently dormant. The format is pretty complicated
> and I have a difficult time seeing a way forward for it to be fully
> integrated into Git.

The format is somewhat complicated, and I think it would have been
possible to design a block-oriented sorted-table approach that is
simpler, but the JGit implementation has set it in stone. But, to put
this in perspective, the amount of work for getting the format to
read/write correctly has been completely dwarfed by the effort needed
to make the refs API in git represent a true abstraction boundary.
Also, if you're introducing a new format, one might as well try to
optimize it a bit.

Here are some of the hard problems that I encountered

* Worktrees and the main repository have a separate view of the ref
namespace. This is not explicit in the ref backend API, and there is a
technical limitation that the packed-refs file cannot be in a
worktree. This means that worktrees will always continue to use
loose-ref storage if you only extend the packed-refs backend.

* Symrefs are refs too, but for some reason the packed-refs file
doesn't support them. Does packed-refs v2 support symrefs too?  If you
want to snapshot the state of refs, do you want to snapshot the value
of HEAD too?

* By not changing reflogs, you are making things simpler. (if a
transaction updates the branch that HEAD points to, the reflog for
HEAD has to be updated too. Because reftable updates the reflog
transactionally, this was some extra work)
Then again, I feel the current way that reflogs work are a bit messy,
because directory/file conflicts force reflogs to be deleted at times
that don't make sense from a user-perspective.

* There are a lot of commands that store SHA1s in files under .git/,
and access them as if they are a ref (for example: rebase-apply/ ,
CHERRY_PICK_HEAD etc.).

> In this RFC, I propose a different model that allows for more customization
> and incremental updates. The extensions.refFormat config key is multi-valued
> and defaults to the list of files and packed. In the context of this RFC,
> the intention is to be able to add packed-v2 so the list of all three values
> would allow Git to write and read either file format version (v1 or v2). In
> the larger scheme, the extension could allow restricting to only loose refs
> (just files) or only packed-refs (just packed) or even later when reftable
> is complete, files and reftable could mean that loose refs are the primary
> ref storage, but the reftable format serves as a drop-in replacement for the
> packed-refs file. Not all combinations need to be understood by Git, but

I'm not sure how feasible this is. reftable also holds reflog data. A
setting {files,reftable} would either not work, or necessitate hairy
merging of data to get the reflogs working correctly.

> In order to optimize the write speed of the packed-refs v2 file format, we
> want to write immediately to the file as we stream existing refs from the
> current refs. The current chunk-format API requires computing the chunk
> lengths in advance, which can slow down the write and take more memory than

yes, this sounds sensible. reftable has the secondary indexes trailing the data.

> Between using raw OIDs and storing the depth-2 prefixes only once, this
> format compresses the file to ~60% of its v1 size. (The format allows not
> writing the prefix chunks, and the prefix chunks are implemented after the
> basics of the ref chunks are complete.)
>
> The write times are reduced in a similar fraction to the size difference.
> Reads are sped up somewhat, and we have the potential to do a ref count by

Do you mean 'enumerate refs' ? Why would you want to count refs by prefix?

> I mentioned earlier that I had considered using reftable as a way to achieve
> the stated goals. With the current state of that work, I'm not confident
> that it is the right approach here.
>
> My main worry is that the reftable is more complicated than we need for a
> typical Git repository that is based on a typical filesystem. This makes
> testing the format very critical, and we seem to not be near reaching that
> approach.

I think the base code of reading and writing the reftable format is
exercised quite exhaustively tested in unit tests. You say 'seem', but
do you have anything concrete to say?

> As mentioned, the current extension plan [6] only allows reftable or files
> and does not allow for a mix of both. This RFC introduces the possibility
> that both could co-exist. Using that multi-valued approach means that I'm
> able to test the v2 packed-refs file format almost as well as the v1 file
> format within this RFC. (More tests need to be added that are specific to
> this format, but I'm waiting for confirmation that this is an acceptable
> direction.) At the very least, this multi-valued approach could be used as a
> way to allow using the reftable format as a drop-in replacement for the
> packed-refs file, as well as upgrading an existing repo to use reftable.

The multi-value approach creates more combinations of code of how
different pieces of code can interact, so I think it actually makes it
more error-prone.
Also,

> That might even help the integration process to allow the reftable format to
> be tested at least by some subset of tests instead of waiting for a full
> test suite update.

I don't understand this comment. In the current state,
https://github.com/git/git/pull/1215 already passes 922 of the 968
test files if you set GIT_TEST_REFTABLE=1.

See https://github.com/git/git/pull/1215#issuecomment-1329579459 for
details. As you can see, for most test files, it's just a few
individual test cases that fail.

> I'm interested to hear from people more involved in the reftable work to see
> the status of that project and how it matches or differs from my
> perspective.

Overall, I found that the loose/packed ref code hard to understand and
full of arbitrary limitations (dir/file conflicts, deleting reflogs
when branches are deleted, locking across loose/packed refs etc.).
The way reftable stacks are setup (with both reflog and ref data
including symrefs in the same file) make it much easier to verify that
it behaves transactionally.

For deleting refs quickly, it seems that you only need to support
$ZEROID in packed-refs and then implement a ref database as a stack of
packed-ref files? If you're going for minimal effort and minimal
disruption wouldn't that be the place to start?

You're concerned about the reftable file format (and maybe rightly
so), but if you're changing the file format anyway and you're not
picking reftable, why not create a block-based, indexed format that
can support storing reflog entries at some point in the future too,
rather than build on (the limitations) of packed-refs? Or is
packed-refs v2 backward compatible with v1 (could an old git client
read v2 files? I think not, right?).

The reftable project has gotten into a slump because my work
responsibilities have increased over the last 1.5 year squeezing down
how much time I have for 'fun' projects. I chatted with John Cai, who
was trying to staff this project out of Gitlab resources. I don't know
where that stands, though.

> The one thing I can say is that if the reftable work had not already begun,
> then this is RFC is how I would have approached a new ref format.
>
> I look forward to your feedback!

Hope this helps.


-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Liana Sebastian

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-28 18:56 ` Han-Wen Nienhuys
@ 2022-11-30 15:16   ` Derrick Stolee
  2022-11-30 15:38     ` Phillip Wood
                       ` (3 more replies)
  0 siblings, 4 replies; 56+ messages in thread
From: Derrick Stolee @ 2022-11-30 15:16 UTC (permalink / raw)
  To: Han-Wen Nienhuys, Derrick Stolee via GitGitGadget; +Cc: git, jrnieder, John Cai

On 11/28/2022 1:56 PM, Han-Wen Nienhuys wrote:

Han-Wen,

Thanks for taking the time to reply. I was specifically hoping for your
perspective on the ideas here.

> On Mon, Nov 7, 2022 at 7:36 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> After carefully considering several options, it seemed that there are two
>> solutions that can solve this effectively:
>>
>>  1. Wait for reftable to be integrated into Git.
>>  2. Update the packed-refs backend to have a stacked version.
>>
>> The reftable work seems currently dormant. The format is pretty complicated
>> and I have a difficult time seeing a way forward for it to be fully
>> integrated into Git.
>
> The format is somewhat complicated, and I think it would have been
> possible to design a block-oriented sorted-table approach that is
> simpler, but the JGit implementation has set it in stone.

I agree that if we pursue reftable, that we should use the format as
agreed upon and implemented in JGit. I do want to say that while I admire
JGit's dedication to being compatible with repositories created by Git, I
don't think the reverse is a goal of the Git project.

> But, to put
> this in perspective, the amount of work for getting the format to
> read/write correctly has been completely dwarfed by the effort needed
> to make the refs API in git represent a true abstraction boundary.
> Also, if you're introducing a new format, one might as well try to
> optimize it a bit.

That's another reason why I was able to make an incremental improvement so
quickly in this RFC: I worked within the existing API, reducing the
overall impact of the change. It's easier to evaluate the performance
difference of packed-refs v2 versus packed-refs v1 because the change is
isolated.

That work to make the Git refs API work with the reftable library is
further ahead than I though (in your draft PR) but it is also completely
missing from the current Git tree, so that work still needs to be arranged
into a reviewable series before it is available to us. That does seem like
a substantial amount of work, but I might have been overestimating how
much work it will be compared to these changes I am advocating for.

> Here are some of the hard problems that I encountered

Thanks for including these.

> * Worktrees and the main repository have a separate view of the ref
> namespace. This is not explicit in the ref backend API, and there is a
> technical limitation that the packed-refs file cannot be in a
> worktree. This means that worktrees will always continue to use
> loose-ref storage if you only extend the packed-refs backend.

If I'm understanding it correctly [1], only the special refs (like HEAD or
REBASE_HEAD) are worktree-specific, and all refs under "refs/*" are
repository-scoped. I don't actually think of those special refs as "loose"
refs and thus they should still work under the "only packed-refs" value
for extensions.refFormat. I should definitely cover this in the
documentation, though. Also, [1] probably needs updating because it calls
HEAD a pseudo ref even though it explicitly is not [2].

[1] https://git-scm.com/docs/git-worktree#_refs
[2] https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpseudorefapseudoref

> * Symrefs are refs too, but for some reason the packed-refs file
> doesn't support them. Does packed-refs v2 support symrefs too?  If you
> want to snapshot the state of refs, do you want to snapshot the value
> of HEAD too?

I forgot that loose refs under .git/refs/ can be symrefs. This definitely
is a limitation that I should mention. Again, pseudorefs like HEAD are not
included and are stored separately, but symrefs within refs/* are not
available in packed-refs (v1 or v2). That should be explicitly called out
in the extensions.refFormat docs.

I imagine that such symrefs are uncommon, and users can make their own
evaluation of whether that use is worth keeping loose refs or not. We can
still have the {files, packed[-v2]} extension value while having a
writing strategy that writes as much as possible into the packed layer.

> * By not changing reflogs, you are making things simpler. (if a
> transaction updates the branch that HEAD points to, the reflog for
> HEAD has to be updated too. Because reftable updates the reflog
> transactionally, this was some extra work)
> Then again, I feel the current way that reflogs work are a bit messy,
> because directory/file conflicts force reflogs to be deleted at times
> that don't make sense from a user-perspective.

I agree that reflogs are messy. I also think that reflogs have different
needs than the ref storage, so separating their needs is valuable.

> * There are a lot of commands that store SHA1s in files under .git/,
> and access them as if they are a ref (for example: rebase-apply/ ,
> CHERRY_PICK_HEAD etc.).

Yes, I think these pseudorefs are stored differently from usual refs, and
hence the {packed[-v2]} extension value would still work, but I'll confirm
this with more testing.

>> In this RFC, I propose a different model that allows for more customization
>> and incremental updates. The extensions.refFormat config key is multi-valued
>> and defaults to the list of files and packed. In the context of this RFC,
>> the intention is to be able to add packed-v2 so the list of all three values
>> would allow Git to write and read either file format version (v1 or v2). In
>> the larger scheme, the extension could allow restricting to only loose refs
>> (just files) or only packed-refs (just packed) or even later when reftable
>> is complete, files and reftable could mean that loose refs are the primary
>> ref storage, but the reftable format serves as a drop-in replacement for the
>> packed-refs file. Not all combinations need to be understood by Git, but
>
> I'm not sure how feasible this is. reftable also holds reflog data. A
> setting {files,reftable} would either not work, or necessitate hairy
> merging of data to get the reflogs working correctly.

In this setup, would it be possible to continue using the "loose reflog"
format while using reftable as the packed layer? I personally think this
combination of formats to be critical to upgrading existing repositories
to reftable.

(Note: there is a strategy that doesn't need this approach, but it's a bit
complicated. It would involve rotating all replicas to new repositories
that are configured to use reftable upon creation, getting the refs from
other replicas via fetches. In my opinion, this is prohibitively
expensive.)

>> Between using raw OIDs and storing the depth-2 prefixes only once, this
>> format compresses the file to ~60% of its v1 size. (The format allows not
>> writing the prefix chunks, and the prefix chunks are implemented after the
>> basics of the ref chunks are complete.)
>>
>> The write times are reduced in a similar fraction to the size difference.
>> Reads are sped up somewhat, and we have the potential to do a ref count by
>
> Do you mean 'enumerate refs' ? Why would you want to count refs by prefix?

Generally, I mean these kind of operations:

* 'git for-each-ref' enumerates all refs within a prefix.

* Serving the ref advertisement enumerates all refs.

* There was a GitHub feature that counted refs and tags, but wanted to
  ignore internal ref prefixes (outside of refs/heads/* or refs/tags/*).
  It turns out that we didn't actually need the full count but an
  existence indicator, but it would be helpful to quickly identify how
  many branches or tags are in a repository at a glance. Packed-refs v1
  requires scanning the whole file while packed-refs v2 does a fixed
  number of binary searches followed by a subtraction of row indexes.

>> I mentioned earlier that I had considered using reftable as a way to achieve
>> the stated goals. With the current state of that work, I'm not confident
>> that it is the right approach here.
>>
>> My main worry is that the reftable is more complicated than we need for a
>> typical Git repository that is based on a typical filesystem. This makes
>> testing the format very critical, and we seem to not be near reaching that
>> approach.
>
> I think the base code of reading and writing the reftable format is
> exercised quite exhaustively tested in unit tests. You say 'seem', but
> do you have anything concrete to say?

Our test suite is focused on integration tests at the command level. While
unit tests are helpful, I'm not sure if all of the corner cases would be
covered by tests that check Git commands only.

>> As mentioned, the current extension plan [6] only allows reftable or files
>> and does not allow for a mix of both. This RFC introduces the possibility
>> that both could co-exist. Using that multi-valued approach means that I'm
>> able to test the v2 packed-refs file format almost as well as the v1 file
>> format within this RFC. (More tests need to be added that are specific to
>> this format, but I'm waiting for confirmation that this is an acceptable
>> direction.) At the very least, this multi-valued approach could be used as a
>> way to allow using the reftable format as a drop-in replacement for the
>> packed-refs file, as well as upgrading an existing repo to use reftable.
>
> The multi-value approach creates more combinations of code of how
> different pieces of code can interact, so I think it actually makes it
> more error-prone.

As multiple values are added, it will be important to indicate which
values are not compatible with each other. However, the plan for the
packed-refs improvements do add values that are orthogonal to each other.
It does make testing all combinations more difficult.

Of course, if reftable is truly incompatible with loose refs, then Git can
say that {reftable} is the only set of values that can use reftable, and
make {files, reftable} an incompatible set (which could be understood by
a later version of Git, if those barriers are overcome). However, if we do
not specify the extension as multi-valued from the start, then we cannot
later add this multi-valued option without changing the extension name.

>> That might even help the integration process to allow the reftable format to
>> be tested at least by some subset of tests instead of waiting for a full
>> test suite update.
>
> I don't understand this comment. In the current state,
> https://github.com/git/git/pull/1215 already passes 922 of the 968
> test files if you set GIT_TEST_REFTABLE=1.
>
> See https://github.com/git/git/pull/1215#issuecomment-1329579459 for
> details. As you can see, for most test files, it's just a few
> individual test cases that fail.

My point is that to get those remaining tests passing requires a
significant update to the test suite. I imagined that the complexity of
that update was the blocker to completing the reftable work.

It seems that my estimation of that complexity was overly high compared to
what you appear to be describing.

>> I'm interested to hear from people more involved in the reftable work to see
>> the status of that project and how it matches or differs from my
>> perspective.
>
> Overall, I found that the loose/packed ref code hard to understand and
> full of arbitrary limitations (dir/file conflicts, deleting reflogs
> when branches are deleted, locking across loose/packed refs etc.).
> The way reftable stacks are setup (with both reflog and ref data
> including symrefs in the same file) make it much easier to verify that
> it behaves transactionally.

I believe you that starting with a new data model makes many of these
things easier to reason with.

> For deleting refs quickly, it seems that you only need to support
> $ZEROID in packed-refs and then implement a ref database as a stack of
> packed-ref files? If you're going for minimal effort and minimal
> disruption wouldn't that be the place to start?

I disagree that jumping straight to stacked packed-refs is minimal effort
or minimal disruption.

Creating the stack approach does require changing the semantics of the
packed-refs format to include $ZEROID, which will modify some meanings in
the iteration code. The use of a stack, as well as how layers are combined
during a ref write or also during maintenance, adds complications to the
locking semantics that are decently complicated.

By contrast, the v2 format is isolated to the on-disk format. None of the
writing or reading semantics are changed in terms of which files to look
at or write in which order. Instead, it's relatively simple to see from
the format exactly how it reduces the file size but otherwise has exactly
the same read/write behavior. In fact, since the refs and OIDs are all
located in the same chunk in a similar order to the v1 file, we can even
deduce that page cache semantics will only improve in the new format.

The reason to start with this step is that the benefits and risks are
clearly understood, which can motivate us to establish the mechanism for
changing the ref format by defining the extension.

> You're concerned about the reftable file format (and maybe rightly
> so), but if you're changing the file format anyway and you're not
> picking reftable, why not create a block-based, indexed format that
> can support storing reflog entries at some point in the future too,
> rather than build on (the limitations) of packed-refs?

My personal feeling is that storing ref tips and storing the history of a
ref are sufficiently different problems that should have their own data
structures. Even if they could be combined by a common format, I don't
think it is safe to transition every part of every ref operation to a new
format all at once.

Looking at reftable from the perspective of a hosting provider, I'm very
hesitant to recommend transitioning to it because of how it is an "all or
nothing" switch. It does not fit with my expectations for safe deployment
practices.

Yes, packed-refs have some limitations, but those limitations are known
and we are working within them right now. I'd rather make a change to
write smaller versions of the file with the same semantics as a first
step.

> Or is
> packed-refs v2 backward compatible with v1 (could an old git client
> read v2 files? I think not, right?).

No, it is not backward compatible. That's why the extension is needed.

> The reftable project has gotten into a slump because my work
> responsibilities have increased over the last 1.5 year squeezing down
> how much time I have for 'fun' projects. I chatted with John Cai, who
> was trying to staff this project out of Gitlab resources. I don't know
> where that stands, though.

I'll have my EM reach out to John to see where that stands to see how we
can coordinate in this space.

>> The one thing I can say is that if the reftable work had not already begun,
>> then this is RFC is how I would have approached a new ref format.
>>
>> I look forward to your feedback!
>
> Hope this helps.

It does help clarify where the reftable project currently stands as well
as some key limitations of the packed-refs format. You've given me a lot
to think about so I'll do some poking around in your branch (and do some
performance tests) to see what I can make of it.

Let me attempt to summarize my understanding, now that you've added
clarity:

* The reftable work needs its refs backend implemented, but your draft PR
  has a prototype of this and some basic test suite integration. There are
  54 test files that have one or more failing tests, and likely these just
  need to be adjusted to not care about loose references.

* The reftable is currently fundamentally different enough that it could
  not be used as a replacement for the packed-refs file underneath loose
  refs (primarily due to its integration with the reflog). Doing so would
  require significant work on top of your prototype.

* This further indicates that moving to reftable is an "all or nothing"
  transition, and even requires starting a repository from scratch with
  reftable enabled. This is a bit of a blocker for a hosting provider to
  transition to the format, and will likely be difficult for clients to
  adopt the feature.

* The plan established by this RFC does _not_ block reftable progress, but
  generally we prefer not having competing formats in Git, so it would be
  better to have only one, unless there is enough of a justification to
  have different formats for different use cases.

I'm going to take the following actions on my end to better understand the
situation:

1. I'll take your draft PR branch and do some performance evaluations on
   the speed of ref updates compared to loose refs and my prototype of a
   two-stack packed-ref where the second layer of the stack is only for
   deleted refs.

2. I'll consult with my peers to determine how expensive it would be to
   roll out reftable via a complete replacement of our hosted
   repositories. I'll also try to discover ways to roll out the feature to
   subsets of the fleet to create a safe deployment strategy.

3. My EM and I will reach out to John Cai to learn about plans to push
   reftable over the finish line.

4. I will split out the "skip_hash" part of this RFC into its own series,
   after adding the necessary details to fsck to understand a null
   trailing hash.

Please let me know if I'm missing anything I should be investigating here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-18 23:31     ` Junio C Hamano
  2022-11-19  0:41       ` Elijah Newren
@ 2022-11-30 15:31       ` Derrick Stolee
  1 sibling, 0 replies; 56+ messages in thread
From: Derrick Stolee @ 2022-11-30 15:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, Derrick Stolee via GitGitGadget, git, jrnieder,
	Han-Wen Nienhuys, Taylor Blau

On 11/18/2022 6:31 PM, Junio C Hamano wrote:

> I have been and am still offline and haven't examined this proposal
> in detail, but would it be a better longer-term approach to improve
> reftable backend, instead of piling more effort on loose+packed
> filesystem based backend?

If reftable was complete and stable, then I would have carefully examined
it to check that it solves the problems at hand. I interpreted the lack of
progress in the area to be due to significant work required or hard
problems blocking its completion. That appears to be a wrong assumption,
so we are exploring what that will take to get it complete.

I am still wary of it requiring a clean slate and not having any way to
upgrade from an existing repository to one with reftable. I'm going to
reevaluate this to see how expensive it would be to upgrade to reftable
and how we can deploy that change safely.

These upgrade concerns may require us to eventually consider a world where
we can upgrade a repository by replacing the packed-refs file with a
reftable file, then later removing the ability to read or write loose
refs. To do so might benefit from the multi-valued extensions.refFormat
that is proposed in this RFC, even if packed-v2 does not become a
recognized value.

My personal opinion is that if reftable was not already implemented in
JGit and was not already partially contributed to Git, then we would not
choose that format or that "all or nothing" upgrade path. Instead,
incremental improvements on the existing ref formats are easier to
understand and test in parts.

But my opinion is not the most important one. I'll defer to the
community in this. I thought it worthwhile to present an alternative.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-30 15:16   ` Derrick Stolee
@ 2022-11-30 15:38     ` Phillip Wood
  2022-11-30 16:37     ` Taylor Blau
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: Phillip Wood @ 2022-11-30 15:38 UTC (permalink / raw)
  To: Derrick Stolee, Han-Wen Nienhuys, Derrick Stolee via GitGitGadget
  Cc: git, jrnieder, John Cai

Hi Stolee

On 30/11/2022 15:16, Derrick Stolee wrote:
> On 11/28/2022 1:56 PM, Han-Wen Nienhuys wrote:
>> * Worktrees and the main repository have a separate view of the ref
>> namespace. This is not explicit in the ref backend API, and there is a
>> technical limitation that the packed-refs file cannot be in a
>> worktree. This means that worktrees will always continue to use
>> loose-ref storage if you only extend the packed-refs backend.
> 
> If I'm understanding it correctly [1], only the special refs (like HEAD or
> REBASE_HEAD) are worktree-specific, and all refs under "refs/*" are
> repository-scoped. I don't actually think of those special refs as "loose"
> refs and thus they should still work under the "only packed-refs" value
> for extensions.refFormat. I should definitely cover this in the
> documentation, though. Also, [1] probably needs updating because it calls
> HEAD a pseudo ref even though it explicitly is not [2].
>
 > [1] https://git-scm.com/docs/git-worktree#_refs
 > [2] 
https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpseudorefapseudoref

Unfortunately I think it is a little messier than that (see 
refs.c:is_per_worktree_ref()). refs/bisect/*, refs/rewritten/* and 
refs/worktree/* are all worktree specific.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-30 15:16   ` Derrick Stolee
  2022-11-30 15:38     ` Phillip Wood
@ 2022-11-30 16:37     ` Taylor Blau
  2022-11-30 18:30     ` Han-Wen Nienhuys
  2022-11-30 22:55     ` Junio C Hamano
  3 siblings, 0 replies; 56+ messages in thread
From: Taylor Blau @ 2022-11-30 16:37 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Han-Wen Nienhuys, Derrick Stolee via GitGitGadget, git, jrnieder,
	John Cai

On Wed, Nov 30, 2022 at 10:16:52AM -0500, Derrick Stolee wrote:
> > Do you mean 'enumerate refs' ? Why would you want to count refs by prefix?
>
> * There was a GitHub feature that counted refs and tags, but wanted to
>   ignore internal ref prefixes (outside of refs/heads/* or refs/tags/*).
>   It turns out that we didn't actually need the full count but an
>   existence indicator, but it would be helpful to quickly identify how
>   many branches or tags are in a repository at a glance. Packed-refs v1
>   requires scanning the whole file while packed-refs v2 does a fixed
>   number of binary searches followed by a subtraction of row indexes.

True. On the surface, it seemed odd to use a function which returns
something like:

    { "refs/heads/*" => NNNN, "refs/tags/*" => MMMM }

only to check whether or not NNNN and MMMM are zero or non-zero.

But there's a little more to the story. That emptiness check does occur
at the beginning of many page loads. But when it responds "non-empty",
we then care about how many branches and tags there actually are.

So calling count_refs() (the name of the internal RPC that powers all of
this) was an optimization written under the assumption that we actually
are going to ask about the exact number of branches/tags very shortly
after querying for emptiness.

It turns out that empirically it's faster to do something like:

    $ git show-ref --[heads|tags] | head -n 1

to check if there are any branches and tags at all[^1], and then a
follow up 'git show-ref --heads | wc -l' to check how many there are.

But it would be nice to do both operations quickly without having
actually scan all of the entries in each prefix.

Thanks,
Taylor

[^1]: Some may remember my series in
  https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/
  which replaced '| head -n 1' with a '--count=1' option. This matches
  what GitHub runs in production where piping one command to another
  from Ruby is unfortunately quite complicated.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-30 15:16   ` Derrick Stolee
  2022-11-30 15:38     ` Phillip Wood
  2022-11-30 16:37     ` Taylor Blau
@ 2022-11-30 18:30     ` Han-Wen Nienhuys
  2022-11-30 18:37       ` Sean Allred
  2022-12-01 20:18       ` Derrick Stolee
  2022-11-30 22:55     ` Junio C Hamano
  3 siblings, 2 replies; 56+ messages in thread
From: Han-Wen Nienhuys @ 2022-11-30 18:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee via GitGitGadget, git, jrnieder, John Cai

On Wed, Nov 30, 2022 at 4:16 PM Derrick Stolee <derrickstolee@github.com> wrote:
> > * Symrefs are refs too, but for some reason the packed-refs file
> > doesn't support them. Does packed-refs v2 support symrefs too?  If you
> > want to snapshot the state of refs, do you want to snapshot the value
> > of HEAD too?
>
> I forgot that loose refs under .git/refs/ can be symrefs. This definitely
> is a limitation that I should mention. Again, pseudorefs like HEAD are not
> included and are stored separately, but symrefs within refs/* are not
> available in packed-refs (v1 or v2). That should be explicitly called out
> in the extensions.refFormat docs.
>
> I imagine that such symrefs are uncommon, and users can make their own
> evaluation of whether that use is worth keeping loose refs or not. We can
> still have the {files, packed[-v2]} extension value while having a
> writing strategy that writes as much as possible into the packed layer.

To be honest, I don't understand why symrefs are such a generic
concept; I've only ever seen them used for HEAD.

> > * By not changing reflogs, you are making things simpler. (if a
> > transaction updates the branch that HEAD points to, the reflog for
> > HEAD has to be updated too. Because reftable updates the reflog
> > transactionally, this was some extra work)
> > Then again, I feel the current way that reflogs work are a bit messy,
> > because directory/file conflicts force reflogs to be deleted at times
> > that don't make sense from a user-perspective.
>
> I agree that reflogs are messy. I also think that reflogs have different
> needs than the ref storage, so separating their needs is valuable.

If the reflog records the history of the ref database, then ideally,
an update of a ref should be transactional across the ref database and
the reflog. I think you can never make this work unless you tie the
storage of both together.

I can't judge how many hosting providers really care about this. At
google, we really care, but we keep the ref database and the refllog
in a global Spanner database. Reftable is only used for per-datacenter
serving. (I discovered some bugs in the JGit reflog code when I ported
it to local filesystem repos, because it was never exercised at
Google)

> > * There are a lot of commands that store SHA1s in files under .git/,
> > and access them as if they are a ref (for example: rebase-apply/ ,
> > CHERRY_PICK_HEAD etc.).
>
> Yes, I think these pseudorefs are stored differently from usual refs, and
> hence the {packed[-v2]} extension value would still work, but I'll confirm
> this with more testing.

They will work as long as you keep support for loose refs, because
there is no distinction between "a entry in the ref database" and "any
file randomly written into .git/ ".

> >> In this RFC, I propose a different model that allows for more customization
> >> and incremental updates. The extensions.refFormat config key is multi-valued
> >> and defaults to the list of files and packed. In the context of this RFC,
> >> the intention is to be able to add packed-v2 so the list of all three values
> >> would allow Git to write and read either file format version (v1 or v2). In
> >> the larger scheme, the extension could allow restricting to only loose refs
> >> (just files) or only packed-refs (just packed) or even later when reftable
> >> is complete, files and reftable could mean that loose refs are the primary
> >> ref storage, but the reftable format serves as a drop-in replacement for the
> >> packed-refs file. Not all combinations need to be understood by Git, but
> >
> > I'm not sure how feasible this is. reftable also holds reflog data. A
> > setting {files,reftable} would either not work, or necessitate hairy
> > merging of data to get the reflogs working correctly.
>
> In this setup, would it be possible to continue using the "loose reflog"
> format while using reftable as the packed layer? I personally think this
> combination of formats to be critical to upgrading existing repositories
> to reftable.

I suppose so? If you only store refs and tags (and don't handle
reflogs, symrefs or use the inverse object mapping) then the reftable
file format is just a highly souped-up version of packed-refs.

> (Note: there is a strategy that doesn't need this approach, but it's a bit
> complicated. It would involve rotating all replicas to new repositories
> that are configured to use reftable upon creation, getting the refs from
> other replicas via fetches. In my opinion, this is prohibitively
> expensive.)

I'm not sure I understand the problem. Any deletion of a ref (that is
in packed-refs) today already requires rewriting the entire
packed-refs file ("all or nothing" operation). Whether you write a
packed-refs or reftable is roughly equally expensive.

Are you looking for a way to upgrade a repo, while concurrent git
process may write updates into the repository during the update? That
may be hard to pull off, because you probably need to rename more than
one file atomically. If you accept momentarily failed writes, you
could do

* rename refs/ to refs.old/ (loose ref writes will fail now)
* collect loose refs under refs.old/ , put into packed-refs
* populate the reftable/ dir
* set refFormat extension.
* rename refs.old/ to refs/ with a refs/heads a file (as described in
the reftable spec.)

See also https://gerrit.googlesource.com/jgit/+/ca166a0c62af2ea87fdedf2728ac19cb59a12601/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/FileRepository.java#734

> >> I mentioned earlier that I had considered using reftable as a way to achieve
> >> the stated goals. With the current state of that work, I'm not confident
> >> that it is the right approach here.
> >>
> >> My main worry is that the reftable is more complicated than we need for a
> >> typical Git repository that is based on a typical filesystem. This makes
> >> testing the format very critical, and we seem to not be near reaching that
> >> approach.
> >
> > I think the base code of reading and writing the reftable format is
> > exercised quite exhaustively tested in unit tests. You say 'seem', but
> > do you have anything concrete to say?
>
> Our test suite is focused on integration tests at the command level. While
> unit tests are helpful, I'm not sure if all of the corner cases would be
> covered by tests that check Git commands only.

It's actually easier to test all of the nooks of the format through
unittests, because you can tweak parameters (eg. blocksize) that
aren't normally available in the command-line

> >> That might even help the integration process to allow the reftable format to
> >> be tested at least by some subset of tests instead of waiting for a full
> >> test suite update.
> >
> > I don't understand this comment. In the current state,
> > https://github.com/git/git/pull/1215 already passes 922 of the 968
> > test files if you set GIT_TEST_REFTABLE=1.
> >
> > See https://github.com/git/git/pull/1215#issuecomment-1329579459 for
> > details. As you can see, for most test files, it's just a few
> > individual test cases that fail.
>
> My point is that to get those remaining tests passing requires a
> significant update to the test suite. I imagined that the complexity of
> that update was the blocker to completing the reftable work.
>
> It seems that my estimation of that complexity was overly high compared to
> what you appear to be describing.

To be honest, i'm not quite sure how significant the work is: for
things like worktrees, it wasn't that obvious to me how things should
work in the first place. That makes it hard to make estimates. I
thought there might be a month of full-time work left, but these days
I can barely make a couple of hours of time per week to work on
reftable  if at all.

> > For deleting refs quickly, it seems that you only need to support
> > $ZEROID in packed-refs and then implement a ref database as a stack of
> > packed-ref files? If you're going for minimal effort and minimal
> > disruption wouldn't that be the place to start?
>
> I disagree that jumping straight to stacked packed-refs is minimal effort
> or minimal disruption.
>
> Creating the stack approach does require changing the semantics of the
> packed-refs format to include $ZEROID, which will modify some meanings in
> the iteration code. The use of a stack, as well as how layers are combined
> during a ref write or also during maintenance, adds complications to the
> locking semantics that are decently complicated.
>
> By contrast, the v2 format is isolated to the on-disk format. None of the
> writing or reading semantics are changed in terms of which files to look
> at or write in which order. Instead, it's relatively simple to see from
> the format exactly how it reduces the file size but otherwise has exactly
> the same read/write behavior. In fact, since the refs and OIDs are all
> located in the same chunk in a similar order to the v1 file, we can even
> deduce that page cache semantics will only improve in the new format.
>
> The reason to start with this step is that the benefits and risks are
> clearly understood, which can motivate us to establish the mechanism for
> changing the ref format by defining the extension.

I believe that the v2 format is a safe change with performance
improvements, but it's a backward incompatible format change with only
modest payoff. I also don't understand how it will help you do a stack
of tables,
which you need for your primary goal (ie. transactions/deletions
writing only the delta, rather than rewriting the whole file?).

> > You're concerned about the reftable file format (and maybe rightly
> > so), but if you're changing the file format anyway and you're not
> > picking reftable, why not create a block-based, indexed format that
> > can support storing reflog entries at some point in the future too,
> > rather than build on (the limitations) of packed-refs?
>
> My personal feeling is that storing ref tips and storing the history of a
> ref are sufficiently different problems that should have their own data
> structures. Even if they could be combined by a common format, I don't
> think it is safe to transition every part of every ref operation to a new
> format all at once.
>
> Looking at reftable from the perspective of a hosting provider, I'm very
> hesitant to recommend transitioning to it because of how it is an "all or
> nothing" switch. It does not fit with my expectations for safe deployment
> practices.

You'd have to consult with your SRE team, how to do this best, but
here's my $.02. If you are a hosting provider, I assume you have 3 or
5 copies of each repo in diffrent datacenters for
redundancy/availability. You could have one of the datacenters use the
new format for while, and see if there are any errors or discrepancies
(both in terms of data consistency and latency metrics)

> * The reftable work needs its refs backend implemented, but your draft PR
>   has a prototype of this and some basic test suite integration. There are
>   54 test files that have one or more failing tests, and likely these just
>   need to be adjusted to not care about loose references.
>
> * The reftable is currently fundamentally different enough that it could
>   not be used as a replacement for the packed-refs file underneath loose
>   refs (primarily due to its integration with the reflog). Doing so would
>   require significant work on top of your prototype.

It could, but I don't see the point.

> I'm going to take the following actions on my end to better understand the
> situation:
>
> 1. I'll take your draft PR branch and do some performance evaluations on
>    the speed of ref updates compared to loose refs and my prototype of a
>    two-stack packed-ref where the second layer of the stack is only for
>    deleted refs.

(tangent) - wouldn't that design perform poorly once the number of
deletions gets large? You'd basically have to rewrite the
deleted-packed-refs file all the time.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Liana Sebastian

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-30 18:30     ` Han-Wen Nienhuys
@ 2022-11-30 18:37       ` Sean Allred
  2022-12-01 20:18       ` Derrick Stolee
  1 sibling, 0 replies; 56+ messages in thread
From: Sean Allred @ 2022-11-30 18:37 UTC (permalink / raw)
  To: Han-Wen Nienhuys
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, jrnieder,
	John Cai


Han-Wen Nienhuys <hanwen@google.com> writes:
> To be honest, I don't understand why symrefs are such a generic
> concept; I've only ever seen them used for HEAD.

I've been only lurking in this thread (and loosely following along,
even!) but I do want to call out that I have recently considered perhaps
abusing symrefs to point to normal feature branches. In our workflow, we
have documentation records identified by a numeric ID -- the code
changes corresponding to that documentation (testing instructions, etc.)
use formulaic branch names like `feature/123456`.

It is sometimes beneficial for two or more of these documentation
records to perform their work on the same code branch. There are myriad
reasons for this, some better than others, but I want to avoid getting
mired in whether or not this is a good idea. It does happen and is
sometimes even the best way to do it.

In these scenarios, I've considered having `feature/2` be a symref to
`feature/1` so that both features can always 'know' what to call their
branch for operations like checkout. I've done this on a smaller scale
in the past to great effect.

Nothing is set in stone here for us, but I did want to call this out as
a potential real-world use case.

--
Sean Allred

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-30 15:16   ` Derrick Stolee
                       ` (2 preceding siblings ...)
  2022-11-30 18:30     ` Han-Wen Nienhuys
@ 2022-11-30 22:55     ` Junio C Hamano
  3 siblings, 0 replies; 56+ messages in thread
From: Junio C Hamano @ 2022-11-30 22:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Han-Wen Nienhuys, Derrick Stolee via GitGitGadget, git, jrnieder,
	John Cai

Derrick Stolee <derrickstolee@github.com> writes:

> I do want to say that while I admire
> JGit's dedication to being compatible with repositories created by Git, I
> don't think the reverse is a goal of the Git project.

The world works better if cross-pollination happens both ways,
though.

>> * Symrefs are refs too, but for some reason the packed-refs file
>> doesn't support them. Does packed-refs v2 support symrefs too?  If you
>> want to snapshot the state of refs, do you want to snapshot the value
>> of HEAD too?
>
> I forgot that loose refs under .git/refs/ can be symrefs. This definitely
> is a limitation that I should mention. Again, pseudorefs like HEAD are not
> included and are stored separately, but symrefs within refs/* are not
> available in packed-refs (v1 or v2). That should be explicitly called out
> in the extensions.refFormat docs.

I expect that, in a typical individual-contributor repository, there
are at least two symbolic refs, e.g.

    .git/HEAD
    .git/refs/remotes/origin/HEAD

Having to fall back on the loose ref hierarchy only to be able to
store the latter is a bit of shame---as long as you are revamping
the format, the design should allow us to eventually migrate all
refs to the new format without having to do the "check there, and if
there isn't then check this other place", which is what the current
loose + packed combination do, I would think.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-11-30 18:30     ` Han-Wen Nienhuys
  2022-11-30 18:37       ` Sean Allred
@ 2022-12-01 20:18       ` Derrick Stolee
  2022-12-02 16:46         ` Han-Wen Nienhuys
  1 sibling, 1 reply; 56+ messages in thread
From: Derrick Stolee @ 2022-12-01 20:18 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: Derrick Stolee via GitGitGadget, git, jrnieder, John Cai

On 11/30/2022 1:30 PM, Han-Wen Nienhuys wrote:
> On Wed, Nov 30, 2022 at 4:16 PM Derrick Stolee <derrickstolee@github.com> wrote:
>> (Note: there is a strategy that doesn't need this approach, but it's a bit
>> complicated. It would involve rotating all replicas to new repositories
>> that are configured to use reftable upon creation, getting the refs from
>> other replicas via fetches. In my opinion, this is prohibitively
>> expensive.)
> 
> I'm not sure I understand the problem. Any deletion of a ref (that is
> in packed-refs) today already requires rewriting the entire
> packed-refs file ("all or nothing" operation). Whether you write a
> packed-refs or reftable is roughly equally expensive.
> 
> Are you looking for a way to upgrade a repo, while concurrent git
> process may write updates into the repository during the update? That
> may be hard to pull off, because you probably need to rename more than
> one file atomically. If you accept momentarily failed writes, you
> could do
> 
> * rename refs/ to refs.old/ (loose ref writes will fail now)
> * collect loose refs under refs.old/ , put into packed-refs
> * populate the reftable/ dir
> * set refFormat extension.
> * rename refs.old/ to refs/ with a refs/heads a file (as described in
> the reftable spec.)
>
> See also https://gerrit.googlesource.com/jgit/+/ca166a0c62af2ea87fdedf2728ac19cb59a12601/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/FileRepository.java#734

Yes, I would ideally like for the repository to "upgrade" its ref
storage mechanism during routine maintenance in a non-blocking way
while other writes and reads continue as normal.

After discussing it a bit internally, we _could_ avoid the "rotate
the replicas" solution if there was a "git upgrade-ref-format"
command that could switch from one to another, but it would still
involve pulling that replica out of the rotation and then having
it catch up to the other replicas after that is complete. If I'm
reading your draft correctly, that is not currently available in
your work, but we could add it after the fact.

Requiring pulling replicas out of rotation is still a bit heavy-
handed for my liking, but it's much less expensive than moving
all of the Git data.

>> The reason to start with this step is that the benefits and risks are
>> clearly understood, which can motivate us to establish the mechanism for
>> changing the ref format by defining the extension.
> 
> I believe that the v2 format is a safe change with performance
> improvements, but it's a backward incompatible format change with only
> modest payoff. I also don't understand how it will help you do a stack
> of tables,
> which you need for your primary goal (ie. transactions/deletions
> writing only the delta, rather than rewriting the whole file?).

The v2 format doesn't help me on its own, but it has other benefits
in terms of size and speed, as well as the "ref count" functionality.

The important thing is that the definition of extensions.refFormat
that I'm proposing in this RFC establishes a way to make incremental
progress on the ref format, allowing the stacked format to come in
later with less friction.
 
>> * The reftable is currently fundamentally different enough that it could
>>   not be used as a replacement for the packed-refs file underneath loose
>>   refs (primarily due to its integration with the reflog). Doing so would
>>   require significant work on top of your prototype.
> 
> It could, but I don't see the point.

My point is that we can upgrade repositories by replacing packed-refs
with reftable during routine maintenance instead of the heavier
approaches discussed earlier.

* Step 1: replace packed-refs with reftable.
* Step 2: stop writing loose refs, only update reftable (but still read loose refs).
* Step 3: collapse all loose refs into reftable, stop reading or writing loose refs.
 
>> I'm going to take the following actions on my end to better understand the
>> situation:
>>
>> 1. I'll take your draft PR branch and do some performance evaluations on
>>    the speed of ref updates compared to loose refs and my prototype of a
>>    two-stack packed-ref where the second layer of the stack is only for
>>    deleted refs.
> 
> (tangent) - wouldn't that design perform poorly once the number of
> deletions gets large? You'd basically have to rewrite the
> deleted-packed-refs file all the time.
 
We have regular maintenance that is triggered by pushes that rewrites
the packed-refs file frequently, anyway. The maintenance currently is
blocked on the amount of time spent repacking object data, so a large
number of ref updates can come in during this process. (That maintenance
step would collapse the deleted-refs layer into the base layer.)

I've tested a simple version of this stack that shows that rewriting the
file with 1,000 deletions is still within 2x the cost of updating a loose
ref, so it solves the immediate problem using a much simpler stack model,
at least in the most-common case where ref deletions are less frequent
than other updates. Even if the size outgrew the 2x cost limit, the
deleted file is still going to be much smaller than the base packed-refs
file, which is currently rewritten for every deletion, so it is still an
improvement.

The more complicated stack model would be required to funnel all ref
updates into that structure and away from loose refs.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-12-01 20:18       ` Derrick Stolee
@ 2022-12-02 16:46         ` Han-Wen Nienhuys
  2022-12-02 18:24           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 56+ messages in thread
From: Han-Wen Nienhuys @ 2022-12-02 16:46 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee via GitGitGadget, git, jrnieder, John Cai

On Thu, Dec 1, 2022 at 9:19 PM Derrick Stolee <derrickstolee@github.com> wrote:
> >> The reason to start with this step is that the benefits and risks are
> >> clearly understood, which can motivate us to establish the mechanism for
> >> changing the ref format by defining the extension.
> >
> > I believe that the v2 format is a safe change with performance
> > improvements, but it's a backward incompatible format change with only
> > modest payoff. I also don't understand how it will help you do a stack
> > of tables,
> > which you need for your primary goal (ie. transactions/deletions
> > writing only the delta, rather than rewriting the whole file?).
>
> The v2 format doesn't help me on its own, but it has other benefits
> in terms of size and speed, as well as the "ref count" functionality.
>
> The important thing is that the definition of extensions.refFormat
> that I'm proposing in this RFC establishes a way to make incremental
> progress on the ref format, allowing the stacked format to come in
> later with less friction.

I guess you want to move the read/write stack under the loose storage
(packed backend), and introduce (read loose/packed + write packed
only) mode that is transitional?

Before you embark on this incremental route, I think it would be best
to think through carefully how an online upgrade would work in detail
(I think it's currently not specified?) If ultimately it's not
feasible to do incrementally, then the added complexity of the
incremental approach will be for naught.

The incremental mode would only be of interest to hosting providers.
It will only be used transitionally. It is inherently going to be
complex, because it has to consider both storage modes at the same
time, and because it is transitional, it will get less real life
testing. At the same time, the ref database is comparatively small, so
the availability blip that converting the storage offline will impair
is going to be small. So, the incremental approach is rather expensive
for a comparatively small benefit.

I also thought a bit about how you could make the transition seamless,
but I can't see a good way: you have to coordinate between tables.list
(the list of reftables active or whatever file signals the presence of
a stack) and files under refs/heads/. I don't know how to do
transactions across multiple files without cooperative locking.

If you assume you can use filesystem locks, then you could do
something simpler: if a git repository is marked 'transitional', git
processes take an FS read lock on .git/ .  The process that converts
the storage can take an exclusive (write) lock on .git/, so it knows
nobody will interfere. I think this only works if the repo is on local
disk rather than NFS, though.

> * Step 1: replace packed-refs with reftable.
> * Step 2: stop writing loose refs, only update reftable (but still read loose refs).

Does that work? A long running process might not notice the switch in
step 2, so it could still write a ref as loose, while another process
racing might write a different value to the same ref through reftable.

PS. I'll be away from work until Jan 9th.
-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Liana Sebastian

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
  2022-12-02 16:46         ` Han-Wen Nienhuys
@ 2022-12-02 18:24           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 56+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-02 18:24 UTC (permalink / raw)
  To: Han-Wen Nienhuys
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, jrnieder,
	John Cai


On Fri, Dec 02 2022, Han-Wen Nienhuys wrote:

> On Thu, Dec 1, 2022 at 9:19 PM Derrick Stolee <derrickstolee@github.com> wrote:
>> >> The reason to start with this step is that the benefits and risks are
>> >> clearly understood, which can motivate us to establish the mechanism for
>> >> changing the ref format by defining the extension.
>> >
>> > I believe that the v2 format is a safe change with performance
>> > improvements, but it's a backward incompatible format change with only
>> > modest payoff. I also don't understand how it will help you do a stack
>> > of tables,
>> > which you need for your primary goal (ie. transactions/deletions
>> > writing only the delta, rather than rewriting the whole file?).
>>
>> The v2 format doesn't help me on its own, but it has other benefits
>> in terms of size and speed, as well as the "ref count" functionality.
>>
>> The important thing is that the definition of extensions.refFormat
>> that I'm proposing in this RFC establishes a way to make incremental
>> progress on the ref format, allowing the stacked format to come in
>> later with less friction.
>
> I guess you want to move the read/write stack under the loose storage
> (packed backend), and introduce (read loose/packed + write packed
> only) mode that is transitional?
>
> Before you embark on this incremental route, I think it would be best
> to think through carefully how an online upgrade would work in detail
> (I think it's currently not specified?) If ultimately it's not
> feasible to do incrementally, then the added complexity of the
> incremental approach will be for naught.
>
> The incremental mode would only be of interest to hosting providers.
> It will only be used transitionally. It is inherently going to be
> complex, because it has to consider both storage modes at the same
> time, and because it is transitional, it will get less real life
> testing. At the same time, the ref database is comparatively small, so
> the availability blip that converting the storage offline will impair
> is going to be small. So, the incremental approach is rather expensive
> for a comparatively small benefit.
>
> I also thought a bit about how you could make the transition seamless,
> but I can't see a good way: you have to coordinate between tables.list
> (the list of reftables active or whatever file signals the presence of
> a stack) and files under refs/heads/. I don't know how to do
> transactions across multiple files without cooperative locking.

A multi-backend transaction would be hard to do at the best of times,
but we'd also presumably run into the issue that not all ref operations
currently use the transaction mechanism (e.g. branch copying/moving). So
if one or the other fails there all bets are off as far as getting back
to a consistent state.

Perhaps a more doable & interesting approach would be to have a "slave"
backend that would follow along, i.e. we'd replay all operations from
"master" to "slave" (as with DB replication, just within a single
repository).

We might get out of sync, but as the "master" is always the source of
truth presumably we could run some one-off re-exporting of the refspace
get back up-to-date, and hopefully not get out of sync again.

Then once we're ready, we could flip the switch indicating what becomes
the canonical backend.

For reftable the FS layout under .git/* is incompatible, so we'd also
need to support writing to some alternate directory to make such a thing
work...

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2022-12-02 18:29 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
2022-11-11 23:31   ` Elijah Newren
2022-11-14 16:30     ` Derrick Stolee
2022-11-17 16:13   ` Ævar Arnfjörð Bjarmason
2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
2022-11-11 23:39   ` Elijah Newren
2022-11-16 14:37     ` Derrick Stolee
2022-11-07 18:35 ` [PATCH 04/30] config: fix multi-level bulleted list Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 05/30] repository: wire ref extensions to ref backends Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 06/30] refs: allow loose files without packed-refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 07/30] chunk-format: number of chunks is optional Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 08/30] chunk-format: document trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 09/30] chunk-format: store chunk offset during write Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 10/30] chunk-format: allow trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 11/30] chunk-format: parse " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 12/30] refs: extract packfile format to new file Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 13/30] packed-backend: extract add_write_error() Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 14/30] packed-backend: extract iterator/updates merge Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 15/30] packed-backend: create abstraction for writing refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 16/30] config: add config values for packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 17/30] packed-backend: create shell of v2 writes Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 18/30] packed-refs: write file format version 2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 19/30] packed-refs: read file format v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 20/30] packed-refs: read optional prefix chunks Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 21/30] packed-refs: write " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 23/30] t1409: test with packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 24/30] t5312: allow packed-refs v2 format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 26/30] t3210: require packed-refs v1 for some tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 27/30] t*: skip packed-refs v2 over http tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 29/30] p1401: create performance test for ref operations Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 30/30] refs: skip hashing when writing packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-09 15:15 ` [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee
2022-11-11 23:28 ` Elijah Newren
2022-11-14  0:07   ` Derrick Stolee
2022-11-15  2:47     ` Elijah Newren
2022-11-16 14:45       ` Derrick Stolee
2022-11-17  4:28         ` Elijah Newren
2022-11-18 23:31     ` Junio C Hamano
2022-11-19  0:41       ` Elijah Newren
2022-11-19  3:00         ` Taylor Blau
2022-11-30 15:31       ` Derrick Stolee
2022-11-28 18:56 ` Han-Wen Nienhuys
2022-11-30 15:16   ` Derrick Stolee
2022-11-30 15:38     ` Phillip Wood
2022-11-30 16:37     ` Taylor Blau
2022-11-30 18:30     ` Han-Wen Nienhuys
2022-11-30 18:37       ` Sean Allred
2022-12-01 20:18       ` Derrick Stolee
2022-12-02 16:46         ` Han-Wen Nienhuys
2022-12-02 18:24           ` Ævar Arnfjörð Bjarmason
2022-11-30 22:55     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).