git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Han-Wen Nienhuys <hanwen@google.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, jrnieder@gmail.com,
	Derrick Stolee <derrickstolee@github.com>,
	John Cai <johncai86@gmail.com>
Subject: Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
Date: Mon, 28 Nov 2022 19:56:55 +0100	[thread overview]
Message-ID: <CAFQ2z_MZd150kQNTcxaDRVvALpZcCUbRj_81pt-VBY8DRaoRNw@mail.gmail.com> (raw)
In-Reply-To: <pull.1408.git.1667846164.gitgitgadget@gmail.com>

On Mon, Nov 7, 2022 at 7:36 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
>
> Introduction
> ============
>
> I became interested in our packed-ref format based on the asymmetry between
> ref updates and ref deletions: if we delete a packed ref, then the
> packed-refs file needs to be rewritten. Compared to writing a loose ref,
> this is an O(N) cost instead of O(1).
>
> In this way, I set out with some goals:
>
>  * (Primary) Make packed ref deletions be nearly as fast as loose ref
>    updates.
>  * (Secondary) Allow using a packed ref format for all refs, dropping loose
>    refs and creating a clear way to snapshot all refs at a given point in
>    time.
>
> I also had one major non-goal to keep things focused:
>
>  * (Non-goal) Update the reflog format.
>
> After carefully considering several options, it seemed that there are two
> solutions that can solve this effectively:
>
>  1. Wait for reftable to be integrated into Git.
>  2. Update the packed-refs backend to have a stacked version.
>
> The reftable work seems currently dormant. The format is pretty complicated
> and I have a difficult time seeing a way forward for it to be fully
> integrated into Git.

The format is somewhat complicated, and I think it would have been
possible to design a block-oriented sorted-table approach that is
simpler, but the JGit implementation has set it in stone. But, to put
this in perspective, the amount of work for getting the format to
read/write correctly has been completely dwarfed by the effort needed
to make the refs API in git represent a true abstraction boundary.
Also, if you're introducing a new format, one might as well try to
optimize it a bit.

Here are some of the hard problems that I encountered

* Worktrees and the main repository have a separate view of the ref
namespace. This is not explicit in the ref backend API, and there is a
technical limitation that the packed-refs file cannot be in a
worktree. This means that worktrees will always continue to use
loose-ref storage if you only extend the packed-refs backend.

* Symrefs are refs too, but for some reason the packed-refs file
doesn't support them. Does packed-refs v2 support symrefs too?  If you
want to snapshot the state of refs, do you want to snapshot the value
of HEAD too?

* By not changing reflogs, you are making things simpler. (if a
transaction updates the branch that HEAD points to, the reflog for
HEAD has to be updated too. Because reftable updates the reflog
transactionally, this was some extra work)
Then again, I feel the current way that reflogs work are a bit messy,
because directory/file conflicts force reflogs to be deleted at times
that don't make sense from a user-perspective.

* There are a lot of commands that store SHA1s in files under .git/,
and access them as if they are a ref (for example: rebase-apply/ ,
CHERRY_PICK_HEAD etc.).

> In this RFC, I propose a different model that allows for more customization
> and incremental updates. The extensions.refFormat config key is multi-valued
> and defaults to the list of files and packed. In the context of this RFC,
> the intention is to be able to add packed-v2 so the list of all three values
> would allow Git to write and read either file format version (v1 or v2). In
> the larger scheme, the extension could allow restricting to only loose refs
> (just files) or only packed-refs (just packed) or even later when reftable
> is complete, files and reftable could mean that loose refs are the primary
> ref storage, but the reftable format serves as a drop-in replacement for the
> packed-refs file. Not all combinations need to be understood by Git, but

I'm not sure how feasible this is. reftable also holds reflog data. A
setting {files,reftable} would either not work, or necessitate hairy
merging of data to get the reflogs working correctly.

> In order to optimize the write speed of the packed-refs v2 file format, we
> want to write immediately to the file as we stream existing refs from the
> current refs. The current chunk-format API requires computing the chunk
> lengths in advance, which can slow down the write and take more memory than

yes, this sounds sensible. reftable has the secondary indexes trailing the data.

> Between using raw OIDs and storing the depth-2 prefixes only once, this
> format compresses the file to ~60% of its v1 size. (The format allows not
> writing the prefix chunks, and the prefix chunks are implemented after the
> basics of the ref chunks are complete.)
>
> The write times are reduced in a similar fraction to the size difference.
> Reads are sped up somewhat, and we have the potential to do a ref count by

Do you mean 'enumerate refs' ? Why would you want to count refs by prefix?

> I mentioned earlier that I had considered using reftable as a way to achieve
> the stated goals. With the current state of that work, I'm not confident
> that it is the right approach here.
>
> My main worry is that the reftable is more complicated than we need for a
> typical Git repository that is based on a typical filesystem. This makes
> testing the format very critical, and we seem to not be near reaching that
> approach.

I think the base code of reading and writing the reftable format is
exercised quite exhaustively tested in unit tests. You say 'seem', but
do you have anything concrete to say?

> As mentioned, the current extension plan [6] only allows reftable or files
> and does not allow for a mix of both. This RFC introduces the possibility
> that both could co-exist. Using that multi-valued approach means that I'm
> able to test the v2 packed-refs file format almost as well as the v1 file
> format within this RFC. (More tests need to be added that are specific to
> this format, but I'm waiting for confirmation that this is an acceptable
> direction.) At the very least, this multi-valued approach could be used as a
> way to allow using the reftable format as a drop-in replacement for the
> packed-refs file, as well as upgrading an existing repo to use reftable.

The multi-value approach creates more combinations of code of how
different pieces of code can interact, so I think it actually makes it
more error-prone.
Also,

> That might even help the integration process to allow the reftable format to
> be tested at least by some subset of tests instead of waiting for a full
> test suite update.

I don't understand this comment. In the current state,
https://github.com/git/git/pull/1215 already passes 922 of the 968
test files if you set GIT_TEST_REFTABLE=1.

See https://github.com/git/git/pull/1215#issuecomment-1329579459 for
details. As you can see, for most test files, it's just a few
individual test cases that fail.

> I'm interested to hear from people more involved in the reftable work to see
> the status of that project and how it matches or differs from my
> perspective.

Overall, I found that the loose/packed ref code hard to understand and
full of arbitrary limitations (dir/file conflicts, deleting reflogs
when branches are deleted, locking across loose/packed refs etc.).
The way reftable stacks are setup (with both reflog and ref data
including symrefs in the same file) make it much easier to verify that
it behaves transactionally.

For deleting refs quickly, it seems that you only need to support
$ZEROID in packed-refs and then implement a ref database as a stack of
packed-ref files? If you're going for minimal effort and minimal
disruption wouldn't that be the place to start?

You're concerned about the reftable file format (and maybe rightly
so), but if you're changing the file format anyway and you're not
picking reftable, why not create a block-based, indexed format that
can support storing reflog entries at some point in the future too,
rather than build on (the limitations) of packed-refs? Or is
packed-refs v2 backward compatible with v1 (could an old git client
read v2 files? I think not, right?).

The reftable project has gotten into a slump because my work
responsibilities have increased over the last 1.5 year squeezing down
how much time I have for 'fun' projects. I chatted with John Cai, who
was trying to staff this project out of Gitlab resources. I don't know
where that stands, though.

> The one thing I can say is that if the reftable work had not already begun,
> then this is RFC is how I would have approached a new ref format.
>
> I look forward to your feedback!

Hope this helps.


-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Liana Sebastian

  parent reply	other threads:[~2022-11-28 18:57 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
2022-11-11 23:31   ` Elijah Newren
2022-11-14 16:30     ` Derrick Stolee
2022-11-17 16:13   ` Ævar Arnfjörð Bjarmason
2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
2022-11-11 23:39   ` Elijah Newren
2022-11-16 14:37     ` Derrick Stolee
2022-11-07 18:35 ` [PATCH 04/30] config: fix multi-level bulleted list Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 05/30] repository: wire ref extensions to ref backends Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 06/30] refs: allow loose files without packed-refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 07/30] chunk-format: number of chunks is optional Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 08/30] chunk-format: document trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 09/30] chunk-format: store chunk offset during write Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 10/30] chunk-format: allow trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 11/30] chunk-format: parse " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 12/30] refs: extract packfile format to new file Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 13/30] packed-backend: extract add_write_error() Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 14/30] packed-backend: extract iterator/updates merge Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 15/30] packed-backend: create abstraction for writing refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 16/30] config: add config values for packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 17/30] packed-backend: create shell of v2 writes Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 18/30] packed-refs: write file format version 2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 19/30] packed-refs: read file format v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 20/30] packed-refs: read optional prefix chunks Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 21/30] packed-refs: write " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 23/30] t1409: test with packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 24/30] t5312: allow packed-refs v2 format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 26/30] t3210: require packed-refs v1 for some tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 27/30] t*: skip packed-refs v2 over http tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 29/30] p1401: create performance test for ref operations Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 30/30] refs: skip hashing when writing packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-09 15:15 ` [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee
2022-11-11 23:28 ` Elijah Newren
2022-11-14  0:07   ` Derrick Stolee
2022-11-15  2:47     ` Elijah Newren
2022-11-16 14:45       ` Derrick Stolee
2022-11-17  4:28         ` Elijah Newren
2022-11-18 23:31     ` Junio C Hamano
2022-11-19  0:41       ` Elijah Newren
2022-11-19  3:00         ` Taylor Blau
2022-11-30 15:31       ` Derrick Stolee
2022-11-28 18:56 ` Han-Wen Nienhuys [this message]
2022-11-30 15:16   ` Derrick Stolee
2022-11-30 15:38     ` Phillip Wood
2022-11-30 16:37     ` Taylor Blau
2022-11-30 18:30     ` Han-Wen Nienhuys
2022-11-30 18:37       ` Sean Allred
2022-12-01 20:18       ` Derrick Stolee
2022-12-02 16:46         ` Han-Wen Nienhuys
2022-12-02 18:24           ` Ævar Arnfjörð Bjarmason
2022-11-30 22:55     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFQ2z_MZd150kQNTcxaDRVvALpZcCUbRj_81pt-VBY8DRaoRNw@mail.gmail.com \
    --to=hanwen@google.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johncai86@gmail.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).