git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: Han-Wen Nienhuys <hanwen@google.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, jrnieder@gmail.com,
	John Cai <johncai86@gmail.com>
Subject: Re: [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format
Date: Thu, 1 Dec 2022 15:18:57 -0500	[thread overview]
Message-ID: <f5370fec-d517-eaa9-8e16-82fa20ac8532@github.com> (raw)
In-Reply-To: <CAFQ2z_MLwUoaSTG04LJYHgJH-QYJEuZ9bQcTsV8mXwxBbz7Egg@mail.gmail.com>

On 11/30/2022 1:30 PM, Han-Wen Nienhuys wrote:
> On Wed, Nov 30, 2022 at 4:16 PM Derrick Stolee <derrickstolee@github.com> wrote:
>> (Note: there is a strategy that doesn't need this approach, but it's a bit
>> complicated. It would involve rotating all replicas to new repositories
>> that are configured to use reftable upon creation, getting the refs from
>> other replicas via fetches. In my opinion, this is prohibitively
>> expensive.)
> 
> I'm not sure I understand the problem. Any deletion of a ref (that is
> in packed-refs) today already requires rewriting the entire
> packed-refs file ("all or nothing" operation). Whether you write a
> packed-refs or reftable is roughly equally expensive.
> 
> Are you looking for a way to upgrade a repo, while concurrent git
> process may write updates into the repository during the update? That
> may be hard to pull off, because you probably need to rename more than
> one file atomically. If you accept momentarily failed writes, you
> could do
> 
> * rename refs/ to refs.old/ (loose ref writes will fail now)
> * collect loose refs under refs.old/ , put into packed-refs
> * populate the reftable/ dir
> * set refFormat extension.
> * rename refs.old/ to refs/ with a refs/heads a file (as described in
> the reftable spec.)
>
> See also https://gerrit.googlesource.com/jgit/+/ca166a0c62af2ea87fdedf2728ac19cb59a12601/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/FileRepository.java#734

Yes, I would ideally like for the repository to "upgrade" its ref
storage mechanism during routine maintenance in a non-blocking way
while other writes and reads continue as normal.

After discussing it a bit internally, we _could_ avoid the "rotate
the replicas" solution if there was a "git upgrade-ref-format"
command that could switch from one to another, but it would still
involve pulling that replica out of the rotation and then having
it catch up to the other replicas after that is complete. If I'm
reading your draft correctly, that is not currently available in
your work, but we could add it after the fact.

Requiring pulling replicas out of rotation is still a bit heavy-
handed for my liking, but it's much less expensive than moving
all of the Git data.

>> The reason to start with this step is that the benefits and risks are
>> clearly understood, which can motivate us to establish the mechanism for
>> changing the ref format by defining the extension.
> 
> I believe that the v2 format is a safe change with performance
> improvements, but it's a backward incompatible format change with only
> modest payoff. I also don't understand how it will help you do a stack
> of tables,
> which you need for your primary goal (ie. transactions/deletions
> writing only the delta, rather than rewriting the whole file?).

The v2 format doesn't help me on its own, but it has other benefits
in terms of size and speed, as well as the "ref count" functionality.

The important thing is that the definition of extensions.refFormat
that I'm proposing in this RFC establishes a way to make incremental
progress on the ref format, allowing the stacked format to come in
later with less friction.
 
>> * The reftable is currently fundamentally different enough that it could
>>   not be used as a replacement for the packed-refs file underneath loose
>>   refs (primarily due to its integration with the reflog). Doing so would
>>   require significant work on top of your prototype.
> 
> It could, but I don't see the point.

My point is that we can upgrade repositories by replacing packed-refs
with reftable during routine maintenance instead of the heavier
approaches discussed earlier.

* Step 1: replace packed-refs with reftable.
* Step 2: stop writing loose refs, only update reftable (but still read loose refs).
* Step 3: collapse all loose refs into reftable, stop reading or writing loose refs.
 
>> I'm going to take the following actions on my end to better understand the
>> situation:
>>
>> 1. I'll take your draft PR branch and do some performance evaluations on
>>    the speed of ref updates compared to loose refs and my prototype of a
>>    two-stack packed-ref where the second layer of the stack is only for
>>    deleted refs.
> 
> (tangent) - wouldn't that design perform poorly once the number of
> deletions gets large? You'd basically have to rewrite the
> deleted-packed-refs file all the time.
 
We have regular maintenance that is triggered by pushes that rewrites
the packed-refs file frequently, anyway. The maintenance currently is
blocked on the amount of time spent repacking object data, so a large
number of ref updates can come in during this process. (That maintenance
step would collapse the deleted-refs layer into the base layer.)

I've tested a simple version of this stack that shows that rewriting the
file with 1,000 deletions is still within 2x the cost of updating a loose
ref, so it solves the immediate problem using a much simpler stack model,
at least in the most-common case where ref deletions are less frequent
than other updates. Even if the size outgrew the 2x cost limit, the
deleted file is still going to be much smaller than the base packed-refs
file, which is currently rewritten for every deletion, so it is still an
improvement.

The more complicated stack model would be required to funnel all ref
updates into that structure and away from loose refs.

Thanks,
-Stolee

  parent reply	other threads:[~2022-12-01 20:19 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-07 18:35 [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 01/30] hashfile: allow skipping the hash function Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 02/30] read-cache: add index.computeHash config option Derrick Stolee via GitGitGadget
2022-11-11 23:31   ` Elijah Newren
2022-11-14 16:30     ` Derrick Stolee
2022-11-17 16:13   ` Ævar Arnfjörð Bjarmason
2022-11-07 18:35 ` [PATCH 03/30] extensions: add refFormat extension Derrick Stolee via GitGitGadget
2022-11-11 23:39   ` Elijah Newren
2022-11-16 14:37     ` Derrick Stolee
2022-11-07 18:35 ` [PATCH 04/30] config: fix multi-level bulleted list Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 05/30] repository: wire ref extensions to ref backends Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 06/30] refs: allow loose files without packed-refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 07/30] chunk-format: number of chunks is optional Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 08/30] chunk-format: document trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 09/30] chunk-format: store chunk offset during write Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 10/30] chunk-format: allow trailing table of contents Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 11/30] chunk-format: parse " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 12/30] refs: extract packfile format to new file Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 13/30] packed-backend: extract add_write_error() Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 14/30] packed-backend: extract iterator/updates merge Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 15/30] packed-backend: create abstraction for writing refs Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 16/30] config: add config values for packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 17/30] packed-backend: create shell of v2 writes Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 18/30] packed-refs: write file format version 2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 19/30] packed-refs: read file format v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 20/30] packed-refs: read optional prefix chunks Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 21/30] packed-refs: write " Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 23/30] t1409: test with packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 24/30] t5312: allow packed-refs v2 format Derrick Stolee via GitGitGadget
2022-11-07 18:35 ` [PATCH 25/30] t5502: add PACKED_REFS_V1 prerequisite Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 26/30] t3210: require packed-refs v1 for some tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 27/30] t*: skip packed-refs v2 over http tests Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 29/30] p1401: create performance test for ref operations Derrick Stolee via GitGitGadget
2022-11-07 18:36 ` [PATCH 30/30] refs: skip hashing when writing packed-refs v2 Derrick Stolee via GitGitGadget
2022-11-09 15:15 ` [PATCH 00/30] [RFC] extensions.refFormat and packed-refs v2 file format Derrick Stolee
2022-11-11 23:28 ` Elijah Newren
2022-11-14  0:07   ` Derrick Stolee
2022-11-15  2:47     ` Elijah Newren
2022-11-16 14:45       ` Derrick Stolee
2022-11-17  4:28         ` Elijah Newren
2022-11-18 23:31     ` Junio C Hamano
2022-11-19  0:41       ` Elijah Newren
2022-11-19  3:00         ` Taylor Blau
2022-11-30 15:31       ` Derrick Stolee
2022-11-28 18:56 ` Han-Wen Nienhuys
2022-11-30 15:16   ` Derrick Stolee
2022-11-30 15:38     ` Phillip Wood
2022-11-30 16:37     ` Taylor Blau
2022-11-30 18:30     ` Han-Wen Nienhuys
2022-11-30 18:37       ` Sean Allred
2022-12-01 20:18       ` Derrick Stolee [this message]
2022-12-02 16:46         ` Han-Wen Nienhuys
2022-12-02 18:24           ` Ævar Arnfjörð Bjarmason
2022-11-30 22:55     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5370fec-d517-eaa9-8e16-82fa20ac8532@github.com \
    --to=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=hanwen@google.com \
    --cc=johncai86@gmail.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).