From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Johannes.Schindelin@gmx.de,
sandals@crustytoothpaste.net, steadmon@google.com,
jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com,
phillip.wood123@gmail.com, emilyshaffer@google.com,
sluongng@gmail.com, jonathantanmy@google.com,
Derrick Stolee <derrickstolee@github.com>,
Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v2 10/18] maintenance: add incremental-repack task
Date: Fri, 24 Jul 2020 11:03:26 -0400 [thread overview]
Message-ID: <54bfc8a9-e3bf-4199-9c58-8ed6ad50fb36@gmail.com> (raw)
In-Reply-To: <xmqqeep1q4d0.fsf@gitster.c.googlers.com>
On 7/23/2020 6:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>> one did not exist, and otherwise will update the multi-pack-index
>> with any new pack-files that appeared since the last write. This
>> is particularly relevant with the background fetch job.
>>
>> When the multi-pack-index sees two copies of the same object, it
>> stores the offset data into the newer pack-file. This means that
>> some old pack-files could become "unreferenced" which I will use
>> to mean "a pack-file that is in the pack-file list of the
>> multi-pack-index but none of the objects in the multi-pack-index
>> reference a location inside that pack-file."
>
> An obvious alternative is to favor the copy in the older pack,
> right? Is the expectation that over time, most of the objects that
> are relevant would reappear in newer packs, so that eventually by
> favoring the copies in the newer packs, we can retire and remove the
> old pack, keeping only the newer ones?
>
> But would that assumption hold? The old packs hold objects that are
> necessary for the older parts of the history, so unless you are
> cauterizing away the old history, these objects in the older packs
> are likely to stay with us longer than those used by the newer parts
> of the history, some of which may not even have been pushed out yet
> and can be rebased away?
If we created a new pack-file containing an already-packed object,
then shouldn't we assume that the new pack-file does a _better_ job
of compressing that object? Or at least, doesn't make it worse?
For example, if we use 'git multi-pack-index repack --batch-size=0',
then this creates a new pack-file containing every previously-packed
object. This new pack-file should have better delta compression than
the previous setup across multiple pack-files. We want this new
pack-file to be used, not the old ones.
This "pick the newest pack" strategy is also what allows us to safely
use the 'expire' option to drop old pack-files. If we always keep the
old copies, then when we try to switch the new pack-files, we cannot
delete the old packs safely because a concurrent Git process could
be trying to reference it. One race is as follows:
* Process A opens the multi-pack-index referring to old pack P. It
doesn't open the pack-file as it hasn't tried to parse objects yet.
* Process B is 'git multi-pack-index expire'. It sees that pack P can
be dropped because all objects appear in newer pack-files. It deletes
P.
* Process A tries to read from P, and this fails. A must reprepare its
representation of the pack-files.
This is the less disruptive race since A can recover with a small cost
to its performance. The worse race (on Windows) is this:
* Process A loads the multi-pack-index and tries to parse an object by
loading "old" pack P.
* Process B tries to delete P. However, on Windows the handle to P by
A prevents the deletion.
At this point, there could be two resolutions. The first is to have
the 'expire' fail because we can't delete A. This means we might never
delete A in a busy repository. The second is that the 'expire' command
continues and drops A from the list in the multi-pack-index. However,
now all Git processes add A to the packed_git list because it isn't
referenced by the multi-pack-index.
In summary: the decision to pick the newer copies is a fundamental
part of how the write->expire->repack loop was designed.
>> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>> and updaes the multi-pack-index to drop those pack-files from the
>> list. This is safe to do as concurrent Git processes will see the
>> multi-pack-index and not open those packs when looking for object
>> contents. (Similar to the 'loose-objects' job, there are some Git
>> commands that open pack-files regardless of the multi-pack-index,
>> but they are rarely used. Further, a user that self-selects to
>> use background operations would likely refrain from using those
>> commands.)
>
> OK.
>
>> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>> of pack-files that are listed in the multi-pack-index and creates
>> a new pack-file containing the objects whose offsets are listed
>> by the multi-pack-index to be in those objects. The set of pack-
>> files is selected greedily by sorting the pack-files by modified
>> time and adding a pack-file to the set if its "expected size" is
>> smaller than the batch size until the total expected size of the
>> selected pack-files is at least the batch size. The "expected
>> size" is calculated by taking the size of the pack-file divided
>> by the number of objects in the pack-file and multiplied by the
>> number of objects from the multi-pack-index with offset in that
>> pack-file. The expected size approximats how much data from that
>
> approximates.
>
>> pack-file will contribute to the resulting pack-file size. The
>> intention is that the resulting pack-file will be close in size
>> to the provided batch size.
>
>> +static int maintenance_task_incremental_repack(void)
>> +{
>> + if (multi_pack_index_write()) {
>> + error(_("failed to write multi-pack-index"));
>> + return 1;
>> + }
>> +
>> + if (multi_pack_index_verify()) {
>> + warning(_("multi-pack-index verify failed after initial write"));
>> + return rewrite_multi_pack_index();
>> + }
>> +
>> + if (multi_pack_index_expire()) {
>> + error(_("multi-pack-index expire failed"));
>> + return 1;
>> + }
>> +
>> + if (multi_pack_index_verify()) {
>> + warning(_("multi-pack-index verify failed after expire"));
>> + return rewrite_multi_pack_index();
>> + }
>> + if (multi_pack_index_repack()) {
>> + error(_("multi-pack-index repack failed"));
>> + return 1;
>> + }
>
> Hmph, I wonder if these warning should come from each helper
> functions that are static to this function anyway.
>
> It also makes it easier to reason about this function by eliminating
> the need for having a different pattern only for the verify helper.
> Instead, verify could call rewrite internally when it notices a
> breakage. I.e.
>
> if (multi_pack_index_write())
> return 1;
> if (multi_pack_index_verify("after initial write"))
> return 1;
> if (multi_pack_index_exire())
> return 1;
> ...
This is a cleaner model. I'll work on that.
> Also, it feels odd, compared to our internal API convention, that
> positive non-zero is used as an error here.
...
> Exactly the same comment as 08/18 about natural/inherent ordering
> applies here as well.
I'll leave these to be resolved in the earlier messages.
Thanks,
-Stolee
next prev parent reply other threads:[~2020-07-24 15:03 UTC|newest]
Thread overview: 164+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 01/21] gc: use the_repository less often Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 02/21] gc: use repository in too_many_loose_objects() Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 03/21] gc: use repo config Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 04/21] gc: drop the_repository in log location Derrick Stolee via GitGitGadget
2020-07-09 2:22 ` Jonathan Tan
2020-07-09 11:13 ` Derrick Stolee
2020-07-07 14:21 ` [PATCH 05/21] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 06/21] maintenance: add --quiet option Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 07/21] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 08/21] maintenance: initialize task array and hashmap Derrick Stolee via GitGitGadget
2020-07-09 2:25 ` Jonathan Tan
2020-07-09 13:15 ` Derrick Stolee
2020-07-09 13:51 ` Junio C Hamano
2020-07-07 14:21 ` [PATCH 09/21] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
2020-07-09 2:29 ` Jonathan Tan
2020-07-09 11:14 ` Derrick Stolee
2020-07-09 22:52 ` Jeff King
2020-07-09 23:41 ` Derrick Stolee
2020-07-07 14:21 ` [PATCH 10/21] maintenance: add --task option Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 11/21] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 12/21] maintenance: add fetch task Derrick Stolee via GitGitGadget
2020-07-09 2:35 ` Jonathan Tan
2020-07-07 14:21 ` [PATCH 13/21] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 14/21] maintenance: add pack-files task Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 15/21] maintenance: auto-size pack-files batch Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 16/21] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 17/21] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 18/21] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 19/21] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 20/21] maintenance: add pack-files auto condition Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 21/21] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-07-08 23:57 ` [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
2020-07-09 11:21 ` Derrick Stolee
2020-07-09 12:43 ` Derrick Stolee
2020-07-09 23:16 ` Jeff King
2020-07-09 23:45 ` Derrick Stolee
2020-07-10 18:46 ` Emily Shaffer
2020-07-10 19:30 ` Son Luong Ngoc
2020-07-09 14:05 ` Junio C Hamano
2020-07-09 15:54 ` Derrick Stolee
2020-07-09 16:26 ` Junio C Hamano
2020-07-09 16:56 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
2020-07-25 1:26 ` Taylor Blau
2020-07-25 1:47 ` Đoàn Trần Công Danh
2020-07-29 22:19 ` Jonathan Nieder
2020-07-30 13:12 ` Derrick Stolee
2020-07-31 0:30 ` Jonathan Nieder
2020-08-03 17:37 ` Derrick Stolee
2020-08-03 17:46 ` Jonathan Nieder
2020-08-03 22:46 ` Taylor Blau
2020-08-03 23:01 ` Jonathan Nieder
2020-08-03 23:08 ` Taylor Blau
2020-08-03 23:17 ` Jonathan Nieder
2020-08-04 0:07 ` Junio C Hamano
2020-08-04 13:32 ` Derrick Stolee
2020-08-04 14:42 ` Jonathan Nieder
2020-08-04 16:32 ` Derrick Stolee
2020-08-04 17:02 ` Jonathan Nieder
2020-08-04 17:51 ` Derrick Stolee
2020-08-05 15:02 ` Derrick Stolee
2020-07-31 16:40 ` Jonathan Nieder
2020-08-03 23:52 ` Jonathan Nieder
2020-07-23 17:56 ` [PATCH v2 02/18] maintenance: add --quiet option Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 03/18] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
2020-07-23 20:21 ` Junio C Hamano
2020-07-25 1:33 ` Taylor Blau
2020-07-30 13:29 ` Derrick Stolee
2020-07-30 13:31 ` Derrick Stolee
2020-07-30 19:00 ` Eric Sunshine
2020-07-30 20:21 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 04/18] maintenance: initialize task array Derrick Stolee via GitGitGadget
2020-07-23 19:57 ` Junio C Hamano
2020-07-24 12:23 ` Derrick Stolee
2020-07-24 12:51 ` Derrick Stolee
2020-07-24 19:39 ` Junio C Hamano
2020-07-25 1:46 ` Taylor Blau
2020-07-29 22:19 ` Emily Shaffer
2020-07-23 17:56 ` [PATCH v2 05/18] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
2020-07-23 20:22 ` Junio C Hamano
2020-07-24 13:09 ` Derrick Stolee
2020-07-24 19:47 ` Junio C Hamano
2020-07-25 1:52 ` Taylor Blau
2020-07-30 13:59 ` Derrick Stolee
2020-07-29 0:22 ` Jeff King
2020-07-23 17:56 ` [PATCH v2 06/18] maintenance: add --task option Derrick Stolee via GitGitGadget
2020-07-23 20:21 ` Junio C Hamano
2020-07-23 22:18 ` Junio C Hamano
2020-07-24 13:36 ` Derrick Stolee
2020-07-24 19:50 ` Junio C Hamano
2020-07-23 17:56 ` [PATCH v2 07/18] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 08/18] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-07-23 20:53 ` Junio C Hamano
2020-07-24 14:25 ` Derrick Stolee
2020-07-24 20:47 ` Junio C Hamano
2020-07-25 1:37 ` Đoàn Trần Công Danh
2020-07-25 1:48 ` Junio C Hamano
2020-07-27 14:07 ` Derrick Stolee
2020-07-27 16:13 ` Junio C Hamano
2020-07-27 18:27 ` Derrick Stolee
2020-07-28 16:37 ` [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano
2020-07-29 9:12 ` Phillip Wood
2020-07-29 9:17 ` Phillip Wood
2020-07-30 15:17 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 09/18] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-07-23 20:59 ` Junio C Hamano
2020-07-24 14:50 ` Derrick Stolee
2020-07-24 19:57 ` Junio C Hamano
2020-07-29 22:21 ` Emily Shaffer
2020-07-30 15:38 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 10/18] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-07-23 22:00 ` Junio C Hamano
2020-07-24 15:03 ` Derrick Stolee [this message]
2020-07-29 22:22 ` Emily Shaffer
2020-07-23 17:56 ` [PATCH v2 11/18] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-07-23 22:15 ` Junio C Hamano
2020-07-23 23:09 ` Eric Sunshine
2020-07-23 23:24 ` Junio C Hamano
2020-07-24 16:09 ` Derrick Stolee
2020-07-24 19:51 ` Derrick Stolee
2020-07-24 20:17 ` Junio C Hamano
2020-07-29 22:23 ` Emily Shaffer
2020-07-30 16:57 ` Derrick Stolee
2020-07-30 19:02 ` Derrick Stolee
2020-07-30 19:24 ` Chris Torek
2020-08-05 12:37 ` Đoàn Trần Công Danh
2020-08-06 13:54 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 12/18] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 13/18] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 14/18] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 15/18] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 16/18] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 17/18] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 18/18] maintenance: add trace2 regions for task execution Derrick Stolee via GitGitGadget
2020-07-29 22:03 ` [PATCH v2 00/18] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
2020-07-30 22:24 ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 01/20] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 02/20] maintenance: add --quiet option Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 03/20] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 04/20] maintenance: initialize task array Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 05/20] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 06/20] maintenance: add --task option Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 07/20] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 08/20] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 09/20] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 10/20] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 11/20] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 12/20] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 13/20] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-07-30 23:36 ` Chris Torek
2020-08-03 17:43 ` Derrick Stolee
2020-07-30 22:24 ` [PATCH v3 14/20] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 15/20] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 16/20] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 17/20] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 18/20] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 19/20] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 20/20] maintenance: add trace2 regions for task execution Derrick Stolee via GitGitGadget
2020-07-30 23:06 ` [PATCH v3 00/20] Maintenance builtin, allowing 'gc --auto' customization Junio C Hamano
2020-07-30 23:31 ` Junio C Hamano
2020-07-31 2:58 ` Junio C Hamano
2020-08-06 17:58 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54bfc8a9-e3bf-4199-9c58-8ed6ad50fb36@gmail.com \
--to=stolee@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=congdanhqx@gmail.com \
--cc=derrickstolee@github.com \
--cc=dstolee@microsoft.com \
--cc=emilyshaffer@google.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
--cc=peff@peff.net \
--cc=phillip.wood123@gmail.com \
--cc=sandals@crustytoothpaste.net \
--cc=sluongng@gmail.com \
--cc=steadmon@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).