From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Johannes.Schindelin@gmx.de,
sandals@crustytoothpaste.net, steadmon@google.com,
jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com,
phillip.wood123@gmail.com, emilyshaffer@google.com,
sluongng@gmail.com, jonathantanmy@google.com,
Derrick Stolee <derrickstolee@github.com>,
Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v2 09/18] maintenance: add loose-objects task
Date: Fri, 24 Jul 2020 10:50:12 -0400 [thread overview]
Message-ID: <b861244c-af0d-581f-b1c8-6c6cb667eed8@gmail.com> (raw)
In-Reply-To: <xmqqr1t2osm3.fsf@gitster.c.googlers.com>
On 7/23/2020 4:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> Create a 'loose-objects' task for the 'git maintenance run' command.
>> This helps clean up loose objects without disrupting concurrent Git
>> commands using the following sequence of events:
>>
>> 1. Run 'git prune-packed' to delete any loose objects that exist
>> in a pack-file. Concurrent commands will prefer the packed
>> version of the object to the loose version. (Of course, there
>> are exceptions for commands that specifically care about the
>> location of an object. These are rare for a user to run on
>> purpose, and we hope a user that has selected background
>> maintenance will not be trying to do foreground maintenance.)
>
> OK. That would make sense.
>
>> 2. Run 'git pack-objects' on a batch of loose objects. These
>> objects are grouped by scanning the loose object directories in
>> lexicographic order until listing all loose objects -or-
>> reaching 50,000 objects. This is more than enough if the loose
>> objects are created only by a user doing normal development.
>
> I haven't seen this in action, but my gut feeling is that this would
> result in horrible locality and deltification in the resulting
> packfile. The order you feed the objects to pack-objects and the
> path hint you attach to each object matters quite a lot.
>
> I do agree that it would be useful to have a task to deal with only
> loose objects without touching existing packfiles. I just am not
> sure if 2. is a worthwhile thing to do. A poorly constructed pack
> will also contaminate later packfiles made without "-f" option to
> "git repack".
There are several factors going on here:
* In a partial clone, it is likely that we get loose objects only
due to a command like "git log -p" that downloads blobs
one-by-one. In such a case, this step coming in later and picking
up those blobs _will_ find good deltas because they are present
in the same batch.
* (I know this case isn't important to core Git, but please indulge
me) In a VFS for Git repo, the loose objects correspond to blobs
that were faulted in by a virtual filesystem read. In this case,
the blobs are usually from a single commit in history, so good
deltas between the blobs don't actually exist!
* My experience indicates that the packs created by the
loose-objects task are rather small (when created daily). This
means that they get selected by the incremental-repack task to
repack into a new pack-file where deltas are recomputed with modest
success. As mentioned in that task, we saw a significant compression
factor using that step for users of the Windows OS repo, mostly due
to recomputing tree deltas.
* Some amount of "extra" space is expected with this incremental
repacking scheme. The most space-efficient thing to do is a full
repack along with a tree walk that detects the paths used for each
blob, allowing better hints for delta compression. However, that
operation is very _time_ consuming. The trade-off here is something
I should make more explicit. In my experience, disk space is cheap
but CPU time is expensive. Most repositories could probably do a
daily repack without being a disruption to the user. These steps
enable maintenance for repositories where a full repack is too
disruptive.
I hope this adds some context. I would love if someone who knows more
about delta compression could challenge my assumptions. Sharing that
expertise can help create better maintenance strategies. Junio's
initial concern here is a good first step in that direction.
Thanks,
-Stolee
next prev parent reply other threads:[~2020-07-24 14:50 UTC|newest]
Thread overview: 164+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 01/21] gc: use the_repository less often Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 02/21] gc: use repository in too_many_loose_objects() Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 03/21] gc: use repo config Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 04/21] gc: drop the_repository in log location Derrick Stolee via GitGitGadget
2020-07-09 2:22 ` Jonathan Tan
2020-07-09 11:13 ` Derrick Stolee
2020-07-07 14:21 ` [PATCH 05/21] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 06/21] maintenance: add --quiet option Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 07/21] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 08/21] maintenance: initialize task array and hashmap Derrick Stolee via GitGitGadget
2020-07-09 2:25 ` Jonathan Tan
2020-07-09 13:15 ` Derrick Stolee
2020-07-09 13:51 ` Junio C Hamano
2020-07-07 14:21 ` [PATCH 09/21] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
2020-07-09 2:29 ` Jonathan Tan
2020-07-09 11:14 ` Derrick Stolee
2020-07-09 22:52 ` Jeff King
2020-07-09 23:41 ` Derrick Stolee
2020-07-07 14:21 ` [PATCH 10/21] maintenance: add --task option Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 11/21] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 12/21] maintenance: add fetch task Derrick Stolee via GitGitGadget
2020-07-09 2:35 ` Jonathan Tan
2020-07-07 14:21 ` [PATCH 13/21] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 14/21] maintenance: add pack-files task Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 15/21] maintenance: auto-size pack-files batch Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 16/21] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 17/21] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 18/21] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 19/21] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 20/21] maintenance: add pack-files auto condition Derrick Stolee via GitGitGadget
2020-07-07 14:21 ` [PATCH 21/21] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-07-08 23:57 ` [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
2020-07-09 11:21 ` Derrick Stolee
2020-07-09 12:43 ` Derrick Stolee
2020-07-09 23:16 ` Jeff King
2020-07-09 23:45 ` Derrick Stolee
2020-07-10 18:46 ` Emily Shaffer
2020-07-10 19:30 ` Son Luong Ngoc
2020-07-09 14:05 ` Junio C Hamano
2020-07-09 15:54 ` Derrick Stolee
2020-07-09 16:26 ` Junio C Hamano
2020-07-09 16:56 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
2020-07-25 1:26 ` Taylor Blau
2020-07-25 1:47 ` Đoàn Trần Công Danh
2020-07-29 22:19 ` Jonathan Nieder
2020-07-30 13:12 ` Derrick Stolee
2020-07-31 0:30 ` Jonathan Nieder
2020-08-03 17:37 ` Derrick Stolee
2020-08-03 17:46 ` Jonathan Nieder
2020-08-03 22:46 ` Taylor Blau
2020-08-03 23:01 ` Jonathan Nieder
2020-08-03 23:08 ` Taylor Blau
2020-08-03 23:17 ` Jonathan Nieder
2020-08-04 0:07 ` Junio C Hamano
2020-08-04 13:32 ` Derrick Stolee
2020-08-04 14:42 ` Jonathan Nieder
2020-08-04 16:32 ` Derrick Stolee
2020-08-04 17:02 ` Jonathan Nieder
2020-08-04 17:51 ` Derrick Stolee
2020-08-05 15:02 ` Derrick Stolee
2020-07-31 16:40 ` Jonathan Nieder
2020-08-03 23:52 ` Jonathan Nieder
2020-07-23 17:56 ` [PATCH v2 02/18] maintenance: add --quiet option Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 03/18] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
2020-07-23 20:21 ` Junio C Hamano
2020-07-25 1:33 ` Taylor Blau
2020-07-30 13:29 ` Derrick Stolee
2020-07-30 13:31 ` Derrick Stolee
2020-07-30 19:00 ` Eric Sunshine
2020-07-30 20:21 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 04/18] maintenance: initialize task array Derrick Stolee via GitGitGadget
2020-07-23 19:57 ` Junio C Hamano
2020-07-24 12:23 ` Derrick Stolee
2020-07-24 12:51 ` Derrick Stolee
2020-07-24 19:39 ` Junio C Hamano
2020-07-25 1:46 ` Taylor Blau
2020-07-29 22:19 ` Emily Shaffer
2020-07-23 17:56 ` [PATCH v2 05/18] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
2020-07-23 20:22 ` Junio C Hamano
2020-07-24 13:09 ` Derrick Stolee
2020-07-24 19:47 ` Junio C Hamano
2020-07-25 1:52 ` Taylor Blau
2020-07-30 13:59 ` Derrick Stolee
2020-07-29 0:22 ` Jeff King
2020-07-23 17:56 ` [PATCH v2 06/18] maintenance: add --task option Derrick Stolee via GitGitGadget
2020-07-23 20:21 ` Junio C Hamano
2020-07-23 22:18 ` Junio C Hamano
2020-07-24 13:36 ` Derrick Stolee
2020-07-24 19:50 ` Junio C Hamano
2020-07-23 17:56 ` [PATCH v2 07/18] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 08/18] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-07-23 20:53 ` Junio C Hamano
2020-07-24 14:25 ` Derrick Stolee
2020-07-24 20:47 ` Junio C Hamano
2020-07-25 1:37 ` Đoàn Trần Công Danh
2020-07-25 1:48 ` Junio C Hamano
2020-07-27 14:07 ` Derrick Stolee
2020-07-27 16:13 ` Junio C Hamano
2020-07-27 18:27 ` Derrick Stolee
2020-07-28 16:37 ` [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano
2020-07-29 9:12 ` Phillip Wood
2020-07-29 9:17 ` Phillip Wood
2020-07-30 15:17 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 09/18] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-07-23 20:59 ` Junio C Hamano
2020-07-24 14:50 ` Derrick Stolee [this message]
2020-07-24 19:57 ` Junio C Hamano
2020-07-29 22:21 ` Emily Shaffer
2020-07-30 15:38 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 10/18] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-07-23 22:00 ` Junio C Hamano
2020-07-24 15:03 ` Derrick Stolee
2020-07-29 22:22 ` Emily Shaffer
2020-07-23 17:56 ` [PATCH v2 11/18] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-07-23 22:15 ` Junio C Hamano
2020-07-23 23:09 ` Eric Sunshine
2020-07-23 23:24 ` Junio C Hamano
2020-07-24 16:09 ` Derrick Stolee
2020-07-24 19:51 ` Derrick Stolee
2020-07-24 20:17 ` Junio C Hamano
2020-07-29 22:23 ` Emily Shaffer
2020-07-30 16:57 ` Derrick Stolee
2020-07-30 19:02 ` Derrick Stolee
2020-07-30 19:24 ` Chris Torek
2020-08-05 12:37 ` Đoàn Trần Công Danh
2020-08-06 13:54 ` Derrick Stolee
2020-07-23 17:56 ` [PATCH v2 12/18] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 13/18] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 14/18] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 15/18] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 16/18] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 17/18] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-07-23 17:56 ` [PATCH v2 18/18] maintenance: add trace2 regions for task execution Derrick Stolee via GitGitGadget
2020-07-29 22:03 ` [PATCH v2 00/18] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
2020-07-30 22:24 ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 01/20] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 02/20] maintenance: add --quiet option Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 03/20] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 04/20] maintenance: initialize task array Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 05/20] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 06/20] maintenance: add --task option Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 07/20] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 08/20] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 09/20] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 10/20] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 11/20] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 12/20] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 13/20] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-07-30 23:36 ` Chris Torek
2020-08-03 17:43 ` Derrick Stolee
2020-07-30 22:24 ` [PATCH v3 14/20] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 15/20] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 16/20] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 17/20] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 18/20] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 19/20] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-07-30 22:24 ` [PATCH v3 20/20] maintenance: add trace2 regions for task execution Derrick Stolee via GitGitGadget
2020-07-30 23:06 ` [PATCH v3 00/20] Maintenance builtin, allowing 'gc --auto' customization Junio C Hamano
2020-07-30 23:31 ` Junio C Hamano
2020-07-31 2:58 ` Junio C Hamano
2020-08-06 17:58 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b861244c-af0d-581f-b1c8-6c6cb667eed8@gmail.com \
--to=stolee@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=congdanhqx@gmail.com \
--cc=derrickstolee@github.com \
--cc=dstolee@microsoft.com \
--cc=emilyshaffer@google.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
--cc=peff@peff.net \
--cc=phillip.wood123@gmail.com \
--cc=sandals@crustytoothpaste.net \
--cc=sluongng@gmail.com \
--cc=steadmon@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).