git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Johannes Berg <johannes@sipsolutions.net>, git@vger.kernel.org
Subject: Re: [PATCH] pack-format: correct multi-pack-index description
Date: Mon, 10 Feb 2020 10:02:01 -0500	[thread overview]
Message-ID: <08dbc3be-34a7-fb8d-e0bd-56a79ab5b65a@gmail.com> (raw)
In-Reply-To: <c077a2100038edf2b0c486c0d364bd00f3921074.camel@sipsolutions.net>

On 2/10/2020 9:50 AM, Johannes Berg wrote:
> On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote:
> 
>> Part of my initial plan was to have this incremental file format.
>> The commit-graph uses a very similar mechanism. The difference may
>> be that you likely allow multiple .midx files found by scanning the
>> pack directory, 
> 
> Right, just scan and use any midx that exist, then compare the packs in
> there against all the packs found, and then remove any packs that
> actually *are* in an midx from the search list. That leaves you with all
> information, but optimised by midx where possible.
> 
>> but I would expect something like the
>> "commit-graph-chain" file that provides an ordered list of the
>> incremental files. This can be important for deciding when to merge
>> layers or delete old files, and would be critical to the possibility
>> of converting reachability bitmaps to rely on a stable object order
>> stored in the multi-pack-index instead of pack-order.
> 
> Right, if we delete then we have to also remove any midx covering the
> deleted pack, that's pretty rare in bup as a backup tool though.
> 
>> The reason the multi-pack-index has not become incremental is that
>> VFS for Git no longer needs to write it very often. We write the
>> entire multi-pack-index during a background job that triggers once
>> per day. If we needed to write it more frequently, then the incremental
>> format would be more important to us.
> 
> So, wait, what if a new pack is created? Does it just get used in
> addition to the multi-pack-index, if it's not covered by it, like I
> described above?
> 
> If so, I guess it wouldn't actually really matter here. I was afraid
> (but didn't check yet) that git would always use only the single multi-
> pack-index file, and not also search additional packs, so that it always
> has to be maintained in "perfect order" ...

Git loads the multi-pack-index file, which includes a sorted list of
the packs it covers. It then scans the "pack" directory for pack-indexes
and checks if they are covered by the multi-pack-index. If not, then
Git will add them to the packed_git struct and use them as normal.
The hope is that this list of "uncovered" packs is small compared to
the data covered by the multi-pack-index.

This allows Git to continue functioning after an action like "git fetch"
that adds a new pack but may not want to rewrite the multi-pack-index.

Our background maintenance essentially runs these commands:

 1. git multi-pack-index write
 2. git multi-pack-index expire
 3. git multi-pack-index repack

Step 1 ensures all packs are pulled into the multi-pack-index. Step 2
deletes any pack-files whose objects are contained in newer pack-files.
Step 3 creates a new pack-file containing all objects from a set of
small pack-files (using the --batch-size=X option). This process helps
incrementally reduce the size and number of packs. That may be helpful
for your backup took, too.

Perhaps after an incremental multi-pack-index is added, then Git could
(optionally) have a mode that only checks the multi-pack-index to
avoid scanning the packs directory. It would require inserting a
multi-pack-index write into the index-pack logic so Git.

I'm not sure if that mode would be helpful, since the pack directory
scan is typically done once per command and is relatively fast.

>> That said: if someone wanted to contribute an incremental format,
>> then I would be happy to review it!
> 
> I might still get motivated to do so :-)

YOU CAN DO IT! (Did that help?)

-Stolee

  reply	other threads:[~2020-02-10 15:02 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-07 22:16 [PATCH] pack-format: correct multi-pack-index description Johannes Berg
2020-02-10 14:18 ` Derrick Stolee
2020-02-10 14:22   ` Johannes Berg
2020-02-10 14:46     ` Derrick Stolee
2020-02-10 14:50       ` Johannes Berg
2020-02-10 15:02         ` Derrick Stolee [this message]
2020-02-10 15:06           ` Johannes Berg
2020-02-10 17:02   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08dbc3be-34a7-fb8d-e0bd-56a79ab5b65a@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=johannes@sipsolutions.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).