git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Stefan Beller <sbeller@google.com>, gitgitgadget@gmail.com
Cc: git <git@vger.kernel.org>, "Jeff King" <peff@peff.net>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Derrick Stolee" <dstolee@microsoft.com>
Subject: Re: [PATCH 5/5] midx: implement midx_repack()
Date: Tue, 11 Dec 2018 08:00:15 -0500	[thread overview]
Message-ID: <113b5ed8-9b25-650b-7f35-6f35bc69075b@gmail.com> (raw)
In-Reply-To: <CAGZ79kbPcy2U9XJA+Je0zRxFsQJGA9u8nfYZe_s75V8c97+dNw@mail.gmail.com>

On 12/10/2018 9:32 PM, Stefan Beller wrote:
> On Mon, Dec 10, 2018 at 10:06 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> To repack using a multi-pack-index, first sort all pack-files by
>> their modified time. Second, walk those pack-files from oldest
>> to newest, adding the packs to a list if they are smaller than the
>> given pack-size. Finally, collect the objects from the multi-pack-
>> index that are in those packs and send them to 'git pack-objects'.
> Makes sense.
>
> With this operation we only coalesce some packfiles into a new
> pack file. So to perform the "complete" repack this command
> has to be run repeatedly until there is at most one packfile
> left that is smaller than batch size.

Well, the batch size essentially means "If a pack-file is larger than 
<size>, then leave it be. I'm happy with packs that large." This assumes 
that the reason the pack is that large is because it was already 
combined with other packs or contains a lot of objects.

>
> Imagine the following scenario:
>
>    There are 5 packfiles A, B, C, D, E,
>    created last Monday thru Friday (A is oldest, E youngest).
>    The sizes are [A=4, B=6, C=5, D=5, E=4]
>
>    You'd issue a repack with batch size=10, such that
>    A and B would be repacked into F, which is
>    created today, size is less or equal than 10.
>
>    You issue another repack tomorrow, which then would
>    coalesce C and D to G, which is
>    dated tomorrow, size is less or equal to 10 as well.
>
>    You issue a third repack, which then takes E
>    (as it is the oldest) and would probably find F as the
>    next oldest (assuming it is less than 10), to repack
>    into H.
>
>    H is then compromised of A, B and E, and G is C+D.
>
> In a way these repacks, always picking up the oldest,
> sound like you "roll forward" objects into new packs.
> As the new packs are newest (we have no packs from
> the future), we'd cycle through different packs to look at
> for packing on each repacking.
>
> It is however more likely that content is more similar
> on a temporal basis. (e.g. I am boldly claiming that
> [ABC, DE] would take less space than [ABE, CD]
> as produced above).
>
> (The obvious solution to this hypothetical would be
> to backdate the resulting pack to the youngest pack
> that is input to the new pack, but I dislike fudging with
> the time a file is created/touched, so let's not go there)

This raises a good point about what happens when we "roll over" into the 
"repacked" packs.

I'm not claiming that this is an optimal way to save space, but is a way 
to incrementally collect small packs into slightly larger packs, all 
while not interrupting concurrent Git commands. Reducing pack count 
improves data locality, which is my goal here. In our environment, we do 
see reduced space as a benefit, even if it is not optimal.

>
> Would the object count make sense as input instead of
> the pack date?
>
>
>> While first designing a 'git multi-pack-index repack' operation, I
>> started by collecting the batches based on the size of the objects
>> instead of the size of the pack-files. This allows repacking a
>> large pack-file that has very few referencd objects. However, this
> referenced
>
>> came at a significant cost of parsing pack-files instead of simply
>> reading the multi-pack-index and getting the file information for
>> the pack-files. This object-size idea could be a direction for
>> future expansion in this area.
> Ah, that also explains why the above idea is toast.
>
> Would it make sense to extend or annotate the midx file
> to give hints at which packs are easy to combine?
>
> I guess such an "annotation worker" could run in a separate
> thread / pool with the lowest priority as this seems like a
> decent fallback for the lack of any better information how
> to pick the packfiles.

One idea I had earlier (and is in 
Documentation/technical/multi-pack-index.txt) is to have the midx track 
metadata about pack-files. We could avoid this "rollover" problem by 
tracking which packs were repacked using this mechanism. This could 
create a "pack generation" value, and we could collect a batch of packs 
that have the same generation. This does seem a bit overcomplicated for 
the potential benefit, and could waste better use of that metadata 
concept. For instance, we could use the metadata to track the 
information given by ".keep" and ".promisor" files.

Thanks,

-Stolee


  reply	other threads:[~2018-12-11 13:00 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-10 18:06 [PATCH 0/5] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
2018-12-10 18:06 ` [PATCH 1/5] multi-pack-index: prepare for 'expire' verb Derrick Stolee via GitGitGadget
2018-12-11  1:35   ` Stefan Beller
2018-12-11  1:59     ` SZEDER Gábor
2018-12-11 12:32       ` Derrick Stolee
2018-12-10 18:06 ` [PATCH 2/5] midx: refactor permutation logic Derrick Stolee via GitGitGadget
2018-12-10 18:06 ` [PATCH 3/5] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
2018-12-10 18:06 ` [PATCH 4/5] multi-pack-index: prepare 'repack' verb Derrick Stolee via GitGitGadget
2018-12-11  1:54   ` Stefan Beller
2018-12-11 12:45     ` Derrick Stolee
2018-12-10 18:06 ` [PATCH 5/5] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2018-12-11  2:32   ` Stefan Beller
2018-12-11 13:00     ` Derrick Stolee [this message]
2018-12-12  7:40   ` Junio C Hamano
2018-12-13  4:23     ` Junio C Hamano
2018-12-21 16:28 ` [PATCH v2 0/7] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 1/7] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 2/7] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 3/7] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 4/7] midx: refactor permutation logic Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 5/7] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 7/7] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2018-12-21 16:28   ` [PATCH v2 6/7] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
2019-01-09 15:21   ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 1/9] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 2/9] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 4/9] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 3/9] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
2019-01-09 15:21     ` [PATCH v3 5/9] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
2019-01-23 21:00       ` Jonathan Tan
2019-01-24 17:34         ` Derrick Stolee
2019-01-24 19:17           ` Derrick Stolee
2019-01-09 15:21     ` [PATCH v3 6/9] multi-pack-index: implement 'expire' verb Derrick Stolee via GitGitGadget
2019-01-09 15:54       ` SZEDER Gábor
2019-01-10 18:05         ` Junio C Hamano
2019-01-23 22:13       ` Jonathan Tan
2019-01-24 17:36         ` Derrick Stolee
2019-01-09 15:21     ` [PATCH v3 7/9] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
2019-01-09 15:56       ` SZEDER Gábor
2019-01-23 22:38       ` Jonathan Tan
2019-01-24 19:36         ` Derrick Stolee
2019-01-24 21:38           ` Jonathan Tan
2019-01-09 15:21     ` [PATCH v3 8/9] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2019-01-23 22:33       ` Jonathan Tan
2019-01-09 15:21     ` [PATCH v3 9/9] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
2019-01-17 15:27     ` [PATCH v3 0/9] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee
2019-01-23 22:44     ` Jonathan Tan
2019-01-24 21:51     ` [PATCH v4 00/10] " Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 01/10] repack: refactor pack deletion for future use Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 02/10] Docs: rearrange subcommands for multi-pack-index Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 03/10] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 04/10] midx: simplify computation of pack name lengths Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 05/10] midx: refactor permutation logic and pack sorting Derrick Stolee via GitGitGadget
2019-01-24 21:51       ` [PATCH v4 07/10] multi-pack-index: prepare 'repack' subcommand Derrick Stolee via GitGitGadget
2019-01-25 23:24         ` Josh Steadmon
2019-01-24 21:51       ` [PATCH v4 06/10] multi-pack-index: implement 'expire' subcommand Derrick Stolee via GitGitGadget
2019-01-24 21:52       ` [PATCH v4 08/10] midx: implement midx_repack() Derrick Stolee via GitGitGadget
2019-01-26 17:10         ` Derrick Stolee
2019-01-27 22:50           ` Junio C Hamano
2019-01-24 21:52       ` [PATCH v4 09/10] multi-pack-index: test expire while adding packs Derrick Stolee via GitGitGadget
2019-01-24 21:52       ` [PATCH v4 10/10] midx: add test that 'expire' respects .keep files Derrick Stolee via GitGitGadget
2019-01-24 22:14       ` [PATCH v4 00/10] Create 'expire' and 'repack' verbs for git-multi-pack-index Jonathan Tan
2019-01-25 23:49       ` Josh Steadmon
2019-04-24 15:14       ` [PATCH v5 00/11] " Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 01/11] repack: refactor pack deletion for future use Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 04/11] midx: simplify computation of pack name lengths Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 08/11] midx: implement midx_repack() Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
2019-04-24 15:14         ` [PATCH v5 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
2019-04-25  5:38         ` [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Junio C Hamano
2019-04-25 11:06           ` Derrick Stolee
2019-05-14 18:47         ` [PATCH v6 " Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 01/11] repack: refactor pack deletion for future use Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 02/11] Docs: rearrange subcommands for multi-pack-index Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 03/11] multi-pack-index: prepare for 'expire' subcommand Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 04/11] midx: simplify computation of pack name lengths Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 05/11] midx: refactor permutation logic and pack sorting Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 06/11] multi-pack-index: implement 'expire' subcommand Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 07/11] multi-pack-index: prepare 'repack' subcommand Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 08/11] midx: implement midx_repack() Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 09/11] multi-pack-index: test expire while adding packs Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 10/11] midx: add test that 'expire' respects .keep files Derrick Stolee
2019-05-14 18:47           ` [PATCH v6 11/11] t5319-multi-pack-index.sh: test batch size zero Derrick Stolee
2019-06-10 14:15           ` [PATCH v6 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Derrick Stolee
2019-06-10 17:31             ` Junio C Hamano
2019-06-10 17:57               ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=113b5ed8-9b25-650b-7f35-6f35bc69075b@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).