git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] multi-pack-index: repack batches below --batch-size
@ 2020-08-11 15:30 Derrick Stolee via GitGitGadget
  2020-08-11 15:50 ` Taylor Blau
  0 siblings, 1 reply; 3+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-08-11 15:30 UTC (permalink / raw)
  To: git; +Cc: sluongng, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The --batch-size=<size> option of 'git multi-pack-index repack' is
intended to limit the amount of work done by the repack. In the case of
a large repository, this command should repack a number of small
pack-files but leave the large pack-files alone. Most often, the
repository has one large pack-file from a 'git clone' operation and
number of smaller pack-files from incremental 'git fetch' operations.

The issue with '--batch-size' is that it also _prevents_ the repack from
happening if the expected size of the resulting pack-file is too small.
This was intended as a way to avoid frequent churn of small pack-files,
but it has mostly caused confusion when a repository is of "medium"
size. That is, not enormous like the Windows OS repository, but also not
so small that this incremental repack isn't valuable.

The solution presented here is to collect pack-files for repack if their
expected size is smaller than the batch-size parameter until either the
total expected size exceeds the batch-size or all pack-files are
considered. If there are at least two pack-files, then these are
combined to a new pack-file whose size should not be too much larger
than the batch-size.

This new strategy should succeed in keeping the number of pack-files
small in these "medium" size repositories. The concern about churn is
likely not interesting, as the real control over that is the frequency
in which the repack command is run.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    multi-pack-index: repack batches below --batch-size
    
    As reported [1], the 'git multi-pack-index repack' command has some
    unexpected behavior due to the nature of "expected size" for un-thinned
    fetch packs and the fact that the batch size requires the total size to
    be at least as large as that batch-size. By removing this minimum size
    restriction, we will repack more frequently and prevent this "many
    pack-file" problems.
    
    [1] 
    https://lore.kernel.org/git/6FA8F54A-C92D-497B-895F-AC6E8287AACD@gmail.com/

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-698%2Fderrickstolee%2Fmidx-repack-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-698/derrickstolee/midx-repack-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/698

 Documentation/git-multi-pack-index.txt | 11 ++++++-----
 midx.c                                 |  2 +-
 t/t5319-multi-pack-index.sh            | 18 ++++++++++++++++++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 0c6619493c..eb0caa0439 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -51,11 +51,12 @@ repack::
 	multi-pack-index, then divide by the total number of objects in
 	the pack and multiply by the pack size. We select packs with
 	expected size below the batch size until the set of packs have
-	total expected size at least the batch size. If the total size
-	does not reach the batch size, then do nothing. If a new pack-
-	file is created, rewrite the multi-pack-index to reference the
-	new pack-file. A later run of 'git multi-pack-index expire' will
-	delete the pack-files that were part of this batch.
+	total expected size at least the batch size, or all pack-files
+	are considered. If only one pack-file is selected, then do
+	nothing. If a new pack-file is created, rewrite the
+	multi-pack-index to reference the new pack-file. A later run of
+	'git multi-pack-index expire' will delete the pack-files that
+	were part of this batch.
 +
 If `repack.packKeptObjects` is `false`, then any pack-files with an
 associated `.keep` file will not be selected for the batch to repack.
diff --git a/midx.c b/midx.c
index 6d1584ca51..38690b46c9 100644
--- a/midx.c
+++ b/midx.c
@@ -1371,7 +1371,7 @@ static int fill_included_packs_batch(struct repository *r,
 
 	free(pack_info);
 
-	if (total_size < batch_size || packs_to_repack < 2)
+	if (packs_to_repack < 2)
 		return 1;
 
 	return 0;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 7214cab36c..b05190f500 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -643,6 +643,7 @@ test_expect_success 'expire respects .keep files' '
 '
 
 test_expect_success 'repack --batch-size=0 repacks everything' '
+	cp -r dup dup2 &&
 	(
 		cd dup &&
 		rm .git/objects/pack/*.keep &&
@@ -662,4 +663,21 @@ test_expect_success 'repack --batch-size=0 repacks everything' '
 	)
 '
 
+test_expect_success 'repack --batch-size=<large> repacks everything' '
+	(
+		cd dup2 &&
+		rm .git/objects/pack/*.keep &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 2 idx-list &&
+		git multi-pack-index repack --batch-size=2000000 &&
+		ls .git/objects/pack/*idx >idx-list &&
+		test_line_count = 3 idx-list &&
+		test-tool read-midx .git/objects | grep idx >midx-list &&
+		test_line_count = 3 midx-list &&
+		git multi-pack-index expire &&
+		ls -al .git/objects/pack/*idx >idx-list &&
+		test_line_count = 1 idx-list
+	)
+'
+
 test_done

base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] multi-pack-index: repack batches below --batch-size
  2020-08-11 15:30 [PATCH] multi-pack-index: repack batches below --batch-size Derrick Stolee via GitGitGadget
@ 2020-08-11 15:50 ` Taylor Blau
  2020-08-11 21:06   ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Taylor Blau @ 2020-08-11 15:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, sluongng, Derrick Stolee, Derrick Stolee

On Tue, Aug 11, 2020 at 03:30:18PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The --batch-size=<size> option of 'git multi-pack-index repack' is
> intended to limit the amount of work done by the repack. In the case of
> a large repository, this command should repack a number of small
> pack-files but leave the large pack-files alone. Most often, the
> repository has one large pack-file from a 'git clone' operation and
> number of smaller pack-files from incremental 'git fetch' operations.
>
> The issue with '--batch-size' is that it also _prevents_ the repack from
> happening if the expected size of the resulting pack-file is too small.
> This was intended as a way to avoid frequent churn of small pack-files,
> but it has mostly caused confusion when a repository is of "medium"
> size. That is, not enormous like the Windows OS repository, but also not
> so small that this incremental repack isn't valuable.
>
> The solution presented here is to collect pack-files for repack if their
> expected size is smaller than the batch-size parameter until either the
> total expected size exceeds the batch-size or all pack-files are
> considered. If there are at least two pack-files, then these are
> combined to a new pack-file whose size should not be too much larger
> than the batch-size.
>
> This new strategy should succeed in keeping the number of pack-files
> small in these "medium" size repositories. The concern about churn is
> likely not interesting, as the real control over that is the frequency
> in which the repack command is run.

OK. This '--batch-size' parameter is a little magical to me, but the
behavior you are describing seems sane. I think that your assessment of
the down-side is reasonable, too.

Looks fine to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>     multi-pack-index: repack batches below --batch-size
>
>     As reported [1], the 'git multi-pack-index repack' command has some
>     unexpected behavior due to the nature of "expected size" for un-thinned
>     fetch packs and the fact that the batch size requires the total size to
>     be at least as large as that batch-size. By removing this minimum size
>     restriction, we will repack more frequently and prevent this "many
>     pack-file" problems.
>
>     [1]
>     https://lore.kernel.org/git/6FA8F54A-C92D-497B-895F-AC6E8287AACD@gmail.com/
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-698%2Fderrickstolee%2Fmidx-repack-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-698/derrickstolee/midx-repack-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/698
>
>  Documentation/git-multi-pack-index.txt | 11 ++++++-----
>  midx.c                                 |  2 +-
>  t/t5319-multi-pack-index.sh            | 18 ++++++++++++++++++
>  3 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 0c6619493c..eb0caa0439 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -51,11 +51,12 @@ repack::
>  	multi-pack-index, then divide by the total number of objects in
>  	the pack and multiply by the pack size. We select packs with
>  	expected size below the batch size until the set of packs have
> -	total expected size at least the batch size. If the total size
> -	does not reach the batch size, then do nothing. If a new pack-
> -	file is created, rewrite the multi-pack-index to reference the
> -	new pack-file. A later run of 'git multi-pack-index expire' will
> -	delete the pack-files that were part of this batch.
> +	total expected size at least the batch size, or all pack-files
> +	are considered. If only one pack-file is selected, then do
> +	nothing. If a new pack-file is created, rewrite the
> +	multi-pack-index to reference the new pack-file. A later run of
> +	'git multi-pack-index expire' will delete the pack-files that
> +	were part of this batch.
>  +
>  If `repack.packKeptObjects` is `false`, then any pack-files with an
>  associated `.keep` file will not be selected for the batch to repack.
> diff --git a/midx.c b/midx.c
> index 6d1584ca51..38690b46c9 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1371,7 +1371,7 @@ static int fill_included_packs_batch(struct repository *r,
>
>  	free(pack_info);
>
> -	if (total_size < batch_size || packs_to_repack < 2)
> +	if (packs_to_repack < 2)
>  		return 1;
>
>  	return 0;
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 7214cab36c..b05190f500 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -643,6 +643,7 @@ test_expect_success 'expire respects .keep files' '
>  '
>
>  test_expect_success 'repack --batch-size=0 repacks everything' '
> +	cp -r dup dup2 &&
>  	(
>  		cd dup &&
>  		rm .git/objects/pack/*.keep &&
> @@ -662,4 +663,21 @@ test_expect_success 'repack --batch-size=0 repacks everything' '
>  	)
>  '
>
> +test_expect_success 'repack --batch-size=<large> repacks everything' '
> +	(
> +		cd dup2 &&
> +		rm .git/objects/pack/*.keep &&
> +		ls .git/objects/pack/*idx >idx-list &&
> +		test_line_count = 2 idx-list &&
> +		git multi-pack-index repack --batch-size=2000000 &&
> +		ls .git/objects/pack/*idx >idx-list &&
> +		test_line_count = 3 idx-list &&
> +		test-tool read-midx .git/objects | grep idx >midx-list &&
> +		test_line_count = 3 midx-list &&
> +		git multi-pack-index expire &&
> +		ls -al .git/objects/pack/*idx >idx-list &&
> +		test_line_count = 1 idx-list
> +	)
> +'
> +
>  test_done
>
> base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
> --
> gitgitgadget
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] multi-pack-index: repack batches below --batch-size
  2020-08-11 15:50 ` Taylor Blau
@ 2020-08-11 21:06   ` Junio C Hamano
  0 siblings, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2020-08-11 21:06 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Derrick Stolee via GitGitGadget, git, sluongng, Derrick Stolee,
	Derrick Stolee

Taylor Blau <me@ttaylorr.com> writes:

> OK. This '--batch-size' parameter is a little magical to me, but the
> behavior you are describing seems sane. I think that your assessment of
> the down-side is reasonable, too.
>
> Looks fine to me.
>
>   Reviewed-by: Taylor Blau <me@ttaylorr.com>
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Thanks, both.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-08-11 21:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-11 15:30 [PATCH] multi-pack-index: repack batches below --batch-size Derrick Stolee via GitGitGadget
2020-08-11 15:50 ` Taylor Blau
2020-08-11 21:06   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).