* [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs
@ 2025-02-27 18:29 Taylor Blau
2025-02-27 18:29 ` [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
` (6 more replies)
0 siblings, 7 replies; 54+ messages in thread
From: Taylor Blau @ 2025-02-27 18:29 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
This short series contains a fix for a bug I noticed while rolling out
multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's
infrastructure.
The series is structured as follows:
- The first patch simplifies how 'repack' aggregates cruft packs
together when their size is below the '--max-cruft-size' or
'--max-pack-size' threshold. This simplification changes behavior
slightly, but not in a meaningful way. It occurred to me while
writing the second patch.
- The second patch describes and fixes the main bug. The gist here is
that objects which are (a) unreachable, (b) exist in a cruft pack
being retained, and (c) were freshened to have a more recent mtime
than any existing cruft copy are unable to be freshened.
The fix pursued in the second patch changes the rules around when we
want to retain an object via builtin/pack-objects.c::want_found_object()
when at least one cruft pack will survive the repack.
Previously the rule was to discard any object which appears in any
surviving pack, regardless of mtime. The rule now is to only discard an
object if it appears in either (a) a non-cruft pack which will survive
the repack, or (b) a cruft pack whose mtime for that object is older
than the one we are trying to pack.
I think that this is the right behavior, but admittedly putting this
series together hurt my brain trying to think through all of the cases.
I'm fairly confident in the testing here as I remember it being fairly
exhaustive of all interesting cases. But I'd appreciate a sanity check
from others that they too are convinced this is the right approach.
Thanks in advance for your review!
Taylor Blau (2):
builtin/repack.c: simplify cruft pack aggregation
builtin/pack-objects.c: freshen objects from existing cruft packs
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
builtin/repack.c | 38 +------------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 106 ++++++++++++++++++++++--------------
5 files changed, 171 insertions(+), 96 deletions(-)
base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958
--
2.49.0.rc0.2.gc0c926adde2
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
@ 2025-02-27 18:29 ` Taylor Blau
2025-02-27 19:23 ` Elijah Newren
2025-02-28 7:52 ` Patrick Steinhardt
2025-02-27 18:29 ` [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
` (5 subsequent siblings)
6 siblings, 2 replies; 54+ messages in thread
From: Taylor Blau @ 2025-02-27 18:29 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), 'git repack' built on support for
multiple cruft packs in Git by instructing 'git pack-objects --cruft'
how to aggregate smaller cruft packs up to the provided threshold.
The implementation in 37dc6d8104 worked something like the following
pseudo-code:
total_size = 0;
for (p in cruft packs) {
if (p->pack_size + total_size < max_size) {
total_size += p->pack_size;
collapse(p)
} else {
retain(p);
}
}
The original idea behind this approach was that smaller cruft packs
would get combined together until the sum of their sizes was no larger
than the given max pack size.
There is a much simpler way to achieve this, however, which is to simply
combine *all* cruft packs which are smaller than the threshold,
regardless of what their sum is. With '--max-pack-size', 'pack-objects'
will split out the resulting pack into individual pack(s) if necessary
to ensure that the written pack(s) are each no larger than the provided
threshold.
This yields a slight behavior change, which is reflected in the removed
test. Previous to this change, we would aggregate smaller cruft packs
first, whereas now we will opportunistically combine as many cruft packs
as possible. As as result, that test is no longer relevant, and can be
deleted.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/repack.c | 38 ++-----------------------------------
t/t7704-repack-cruft.sh | 42 -----------------------------------------
2 files changed, 2 insertions(+), 78 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index 75e3752353a..4d83d40f39f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1022,29 +1022,13 @@ static int write_filtered_pack(const struct pack_objects_args *args,
return finish_pack_objects_cmd(&cmd, names, local);
}
-static int existing_cruft_pack_cmp(const void *va, const void *vb)
-{
- struct packed_git *a = *(struct packed_git **)va;
- struct packed_git *b = *(struct packed_git **)vb;
-
- if (a->pack_size < b->pack_size)
- return -1;
- if (a->pack_size > b->pack_size)
- return 1;
- return 0;
-}
-
static void collapse_small_cruft_packs(FILE *in, size_t max_size,
struct existing_packs *existing)
{
- struct packed_git **existing_cruft, *p;
+ struct packed_git *p;
struct strbuf buf = STRBUF_INIT;
- size_t total_size = 0;
- size_t existing_cruft_nr = 0;
size_t i;
- ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
-
for (p = get_all_packs(the_repository); p; p = p->next) {
if (!(p->is_cruft && p->pack_local))
continue;
@@ -1056,24 +1040,7 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
if (!string_list_has_string(&existing->cruft_packs, buf.buf))
continue;
- if (existing_cruft_nr >= existing->cruft_packs.nr)
- BUG("too many cruft packs (found %"PRIuMAX", but knew "
- "of %"PRIuMAX")",
- (uintmax_t)existing_cruft_nr + 1,
- (uintmax_t)existing->cruft_packs.nr);
- existing_cruft[existing_cruft_nr++] = p;
- }
-
- QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
-
- for (i = 0; i < existing_cruft_nr; i++) {
- size_t proposed;
-
- p = existing_cruft[i];
- proposed = st_add(total_size, p->pack_size);
-
- if (proposed <= max_size) {
- total_size = proposed;
+ if (p->pack_size < max_size) {
fprintf(in, "-%s\n", pack_basename(p));
} else {
retain_cruft_pack(existing, p);
@@ -1086,7 +1053,6 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
existing->non_kept_packs.items[i].string);
strbuf_release(&buf);
- free(existing_cruft);
}
static int write_cruft_pack(const struct pack_objects_args *args,
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 959e6e26488..5a76b541ddd 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -194,48 +194,6 @@ test_expect_success '--max-cruft-size combines existing packs when below thresho
)
'
-test_expect_success '--max-cruft-size combines smaller packs first' '
- git init max-cruft-size-consume-small &&
- (
- cd max-cruft-size-consume-small &&
-
- test_commit base &&
- git repack -ad &&
-
- cruft_foo="$(generate_cruft_pack foo 524288)" && # 0.5 MiB
- cruft_bar="$(generate_cruft_pack bar 524288)" && # 0.5 MiB
- cruft_baz="$(generate_cruft_pack baz 1048576)" && # 1.0 MiB
- cruft_quux="$(generate_cruft_pack quux 1572864)" && # 1.5 MiB
-
- test-tool pack-mtimes "$(basename $cruft_foo)" >expect.raw &&
- test-tool pack-mtimes "$(basename $cruft_bar)" >>expect.raw &&
- sort expect.raw >expect.objects &&
-
- # repacking with `--max-cruft-size=2M` should combine
- # both 0.5 MiB packs together, instead of, say, one of
- # the 0.5 MiB packs with the 1.0 MiB pack
- ls $packdir/pack-*.mtimes | sort >cruft.before &&
- git repack -d --cruft --max-cruft-size=2M &&
- ls $packdir/pack-*.mtimes | sort >cruft.after &&
-
- comm -13 cruft.before cruft.after >cruft.new &&
- comm -23 cruft.before cruft.after >cruft.removed &&
-
- test_line_count = 1 cruft.new &&
- test_line_count = 2 cruft.removed &&
-
- # the two smaller packs should be rolled up first
- printf "%s\n" $cruft_foo $cruft_bar | sort >expect.removed &&
- test_cmp expect.removed cruft.removed &&
-
- # ...and contain the set of objects rolled up
- test-tool pack-mtimes "$(basename $(cat cruft.new))" >actual.raw &&
- sort actual.raw >actual.objects &&
-
- test_cmp expect.objects actual.objects
- )
-'
-
test_expect_success 'setup --max-cruft-size with freshened objects' '
git init max-cruft-size-freshen &&
(
--
2.49.0.rc0.2.gc0c926adde2
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
2025-02-27 18:29 ` [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
@ 2025-02-27 18:29 ` Taylor Blau
2025-02-27 19:26 ` Elijah Newren
2025-02-27 19:28 ` [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
` (4 subsequent siblings)
6 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-02-27 18:29 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
Once an object is written into a cruft pack, we can only freshen it by
writing a new loose or packed copy of that object with a more recent
mtime.
Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
a repository at any given time. So freshening unreachable objects was
straightforward when already rewriting the cruft pack (and its *.mtimes
file).
But 61568efa95 changes things: 'pack-objects' now supports writing
multiple cruft packs when invoked with `--cruft` and the
`--max-pack-size` flag. Cruft packs are rewritten until they reach some
size threshold, at which point they are considered "frozen", and will
only be modified in a pruning GC, or if the threshold itself is
adjusted.
However, this process breaks down when we attempt to freshen an object
packed in an earlier cruft pack that is larger than the threshold and
thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
pack(s) which are larger than the threshold. This is because we avoid
writing them in the new cruft pack entirely, for a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
over the packs we're going to retain (which are marked as kept
in-core by 'read_cruft_objects()').
This means that we will avoid enumerating additional packed copies
of objects found in any cruft packs which are larger than the given
size threshold. Thus there is no opportunity to call
'create_object_entry()' whatsoever.
2. We likewise will discard the loose copy (if one exists) of any
unreachable object packed in a cruft pack that is larger than the
threshold. Here our call path is 'add_unreachable_loose_objects()',
which uses the 'add_loose_object()' callback.
That function will eventually land us in 'want_object_in_pack()'
(via 'add_cruft_object_entry()'), and we'll discard the object as it
appears in one of the packs which we marked as kept in-core.
This means in effect that it is impossible to freshen an unreachable
object once it appears in a cruft pack larger than the given threshold.
Instead, we should pack an additional copy of an unreachable object we
want to freshen even if it appears in a cruft pack, provided that the
cruft copy has an mtime which is before the mtime of the copy we are
trying to pack/freshen. This is sub-optimal in the sense that it
requires keeping an additional copy of unreachable objects upon
freshening, but we don't have a better alternative without the ability
to make in-place modifications to existing *.mtimes files.
In order to implement this, we have to adjust the behavior of
'want_found_object()'. When 'pack-objects' is told that we're *not*
going to retain any cruft packs (i.e. the set of packs marked as kept
in-core does not contain a cruft pack), the behavior is unchanged.
But when there *is* at least one cruft pack that we're holding onto, it
is no longer sufficient to reject a copy of an object found in that
cruft pack for that reason alone. In this case, we only want to reject a
candidate object when copies of that object either:
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- exists in a cruft pack with an mtime more recent than the copy we are
debating whether or not to pack, in which case freshening would be
redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
reject objects when we have a copy in an existing cruft pack with a more
recent mtime (in which case "freshening" would be redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 64 ++++++++++++++++++++++
4 files changed, 169 insertions(+), 18 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b161262..79e1e6fb52b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -206,6 +206,7 @@ static int have_non_local_packs;
static int incremental;
static int ignore_packed_keep_on_disk;
static int ignore_packed_keep_in_core;
+static int ignore_packed_keep_in_core_has_cruft;
static int allow_ofs_delta;
static struct pack_idx_option pack_idx_opts;
static const char *base_name;
@@ -1502,8 +1503,60 @@ static int have_duplicate_entry(const struct object_id *oid,
return 1;
}
+static int want_cruft_object_mtime(struct repository *r,
+ const struct object_id *oid,
+ unsigned flags, uint32_t mtime)
+{
+ struct packed_git **cache;
+
+ for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+ struct packed_git *p = *cache;
+ off_t ofs;
+ uint32_t candidate_mtime;
+
+ ofs = find_pack_entry_one(oid, p);
+ if (!ofs)
+ continue;
+
+ /*
+ * We have a copy of the object 'oid' in a non-cruft
+ * pack. We can avoid packing an additional copy
+ * regardless of what the existing copy's mtime is since
+ * it is outside of a cruft pack.
+ */
+ if (!p->is_cruft)
+ return 0;
+
+ /*
+ * If we have a copy of the object 'oid' in a cruft
+ * pack, then either read the cruft pack's mtime for
+ * that object, or, if that can't be loaded, assume the
+ * pack's mtime itself.
+ */
+ if (!load_pack_mtimes(p)) {
+ uint32_t pos;
+ if (offset_to_pack_pos(p, ofs, &pos) < 0)
+ continue;
+ candidate_mtime = nth_packed_mtime(p, pos);
+ } else {
+ candidate_mtime = p->mtime;
+ }
+
+ /*
+ * We have a surviving copy of the object in a cruft
+ * pack whose mtime is greater than or equal to the one
+ * we are considering. We can thus avoid packing an
+ * additional copy of that object.
+ */
+ if (mtime <= candidate_mtime)
+ return 0;
+ }
+
+ return -1;
+}
+
static int want_found_object(const struct object_id *oid, int exclude,
- struct packed_git *p)
+ struct packed_git *p, uint32_t mtime)
{
if (exclude)
return 1;
@@ -1553,12 +1606,29 @@ static int want_found_object(const struct object_id *oid, int exclude,
if (ignore_packed_keep_in_core)
flags |= IN_CORE_KEEP_PACKS;
- if (ignore_packed_keep_on_disk && p->pack_keep)
- return 0;
- if (ignore_packed_keep_in_core && p->pack_keep_in_core)
- return 0;
- if (has_object_kept_pack(p->repo, oid, flags))
- return 0;
+ /*
+ * If the object is in a pack that we want to ignore, *and* we
+ * don't have any cruft packs that are being retained, we can
+ * abort quickly.
+ */
+ if (!ignore_packed_keep_in_core_has_cruft) {
+ if (ignore_packed_keep_on_disk && p->pack_keep)
+ return 0;
+ if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+ return 0;
+ if (has_object_kept_pack(p->repo, oid, flags))
+ return 0;
+ } else {
+ /*
+ * But if there is at least one cruft pack which
+ * is being kept, we only want to include the
+ * provided object if it has a strictly greater
+ * mtime than any existing cruft copy.
+ */
+ if (!want_cruft_object_mtime(p->repo, oid, flags,
+ mtime))
+ return 0;
+ }
}
/*
@@ -1577,7 +1647,8 @@ static int want_object_in_pack_one(struct packed_git *p,
const struct object_id *oid,
int exclude,
struct packed_git **found_pack,
- off_t *found_offset)
+ off_t *found_offset,
+ uint32_t found_mtime)
{
off_t offset;
@@ -1593,7 +1664,7 @@ static int want_object_in_pack_one(struct packed_git *p,
*found_offset = offset;
*found_pack = p;
}
- return want_found_object(oid, exclude, p);
+ return want_found_object(oid, exclude, p, found_mtime);
}
return -1;
}
@@ -1607,10 +1678,11 @@ static int want_object_in_pack_one(struct packed_git *p,
* function finds if there is any pack that has the object and returns the pack
* and its offset in these variables.
*/
-static int want_object_in_pack(const struct object_id *oid,
- int exclude,
- struct packed_git **found_pack,
- off_t *found_offset)
+static int want_object_in_pack_mtime(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset,
+ uint32_t found_mtime)
{
int want;
struct list_head *pos;
@@ -1625,7 +1697,8 @@ static int want_object_in_pack(const struct object_id *oid,
* are present we will determine the answer right now.
*/
if (*found_pack) {
- want = want_found_object(oid, exclude, *found_pack);
+ want = want_found_object(oid, exclude, *found_pack,
+ found_mtime);
if (want != -1)
return want;
@@ -1636,7 +1709,7 @@ static int want_object_in_pack(const struct object_id *oid,
for (m = get_multi_pack_index(the_repository); m; m = m->next) {
struct pack_entry e;
if (fill_midx_entry(the_repository, oid, &e, m)) {
- want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset, found_mtime);
if (want != -1)
return want;
}
@@ -1644,7 +1717,7 @@ static int want_object_in_pack(const struct object_id *oid,
list_for_each(pos, get_packed_git_mru(the_repository)) {
struct packed_git *p = list_entry(pos, struct packed_git, mru);
- want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset, found_mtime);
if (!exclude && want > 0)
list_move(&p->mru,
get_packed_git_mru(the_repository));
@@ -1674,6 +1747,15 @@ static int want_object_in_pack(const struct object_id *oid,
return 1;
}
+static inline int want_object_in_pack(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset)
+{
+ return want_object_in_pack_mtime(oid, exclude, found_pack, found_offset,
+ 0);
+}
+
static struct object_entry *create_object_entry(const struct object_id *oid,
enum object_type type,
uint32_t hash,
@@ -3606,7 +3688,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
entry->no_try_delta = no_try_delta(name);
}
} else {
- if (!want_object_in_pack(oid, 0, &pack, &offset))
+ if (!want_object_in_pack_mtime(oid, 0, &pack, &offset, mtime))
return;
if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
/*
@@ -3680,6 +3762,8 @@ static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
struct packed_git *p = item->util;
if (!p)
die(_("could not find pack '%s'"), item->string);
+ if (p->is_cruft && keep)
+ ignore_packed_keep_in_core_has_cruft = 1;
p->pack_keep_in_core = keep;
}
}
diff --git a/packfile.c b/packfile.c
index 2d80d80cb38..9d09f8bc726 100644
--- a/packfile.c
+++ b/packfile.c
@@ -24,6 +24,7 @@
#include "commit-graph.h"
#include "pack-revindex.h"
#include "promisor-remote.h"
+#include "pack-mtimes.h"
char *odb_pack_name(struct repository *r, struct strbuf *buf,
const unsigned char *hash, const char *ext)
@@ -2107,7 +2108,7 @@ static void maybe_invalidate_kept_pack_cache(struct repository *r,
r->objects->kept_pack_cache.flags = 0;
}
-static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
{
maybe_invalidate_kept_pack_cache(r, flags);
diff --git a/packfile.h b/packfile.h
index 00ada7a938f..25097213d06 100644
--- a/packfile.h
+++ b/packfile.h
@@ -197,6 +197,8 @@ int has_object_pack(struct repository *r, const struct object_id *oid);
int has_object_kept_pack(struct repository *r, const struct object_id *oid,
unsigned flags);
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 5a76b541ddd..2b20d66333c 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -262,6 +262,70 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
)
'
+test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
+ git init max-cruft-size-threshold &&
+ (
+ cd max-cruft-size-threshold &&
+
+ test_commit base &&
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ baz="$(generate_random_blob baz $((2*1024*1024)))" &&
+
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
+
+ git repack --cruft -d &&
+
+ # Make a packed copy of object $foo with a more recent
+ # mtime.
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
+ test-tool chmtime --get -100 \
+ "$packdir/pack-$foo_pack.pack" >foo.new &&
+ git prune-packed &&
+
+ # Make a loose copy of object $bar with a more recent
+ # mtime.
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
+
+ # Make a new cruft object $quux to ensure we do not
+ # generate an identical pack to the existing cruft
+ # pack.
+ quux="$(generate_random_blob quux $((1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$quux")" >quux.new &&
+
+ git repack --cruft --max-cruft-size=3M -d &&
+
+ for p in $packdir/pack-*.mtimes
+ do
+ test-tool pack-mtimes "$(basename "$p")" || return 1
+ done >actual.raw &&
+ sort actual.raw >actual &&
+
+ # Among the set of all cruft packs, we should see both
+ # mtimes for object $foo and $bar, as well as the
+ # single new copy of $baz.
+ sort >expect <<-EOF &&
+ $foo $(cat foo.old)
+ $foo $(cat foo.new)
+ $bar $(cat bar.old)
+ $bar $(cat bar.new)
+ $baz $(cat baz.old)
+ $quux $(cat quux.new)
+ EOF
+
+ test_cmp expect actual
+ )
+'
+
test_expect_success '--max-cruft-size with pruning' '
git init max-cruft-size-prune &&
(
--
2.49.0.rc0.2.gc0c926adde2
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-02-27 18:29 ` [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
@ 2025-02-27 19:23 ` Elijah Newren
2025-02-27 22:53 ` Taylor Blau
2025-02-28 7:52 ` Patrick Steinhardt
1 sibling, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-02-27 19:23 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 10:29 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> In 37dc6d8104 (builtin/repack.c: implement support for
> `--max-cruft-size`, 2023-10-02), 'git repack' built on support for
> multiple cruft packs in Git by instructing 'git pack-objects --cruft'
> how to aggregate smaller cruft packs up to the provided threshold.
>
> The implementation in 37dc6d8104 worked something like the following
> pseudo-code:
>
> total_size = 0;
>
> for (p in cruft packs) {
> if (p->pack_size + total_size < max_size) {
> total_size += p->pack_size;
> collapse(p)
> } else {
> retain(p);
> }
> }
>
> The original idea behind this approach was that smaller cruft packs
> would get combined together until the sum of their sizes was no larger
> than the given max pack size.
>
> There is a much simpler way to achieve this, however, which is to simply
> combine *all* cruft packs which are smaller than the threshold,
> regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> will split out the resulting pack into individual pack(s) if necessary
> to ensure that the written pack(s) are each no larger than the provided
> threshold.
That doesn't really "achieve this" though, unless the antecedent of
"this" isn't what was described in the previous paragraph but
something elsewhere. I suspect your actual meaning was something
along the lines of "There is a much simpler way to combine cruft
packs, however, which..." ?
> This yields a slight behavior change, which is reflected in the removed
> test. Previous to this change, we would aggregate smaller cruft packs
> first, whereas now we will opportunistically combine as many cruft packs
> as possible. As as result, that test is no longer relevant, and can be
> deleted.
I like the idea, since it sounds like it should be simpler...
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> builtin/repack.c | 38 ++-----------------------------------
> t/t7704-repack-cruft.sh | 42 -----------------------------------------
> 2 files changed, 2 insertions(+), 78 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 75e3752353a..4d83d40f39f 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -1022,29 +1022,13 @@ static int write_filtered_pack(const struct pack_objects_args *args,
> return finish_pack_objects_cmd(&cmd, names, local);
> }
>
> -static int existing_cruft_pack_cmp(const void *va, const void *vb)
> -{
> - struct packed_git *a = *(struct packed_git **)va;
> - struct packed_git *b = *(struct packed_git **)vb;
> -
> - if (a->pack_size < b->pack_size)
> - return -1;
> - if (a->pack_size > b->pack_size)
> - return 1;
> - return 0;
> -}
> -
> static void collapse_small_cruft_packs(FILE *in, size_t max_size,
> struct existing_packs *existing)
> {
> - struct packed_git **existing_cruft, *p;
> + struct packed_git *p;
> struct strbuf buf = STRBUF_INIT;
> - size_t total_size = 0;
> - size_t existing_cruft_nr = 0;
> size_t i;
>
> - ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
> -
> for (p = get_all_packs(the_repository); p; p = p->next) {
> if (!(p->is_cruft && p->pack_local))
> continue;
> @@ -1056,24 +1040,7 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
> if (!string_list_has_string(&existing->cruft_packs, buf.buf))
> continue;
>
> - if (existing_cruft_nr >= existing->cruft_packs.nr)
> - BUG("too many cruft packs (found %"PRIuMAX", but knew "
> - "of %"PRIuMAX")",
> - (uintmax_t)existing_cruft_nr + 1,
> - (uintmax_t)existing->cruft_packs.nr);
> - existing_cruft[existing_cruft_nr++] = p;
> - }
> -
> - QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
> -
> - for (i = 0; i < existing_cruft_nr; i++) {
> - size_t proposed;
> -
> - p = existing_cruft[i];
> - proposed = st_add(total_size, p->pack_size);
> -
> - if (proposed <= max_size) {
> - total_size = proposed;
> + if (p->pack_size < max_size) {
Look at all that deleted code. Always nice to see a simplification in
action. :-)
> fprintf(in, "-%s\n", pack_basename(p));
> } else {
> retain_cruft_pack(existing, p);
> @@ -1086,7 +1053,6 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
> existing->non_kept_packs.items[i].string);
>
> strbuf_release(&buf);
> - free(existing_cruft);
> }
>
> static int write_cruft_pack(const struct pack_objects_args *args,
> diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
> index 959e6e26488..5a76b541ddd 100755
> --- a/t/t7704-repack-cruft.sh
> +++ b/t/t7704-repack-cruft.sh
> @@ -194,48 +194,6 @@ test_expect_success '--max-cruft-size combines existing packs when below thresho
> )
> '
>
> -test_expect_success '--max-cruft-size combines smaller packs first' '
> - git init max-cruft-size-consume-small &&
> - (
> - cd max-cruft-size-consume-small &&
> -
> - test_commit base &&
> - git repack -ad &&
> -
> - cruft_foo="$(generate_cruft_pack foo 524288)" && # 0.5 MiB
> - cruft_bar="$(generate_cruft_pack bar 524288)" && # 0.5 MiB
> - cruft_baz="$(generate_cruft_pack baz 1048576)" && # 1.0 MiB
> - cruft_quux="$(generate_cruft_pack quux 1572864)" && # 1.5 MiB
> -
> - test-tool pack-mtimes "$(basename $cruft_foo)" >expect.raw &&
> - test-tool pack-mtimes "$(basename $cruft_bar)" >>expect.raw &&
> - sort expect.raw >expect.objects &&
> -
> - # repacking with `--max-cruft-size=2M` should combine
> - # both 0.5 MiB packs together, instead of, say, one of
> - # the 0.5 MiB packs with the 1.0 MiB pack
> - ls $packdir/pack-*.mtimes | sort >cruft.before &&
> - git repack -d --cruft --max-cruft-size=2M &&
> - ls $packdir/pack-*.mtimes | sort >cruft.after &&
> -
> - comm -13 cruft.before cruft.after >cruft.new &&
> - comm -23 cruft.before cruft.after >cruft.removed &&
> -
> - test_line_count = 1 cruft.new &&
> - test_line_count = 2 cruft.removed &&
> -
> - # the two smaller packs should be rolled up first
> - printf "%s\n" $cruft_foo $cruft_bar | sort >expect.removed &&
> - test_cmp expect.removed cruft.removed &&
> -
> - # ...and contain the set of objects rolled up
> - test-tool pack-mtimes "$(basename $(cat cruft.new))" >actual.raw &&
> - sort actual.raw >actual.objects &&
> -
> - test_cmp expect.objects actual.objects
> - )
> -'
> -
> test_expect_success 'setup --max-cruft-size with freshened objects' '
> git init max-cruft-size-freshen &&
> (
> --
> 2.49.0.rc0.2.gc0c926adde2
Looks good to me, other than the misleading/ambiguous wording in that
one paragraph of the commit message.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-02-27 18:29 ` [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-02-27 19:26 ` Elijah Newren
2025-02-27 23:03 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-02-27 19:26 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 10:29 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> Once an object is written into a cruft pack, we can only freshen it by
> writing a new loose or packed copy of that object with a more recent
> mtime.
Thanks for adding this. (Side note for others reading: I reviewed an
earlier round before it was posted to the list).
> Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
> with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
> a repository at any given time. So freshening unreachable objects was
> straightforward when already rewriting the cruft pack (and its *.mtimes
> file).
>
> But 61568efa95 changes things: 'pack-objects' now supports writing
> multiple cruft packs when invoked with `--cruft` and the
> `--max-pack-size` flag. Cruft packs are rewritten until they reach some
> size threshold, at which point they are considered "frozen", and will
> only be modified in a pruning GC, or if the threshold itself is
> adjusted.
>
> However, this process breaks down when we attempt to freshen an object
> packed in an earlier cruft pack that is larger than the threshold and
> thus will survive the repack.
...packed in an earlier cruft pack, and that cruft pack is larger than
the threshold...
(Otherwise, it's unclear whether you are talking about the object or
the cruft pack it is in being larger than the threshold.)
> When this is the case, it is impossible to freshen objects in cruft
> pack(s) which are larger than the threshold. This is because we avoid
> writing them in the new cruft pack entirely, for a couple of reasons.
...freshen objects in cruft packs when those cruft packs are larger
than the threshold...
Again, just to clarify what thing is "larger".
Also, this paragraph while fine on its own is slightly unclear whether
you are discussing pre-patch or post-patch state, which when reading
the next two items causes some double takes. Perhaps just spell it
out slightly clearer here that for the next two enumerated items you
are discussing the existing state previous to your changes?
> 1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
> we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
> over the packs we're going to retain (which are marked as kept
> in-core by 'read_cruft_objects()').
>
> This means that we will avoid enumerating additional packed copies
> of objects found in any cruft packs which are larger than the given
> size threshold. Thus there is no opportunity to call
> 'create_object_entry()' whatsoever.
>
> 2. We likewise will discard the loose copy (if one exists) of any
> unreachable object packed in a cruft pack that is larger than the
> threshold. Here our call path is 'add_unreachable_loose_objects()',
> which uses the 'add_loose_object()' callback.
>
> That function will eventually land us in 'want_object_in_pack()'
> (via 'add_cruft_object_entry()'), and we'll discard the object as it
> appears in one of the packs which we marked as kept in-core.
>
> This means in effect that it is impossible to freshen an unreachable
> object once it appears in a cruft pack larger than the given threshold.
>
> Instead,
...it does become clear here that the above was about the previous
state rather than the new state after the patches, but that's a while
to leave the possible confusion dangling.
> we should pack an additional copy of an unreachable object we
> want to freshen even if it appears in a cruft pack, provided that the
> cruft copy has an mtime which is before the mtime of the copy we are
> trying to pack/freshen. This is sub-optimal in the sense that it
> requires keeping an additional copy of unreachable objects upon
> freshening, but we don't have a better alternative without the ability
> to make in-place modifications to existing *.mtimes files.
>
> In order to implement this, we have to adjust the behavior of
> 'want_found_object()'. When 'pack-objects' is told that we're *not*
> going to retain any cruft packs (i.e. the set of packs marked as kept
> in-core does not contain a cruft pack), the behavior is unchanged.
>
> But when there *is* at least one cruft pack that we're holding onto, it
> is no longer sufficient to reject a copy of an object found in that
> cruft pack for that reason alone. In this case, we only want to reject a
> candidate object when copies of that object either:
>
> - exists in a non-cruft pack that we are retaining, regardless of that
> pack's mtime, or
>
> - exists in a cruft pack with an mtime more recent than the copy we are
> debating whether or not to pack, in which case freshening would be
> redundant.
s/more recent than/at least as recent as/ ?
>
> To do this, keep track of whether or not we have any cruft packs in our
> in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
> flag. When we end up in this new special case, we replace a call to
> 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
> reject objects when we have a copy in an existing cruft pack with a more
> recent mtime (in which case "freshening" would be redundant).
Again, s/a more recent/at least as recent/ ?
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
> packfile.c | 3 +-
> packfile.h | 2 +
> t/t7704-repack-cruft.sh | 64 ++++++++++++++++++++++
> 4 files changed, 169 insertions(+), 18 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 58a9b161262..79e1e6fb52b 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -206,6 +206,7 @@ static int have_non_local_packs;
> static int incremental;
> static int ignore_packed_keep_on_disk;
> static int ignore_packed_keep_in_core;
> +static int ignore_packed_keep_in_core_has_cruft;
> static int allow_ofs_delta;
> static struct pack_idx_option pack_idx_opts;
> static const char *base_name;
> @@ -1502,8 +1503,60 @@ static int have_duplicate_entry(const struct object_id *oid,
> return 1;
> }
>
> +static int want_cruft_object_mtime(struct repository *r,
> + const struct object_id *oid,
> + unsigned flags, uint32_t mtime)
> +{
> + struct packed_git **cache;
> +
> + for (cache = kept_pack_cache(r, flags); *cache; cache++) {
> + struct packed_git *p = *cache;
> + off_t ofs;
> + uint32_t candidate_mtime;
> +
> + ofs = find_pack_entry_one(oid, p);
> + if (!ofs)
> + continue;
> +
> + /*
> + * We have a copy of the object 'oid' in a non-cruft
> + * pack. We can avoid packing an additional copy
> + * regardless of what the existing copy's mtime is since
> + * it is outside of a cruft pack.
> + */
> + if (!p->is_cruft)
> + return 0;
> +
> + /*
> + * If we have a copy of the object 'oid' in a cruft
> + * pack, then either read the cruft pack's mtime for
> + * that object, or, if that can't be loaded, assume the
> + * pack's mtime itself.
> + */
> + if (!load_pack_mtimes(p)) {
> + uint32_t pos;
> + if (offset_to_pack_pos(p, ofs, &pos) < 0)
> + continue;
> + candidate_mtime = nth_packed_mtime(p, pos);
> + } else {
> + candidate_mtime = p->mtime;
> + }
> +
> + /*
> + * We have a surviving copy of the object in a cruft
> + * pack whose mtime is greater than or equal to the one
> + * we are considering. We can thus avoid packing an
> + * additional copy of that object.
> + */
> + if (mtime <= candidate_mtime)
> + return 0;
Here is where you have the logic for avoiding packing an additional
copy when the mtimes are equal, leading to my above wording
suggestions for the commit message.
> + }
> +
> + return -1;
> +}
Good comments making it clear what is going on in this function.
> +
> static int want_found_object(const struct object_id *oid, int exclude,
> - struct packed_git *p)
> + struct packed_git *p, uint32_t mtime)
> {
> if (exclude)
> return 1;
> @@ -1553,12 +1606,29 @@ static int want_found_object(const struct object_id *oid, int exclude,
> if (ignore_packed_keep_in_core)
> flags |= IN_CORE_KEEP_PACKS;
>
> - if (ignore_packed_keep_on_disk && p->pack_keep)
> - return 0;
> - if (ignore_packed_keep_in_core && p->pack_keep_in_core)
> - return 0;
> - if (has_object_kept_pack(p->repo, oid, flags))
> - return 0;
> + /*
> + * If the object is in a pack that we want to ignore, *and* we
> + * don't have any cruft packs that are being retained, we can
> + * abort quickly.
> + */
> + if (!ignore_packed_keep_in_core_has_cruft) {
> + if (ignore_packed_keep_on_disk && p->pack_keep)
> + return 0;
> + if (ignore_packed_keep_in_core && p->pack_keep_in_core)
> + return 0;
> + if (has_object_kept_pack(p->repo, oid, flags))
> + return 0;
> + } else {
> + /*
> + * But if there is at least one cruft pack which
> + * is being kept, we only want to include the
> + * provided object if it has a strictly greater
> + * mtime than any existing cruft copy.
> + */
> + if (!want_cruft_object_mtime(p->repo, oid, flags,
> + mtime))
> + return 0;
> + }
> }
Here's the core high-level logic behind this change.
>
> /*
> @@ -1577,7 +1647,8 @@ static int want_object_in_pack_one(struct packed_git *p,
> const struct object_id *oid,
> int exclude,
> struct packed_git **found_pack,
> - off_t *found_offset)
> + off_t *found_offset,
> + uint32_t found_mtime)
> {
> off_t offset;
>
> @@ -1593,7 +1664,7 @@ static int want_object_in_pack_one(struct packed_git *p,
> *found_offset = offset;
> *found_pack = p;
> }
> - return want_found_object(oid, exclude, p);
> + return want_found_object(oid, exclude, p, found_mtime);
> }
> return -1;
> }
> @@ -1607,10 +1678,11 @@ static int want_object_in_pack_one(struct packed_git *p,
> * function finds if there is any pack that has the object and returns the pack
> * and its offset in these variables.
> */
> -static int want_object_in_pack(const struct object_id *oid,
> - int exclude,
> - struct packed_git **found_pack,
> - off_t *found_offset)
> +static int want_object_in_pack_mtime(const struct object_id *oid,
> + int exclude,
> + struct packed_git **found_pack,
> + off_t *found_offset,
> + uint32_t found_mtime)
> {
> int want;
> struct list_head *pos;
> @@ -1625,7 +1697,8 @@ static int want_object_in_pack(const struct object_id *oid,
> * are present we will determine the answer right now.
> */
> if (*found_pack) {
> - want = want_found_object(oid, exclude, *found_pack);
> + want = want_found_object(oid, exclude, *found_pack,
> + found_mtime);
> if (want != -1)
> return want;
>
> @@ -1636,7 +1709,7 @@ static int want_object_in_pack(const struct object_id *oid,
> for (m = get_multi_pack_index(the_repository); m; m = m->next) {
> struct pack_entry e;
> if (fill_midx_entry(the_repository, oid, &e, m)) {
> - want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
> + want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset, found_mtime);
> if (want != -1)
> return want;
> }
> @@ -1644,7 +1717,7 @@ static int want_object_in_pack(const struct object_id *oid,
>
> list_for_each(pos, get_packed_git_mru(the_repository)) {
> struct packed_git *p = list_entry(pos, struct packed_git, mru);
> - want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
> + want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset, found_mtime);
> if (!exclude && want > 0)
> list_move(&p->mru,
> get_packed_git_mru(the_repository));
> @@ -1674,6 +1747,15 @@ static int want_object_in_pack(const struct object_id *oid,
> return 1;
> }
>
> +static inline int want_object_in_pack(const struct object_id *oid,
> + int exclude,
> + struct packed_git **found_pack,
> + off_t *found_offset)
> +{
> + return want_object_in_pack_mtime(oid, exclude, found_pack, found_offset,
> + 0);
> +}
> +
> static struct object_entry *create_object_entry(const struct object_id *oid,
> enum object_type type,
> uint32_t hash,
> @@ -3606,7 +3688,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
> entry->no_try_delta = no_try_delta(name);
> }
> } else {
> - if (!want_object_in_pack(oid, 0, &pack, &offset))
> + if (!want_object_in_pack_mtime(oid, 0, &pack, &offset, mtime))
Stuff since my last comment was plumbing the extra parameter through
the right places; makes sense.
> return;
> if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
> /*
> @@ -3680,6 +3762,8 @@ static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
> struct packed_git *p = item->util;
> if (!p)
> die(_("could not find pack '%s'"), item->string);
> + if (p->is_cruft && keep)
> + ignore_packed_keep_in_core_has_cruft = 1;
And here you set the necessary flag for enabling it all.
> p->pack_keep_in_core = keep;
> }
> }
> diff --git a/packfile.c b/packfile.c
> index 2d80d80cb38..9d09f8bc726 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -24,6 +24,7 @@
> #include "commit-graph.h"
> #include "pack-revindex.h"
> #include "promisor-remote.h"
> +#include "pack-mtimes.h"
>
> char *odb_pack_name(struct repository *r, struct strbuf *buf,
> const unsigned char *hash, const char *ext)
> @@ -2107,7 +2108,7 @@ static void maybe_invalidate_kept_pack_cache(struct repository *r,
> r->objects->kept_pack_cache.flags = 0;
> }
>
> -static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
> +struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
> {
> maybe_invalidate_kept_pack_cache(r, flags);
>
> diff --git a/packfile.h b/packfile.h
> index 00ada7a938f..25097213d06 100644
> --- a/packfile.h
> +++ b/packfile.h
> @@ -197,6 +197,8 @@ int has_object_pack(struct repository *r, const struct object_id *oid);
> int has_object_kept_pack(struct repository *r, const struct object_id *oid,
> unsigned flags);
>
> +struct packed_git **kept_pack_cache(struct repository *r, unsigned flags);
> +
> /*
> * Return 1 if an object in a promisor packfile is or refers to the given
> * object, 0 otherwise.
You need access to one more function to do your work; makes sense.
> diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
> index 5a76b541ddd..2b20d66333c 100755
> --- a/t/t7704-repack-cruft.sh
> +++ b/t/t7704-repack-cruft.sh
> @@ -262,6 +262,70 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
> )
> '
>
> +test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
> + git init max-cruft-size-threshold &&
> + (
> + cd max-cruft-size-threshold &&
> +
> + test_commit base &&
> + foo="$(generate_random_blob foo $((2*1024*1024)))" &&
> + bar="$(generate_random_blob bar $((2*1024*1024)))" &&
> + baz="$(generate_random_blob baz $((2*1024*1024)))" &&
> +
> + test-tool chmtime --get -100000 \
> + "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
> + test-tool chmtime --get -100000 \
> + "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
> + test-tool chmtime --get -100000 \
> + "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
> +
> + git repack --cruft -d &&
> +
> + # Make a packed copy of object $foo with a more recent
> + # mtime.
s/$foo/foo/ ?
> + foo="$(generate_random_blob foo $((2*1024*1024)))" &&
I thought this was creating a completely different foo, which would
defeat the point of the test. It might be worth adding a comment that
because generate_random_blob uses a very simplistic and repeatable
random character generator with the first argument as the seed, that
this will regenerate the same loose object as above for foo.
> + foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
> + test-tool chmtime --get -100 \
> + "$packdir/pack-$foo_pack.pack" >foo.new &&
> + git prune-packed &&
> +
> + # Make a loose copy of object $bar with a more recent
> + # mtime.
s/$bar/bar/ ?
> + bar="$(generate_random_blob bar $((2*1024*1024)))" &&
Likewise, this isn't a different bar than above because it uses the
same seed, "bar" that the previous one used.
> + test-tool chmtime --get -100 \
> + "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
> +
> + # Make a new cruft object $quux to ensure we do not
> + # generate an identical pack to the existing cruft
> + # pack.
> + quux="$(generate_random_blob quux $((1024)))" &&
> + test-tool chmtime --get -100 \
> + "$objdir/$(test_oid_to_path "$quux")" >quux.new &&
> +
> + git repack --cruft --max-cruft-size=3M -d &&
> +
> + for p in $packdir/pack-*.mtimes
> + do
> + test-tool pack-mtimes "$(basename "$p")" || return 1
> + done >actual.raw &&
> + sort actual.raw >actual &&
> +
> + # Among the set of all cruft packs, we should see both
> + # mtimes for object $foo and $bar, as well as the
> + # single new copy of $baz.
> + sort >expect <<-EOF &&
> + $foo $(cat foo.old)
> + $foo $(cat foo.new)
> + $bar $(cat bar.old)
> + $bar $(cat bar.new)
> + $baz $(cat baz.old)
> + $quux $(cat quux.new)
> + EOF
Makes sense, but only once you understand that the second foo and bar
objects really were the same objects as the first foo and bar objects.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
2025-02-27 18:29 ` [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
2025-02-27 18:29 ` [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-02-27 19:28 ` Elijah Newren
2025-02-27 23:05 ` Taylor Blau
2025-03-04 21:35 ` [PATCH v2 " Taylor Blau
` (3 subsequent siblings)
6 siblings, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-02-27 19:28 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 10:29 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> This short series contains a fix for a bug I noticed while rolling out
> multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's
> infrastructure.
>
> The series is structured as follows:
>
> - The first patch simplifies how 'repack' aggregates cruft packs
> together when their size is below the '--max-cruft-size' or
> '--max-pack-size' threshold. This simplification changes behavior
> slightly, but not in a meaningful way. It occurred to me while
> writing the second patch.
>
> - The second patch describes and fixes the main bug. The gist here is
> that objects which are (a) unreachable, (b) exist in a cruft pack
> being retained, and (c) were freshened to have a more recent mtime
> than any existing cruft copy are unable to be freshened.
>
> The fix pursued in the second patch changes the rules around when we
> want to retain an object via builtin/pack-objects.c::want_found_object()
> when at least one cruft pack will survive the repack.
>
> Previously the rule was to discard any object which appears in any
> surviving pack, regardless of mtime. The rule now is to only discard an
> object if it appears in either (a) a non-cruft pack which will survive
> the repack, or (b) a cruft pack whose mtime for that object is older
> than the one we are trying to pack.
I think in (b) you got the meaning reversed, and instead mean s/older
than/at least as new as/ ?
> I think that this is the right behavior, but admittedly putting this
> series together hurt my brain trying to think through all of the cases.
> I'm fairly confident in the testing here as I remember it being fairly
> exhaustive of all interesting cases. But I'd appreciate a sanity check
> from others that they too are convinced this is the right approach.
>
> Thanks in advance for your review!
>
> Taylor Blau (2):
> builtin/repack.c: simplify cruft pack aggregation
> builtin/pack-objects.c: freshen objects from existing cruft packs
>
> builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
> builtin/repack.c | 38 +------------
> packfile.c | 3 +-
> packfile.h | 2 +
> t/t7704-repack-cruft.sh | 106 ++++++++++++++++++++++--------------
> 5 files changed, 171 insertions(+), 96 deletions(-)
Code changes look good to me, but I had some wording suggestions in a
few places for commit messages and comments. (Sorry for missing some
of those in my preliminary review before you sent this series to the
list.)
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-02-27 19:23 ` Elijah Newren
@ 2025-02-27 22:53 ` Taylor Blau
0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-02-27 22:53 UTC (permalink / raw)
To: Elijah Newren; +Cc: git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 11:23:02AM -0800, Elijah Newren wrote:
> > The original idea behind this approach was that smaller cruft packs
> > would get combined together until the sum of their sizes was no larger
> > than the given max pack size.
> >
> > There is a much simpler way to achieve this, however, which is to simply
> > combine *all* cruft packs which are smaller than the threshold,
> > regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> > will split out the resulting pack into individual pack(s) if necessary
> > to ensure that the written pack(s) are each no larger than the provided
> > threshold.
>
> That doesn't really "achieve this" though, unless the antecedent of
> "this" isn't what was described in the previous paragraph but
> something elsewhere. I suspect your actual meaning was something
> along the lines of "There is a much simpler way to combine cruft
> packs, however, which..." ?
Great suggestion, thanks. I swapped out "achieve this" for your
recommendation.
> > This yields a slight behavior change, which is reflected in the removed
> > test. Previous to this change, we would aggregate smaller cruft packs
> > first, whereas now we will opportunistically combine as many cruft packs
> > as possible. As as result, that test is no longer relevant, and can be
> > deleted.
>
> I like the idea, since it sounds like it should be simpler...
Heh. I don't know why I wrote it the way it was originally. I wrote the
second patch in this series first, and when I was trying to explain how
multi-cruft pack aggregation works I paused and then wrote what is now
the first patch.
Hindsight is 20/20, I suppose ;-).
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-02-27 19:26 ` Elijah Newren
@ 2025-02-27 23:03 ` Taylor Blau
0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-02-27 23:03 UTC (permalink / raw)
To: Elijah Newren; +Cc: git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 11:26:32AM -0800, Elijah Newren wrote:
> > However, this process breaks down when we attempt to freshen an object
> > packed in an earlier cruft pack that is larger than the threshold and
> > thus will survive the repack.
>
> ...packed in an earlier cruft pack, and that cruft pack is larger than
> the threshold...
>
> (Otherwise, it's unclear whether you are talking about the object or
> the cruft pack it is in being larger than the threshold.)
Good suggestion, thanks!
> > When this is the case, it is impossible to freshen objects in cruft
> > pack(s) which are larger than the threshold. This is because we avoid
> > writing them in the new cruft pack entirely, for a couple of reasons.
>
> ...freshen objects in cruft packs when those cruft packs are larger
> than the threshold...
>
> Again, just to clarify what thing is "larger".
Likewise, this makes sense as well, and I applied it in my local copy.
> Also, this paragraph while fine on its own is slightly unclear whether
> you are discussing pre-patch or post-patch state, which when reading
> the next two items causes some double takes. Perhaps just spell it
> out slightly clearer here that for the next two enumerated items you
> are discussing the existing state previous to your changes?
I adjusted the paragraph before this one to make it a little clearer.
Instead of saying "However, [...]", I rewrote it as "Prior to this
patch, however, [...]".
> > - exists in a non-cruft pack that we are retaining, regardless of that
> > pack's mtime, or
> >
> > - exists in a cruft pack with an mtime more recent than the copy we are
> > debating whether or not to pack, in which case freshening would be
> > redundant.
>
> s/more recent than/at least as recent as/ ?
Thanks for the careful read, and yes, the comparison here is a >= rather
than a strict >, and that difference is worth being precise about.
> >
> > To do this, keep track of whether or not we have any cruft packs in our
> > in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
> > flag. When we end up in this new special case, we replace a call to
> > 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
> > reject objects when we have a copy in an existing cruft pack with a more
> > recent mtime (in which case "freshening" would be redundant).
>
> Again, s/a more recent/at least as recent/ ?
I like this suggestion, but I think the wording ends up a little awkward
if applied as-is. I turned this sentence into:
[...], and only reject objects when we have a copy in an existing
cruft pack with at least as recent an mtime as our candidate (in which
case "freshening" would be redundant).
Let me know what you think!
> > +test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
> > + git init max-cruft-size-threshold &&
> > + (
> > + cd max-cruft-size-threshold &&
> > +
> > + test_commit base &&
> > + foo="$(generate_random_blob foo $((2*1024*1024)))" &&
> > + bar="$(generate_random_blob bar $((2*1024*1024)))" &&
> > + baz="$(generate_random_blob baz $((2*1024*1024)))" &&
> > +
> > + test-tool chmtime --get -100000 \
> > + "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
> > + test-tool chmtime --get -100000 \
> > + "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
> > + test-tool chmtime --get -100000 \
> > + "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
> > +
> > + git repack --cruft -d &&
> > +
> > + # Make a packed copy of object $foo with a more recent
> > + # mtime.
>
> s/$foo/foo/ ?
Eh. $foo holds the OID of that blob, so "foo" on its own doesn't really
mean anything (even though the implicit meaning is clear from context).
I think changing it is fine (leaving it alone is equally fine in my
mind, but I don't feel strongly about it).
> > + foo="$(generate_random_blob foo $((2*1024*1024)))" &&
>
> I thought this was creating a completely different foo, which would
> defeat the point of the test. It might be worth adding a comment that
> because generate_random_blob uses a very simplistic and repeatable
> random character generator with the first argument as the seed, that
> this will regenerate the same loose object as above for foo.
I think the part of the comment which reads "packed copy of" makes it
clear-ish that we're creating an identical copy, but it doesn't hurt to
be more explicit here.
Thanks for the careful read!
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs
2025-02-27 19:28 ` [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
@ 2025-02-27 23:05 ` Taylor Blau
0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-02-27 23:05 UTC (permalink / raw)
To: Elijah Newren; +Cc: git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 11:28:18AM -0800, Elijah Newren wrote:
> > Previously the rule was to discard any object which appears in any
> > surviving pack, regardless of mtime. The rule now is to only discard an
> > object if it appears in either (a) a non-cruft pack which will survive
> > the repack, or (b) a cruft pack whose mtime for that object is older
> > than the one we are trying to pack.
>
> I think in (b) you got the meaning reversed, and instead mean s/older
> than/at least as new as/ ?
Oops, you are definitely right, that is a clear typo on my part.
> Code changes look good to me, but I had some wording suggestions in a
> few places for commit messages and comments. (Sorry for missing some
> of those in my preliminary review before you sent this series to the
> list.)
No problem. I had only shared them before sending them here for a sanity
check on the implementation, so I was figuring that I'd get some
comments here on both the patch message and contents.
I adjusted the wording of both patches slightly to reflect your review.
I'll hold off on sending another round for a day or so in case any other
reviewers want to chime in. Thanks again for the careful review!
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-02-27 18:29 ` [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
2025-02-27 19:23 ` Elijah Newren
@ 2025-02-28 7:52 ` Patrick Steinhardt
2025-03-04 21:52 ` Elijah Newren
2025-03-05 0:09 ` Taylor Blau
1 sibling, 2 replies; 54+ messages in thread
From: Patrick Steinhardt @ 2025-02-28 7:52 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King, Elijah Newren
On Thu, Feb 27, 2025 at 01:29:28PM -0500, Taylor Blau wrote:
> In 37dc6d8104 (builtin/repack.c: implement support for
> `--max-cruft-size`, 2023-10-02), 'git repack' built on support for
> multiple cruft packs in Git by instructing 'git pack-objects --cruft'
> how to aggregate smaller cruft packs up to the provided threshold.
>
> The implementation in 37dc6d8104 worked something like the following
> pseudo-code:
>
> total_size = 0;
>
> for (p in cruft packs) {
> if (p->pack_size + total_size < max_size) {
> total_size += p->pack_size;
> collapse(p)
> } else {
> retain(p);
> }
> }
>
> The original idea behind this approach was that smaller cruft packs
> would get combined together until the sum of their sizes was no larger
> than the given max pack size.
>
> There is a much simpler way to achieve this, however, which is to simply
> combine *all* cruft packs which are smaller than the threshold,
> regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> will split out the resulting pack into individual pack(s) if necessary
> to ensure that the written pack(s) are each no larger than the provided
> threshold.
Hm. So the result would be a new set of packfiles where each of them is
smaller than the threshold, right? Wouldn't that mean that the next time
we'll again do the same thing and try to combine the new set of cruft
packs into one, and basically never arrive at a state where we don't
touch the cruft packs anymore?
Patrick
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
` (2 preceding siblings ...)
2025-02-27 19:28 ` [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
@ 2025-03-04 21:35 ` Taylor Blau
2025-03-04 21:35 ` [PATCH v2 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
` (2 more replies)
2025-03-05 0:15 ` [PATCH v3 0/1] " Taylor Blau
` (2 subsequent siblings)
6 siblings, 3 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-04 21:35 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
This is a minor reroll of my series to fix a bug in freshening objects
with multiple cruft packs.
Since last time the only updates to the series are textual changes to
the commit message, so everything functionally should be working as it
was before.
As usual, there is a range-diff showing as much below, and the original
cover letter is as follows:
---
This short series contains a fix for a bug I noticed while rolling out
multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's
infrastructure.
The series is structured as follows:
- The first patch simplifies how 'repack' aggregates cruft packs
together when their size is below the '--max-cruft-size' or
'--max-pack-size' threshold. This simplification changes behavior
slightly, but not in a meaningful way. It occurred to me while
writing the second patch.
- The second patch describes and fixes the main bug. The gist here is
that objects which are (a) unreachable, (b) exist in a cruft pack
being retained, and (c) were freshened to have a more recent mtime
than any existing cruft copy are unable to be freshened.
The fix pursued in the second patch changes the rules around when we
want to retain an object via builtin/pack-objects.c::want_found_object()
when at least one cruft pack will survive the repack.
Previously the rule was to discard any object which appears in any
surviving pack, regardless of mtime. The rule now is to only discard an
object if it appears in either (a) a non-cruft pack which will survive
the repack, or (b) a cruft pack whose mtime for that object is older
than the one we are trying to pack.
I think that this is the right behavior, but admittedly putting this
series together hurt my brain trying to think through all of the cases.
I'm fairly confident in the testing here as I remember it being fairly
exhaustive of all interesting cases. But I'd appreciate a sanity check
from others that they too are convinced this is the right approach.
Thanks in advance for your review!
Taylor Blau (2):
builtin/repack.c: simplify cruft pack aggregation
builtin/pack-objects.c: freshen objects from existing cruft packs
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
builtin/repack.c | 38 +------------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 105 +++++++++++++++++++++--------------
5 files changed, 170 insertions(+), 96 deletions(-)
Range-diff against v1:
1: 8564f982597 ! 1: 63ea9d4d00e builtin/repack.c: simplify cruft pack aggregation
@@ Commit message
would get combined together until the sum of their sizes was no larger
than the given max pack size.
- There is a much simpler way to achieve this, however, which is to simply
- combine *all* cruft packs which are smaller than the threshold,
+ There is a much simpler way to combine cruft packs, however, which is to
+ simply combine *all* cruft packs which are smaller than the threshold,
regardless of what their sum is. With '--max-pack-size', 'pack-objects'
will split out the resulting pack into individual pack(s) if necessary
to ensure that the written pack(s) are each no larger than the provided
2: c0c926adde2 ! 2: 7ba9054701b builtin/pack-objects.c: freshen objects from existing cruft packs
@@ Commit message
only be modified in a pruning GC, or if the threshold itself is
adjusted.
- However, this process breaks down when we attempt to freshen an object
- packed in an earlier cruft pack that is larger than the threshold and
- thus will survive the repack.
+ Prior to this patch, however, this process breaks down when we attempt
+ to freshen an object packed in an earlier cruft pack, and that cruft
+ pack is larger than the threshold and thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
- pack(s) which are larger than the threshold. This is because we avoid
- writing them in the new cruft pack entirely, for a couple of reasons.
+ pack(s) when those cruft packs are larger than the threshold. This is
+ because we would avoid writing them in the new cruft pack entirely, for
+ a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
@@ Commit message
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- - exists in a cruft pack with an mtime more recent than the copy we are
- debating whether or not to pack, in which case freshening would be
- redundant.
+ - exists in a cruft pack with an mtime at least as recent as the copy
+ we are debating whether or not to pack, in which case freshening
+ would be redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
- 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
- reject objects when we have a copy in an existing cruft pack with a more
- recent mtime (in which case "freshening" would be redundant).
+ 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
+ objects when we have a copy in an existing cruft pack with at least as
+ recent an mtime as our candidate (in which case "freshening" would be
+ redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
@@ t/t7704-repack-cruft.sh: test_expect_success '--max-cruft-size with freshened ob
+
+ git repack --cruft -d &&
+
-+ # Make a packed copy of object $foo with a more recent
-+ # mtime.
++ # Make an identical copy of foo stored in a pack with a more
++ # recent mtime.
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
+ test-tool chmtime --get -100 \
+ "$packdir/pack-$foo_pack.pack" >foo.new &&
+ git prune-packed &&
+
-+ # Make a loose copy of object $bar with a more recent
-+ # mtime.
++ # Make a loose copy of bar, also with a more recent mtime.
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958
--
2.49.0.rc0.57.gdb91954e186.dirty
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v2 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-03-04 21:35 ` [PATCH v2 " Taylor Blau
@ 2025-03-04 21:35 ` Taylor Blau
2025-03-04 21:35 ` [PATCH v2 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-04 22:55 ` [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
2 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-04 21:35 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), 'git repack' built on support for
multiple cruft packs in Git by instructing 'git pack-objects --cruft'
how to aggregate smaller cruft packs up to the provided threshold.
The implementation in 37dc6d8104 worked something like the following
pseudo-code:
total_size = 0;
for (p in cruft packs) {
if (p->pack_size + total_size < max_size) {
total_size += p->pack_size;
collapse(p)
} else {
retain(p);
}
}
The original idea behind this approach was that smaller cruft packs
would get combined together until the sum of their sizes was no larger
than the given max pack size.
There is a much simpler way to combine cruft packs, however, which is to
simply combine *all* cruft packs which are smaller than the threshold,
regardless of what their sum is. With '--max-pack-size', 'pack-objects'
will split out the resulting pack into individual pack(s) if necessary
to ensure that the written pack(s) are each no larger than the provided
threshold.
This yields a slight behavior change, which is reflected in the removed
test. Previous to this change, we would aggregate smaller cruft packs
first, whereas now we will opportunistically combine as many cruft packs
as possible. As as result, that test is no longer relevant, and can be
deleted.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/repack.c | 38 ++-----------------------------------
t/t7704-repack-cruft.sh | 42 -----------------------------------------
2 files changed, 2 insertions(+), 78 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index 75e3752353a..4d83d40f39f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1022,29 +1022,13 @@ static int write_filtered_pack(const struct pack_objects_args *args,
return finish_pack_objects_cmd(&cmd, names, local);
}
-static int existing_cruft_pack_cmp(const void *va, const void *vb)
-{
- struct packed_git *a = *(struct packed_git **)va;
- struct packed_git *b = *(struct packed_git **)vb;
-
- if (a->pack_size < b->pack_size)
- return -1;
- if (a->pack_size > b->pack_size)
- return 1;
- return 0;
-}
-
static void collapse_small_cruft_packs(FILE *in, size_t max_size,
struct existing_packs *existing)
{
- struct packed_git **existing_cruft, *p;
+ struct packed_git *p;
struct strbuf buf = STRBUF_INIT;
- size_t total_size = 0;
- size_t existing_cruft_nr = 0;
size_t i;
- ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
-
for (p = get_all_packs(the_repository); p; p = p->next) {
if (!(p->is_cruft && p->pack_local))
continue;
@@ -1056,24 +1040,7 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
if (!string_list_has_string(&existing->cruft_packs, buf.buf))
continue;
- if (existing_cruft_nr >= existing->cruft_packs.nr)
- BUG("too many cruft packs (found %"PRIuMAX", but knew "
- "of %"PRIuMAX")",
- (uintmax_t)existing_cruft_nr + 1,
- (uintmax_t)existing->cruft_packs.nr);
- existing_cruft[existing_cruft_nr++] = p;
- }
-
- QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
-
- for (i = 0; i < existing_cruft_nr; i++) {
- size_t proposed;
-
- p = existing_cruft[i];
- proposed = st_add(total_size, p->pack_size);
-
- if (proposed <= max_size) {
- total_size = proposed;
+ if (p->pack_size < max_size) {
fprintf(in, "-%s\n", pack_basename(p));
} else {
retain_cruft_pack(existing, p);
@@ -1086,7 +1053,6 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
existing->non_kept_packs.items[i].string);
strbuf_release(&buf);
- free(existing_cruft);
}
static int write_cruft_pack(const struct pack_objects_args *args,
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 959e6e26488..5a76b541ddd 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -194,48 +194,6 @@ test_expect_success '--max-cruft-size combines existing packs when below thresho
)
'
-test_expect_success '--max-cruft-size combines smaller packs first' '
- git init max-cruft-size-consume-small &&
- (
- cd max-cruft-size-consume-small &&
-
- test_commit base &&
- git repack -ad &&
-
- cruft_foo="$(generate_cruft_pack foo 524288)" && # 0.5 MiB
- cruft_bar="$(generate_cruft_pack bar 524288)" && # 0.5 MiB
- cruft_baz="$(generate_cruft_pack baz 1048576)" && # 1.0 MiB
- cruft_quux="$(generate_cruft_pack quux 1572864)" && # 1.5 MiB
-
- test-tool pack-mtimes "$(basename $cruft_foo)" >expect.raw &&
- test-tool pack-mtimes "$(basename $cruft_bar)" >>expect.raw &&
- sort expect.raw >expect.objects &&
-
- # repacking with `--max-cruft-size=2M` should combine
- # both 0.5 MiB packs together, instead of, say, one of
- # the 0.5 MiB packs with the 1.0 MiB pack
- ls $packdir/pack-*.mtimes | sort >cruft.before &&
- git repack -d --cruft --max-cruft-size=2M &&
- ls $packdir/pack-*.mtimes | sort >cruft.after &&
-
- comm -13 cruft.before cruft.after >cruft.new &&
- comm -23 cruft.before cruft.after >cruft.removed &&
-
- test_line_count = 1 cruft.new &&
- test_line_count = 2 cruft.removed &&
-
- # the two smaller packs should be rolled up first
- printf "%s\n" $cruft_foo $cruft_bar | sort >expect.removed &&
- test_cmp expect.removed cruft.removed &&
-
- # ...and contain the set of objects rolled up
- test-tool pack-mtimes "$(basename $(cat cruft.new))" >actual.raw &&
- sort actual.raw >actual.objects &&
-
- test_cmp expect.objects actual.objects
- )
-'
-
test_expect_success 'setup --max-cruft-size with freshened objects' '
git init max-cruft-size-freshen &&
(
--
2.49.0.rc0.57.gdb91954e186.dirty
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v2 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-03-04 21:35 ` [PATCH v2 " Taylor Blau
2025-03-04 21:35 ` [PATCH v2 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
@ 2025-03-04 21:35 ` Taylor Blau
2025-03-04 22:55 ` [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
2 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-04 21:35 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
Once an object is written into a cruft pack, we can only freshen it by
writing a new loose or packed copy of that object with a more recent
mtime.
Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
a repository at any given time. So freshening unreachable objects was
straightforward when already rewriting the cruft pack (and its *.mtimes
file).
But 61568efa95 changes things: 'pack-objects' now supports writing
multiple cruft packs when invoked with `--cruft` and the
`--max-pack-size` flag. Cruft packs are rewritten until they reach some
size threshold, at which point they are considered "frozen", and will
only be modified in a pruning GC, or if the threshold itself is
adjusted.
Prior to this patch, however, this process breaks down when we attempt
to freshen an object packed in an earlier cruft pack, and that cruft
pack is larger than the threshold and thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
pack(s) when those cruft packs are larger than the threshold. This is
because we would avoid writing them in the new cruft pack entirely, for
a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
over the packs we're going to retain (which are marked as kept
in-core by 'read_cruft_objects()').
This means that we will avoid enumerating additional packed copies
of objects found in any cruft packs which are larger than the given
size threshold. Thus there is no opportunity to call
'create_object_entry()' whatsoever.
2. We likewise will discard the loose copy (if one exists) of any
unreachable object packed in a cruft pack that is larger than the
threshold. Here our call path is 'add_unreachable_loose_objects()',
which uses the 'add_loose_object()' callback.
That function will eventually land us in 'want_object_in_pack()'
(via 'add_cruft_object_entry()'), and we'll discard the object as it
appears in one of the packs which we marked as kept in-core.
This means in effect that it is impossible to freshen an unreachable
object once it appears in a cruft pack larger than the given threshold.
Instead, we should pack an additional copy of an unreachable object we
want to freshen even if it appears in a cruft pack, provided that the
cruft copy has an mtime which is before the mtime of the copy we are
trying to pack/freshen. This is sub-optimal in the sense that it
requires keeping an additional copy of unreachable objects upon
freshening, but we don't have a better alternative without the ability
to make in-place modifications to existing *.mtimes files.
In order to implement this, we have to adjust the behavior of
'want_found_object()'. When 'pack-objects' is told that we're *not*
going to retain any cruft packs (i.e. the set of packs marked as kept
in-core does not contain a cruft pack), the behavior is unchanged.
But when there *is* at least one cruft pack that we're holding onto, it
is no longer sufficient to reject a copy of an object found in that
cruft pack for that reason alone. In this case, we only want to reject a
candidate object when copies of that object either:
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- exists in a cruft pack with an mtime at least as recent as the copy
we are debating whether or not to pack, in which case freshening
would be redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
objects when we have a copy in an existing cruft pack with at least as
recent an mtime as our candidate (in which case "freshening" would be
redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 63 +++++++++++++++++++++
4 files changed, 168 insertions(+), 18 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b161262..79e1e6fb52b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -206,6 +206,7 @@ static int have_non_local_packs;
static int incremental;
static int ignore_packed_keep_on_disk;
static int ignore_packed_keep_in_core;
+static int ignore_packed_keep_in_core_has_cruft;
static int allow_ofs_delta;
static struct pack_idx_option pack_idx_opts;
static const char *base_name;
@@ -1502,8 +1503,60 @@ static int have_duplicate_entry(const struct object_id *oid,
return 1;
}
+static int want_cruft_object_mtime(struct repository *r,
+ const struct object_id *oid,
+ unsigned flags, uint32_t mtime)
+{
+ struct packed_git **cache;
+
+ for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+ struct packed_git *p = *cache;
+ off_t ofs;
+ uint32_t candidate_mtime;
+
+ ofs = find_pack_entry_one(oid, p);
+ if (!ofs)
+ continue;
+
+ /*
+ * We have a copy of the object 'oid' in a non-cruft
+ * pack. We can avoid packing an additional copy
+ * regardless of what the existing copy's mtime is since
+ * it is outside of a cruft pack.
+ */
+ if (!p->is_cruft)
+ return 0;
+
+ /*
+ * If we have a copy of the object 'oid' in a cruft
+ * pack, then either read the cruft pack's mtime for
+ * that object, or, if that can't be loaded, assume the
+ * pack's mtime itself.
+ */
+ if (!load_pack_mtimes(p)) {
+ uint32_t pos;
+ if (offset_to_pack_pos(p, ofs, &pos) < 0)
+ continue;
+ candidate_mtime = nth_packed_mtime(p, pos);
+ } else {
+ candidate_mtime = p->mtime;
+ }
+
+ /*
+ * We have a surviving copy of the object in a cruft
+ * pack whose mtime is greater than or equal to the one
+ * we are considering. We can thus avoid packing an
+ * additional copy of that object.
+ */
+ if (mtime <= candidate_mtime)
+ return 0;
+ }
+
+ return -1;
+}
+
static int want_found_object(const struct object_id *oid, int exclude,
- struct packed_git *p)
+ struct packed_git *p, uint32_t mtime)
{
if (exclude)
return 1;
@@ -1553,12 +1606,29 @@ static int want_found_object(const struct object_id *oid, int exclude,
if (ignore_packed_keep_in_core)
flags |= IN_CORE_KEEP_PACKS;
- if (ignore_packed_keep_on_disk && p->pack_keep)
- return 0;
- if (ignore_packed_keep_in_core && p->pack_keep_in_core)
- return 0;
- if (has_object_kept_pack(p->repo, oid, flags))
- return 0;
+ /*
+ * If the object is in a pack that we want to ignore, *and* we
+ * don't have any cruft packs that are being retained, we can
+ * abort quickly.
+ */
+ if (!ignore_packed_keep_in_core_has_cruft) {
+ if (ignore_packed_keep_on_disk && p->pack_keep)
+ return 0;
+ if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+ return 0;
+ if (has_object_kept_pack(p->repo, oid, flags))
+ return 0;
+ } else {
+ /*
+ * But if there is at least one cruft pack which
+ * is being kept, we only want to include the
+ * provided object if it has a strictly greater
+ * mtime than any existing cruft copy.
+ */
+ if (!want_cruft_object_mtime(p->repo, oid, flags,
+ mtime))
+ return 0;
+ }
}
/*
@@ -1577,7 +1647,8 @@ static int want_object_in_pack_one(struct packed_git *p,
const struct object_id *oid,
int exclude,
struct packed_git **found_pack,
- off_t *found_offset)
+ off_t *found_offset,
+ uint32_t found_mtime)
{
off_t offset;
@@ -1593,7 +1664,7 @@ static int want_object_in_pack_one(struct packed_git *p,
*found_offset = offset;
*found_pack = p;
}
- return want_found_object(oid, exclude, p);
+ return want_found_object(oid, exclude, p, found_mtime);
}
return -1;
}
@@ -1607,10 +1678,11 @@ static int want_object_in_pack_one(struct packed_git *p,
* function finds if there is any pack that has the object and returns the pack
* and its offset in these variables.
*/
-static int want_object_in_pack(const struct object_id *oid,
- int exclude,
- struct packed_git **found_pack,
- off_t *found_offset)
+static int want_object_in_pack_mtime(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset,
+ uint32_t found_mtime)
{
int want;
struct list_head *pos;
@@ -1625,7 +1697,8 @@ static int want_object_in_pack(const struct object_id *oid,
* are present we will determine the answer right now.
*/
if (*found_pack) {
- want = want_found_object(oid, exclude, *found_pack);
+ want = want_found_object(oid, exclude, *found_pack,
+ found_mtime);
if (want != -1)
return want;
@@ -1636,7 +1709,7 @@ static int want_object_in_pack(const struct object_id *oid,
for (m = get_multi_pack_index(the_repository); m; m = m->next) {
struct pack_entry e;
if (fill_midx_entry(the_repository, oid, &e, m)) {
- want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset, found_mtime);
if (want != -1)
return want;
}
@@ -1644,7 +1717,7 @@ static int want_object_in_pack(const struct object_id *oid,
list_for_each(pos, get_packed_git_mru(the_repository)) {
struct packed_git *p = list_entry(pos, struct packed_git, mru);
- want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset, found_mtime);
if (!exclude && want > 0)
list_move(&p->mru,
get_packed_git_mru(the_repository));
@@ -1674,6 +1747,15 @@ static int want_object_in_pack(const struct object_id *oid,
return 1;
}
+static inline int want_object_in_pack(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset)
+{
+ return want_object_in_pack_mtime(oid, exclude, found_pack, found_offset,
+ 0);
+}
+
static struct object_entry *create_object_entry(const struct object_id *oid,
enum object_type type,
uint32_t hash,
@@ -3606,7 +3688,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
entry->no_try_delta = no_try_delta(name);
}
} else {
- if (!want_object_in_pack(oid, 0, &pack, &offset))
+ if (!want_object_in_pack_mtime(oid, 0, &pack, &offset, mtime))
return;
if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
/*
@@ -3680,6 +3762,8 @@ static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
struct packed_git *p = item->util;
if (!p)
die(_("could not find pack '%s'"), item->string);
+ if (p->is_cruft && keep)
+ ignore_packed_keep_in_core_has_cruft = 1;
p->pack_keep_in_core = keep;
}
}
diff --git a/packfile.c b/packfile.c
index 2d80d80cb38..9d09f8bc726 100644
--- a/packfile.c
+++ b/packfile.c
@@ -24,6 +24,7 @@
#include "commit-graph.h"
#include "pack-revindex.h"
#include "promisor-remote.h"
+#include "pack-mtimes.h"
char *odb_pack_name(struct repository *r, struct strbuf *buf,
const unsigned char *hash, const char *ext)
@@ -2107,7 +2108,7 @@ static void maybe_invalidate_kept_pack_cache(struct repository *r,
r->objects->kept_pack_cache.flags = 0;
}
-static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
{
maybe_invalidate_kept_pack_cache(r, flags);
diff --git a/packfile.h b/packfile.h
index 00ada7a938f..25097213d06 100644
--- a/packfile.h
+++ b/packfile.h
@@ -197,6 +197,8 @@ int has_object_pack(struct repository *r, const struct object_id *oid);
int has_object_kept_pack(struct repository *r, const struct object_id *oid,
unsigned flags);
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 5a76b541ddd..183cb239447 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -262,6 +262,69 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
)
'
+test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
+ git init max-cruft-size-threshold &&
+ (
+ cd max-cruft-size-threshold &&
+
+ test_commit base &&
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ baz="$(generate_random_blob baz $((2*1024*1024)))" &&
+
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
+
+ git repack --cruft -d &&
+
+ # Make an identical copy of foo stored in a pack with a more
+ # recent mtime.
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
+ test-tool chmtime --get -100 \
+ "$packdir/pack-$foo_pack.pack" >foo.new &&
+ git prune-packed &&
+
+ # Make a loose copy of bar, also with a more recent mtime.
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
+
+ # Make a new cruft object $quux to ensure we do not
+ # generate an identical pack to the existing cruft
+ # pack.
+ quux="$(generate_random_blob quux $((1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$quux")" >quux.new &&
+
+ git repack --cruft --max-cruft-size=3M -d &&
+
+ for p in $packdir/pack-*.mtimes
+ do
+ test-tool pack-mtimes "$(basename "$p")" || return 1
+ done >actual.raw &&
+ sort actual.raw >actual &&
+
+ # Among the set of all cruft packs, we should see both
+ # mtimes for object $foo and $bar, as well as the
+ # single new copy of $baz.
+ sort >expect <<-EOF &&
+ $foo $(cat foo.old)
+ $foo $(cat foo.new)
+ $bar $(cat bar.old)
+ $bar $(cat bar.new)
+ $baz $(cat baz.old)
+ $quux $(cat quux.new)
+ EOF
+
+ test_cmp expect actual
+ )
+'
+
test_expect_success '--max-cruft-size with pruning' '
git init max-cruft-size-prune &&
(
--
2.49.0.rc0.57.gdb91954e186.dirty
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-02-28 7:52 ` Patrick Steinhardt
@ 2025-03-04 21:52 ` Elijah Newren
2025-03-05 2:04 ` Junio C Hamano
2025-03-05 0:09 ` Taylor Blau
1 sibling, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-03-04 21:52 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: Taylor Blau, git, Junio C Hamano, Jeff King
On Thu, Feb 27, 2025 at 11:52 PM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Thu, Feb 27, 2025 at 01:29:28PM -0500, Taylor Blau wrote:
> > In 37dc6d8104 (builtin/repack.c: implement support for
> > `--max-cruft-size`, 2023-10-02), 'git repack' built on support for
> > multiple cruft packs in Git by instructing 'git pack-objects --cruft'
> > how to aggregate smaller cruft packs up to the provided threshold.
> >
> > The implementation in 37dc6d8104 worked something like the following
> > pseudo-code:
> >
> > total_size = 0;
> >
> > for (p in cruft packs) {
> > if (p->pack_size + total_size < max_size) {
> > total_size += p->pack_size;
> > collapse(p)
> > } else {
> > retain(p);
> > }
> > }
> >
> > The original idea behind this approach was that smaller cruft packs
> > would get combined together until the sum of their sizes was no larger
> > than the given max pack size.
> >
> > There is a much simpler way to achieve this, however, which is to simply
> > combine *all* cruft packs which are smaller than the threshold,
> > regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> > will split out the resulting pack into individual pack(s) if necessary
> > to ensure that the written pack(s) are each no larger than the provided
> > threshold.
>
> Hm. So the result would be a new set of packfiles where each of them is
> smaller than the threshold, right?
Are you assuming there's only one threshold, or that --max-pack-size
== --max-cruft-size?
I read this assuming --max-pack-size >> --max-cruft-size, so the odds
that the N packs smaller than --max-cruft-size add up to more than
--max-pack-size is small -- but even if it does happen, it just
results in the cruft packs being split out into a couple packs.
> Wouldn't that mean that the next time
> we'll again do the same thing and try to combine the new set of cruft
> packs into one, and basically never arrive at a state where we don't
> touch the cruft packs anymore?
This would be a risk if we allow --max-cruft-size to approach or be
equal to --max-pack-size. (And if --max-pack-size is less than
--max-cruft-size, then we'll perversely split into even more cruft
packs rather than combining as intended.)
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs
2025-03-04 21:35 ` [PATCH v2 " Taylor Blau
2025-03-04 21:35 ` [PATCH v2 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
2025-03-04 21:35 ` [PATCH v2 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-03-04 22:55 ` Elijah Newren
2025-03-05 0:06 ` Taylor Blau
2 siblings, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-03-04 22:55 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King
On Tue, Mar 4, 2025 at 1:35 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> This is a minor reroll of my series to fix a bug in freshening objects
> with multiple cruft packs.
>
> Since last time the only updates to the series are textual changes to
> the commit message, so everything functionally should be working as it
> was before.
>
> As usual, there is a range-diff showing as much below, and the original
> cover letter is as follows:
>
> ---
>
> This short series contains a fix for a bug I noticed while rolling out
> multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's
> infrastructure.
>
> The series is structured as follows:
>
> - The first patch simplifies how 'repack' aggregates cruft packs
> together when their size is below the '--max-cruft-size' or
> '--max-pack-size' threshold. This simplification changes behavior
> slightly, but not in a meaningful way. It occurred to me while
> writing the second patch.
>
> - The second patch describes and fixes the main bug. The gist here is
> that objects which are (a) unreachable, (b) exist in a cruft pack
> being retained, and (c) were freshened to have a more recent mtime
> than any existing cruft copy are unable to be freshened.
>
> The fix pursued in the second patch changes the rules around when we
> want to retain an object via builtin/pack-objects.c::want_found_object()
> when at least one cruft pack will survive the repack.
>
> Previously the rule was to discard any object which appears in any
> surviving pack, regardless of mtime. The rule now is to only discard an
> object if it appears in either (a) a non-cruft pack which will survive
> the repack, or (b) a cruft pack whose mtime for that object is older
> than the one we are trying to pack.
>
> I think that this is the right behavior, but admittedly putting this
> series together hurt my brain trying to think through all of the cases.
> I'm fairly confident in the testing here as I remember it being fairly
> exhaustive of all interesting cases. But I'd appreciate a sanity check
> from others that they too are convinced this is the right approach.
>
> Thanks in advance for your review!
>
> Taylor Blau (2):
> builtin/repack.c: simplify cruft pack aggregation
> builtin/pack-objects.c: freshen objects from existing cruft packs
>
> builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
> builtin/repack.c | 38 +------------
> packfile.c | 3 +-
> packfile.h | 2 +
> t/t7704-repack-cruft.sh | 105 +++++++++++++++++++++--------------
> 5 files changed, 170 insertions(+), 96 deletions(-)
>
> Range-diff against v1:
> 1: 8564f982597 ! 1: 63ea9d4d00e builtin/repack.c: simplify cruft pack aggregation
> @@ Commit message
> would get combined together until the sum of their sizes was no larger
> than the given max pack size.
>
> - There is a much simpler way to achieve this, however, which is to simply
> - combine *all* cruft packs which are smaller than the threshold,
> + There is a much simpler way to combine cruft packs, however, which is to
> + simply combine *all* cruft packs which are smaller than the threshold,
> regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> will split out the resulting pack into individual pack(s) if necessary
> to ensure that the written pack(s) are each no larger than the provided
> 2: c0c926adde2 ! 2: 7ba9054701b builtin/pack-objects.c: freshen objects from existing cruft packs
> @@ Commit message
> only be modified in a pruning GC, or if the threshold itself is
> adjusted.
>
> - However, this process breaks down when we attempt to freshen an object
> - packed in an earlier cruft pack that is larger than the threshold and
> - thus will survive the repack.
> + Prior to this patch, however, this process breaks down when we attempt
> + to freshen an object packed in an earlier cruft pack, and that cruft
> + pack is larger than the threshold and thus will survive the repack.
>
> When this is the case, it is impossible to freshen objects in cruft
> - pack(s) which are larger than the threshold. This is because we avoid
> - writing them in the new cruft pack entirely, for a couple of reasons.
> + pack(s) when those cruft packs are larger than the threshold. This is
> + because we would avoid writing them in the new cruft pack entirely, for
> + a couple of reasons.
>
> 1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
> we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
> @@ Commit message
> - exists in a non-cruft pack that we are retaining, regardless of that
> pack's mtime, or
>
> - - exists in a cruft pack with an mtime more recent than the copy we are
> - debating whether or not to pack, in which case freshening would be
> - redundant.
> + - exists in a cruft pack with an mtime at least as recent as the copy
> + we are debating whether or not to pack, in which case freshening
> + would be redundant.
>
> To do this, keep track of whether or not we have any cruft packs in our
> in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
> flag. When we end up in this new special case, we replace a call to
> - 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only
> - reject objects when we have a copy in an existing cruft pack with a more
> - recent mtime (in which case "freshening" would be redundant).
> + 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
> + objects when we have a copy in an existing cruft pack with at least as
> + recent an mtime as our candidate (in which case "freshening" would be
> + redundant).
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
> @@ t/t7704-repack-cruft.sh: test_expect_success '--max-cruft-size with freshened ob
> +
> + git repack --cruft -d &&
> +
> -+ # Make a packed copy of object $foo with a more recent
> -+ # mtime.
> ++ # Make an identical copy of foo stored in a pack with a more
> ++ # recent mtime.
> + foo="$(generate_random_blob foo $((2*1024*1024)))" &&
> + foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
> + test-tool chmtime --get -100 \
> + "$packdir/pack-$foo_pack.pack" >foo.new &&
> + git prune-packed &&
> +
> -+ # Make a loose copy of object $bar with a more recent
> -+ # mtime.
> ++ # Make a loose copy of bar, also with a more recent mtime.
> + bar="$(generate_random_blob bar $((2*1024*1024)))" &&
> + test-tool chmtime --get -100 \
> + "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
>
> base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958
> --
> 2.49.0.rc0.57.gdb91954e186.dirty
v2 looks good to me; though I'm curious if some wording improvement in
the commit message might help in distinguishing between
--max-cruft-size and --max-pack-size...and whether we want to provide
any checks on the relative sizes of the two.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs
2025-03-04 22:55 ` [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
@ 2025-03-05 0:06 ` Taylor Blau
2025-03-05 0:13 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-05 0:06 UTC (permalink / raw)
To: Elijah Newren; +Cc: git, Junio C Hamano, Jeff King
On Tue, Mar 04, 2025 at 02:55:00PM -0800, Elijah Newren wrote:
> v2 looks good to me; though I'm curious if some wording improvement in
> the commit message might help in distinguishing between
> --max-cruft-size and --max-pack-size...and whether we want to provide
> any checks on the relative sizes of the two.
Thanks for the review! :-)
--max-cruft-size and --max-pack-size are the same thing from
pack-objects' perspective; the two flags exist at the repack level in
case you want to set a different maximum pack size for cruft- and
non-cruft packs.
Both end up as a --max-pack-size value when repack invokes pack-objects
for the cruft and non-cruft case.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-02-28 7:52 ` Patrick Steinhardt
2025-03-04 21:52 ` Elijah Newren
@ 2025-03-05 0:09 ` Taylor Blau
1 sibling, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-05 0:09 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Jeff King, Elijah Newren
On Fri, Feb 28, 2025 at 08:52:01AM +0100, Patrick Steinhardt wrote:
> On Thu, Feb 27, 2025 at 01:29:28PM -0500, Taylor Blau wrote:
> > In 37dc6d8104 (builtin/repack.c: implement support for
> > `--max-cruft-size`, 2023-10-02), 'git repack' built on support for
> > multiple cruft packs in Git by instructing 'git pack-objects --cruft'
> > how to aggregate smaller cruft packs up to the provided threshold.
> >
> > The implementation in 37dc6d8104 worked something like the following
> > pseudo-code:
> >
> > total_size = 0;
> >
> > for (p in cruft packs) {
> > if (p->pack_size + total_size < max_size) {
> > total_size += p->pack_size;
> > collapse(p)
> > } else {
> > retain(p);
> > }
> > }
> >
> > The original idea behind this approach was that smaller cruft packs
> > would get combined together until the sum of their sizes was no larger
> > than the given max pack size.
> >
> > There is a much simpler way to achieve this, however, which is to simply
> > combine *all* cruft packs which are smaller than the threshold,
> > regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> > will split out the resulting pack into individual pack(s) if necessary
> > to ensure that the written pack(s) are each no larger than the provided
> > threshold.
>
> Hm. So the result would be a new set of packfiles where each of them is
> smaller than the threshold, right? Wouldn't that mean that the next time
> we'll again do the same thing and try to combine the new set of cruft
> packs into one, and basically never arrive at a state where we don't
> touch the cruft packs anymore?
Ugh. You are most definitely right, now I remember why I wrote the
original series the way I did ;-).
Of course, I read your message after sending a new round. Let's abort
that v2 and I'll send a v3 which just contains the second patch. Thanks
for catching.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs
2025-03-05 0:06 ` Taylor Blau
@ 2025-03-05 0:13 ` Taylor Blau
0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-05 0:13 UTC (permalink / raw)
To: Elijah Newren; +Cc: git, Junio C Hamano, Jeff King
On Tue, Mar 04, 2025 at 07:06:10PM -0500, Taylor Blau wrote:
> --max-cruft-size and --max-pack-size are the same thing from
> pack-objects' perspective; the two flags exist at the repack level in
> case you want to set a different maximum pack size for cruft- and
> non-cruft packs.
>
> Both end up as a --max-pack-size value when repack invokes pack-objects
> for the cruft and non-cruft case.
Of course, I should have read Patrick's and your discussion earlier in
the thread before sending a new round. The first patch should
*definitely* not be queued.
I'll send a clean v3 that Junio can apply if he likes.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v3 0/1] pack-objects: freshen objects with multi-cruft packs
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
` (3 preceding siblings ...)
2025-03-04 21:35 ` [PATCH v2 " Taylor Blau
@ 2025-03-05 0:15 ` Taylor Blau
2025-03-05 0:15 ` [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-06 10:31 ` [PATCH v3 0/1] pack-objects: freshen objects with multi-cruft packs Patrick Steinhardt
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
2025-03-13 18:09 ` [PATCH v5] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
6 siblings, 2 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-05 0:15 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
This is a(nother) reroll of my series to fix a bug in freshening objects
with multiple cruft packs.
The only update since last time is that I dropped the first patch, after
Patrick astutely pointed out a flaw with the approach pursued there.
That flaw is why I wrote '--max-cruft-size' in what appeared to be a
convoluted fashion, but I couldn't remember my line of thinking at the
time.
As usual, there is a range-diff showing as much below. Thanks again,
Patrick, for catching what would be a very frustrating issue to deal
with later on ;-).
Taylor Blau (1):
builtin/pack-objects.c: freshen objects from existing cruft packs
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 63 +++++++++++++++++++++
4 files changed, 168 insertions(+), 18 deletions(-)
Range-diff against v2:
1: 63ea9d4d00e < -: ----------- builtin/repack.c: simplify cruft pack aggregation
2: 7ba9054701b = 1: 6e93471f9a8 builtin/pack-objects.c: freshen objects from existing cruft packs
base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958
--
2.49.0.rc0.1.g6e93471f9a8
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-03-05 0:15 ` [PATCH v3 0/1] " Taylor Blau
@ 2025-03-05 0:15 ` Taylor Blau
2025-03-06 10:31 ` Patrick Steinhardt
2025-03-06 10:31 ` [PATCH v3 0/1] pack-objects: freshen objects with multi-cruft packs Patrick Steinhardt
1 sibling, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-05 0:15 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
Once an object is written into a cruft pack, we can only freshen it by
writing a new loose or packed copy of that object with a more recent
mtime.
Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
a repository at any given time. So freshening unreachable objects was
straightforward when already rewriting the cruft pack (and its *.mtimes
file).
But 61568efa95 changes things: 'pack-objects' now supports writing
multiple cruft packs when invoked with `--cruft` and the
`--max-pack-size` flag. Cruft packs are rewritten until they reach some
size threshold, at which point they are considered "frozen", and will
only be modified in a pruning GC, or if the threshold itself is
adjusted.
Prior to this patch, however, this process breaks down when we attempt
to freshen an object packed in an earlier cruft pack, and that cruft
pack is larger than the threshold and thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
pack(s) when those cruft packs are larger than the threshold. This is
because we would avoid writing them in the new cruft pack entirely, for
a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
over the packs we're going to retain (which are marked as kept
in-core by 'read_cruft_objects()').
This means that we will avoid enumerating additional packed copies
of objects found in any cruft packs which are larger than the given
size threshold. Thus there is no opportunity to call
'create_object_entry()' whatsoever.
2. We likewise will discard the loose copy (if one exists) of any
unreachable object packed in a cruft pack that is larger than the
threshold. Here our call path is 'add_unreachable_loose_objects()',
which uses the 'add_loose_object()' callback.
That function will eventually land us in 'want_object_in_pack()'
(via 'add_cruft_object_entry()'), and we'll discard the object as it
appears in one of the packs which we marked as kept in-core.
This means in effect that it is impossible to freshen an unreachable
object once it appears in a cruft pack larger than the given threshold.
Instead, we should pack an additional copy of an unreachable object we
want to freshen even if it appears in a cruft pack, provided that the
cruft copy has an mtime which is before the mtime of the copy we are
trying to pack/freshen. This is sub-optimal in the sense that it
requires keeping an additional copy of unreachable objects upon
freshening, but we don't have a better alternative without the ability
to make in-place modifications to existing *.mtimes files.
In order to implement this, we have to adjust the behavior of
'want_found_object()'. When 'pack-objects' is told that we're *not*
going to retain any cruft packs (i.e. the set of packs marked as kept
in-core does not contain a cruft pack), the behavior is unchanged.
But when there *is* at least one cruft pack that we're holding onto, it
is no longer sufficient to reject a copy of an object found in that
cruft pack for that reason alone. In this case, we only want to reject a
candidate object when copies of that object either:
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- exists in a cruft pack with an mtime at least as recent as the copy
we are debating whether or not to pack, in which case freshening
would be redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
objects when we have a copy in an existing cruft pack with at least as
recent an mtime as our candidate (in which case "freshening" would be
redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 63 +++++++++++++++++++++
4 files changed, 168 insertions(+), 18 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b161262..79e1e6fb52b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -206,6 +206,7 @@ static int have_non_local_packs;
static int incremental;
static int ignore_packed_keep_on_disk;
static int ignore_packed_keep_in_core;
+static int ignore_packed_keep_in_core_has_cruft;
static int allow_ofs_delta;
static struct pack_idx_option pack_idx_opts;
static const char *base_name;
@@ -1502,8 +1503,60 @@ static int have_duplicate_entry(const struct object_id *oid,
return 1;
}
+static int want_cruft_object_mtime(struct repository *r,
+ const struct object_id *oid,
+ unsigned flags, uint32_t mtime)
+{
+ struct packed_git **cache;
+
+ for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+ struct packed_git *p = *cache;
+ off_t ofs;
+ uint32_t candidate_mtime;
+
+ ofs = find_pack_entry_one(oid, p);
+ if (!ofs)
+ continue;
+
+ /*
+ * We have a copy of the object 'oid' in a non-cruft
+ * pack. We can avoid packing an additional copy
+ * regardless of what the existing copy's mtime is since
+ * it is outside of a cruft pack.
+ */
+ if (!p->is_cruft)
+ return 0;
+
+ /*
+ * If we have a copy of the object 'oid' in a cruft
+ * pack, then either read the cruft pack's mtime for
+ * that object, or, if that can't be loaded, assume the
+ * pack's mtime itself.
+ */
+ if (!load_pack_mtimes(p)) {
+ uint32_t pos;
+ if (offset_to_pack_pos(p, ofs, &pos) < 0)
+ continue;
+ candidate_mtime = nth_packed_mtime(p, pos);
+ } else {
+ candidate_mtime = p->mtime;
+ }
+
+ /*
+ * We have a surviving copy of the object in a cruft
+ * pack whose mtime is greater than or equal to the one
+ * we are considering. We can thus avoid packing an
+ * additional copy of that object.
+ */
+ if (mtime <= candidate_mtime)
+ return 0;
+ }
+
+ return -1;
+}
+
static int want_found_object(const struct object_id *oid, int exclude,
- struct packed_git *p)
+ struct packed_git *p, uint32_t mtime)
{
if (exclude)
return 1;
@@ -1553,12 +1606,29 @@ static int want_found_object(const struct object_id *oid, int exclude,
if (ignore_packed_keep_in_core)
flags |= IN_CORE_KEEP_PACKS;
- if (ignore_packed_keep_on_disk && p->pack_keep)
- return 0;
- if (ignore_packed_keep_in_core && p->pack_keep_in_core)
- return 0;
- if (has_object_kept_pack(p->repo, oid, flags))
- return 0;
+ /*
+ * If the object is in a pack that we want to ignore, *and* we
+ * don't have any cruft packs that are being retained, we can
+ * abort quickly.
+ */
+ if (!ignore_packed_keep_in_core_has_cruft) {
+ if (ignore_packed_keep_on_disk && p->pack_keep)
+ return 0;
+ if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+ return 0;
+ if (has_object_kept_pack(p->repo, oid, flags))
+ return 0;
+ } else {
+ /*
+ * But if there is at least one cruft pack which
+ * is being kept, we only want to include the
+ * provided object if it has a strictly greater
+ * mtime than any existing cruft copy.
+ */
+ if (!want_cruft_object_mtime(p->repo, oid, flags,
+ mtime))
+ return 0;
+ }
}
/*
@@ -1577,7 +1647,8 @@ static int want_object_in_pack_one(struct packed_git *p,
const struct object_id *oid,
int exclude,
struct packed_git **found_pack,
- off_t *found_offset)
+ off_t *found_offset,
+ uint32_t found_mtime)
{
off_t offset;
@@ -1593,7 +1664,7 @@ static int want_object_in_pack_one(struct packed_git *p,
*found_offset = offset;
*found_pack = p;
}
- return want_found_object(oid, exclude, p);
+ return want_found_object(oid, exclude, p, found_mtime);
}
return -1;
}
@@ -1607,10 +1678,11 @@ static int want_object_in_pack_one(struct packed_git *p,
* function finds if there is any pack that has the object and returns the pack
* and its offset in these variables.
*/
-static int want_object_in_pack(const struct object_id *oid,
- int exclude,
- struct packed_git **found_pack,
- off_t *found_offset)
+static int want_object_in_pack_mtime(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset,
+ uint32_t found_mtime)
{
int want;
struct list_head *pos;
@@ -1625,7 +1697,8 @@ static int want_object_in_pack(const struct object_id *oid,
* are present we will determine the answer right now.
*/
if (*found_pack) {
- want = want_found_object(oid, exclude, *found_pack);
+ want = want_found_object(oid, exclude, *found_pack,
+ found_mtime);
if (want != -1)
return want;
@@ -1636,7 +1709,7 @@ static int want_object_in_pack(const struct object_id *oid,
for (m = get_multi_pack_index(the_repository); m; m = m->next) {
struct pack_entry e;
if (fill_midx_entry(the_repository, oid, &e, m)) {
- want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset, found_mtime);
if (want != -1)
return want;
}
@@ -1644,7 +1717,7 @@ static int want_object_in_pack(const struct object_id *oid,
list_for_each(pos, get_packed_git_mru(the_repository)) {
struct packed_git *p = list_entry(pos, struct packed_git, mru);
- want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset, found_mtime);
if (!exclude && want > 0)
list_move(&p->mru,
get_packed_git_mru(the_repository));
@@ -1674,6 +1747,15 @@ static int want_object_in_pack(const struct object_id *oid,
return 1;
}
+static inline int want_object_in_pack(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset)
+{
+ return want_object_in_pack_mtime(oid, exclude, found_pack, found_offset,
+ 0);
+}
+
static struct object_entry *create_object_entry(const struct object_id *oid,
enum object_type type,
uint32_t hash,
@@ -3606,7 +3688,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
entry->no_try_delta = no_try_delta(name);
}
} else {
- if (!want_object_in_pack(oid, 0, &pack, &offset))
+ if (!want_object_in_pack_mtime(oid, 0, &pack, &offset, mtime))
return;
if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
/*
@@ -3680,6 +3762,8 @@ static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
struct packed_git *p = item->util;
if (!p)
die(_("could not find pack '%s'"), item->string);
+ if (p->is_cruft && keep)
+ ignore_packed_keep_in_core_has_cruft = 1;
p->pack_keep_in_core = keep;
}
}
diff --git a/packfile.c b/packfile.c
index 2d80d80cb38..9d09f8bc726 100644
--- a/packfile.c
+++ b/packfile.c
@@ -24,6 +24,7 @@
#include "commit-graph.h"
#include "pack-revindex.h"
#include "promisor-remote.h"
+#include "pack-mtimes.h"
char *odb_pack_name(struct repository *r, struct strbuf *buf,
const unsigned char *hash, const char *ext)
@@ -2107,7 +2108,7 @@ static void maybe_invalidate_kept_pack_cache(struct repository *r,
r->objects->kept_pack_cache.flags = 0;
}
-static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
{
maybe_invalidate_kept_pack_cache(r, flags);
diff --git a/packfile.h b/packfile.h
index 00ada7a938f..25097213d06 100644
--- a/packfile.h
+++ b/packfile.h
@@ -197,6 +197,8 @@ int has_object_pack(struct repository *r, const struct object_id *oid);
int has_object_kept_pack(struct repository *r, const struct object_id *oid,
unsigned flags);
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 959e6e26488..f427150de5b 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -304,6 +304,69 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
)
'
+test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
+ git init max-cruft-size-threshold &&
+ (
+ cd max-cruft-size-threshold &&
+
+ test_commit base &&
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ baz="$(generate_random_blob baz $((2*1024*1024)))" &&
+
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
+
+ git repack --cruft -d &&
+
+ # Make an identical copy of foo stored in a pack with a more
+ # recent mtime.
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
+ test-tool chmtime --get -100 \
+ "$packdir/pack-$foo_pack.pack" >foo.new &&
+ git prune-packed &&
+
+ # Make a loose copy of bar, also with a more recent mtime.
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
+
+ # Make a new cruft object $quux to ensure we do not
+ # generate an identical pack to the existing cruft
+ # pack.
+ quux="$(generate_random_blob quux $((1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$quux")" >quux.new &&
+
+ git repack --cruft --max-cruft-size=3M -d &&
+
+ for p in $packdir/pack-*.mtimes
+ do
+ test-tool pack-mtimes "$(basename "$p")" || return 1
+ done >actual.raw &&
+ sort actual.raw >actual &&
+
+ # Among the set of all cruft packs, we should see both
+ # mtimes for object $foo and $bar, as well as the
+ # single new copy of $baz.
+ sort >expect <<-EOF &&
+ $foo $(cat foo.old)
+ $foo $(cat foo.new)
+ $bar $(cat bar.old)
+ $bar $(cat bar.new)
+ $baz $(cat baz.old)
+ $quux $(cat quux.new)
+ EOF
+
+ test_cmp expect actual
+ )
+'
+
test_expect_success '--max-cruft-size with pruning' '
git init max-cruft-size-prune &&
(
--
2.49.0.rc0.1.g6e93471f9a8
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation
2025-03-04 21:52 ` Elijah Newren
@ 2025-03-05 2:04 ` Junio C Hamano
0 siblings, 0 replies; 54+ messages in thread
From: Junio C Hamano @ 2025-03-05 2:04 UTC (permalink / raw)
To: Elijah Newren; +Cc: Patrick Steinhardt, Taylor Blau, git, Jeff King
Elijah Newren <newren@gmail.com> writes:
>> Hm. So the result would be a new set of packfiles where each of them is
>> smaller than the threshold, right?
>
> Are you assuming there's only one threshold, or that --max-pack-size
> == --max-cruft-size?
>
> I read this assuming --max-pack-size >> --max-cruft-size, so the odds
> that the N packs smaller than --max-cruft-size add up to more than
> --max-pack-size is small -- but even if it does happen, it just
> results in the cruft packs being split out into a couple packs.
Interesting observation ;-)
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-03-05 0:15 ` [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-03-06 10:31 ` Patrick Steinhardt
2025-03-13 17:32 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Patrick Steinhardt @ 2025-03-06 10:31 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King, Elijah Newren
On Tue, Mar 04, 2025 at 07:15:18PM -0500, Taylor Blau wrote:
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 58a9b161262..79e1e6fb52b 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1502,8 +1503,60 @@ static int have_duplicate_entry(const struct object_id *oid,
> return 1;
> }
>
> +static int want_cruft_object_mtime(struct repository *r,
> + const struct object_id *oid,
> + unsigned flags, uint32_t mtime)
> +{
> + struct packed_git **cache;
> +
> + for (cache = kept_pack_cache(r, flags); *cache; cache++) {
> + struct packed_git *p = *cache;
> + off_t ofs;
> + uint32_t candidate_mtime;
> +
> + ofs = find_pack_entry_one(oid, p);
> + if (!ofs)
> + continue;
> +
> + /*
> + * We have a copy of the object 'oid' in a non-cruft
> + * pack. We can avoid packing an additional copy
> + * regardless of what the existing copy's mtime is since
> + * it is outside of a cruft pack.
> + */
> + if (!p->is_cruft)
> + return 0;
> +
> + /*
> + * If we have a copy of the object 'oid' in a cruft
> + * pack, then either read the cruft pack's mtime for
> + * that object, or, if that can't be loaded, assume the
> + * pack's mtime itself.
> + */
> + if (!load_pack_mtimes(p)) {
> + uint32_t pos;
> + if (offset_to_pack_pos(p, ofs, &pos) < 0)
> + continue;
> + candidate_mtime = nth_packed_mtime(p, pos);
> + } else {
> + candidate_mtime = p->mtime;
> + }
> +
> + /*
> + * We have a surviving copy of the object in a cruft
> + * pack whose mtime is greater than or equal to the one
> + * we are considering. We can thus avoid packing an
> + * additional copy of that object.
> + */
> + if (mtime <= candidate_mtime)
> + return 0;
> + }
> +
> + return -1;
> +}
> +
Minor nit: it is a bit unusual that a negative value, which typically
indicates an error, is used as a boolean value here to indicate that we
don't want to have the object.
> diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
> index 959e6e26488..f427150de5b 100755
> --- a/t/t7704-repack-cruft.sh
> +++ b/t/t7704-repack-cruft.sh
> @@ -304,6 +304,69 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
> )
> '
>
> +test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
> + git init max-cruft-size-threshold &&
Let's also delete the repository via `test_when_finished`.
Patrick
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3 0/1] pack-objects: freshen objects with multi-cruft packs
2025-03-05 0:15 ` [PATCH v3 0/1] " Taylor Blau
2025-03-05 0:15 ` [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-03-06 10:31 ` Patrick Steinhardt
1 sibling, 0 replies; 54+ messages in thread
From: Patrick Steinhardt @ 2025-03-06 10:31 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Junio C Hamano, Jeff King, Elijah Newren
On Tue, Mar 04, 2025 at 07:15:14PM -0500, Taylor Blau wrote:
> This is a(nother) reroll of my series to fix a bug in freshening objects
> with multiple cruft packs.
>
> The only update since last time is that I dropped the first patch, after
> Patrick astutely pointed out a flaw with the approach pursued there.
> That flaw is why I wrote '--max-cruft-size' in what appeared to be a
> convoluted fashion, but I couldn't remember my line of thinking at the
> time.
>
> As usual, there is a range-diff showing as much below. Thanks again,
> Patrick, for catching what would be a very frustrating issue to deal
> with later on ;-).
Would it maybe make sense to add a commit that explains _why_ the
existing algorithm looks as it does so that the next person won't be
triggered to remove the code again?
Patrick
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
` (4 preceding siblings ...)
2025-03-05 0:15 ` [PATCH v3 0/1] " Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 0:21 ` [PATCH v4 1/6] t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests Taylor Blau
` (6 more replies)
2025-03-13 18:09 ` [PATCH v5] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
6 siblings, 7 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
Here is a slightly larger reroll of my series to fix object freshening
when using multi-cruft packs that I have been meaning to send for a
couple of days.
I realized after sending the last round that not only was the first
commit from v1 flawed (for the reasons Patrick identified) but that
there is currently no way to grow a new cruft pack past the configured
limit.
Independent of this series suppose for example that we have two 100 MiB
packs, and the threshold is 200 MiB. We should able to in theory combine
those packs together. But we can't! The largest pack we'll make is
199MiB (and change), since builtin/pack-objects.c::write_one() will
refuse to write any object which would bust the limit given by
--max-pack-size.
This series resurrects the first patch from v1 after introducing a
behavior change for 'git pack-objects --cruft --max-pack-size'. When
given with '--cruft', '--max-pack-size' now allows pack-objects to grow
a pack *just* past the given limit by at most one object. This allows
packs to grow past their threshold and age out of the active generation
of cruft packs so they are no longer repacked with each 'git repack
--cruft'.
We're way too late into the -rc cycle for this to land in the
forthcoming release, but I wanted to get this off of my workstation
anyway to allow folks to review it as they have time.
As usual, there is a range-diff since last time. Thanks in advance for
your review!
Taylor Blau (6):
t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests
t7704-repack-cruft.sh: consolidate `write_blob()`
t/lib-cruft.sh: extract some cruft-related helpers
pack-objects: generate cruft packs at most one object over threshold
builtin/repack.c: simplify cruft pack aggregation
builtin/pack-objects.c: freshen objects from existing cruft packs
Documentation/config/pack.adoc | 4 +
Documentation/git-pack-objects.adoc | 4 +
builtin/pack-objects.c | 150 +++++++++--
builtin/repack.c | 38 +--
packfile.c | 3 +-
packfile.h | 2 +
t/lib-cruft.sh | 23 ++
t/t5329-pack-objects-cruft.sh | 317 +++++-------------------
t/t7704-repack-cruft.sh | 372 +++++++++++++++++++++++-----
9 files changed, 544 insertions(+), 369 deletions(-)
create mode 100644 t/lib-cruft.sh
Range-diff against v3:
-: ---------- > 1: 390c3a6d85 t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests
-: ---------- > 2: e7ebe6c460 t7704-repack-cruft.sh: consolidate `write_blob()`
-: ---------- > 3: aa7588f817 t/lib-cruft.sh: extract some cruft-related helpers
-: ---------- > 4: f2ca92245a pack-objects: generate cruft packs at most one object over threshold
-: ---------- > 5: 12ddea7603 builtin/repack.c: simplify cruft pack aggregation
1: c80188164e = 6: d44a124c81 builtin/pack-objects.c: freshen objects from existing cruft packs
base-commit: 87a0bdbf0f72b7561f3cd50636eee33dcb7dbcc3
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v4 1/6] t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 0:21 ` [PATCH v4 2/6] t7704-repack-cruft.sh: consolidate `write_blob()` Taylor Blau
` (5 subsequent siblings)
6 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
The cruft pack feature has two primary test scripts which exercise
various parts of it, which are:
- t5329-pack-objects-cruft.sh
- t7704-repack-cruft.sh
The former is designed to test low-level pack generation mechanics at
the 'git pack-objects --cruft'-level, which is plumbing. The latter, on
the other hand, is designed to test the user-facing behavior through
'git repack --cruft', which is porcelain (under the "ancillary
manipulators" sub-section).
At some point a handful of tests which should have been added to the
latter script were instead written to the former. This isn't a huge
deal, but rectifying it is straightforward. Move a handful of
'repack'-related tests out of t5329 and into their rightful home in
t7704.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
t/t5329-pack-objects-cruft.sh | 250 ----------------------------------
t/t7704-repack-cruft.sh | 250 ++++++++++++++++++++++++++++++++++
2 files changed, 250 insertions(+), 250 deletions(-)
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index b71a0aef40..60dac8312d 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -360,43 +360,6 @@ test_expect_success 'expired objects are pruned' '
)
'
-test_expect_success 'repack --cruft generates a cruft pack' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit reachable &&
- git branch -M main &&
- git checkout --orphan other &&
- test_commit unreachable &&
-
- git checkout main &&
- git branch -D other &&
- git tag -d unreachable &&
- # objects are not cruft if they are contained in the reflogs
- git reflog expire --all --expire=all &&
-
- git rev-list --objects --all --no-object-names >reachable.raw &&
- git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
- sort <reachable.raw >reachable &&
- comm -13 reachable objects >unreachable &&
-
- git repack --cruft -d &&
-
- cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
- pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
-
- git show-index <$packdir/$pack.idx >actual.raw &&
- cut -f2 -d" " actual.raw | sort >actual &&
- test_cmp reachable actual &&
-
- git show-index <$packdir/$cruft.idx >actual.raw &&
- cut -f2 -d" " actual.raw | sort >actual &&
- test_cmp unreachable actual
- )
-'
-
test_expect_success 'loose objects mtimes upsert others' '
git init repo &&
test_when_finished "rm -fr repo" &&
@@ -470,219 +433,6 @@ test_expect_success 'expiring cruft objects with git gc' '
)
'
-test_expect_success 'cruft packs are not included in geometric repack' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit reachable &&
- git repack -Ad &&
- git branch -M main &&
-
- git checkout --orphan other &&
- test_commit cruft &&
- git repack -d &&
-
- git checkout main &&
- git branch -D other &&
- git tag -d cruft &&
- git reflog expire --all --expire=all &&
-
- git repack --cruft &&
-
- find $packdir -type f | sort >before &&
- git repack --geometric=2 -d &&
- find $packdir -type f | sort >after &&
-
- test_cmp before after
- )
-'
-
-test_expect_success 'repack --geometric collects once-cruft objects' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit reachable &&
- git repack -Ad &&
- git branch -M main &&
-
- git checkout --orphan other &&
- git rm -rf . &&
- test_commit --no-tag cruft &&
- cruft="$(git rev-parse HEAD)" &&
-
- git checkout main &&
- git branch -D other &&
- git reflog expire --all --expire=all &&
-
- # Pack the objects created in the previous step into a cruft
- # pack. Intentionally leave loose copies of those objects
- # around so we can pick them up in a subsequent --geometric
- # reapack.
- git repack --cruft &&
-
- # Now make those objects reachable, and ensure that they are
- # packed into the new pack created via a --geometric repack.
- git update-ref refs/heads/other $cruft &&
-
- # Without this object, the set of unpacked objects is exactly
- # the set of objects already in the cruft pack. Tweak that set
- # to ensure we do not overwrite the cruft pack entirely.
- test_commit reachable2 &&
-
- find $packdir -name "pack-*.idx" | sort >before &&
- git repack --geometric=2 -d &&
- find $packdir -name "pack-*.idx" | sort >after &&
-
- {
- git rev-list --objects --no-object-names $cruft &&
- git rev-list --objects --no-object-names reachable..reachable2
- } >want.raw &&
- sort want.raw >want &&
-
- pack=$(comm -13 before after) &&
- git show-index <$pack >objects.raw &&
-
- cut -d" " -f2 objects.raw | sort >got &&
-
- test_cmp want got
- )
-'
-
-test_expect_success 'cruft repack with no reachable objects' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit base &&
- git repack -ad &&
-
- base="$(git rev-parse base)" &&
-
- git for-each-ref --format="delete %(refname)" >in &&
- git update-ref --stdin <in &&
- git reflog expire --all --expire=all &&
- rm -fr .git/index &&
-
- git repack --cruft -d &&
-
- git cat-file -t $base
- )
-'
-
-write_blob () {
- test-tool genrandom "$@" >in &&
- git hash-object -w -t blob in
-}
-
-find_pack () {
- for idx in $(ls $packdir/pack-*.idx)
- do
- git show-index <$idx >out &&
- if grep -q "$1" out
- then
- echo $idx
- fi || return 1
- done
-}
-
-test_expect_success 'cruft repack with --max-pack-size' '
- git init max-pack-size &&
- (
- cd max-pack-size &&
- test_commit base &&
-
- # two cruft objects which exceed the maximum pack size
- foo=$(write_blob foo 1048576) &&
- bar=$(write_blob bar 1048576) &&
- test-tool chmtime --get -1000 \
- "$objdir/$(test_oid_to_path $foo)" >foo.mtime &&
- test-tool chmtime --get -2000 \
- "$objdir/$(test_oid_to_path $bar)" >bar.mtime &&
- git repack --cruft --max-pack-size=1M &&
- find $packdir -name "*.mtimes" >cruft &&
- test_line_count = 2 cruft &&
-
- foo_mtimes="$(basename $(find_pack $foo) .idx).mtimes" &&
- bar_mtimes="$(basename $(find_pack $bar) .idx).mtimes" &&
- test-tool pack-mtimes $foo_mtimes >foo.actual &&
- test-tool pack-mtimes $bar_mtimes >bar.actual &&
-
- echo "$foo $(cat foo.mtime)" >foo.expect &&
- echo "$bar $(cat bar.mtime)" >bar.expect &&
-
- test_cmp foo.expect foo.actual &&
- test_cmp bar.expect bar.actual &&
- test "$foo_mtimes" != "$bar_mtimes"
- )
-'
-
-test_expect_success 'cruft repack with pack.packSizeLimit' '
- (
- cd max-pack-size &&
- # repack everything back together to remove the existing cruft
- # pack (but to keep its objects)
- git repack -adk &&
- git -c pack.packSizeLimit=1M repack --cruft &&
- # ensure the same post condition is met when --max-pack-size
- # would otherwise be inferred from the configuration
- find $packdir -name "*.mtimes" >cruft &&
- test_line_count = 2 cruft &&
- for pack in $(cat cruft)
- do
- test-tool pack-mtimes "$(basename $pack)" >objects &&
- test_line_count = 1 objects || return 1
- done
- )
-'
-
-test_expect_success 'cruft repack respects repack.cruftWindow' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit base &&
-
- GIT_TRACE2_EVENT=$(pwd)/event.trace \
- git -c pack.window=1 -c repack.cruftWindow=2 repack \
- --cruft --window=3 &&
-
- grep "pack-objects.*--window=2.*--cruft" event.trace
- )
-'
-
-test_expect_success 'cruft repack respects --window by default' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit base &&
-
- GIT_TRACE2_EVENT=$(pwd)/event.trace \
- git -c pack.window=2 repack --cruft --window=3 &&
-
- grep "pack-objects.*--window=3.*--cruft" event.trace
- )
-'
-
-test_expect_success 'cruft repack respects --quiet' '
- git init repo &&
- test_when_finished "rm -fr repo" &&
- (
- cd repo &&
-
- test_commit base &&
- GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
- test_must_be_empty err
- )
-'
-
test_expect_success 'cruft --local drops unreachable objects' '
git init alternate &&
git init repo &&
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 959e6e2648..aa5d8913ae 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -411,4 +411,254 @@ test_expect_success 'reachable packs are preferred over cruft ones' '
)
'
+test_expect_success 'repack --cruft generates a cruft pack' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit reachable &&
+ git branch -M main &&
+ git checkout --orphan other &&
+ test_commit unreachable &&
+
+ git checkout main &&
+ git branch -D other &&
+ git tag -d unreachable &&
+ # objects are not cruft if they are contained in the reflogs
+ git reflog expire --all --expire=all &&
+
+ git rev-list --objects --all --no-object-names >reachable.raw &&
+ git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+ sort <reachable.raw >reachable &&
+ comm -13 reachable objects >unreachable &&
+
+ git repack --cruft -d &&
+
+ cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+ pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+ git show-index <$packdir/$pack.idx >actual.raw &&
+ cut -f2 -d" " actual.raw | sort >actual &&
+ test_cmp reachable actual &&
+
+ git show-index <$packdir/$cruft.idx >actual.raw &&
+ cut -f2 -d" " actual.raw | sort >actual &&
+ test_cmp unreachable actual
+ )
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit reachable &&
+ git repack -Ad &&
+ git branch -M main &&
+
+ git checkout --orphan other &&
+ test_commit cruft &&
+ git repack -d &&
+
+ git checkout main &&
+ git branch -D other &&
+ git tag -d cruft &&
+ git reflog expire --all --expire=all &&
+
+ git repack --cruft &&
+
+ find $packdir -type f | sort >before &&
+ git repack --geometric=2 -d &&
+ find $packdir -type f | sort >after &&
+
+ test_cmp before after
+ )
+'
+
+test_expect_success 'repack --geometric collects once-cruft objects' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit reachable &&
+ git repack -Ad &&
+ git branch -M main &&
+
+ git checkout --orphan other &&
+ git rm -rf . &&
+ test_commit --no-tag cruft &&
+ cruft="$(git rev-parse HEAD)" &&
+
+ git checkout main &&
+ git branch -D other &&
+ git reflog expire --all --expire=all &&
+
+ # Pack the objects created in the previous step into a cruft
+ # pack. Intentionally leave loose copies of those objects
+ # around so we can pick them up in a subsequent --geometric
+ # reapack.
+ git repack --cruft &&
+
+ # Now make those objects reachable, and ensure that they are
+ # packed into the new pack created via a --geometric repack.
+ git update-ref refs/heads/other $cruft &&
+
+ # Without this object, the set of unpacked objects is exactly
+ # the set of objects already in the cruft pack. Tweak that set
+ # to ensure we do not overwrite the cruft pack entirely.
+ test_commit reachable2 &&
+
+ find $packdir -name "pack-*.idx" | sort >before &&
+ git repack --geometric=2 -d &&
+ find $packdir -name "pack-*.idx" | sort >after &&
+
+ {
+ git rev-list --objects --no-object-names $cruft &&
+ git rev-list --objects --no-object-names reachable..reachable2
+ } >want.raw &&
+ sort want.raw >want &&
+
+ pack=$(comm -13 before after) &&
+ git show-index <$pack >objects.raw &&
+
+ cut -d" " -f2 objects.raw | sort >got &&
+
+ test_cmp want got
+ )
+'
+
+test_expect_success 'cruft repack with no reachable objects' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit base &&
+ git repack -ad &&
+
+ base="$(git rev-parse base)" &&
+
+ git for-each-ref --format="delete %(refname)" >in &&
+ git update-ref --stdin <in &&
+ git reflog expire --all --expire=all &&
+ rm -fr .git/index &&
+
+ git repack --cruft -d &&
+
+ git cat-file -t $base
+ )
+'
+
+write_blob () {
+ test-tool genrandom "$@" >in &&
+ git hash-object -w -t blob in
+}
+
+find_pack () {
+ for idx in $(ls $packdir/pack-*.idx)
+ do
+ git show-index <$idx >out &&
+ if grep -q "$1" out
+ then
+ echo $idx
+ fi || return 1
+ done
+}
+
+test_expect_success 'cruft repack with --max-pack-size' '
+ git init max-pack-size &&
+ (
+ cd max-pack-size &&
+ test_commit base &&
+
+ # two cruft objects which exceed the maximum pack size
+ foo=$(write_blob foo 1048576) &&
+ bar=$(write_blob bar 1048576) &&
+ test-tool chmtime --get -1000 \
+ "$objdir/$(test_oid_to_path $foo)" >foo.mtime &&
+ test-tool chmtime --get -2000 \
+ "$objdir/$(test_oid_to_path $bar)" >bar.mtime &&
+ git repack --cruft --max-pack-size=1M &&
+ find $packdir -name "*.mtimes" >cruft &&
+ test_line_count = 2 cruft &&
+
+ foo_mtimes="$(basename $(find_pack $foo) .idx).mtimes" &&
+ bar_mtimes="$(basename $(find_pack $bar) .idx).mtimes" &&
+ test-tool pack-mtimes $foo_mtimes >foo.actual &&
+ test-tool pack-mtimes $bar_mtimes >bar.actual &&
+
+ echo "$foo $(cat foo.mtime)" >foo.expect &&
+ echo "$bar $(cat bar.mtime)" >bar.expect &&
+
+ test_cmp foo.expect foo.actual &&
+ test_cmp bar.expect bar.actual &&
+ test "$foo_mtimes" != "$bar_mtimes"
+ )
+'
+
+test_expect_success 'cruft repack with pack.packSizeLimit' '
+ (
+ cd max-pack-size &&
+ # repack everything back together to remove the existing cruft
+ # pack (but to keep its objects)
+ git repack -adk &&
+ git -c pack.packSizeLimit=1M repack --cruft &&
+ # ensure the same post condition is met when --max-pack-size
+ # would otherwise be inferred from the configuration
+ find $packdir -name "*.mtimes" >cruft &&
+ test_line_count = 2 cruft &&
+ for pack in $(cat cruft)
+ do
+ test-tool pack-mtimes "$(basename $pack)" >objects &&
+ test_line_count = 1 objects || return 1
+ done
+ )
+'
+
+test_expect_success 'cruft repack respects repack.cruftWindow' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit base &&
+
+ GIT_TRACE2_EVENT=$(pwd)/event.trace \
+ git -c pack.window=1 -c repack.cruftWindow=2 repack \
+ --cruft --window=3 &&
+
+ grep "pack-objects.*--window=2.*--cruft" event.trace
+ )
+'
+
+test_expect_success 'cruft repack respects --window by default' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit base &&
+
+ GIT_TRACE2_EVENT=$(pwd)/event.trace \
+ git -c pack.window=2 repack --cruft --window=3 &&
+
+ grep "pack-objects.*--window=3.*--cruft" event.trace
+ )
+'
+
+test_expect_success 'cruft repack respects --quiet' '
+ git init repo &&
+ test_when_finished "rm -fr repo" &&
+ (
+ cd repo &&
+
+ test_commit base &&
+ GIT_PROGRESS_DELAY=0 git repack --cruft --quiet 2>err &&
+ test_must_be_empty err
+ )
+'
+
test_done
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v4 2/6] t7704-repack-cruft.sh: consolidate `write_blob()`
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
2025-03-11 0:21 ` [PATCH v4 1/6] t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 0:21 ` [PATCH v4 3/6] t/lib-cruft.sh: extract some cruft-related helpers Taylor Blau
` (4 subsequent siblings)
6 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
The last commit moved a handful of tests from a different script into
t7704, including one that relies on generating random blobs.
Incidentally, the original home of this test defined its own helper
"write_blob" for doing so, which is identical in function to our
"generate_random_blob" (and is slightly inferior to the latter, which
cleans up after itself).
Rewrite the test that uses "write_blob" to no longer do so and then
remove the function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
t/t7704-repack-cruft.sh | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index aa5d8913ae..5ce2648a29 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -552,11 +552,6 @@ test_expect_success 'cruft repack with no reachable objects' '
)
'
-write_blob () {
- test-tool genrandom "$@" >in &&
- git hash-object -w -t blob in
-}
-
find_pack () {
for idx in $(ls $packdir/pack-*.idx)
do
@@ -575,8 +570,8 @@ test_expect_success 'cruft repack with --max-pack-size' '
test_commit base &&
# two cruft objects which exceed the maximum pack size
- foo=$(write_blob foo 1048576) &&
- bar=$(write_blob bar 1048576) &&
+ foo=$(generate_random_blob foo 1048576) &&
+ bar=$(generate_random_blob bar 1048576) &&
test-tool chmtime --get -1000 \
"$objdir/$(test_oid_to_path $foo)" >foo.mtime &&
test-tool chmtime --get -2000 \
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v4 3/6] t/lib-cruft.sh: extract some cruft-related helpers
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
2025-03-11 0:21 ` [PATCH v4 1/6] t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests Taylor Blau
2025-03-11 0:21 ` [PATCH v4 2/6] t7704-repack-cruft.sh: consolidate `write_blob()` Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 0:21 ` [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold Taylor Blau
` (3 subsequent siblings)
6 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
The test script in t7704-repack-cruft.sh uses a handful of helpers to
generate and pack random blobs, but a forthcoming test in t5329 will
want to make use of generate_random_blob().
Instead of rewriting that function, move it (and the related helpers) to
its own reusable library to avoid duplicating the function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
t/lib-cruft.sh | 23 +++++++++++++++++++++++
t/t7704-repack-cruft.sh | 22 +---------------------
2 files changed, 24 insertions(+), 21 deletions(-)
create mode 100644 t/lib-cruft.sh
diff --git a/t/lib-cruft.sh b/t/lib-cruft.sh
new file mode 100644
index 0000000000..9fdc0b01b0
--- /dev/null
+++ b/t/lib-cruft.sh
@@ -0,0 +1,23 @@
+# generate_random_blob <seed-string> [<size>]
+generate_random_blob () {
+ test-tool genrandom "$@" >blob &&
+ git hash-object -w -t blob blob &&
+ rm blob
+}
+
+# pack_random_blob <seed-string> [<size>]
+pack_random_blob () {
+ generate_random_blob "$@" &&
+ git repack -d -q >/dev/null
+}
+
+# generate_cruft_pack <seed-string> [<size>]
+generate_cruft_pack () {
+ pack_random_blob "$@" >/dev/null &&
+
+ ls $packdir/pack-*.pack | xargs -n 1 basename >in &&
+ pack="$(git pack-objects --cruft $packdir/pack <in)" &&
+ git prune-packed &&
+
+ echo "$packdir/pack-$pack.mtimes"
+}
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 5ce2648a29..88c6ce2913 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -3,6 +3,7 @@
test_description='git repack works correctly'
. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-cruft.sh
objdir=.git/objects
packdir=$objdir/pack
@@ -128,27 +129,6 @@ test_expect_success '--expire-to stores pruned objects (5.minutes.ago)' '
)
'
-generate_random_blob() {
- test-tool genrandom "$@" >blob &&
- git hash-object -w -t blob blob &&
- rm blob
-}
-
-pack_random_blob () {
- generate_random_blob "$@" &&
- git repack -d -q >/dev/null
-}
-
-generate_cruft_pack () {
- pack_random_blob "$@" >/dev/null &&
-
- ls $packdir/pack-*.pack | xargs -n 1 basename >in &&
- pack="$(git pack-objects --cruft $packdir/pack <in)" &&
- git prune-packed &&
-
- echo "$packdir/pack-$pack.mtimes"
-}
-
test_expect_success '--max-cruft-size creates new packs when above threshold' '
git init max-cruft-size-large &&
(
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
` (2 preceding siblings ...)
2025-03-11 0:21 ` [PATCH v4 3/6] t/lib-cruft.sh: extract some cruft-related helpers Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 21:59 ` Junio C Hamano
2025-03-11 0:21 ` [PATCH v4 5/6] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
` (2 subsequent siblings)
6 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
When generating multiple cruft packs with 'git repack --max-cruft-size',
we use 'git pack-objects --cruft --max-pack-size' (with many other
elided options), filling in the '--max-pack-size' value with whatever
was provided via the '--max-cruft-size' flag.
This causes us to generate a pack that is smaller than the specified
threshold. This poses a problem since we will never be able to generate
a cruft pack that crosses the threshold. In effect, this means that we
will try and repack its contents over and over again.
Instead, change the meaning of '--max-pack-size' in pack-objects when
combined with '--cruft'. When put together, '--max-pack-size' allows the
pack to grow larger than the specified threshold, but only by one
additional object.
This allows cruft packs to become just a little bit larger than the
threshold, allowing cruft packs to accumulate past the set threshold and
avoid being repacked in the future until a pruning GC takes place.
Noticed-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Documentation/config/pack.adoc | 4 ++
Documentation/git-pack-objects.adoc | 4 ++
builtin/pack-objects.c | 32 +++++++++++++-
t/t5329-pack-objects-cruft.sh | 67 +++++++++++++++++++++++++++++
4 files changed, 105 insertions(+), 2 deletions(-)
diff --git a/Documentation/config/pack.adoc b/Documentation/config/pack.adoc
index da527377fa..0a90931b93 100644
--- a/Documentation/config/pack.adoc
+++ b/Documentation/config/pack.adoc
@@ -119,6 +119,10 @@ sizes (e.g., removable media that cannot store the whole repository),
you are likely better off creating a single large packfile and splitting
it using a generic multi-volume archive tool (e.g., Unix `split`).
+
+When generating cruft packs with `git pack-objects`, this option has a
+slightly different interpretation than above; see the documentation for
+`--max-pack-size` option in linkgit:git-pack-objects[1].
++
The minimum size allowed is limited to 1 MiB. The default is unlimited.
Common unit suffixes of 'k', 'm', or 'g' are supported.
diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
index 7f69ae4855..aee467c496 100644
--- a/Documentation/git-pack-objects.adoc
+++ b/Documentation/git-pack-objects.adoc
@@ -161,6 +161,10 @@ depth is 4095.
`pack.packSizeLimit` is set. Note that this option may result in
a larger and slower repository; see the discussion in
`pack.packSizeLimit`.
++
+When used with `--cruft`, the output packfile(s) may be as large or
+larger than the configured maximum size. The pack will exceed the
+specified maximum by no more than one object.
--honor-pack-keep::
This flag causes an object already in a local pack that
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b16126..f701b4c9ec 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -692,11 +692,39 @@ static off_t write_object(struct hashfile *f,
off_t len;
int usable_delta, to_reuse;
+ if (cruft && pack_size_limit && pack_size_limit <= write_offset) {
+ /*
+ * When writing a cruft pack with a limited size,
+ * perform the --max-pack-size check *before* writing
+ * the object.
+ *
+ * When we have not yet reached the size limit, this
+ * combined with the fact that we act as if there is no
+ * limit when writing objects via write_object() allows
+ * us to grow one object *past* the specified limit.
+ *
+ * This is important for generating cruft packs with a
+ * --max-pack-size so we can generate packs that are
+ * just over the threshold to avoid repacking them in
+ * the future.
+ */
+ return 0;
+ }
+
if (!pack_to_stdout)
crc32_begin(f);
- /* apply size limit if limited packsize and not first object */
- if (!pack_size_limit || !nr_written)
+ /*
+ * Apply size limit when one is provided, with the following
+ * exceptions:
+ *
+ * - We are writing the first object.
+ *
+ * - We are writing a cruft pack with a size limit. The check
+ * above covers this case while letting the pack grow at most
+ * one object beyond the limit.
+ */
+ if (!pack_size_limit || !nr_written || cruft)
limit = 0;
else if (pack_size_limit <= write_offset)
/*
diff --git a/t/t5329-pack-objects-cruft.sh b/t/t5329-pack-objects-cruft.sh
index 60dac8312d..9cbc21a65d 100755
--- a/t/t5329-pack-objects-cruft.sh
+++ b/t/t5329-pack-objects-cruft.sh
@@ -3,6 +3,7 @@
test_description='cruft pack related pack-objects tests'
. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-cruft.sh
objdir=.git/objects
packdir=$objdir/pack
@@ -695,4 +696,70 @@ test_expect_success 'additional cruft blobs via gc.recentObjectsHook' '
)
'
+test_expect_success 'cruft pack generation beyond --max-pack-size' '
+ test_when_finished "rm -fr repo" &&
+ git init repo &&
+ (
+ cd repo &&
+
+ # Disable pack compression to ensure the pack size is
+ # predictable.
+ git config pack.compression 0 &&
+
+ sz=524288 && # 0.5 MiB
+ foo="$(generate_random_blob foo $sz)" &&
+ bar="$(generate_random_blob bar $sz)" &&
+ baz="$(generate_random_blob baz $sz)" &&
+ quux="$(generate_random_blob quux $sz)" &&
+
+ printf "%s\n" "$foo" "$bar" >A.objects &&
+ printf "%s\n" "$baz" "$quux" >B.objects &&
+
+ A="$(git pack-objects $packdir/pack <A.objects)" &&
+ B="$(git pack-objects $packdir/pack <B.objects)" &&
+
+ git prune-packed &&
+
+ sz=1572864 && # 1.5 MiB
+ printf -- "-%s\n" "pack-$A.pack" "pack-$B.pack" >C.in &&
+ git pack-objects --cruft --max-pack-size=$sz $packdir/pack \
+ <C.in >C.out &&
+
+ test_line_count = 2 C.out &&
+ C_large="$(head -n 1 C.out)" &&
+ C_small="$(tail -n 1 C.out)" &&
+
+ # Swap $C_large and $C_small if necessary.
+ if test "$(test_file_size $packdir/pack-$C_large.idx)" -lt \
+ "$(test_file_size $packdir/pack-$C_small.idx)"
+ then
+ tmp="$C_large" &&
+ C_large="$C_small" &&
+ C_small="$tmp"
+ fi &&
+
+ # Ensure the large pack is no smaller than the threshold
+ # such that it does not get repacked in subsequent runs
+ # with the same --max-pack-size setting.
+ test $(test_file_size $packdir/pack-$C_large.pack) -ge $sz &&
+
+ {
+ git show-index <"$packdir/pack-$C_large.idx" &&
+ git show-index <"$packdir/pack-$C_small.idx"
+ } >actual.raw &&
+ printf "%s\n" "$foo" "$bar" "$baz" "$quux" >expect.raw &&
+
+ sort <expect.raw >expect &&
+ cut -d " " -f 2 actual.raw | sort >actual &&
+
+ # Ensure that all of the objects are present in the two
+ # cruft packs we just generated.
+ #
+ # Note that the contents of "actual" are not
+ # de-duplicated. This is intentional to ensure we avoid
+ # packing the same object twice (once in each pack).
+ test_cmp expect actual
+ )
+'
+
test_done
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v4 5/6] builtin/repack.c: simplify cruft pack aggregation
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
` (3 preceding siblings ...)
2025-03-11 0:21 ` [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 0:21 ` [PATCH v4 6/6] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-11 20:13 ` [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs Junio C Hamano
6 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), 'git repack' built on support for
multiple cruft packs in Git by instructing 'git pack-objects --cruft'
how to aggregate smaller cruft packs up to the provided threshold.
The implementation in 37dc6d8104 worked something like the following
pseudo-code:
total_size = 0;
for (p in cruft packs) {
if (p->pack_size + total_size < max_size) {
total_size += p->pack_size;
collapse(p)
} else {
retain(p);
}
}
The original idea behind this approach was that smaller cruft packs
would get combined together until the sum of their sizes was no larger
than the given max pack size.
There is a much simpler way to achieve this, however, which is to simply
combine *all* cruft packs which are smaller than the threshold,
regardless of what their sum is. With '--max-pack-size', 'pack-objects'
will split out the resulting pack into individual pack(s) if necessary
to ensure that the written pack(s) are each at most one object larger
than the provided threshold.
This yields a slight behavior change, which is reflected in the removed
test. Previous to this change, we would aggregate smaller cruft packs
first, whereas now we will opportunistically combine as many cruft packs
as possible. As as result, that test is no longer relevant, and can be
deleted.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/repack.c | 38 ++-----------------------------------
t/t7704-repack-cruft.sh | 42 -----------------------------------------
2 files changed, 2 insertions(+), 78 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index 75e3752353..4d83d40f39 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1022,29 +1022,13 @@ static int write_filtered_pack(const struct pack_objects_args *args,
return finish_pack_objects_cmd(&cmd, names, local);
}
-static int existing_cruft_pack_cmp(const void *va, const void *vb)
-{
- struct packed_git *a = *(struct packed_git **)va;
- struct packed_git *b = *(struct packed_git **)vb;
-
- if (a->pack_size < b->pack_size)
- return -1;
- if (a->pack_size > b->pack_size)
- return 1;
- return 0;
-}
-
static void collapse_small_cruft_packs(FILE *in, size_t max_size,
struct existing_packs *existing)
{
- struct packed_git **existing_cruft, *p;
+ struct packed_git *p;
struct strbuf buf = STRBUF_INIT;
- size_t total_size = 0;
- size_t existing_cruft_nr = 0;
size_t i;
- ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
-
for (p = get_all_packs(the_repository); p; p = p->next) {
if (!(p->is_cruft && p->pack_local))
continue;
@@ -1056,24 +1040,7 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
if (!string_list_has_string(&existing->cruft_packs, buf.buf))
continue;
- if (existing_cruft_nr >= existing->cruft_packs.nr)
- BUG("too many cruft packs (found %"PRIuMAX", but knew "
- "of %"PRIuMAX")",
- (uintmax_t)existing_cruft_nr + 1,
- (uintmax_t)existing->cruft_packs.nr);
- existing_cruft[existing_cruft_nr++] = p;
- }
-
- QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
-
- for (i = 0; i < existing_cruft_nr; i++) {
- size_t proposed;
-
- p = existing_cruft[i];
- proposed = st_add(total_size, p->pack_size);
-
- if (proposed <= max_size) {
- total_size = proposed;
+ if (p->pack_size < max_size) {
fprintf(in, "-%s\n", pack_basename(p));
} else {
retain_cruft_pack(existing, p);
@@ -1086,7 +1053,6 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
existing->non_kept_packs.items[i].string);
strbuf_release(&buf);
- free(existing_cruft);
}
static int write_cruft_pack(const struct pack_objects_args *args,
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 88c6ce2913..fb52bb36a2 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -174,48 +174,6 @@ test_expect_success '--max-cruft-size combines existing packs when below thresho
)
'
-test_expect_success '--max-cruft-size combines smaller packs first' '
- git init max-cruft-size-consume-small &&
- (
- cd max-cruft-size-consume-small &&
-
- test_commit base &&
- git repack -ad &&
-
- cruft_foo="$(generate_cruft_pack foo 524288)" && # 0.5 MiB
- cruft_bar="$(generate_cruft_pack bar 524288)" && # 0.5 MiB
- cruft_baz="$(generate_cruft_pack baz 1048576)" && # 1.0 MiB
- cruft_quux="$(generate_cruft_pack quux 1572864)" && # 1.5 MiB
-
- test-tool pack-mtimes "$(basename $cruft_foo)" >expect.raw &&
- test-tool pack-mtimes "$(basename $cruft_bar)" >>expect.raw &&
- sort expect.raw >expect.objects &&
-
- # repacking with `--max-cruft-size=2M` should combine
- # both 0.5 MiB packs together, instead of, say, one of
- # the 0.5 MiB packs with the 1.0 MiB pack
- ls $packdir/pack-*.mtimes | sort >cruft.before &&
- git repack -d --cruft --max-cruft-size=2M &&
- ls $packdir/pack-*.mtimes | sort >cruft.after &&
-
- comm -13 cruft.before cruft.after >cruft.new &&
- comm -23 cruft.before cruft.after >cruft.removed &&
-
- test_line_count = 1 cruft.new &&
- test_line_count = 2 cruft.removed &&
-
- # the two smaller packs should be rolled up first
- printf "%s\n" $cruft_foo $cruft_bar | sort >expect.removed &&
- test_cmp expect.removed cruft.removed &&
-
- # ...and contain the set of objects rolled up
- test-tool pack-mtimes "$(basename $(cat cruft.new))" >actual.raw &&
- sort actual.raw >actual.objects &&
-
- test_cmp expect.objects actual.objects
- )
-'
-
test_expect_success 'setup --max-cruft-size with freshened objects' '
git init max-cruft-size-freshen &&
(
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply related [flat|nested] 54+ messages in thread
* [PATCH v4 6/6] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
` (4 preceding siblings ...)
2025-03-11 0:21 ` [PATCH v4 5/6] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
@ 2025-03-11 0:21 ` Taylor Blau
2025-03-11 20:13 ` [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs Junio C Hamano
6 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-11 0:21 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren, Patrick Steinhardt
Once an object is written into a cruft pack, we can only freshen it by
writing a new loose or packed copy of that object with a more recent
mtime.
Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
a repository at any given time. So freshening unreachable objects was
straightforward when already rewriting the cruft pack (and its *.mtimes
file).
But 61568efa95 changes things: 'pack-objects' now supports writing
multiple cruft packs when invoked with `--cruft` and the
`--max-pack-size` flag. Cruft packs are rewritten until they reach some
size threshold, at which point they are considered "frozen", and will
only be modified in a pruning GC, or if the threshold itself is
adjusted.
Prior to this patch, however, this process breaks down when we attempt
to freshen an object packed in an earlier cruft pack, and that cruft
pack is larger than the threshold and thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
pack(s) when those cruft packs are larger than the threshold. This is
because we would avoid writing them in the new cruft pack entirely, for
a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
over the packs we're going to retain (which are marked as kept
in-core by 'read_cruft_objects()').
This means that we will avoid enumerating additional packed copies
of objects found in any cruft packs which are larger than the given
size threshold. Thus there is no opportunity to call
'create_object_entry()' whatsoever.
2. We likewise will discard the loose copy (if one exists) of any
unreachable object packed in a cruft pack that is larger than the
threshold. Here our call path is 'add_unreachable_loose_objects()',
which uses the 'add_loose_object()' callback.
That function will eventually land us in 'want_object_in_pack()'
(via 'add_cruft_object_entry()'), and we'll discard the object as it
appears in one of the packs which we marked as kept in-core.
This means in effect that it is impossible to freshen an unreachable
object once it appears in a cruft pack larger than the given threshold.
Instead, we should pack an additional copy of an unreachable object we
want to freshen even if it appears in a cruft pack, provided that the
cruft copy has an mtime which is before the mtime of the copy we are
trying to pack/freshen. This is sub-optimal in the sense that it
requires keeping an additional copy of unreachable objects upon
freshening, but we don't have a better alternative without the ability
to make in-place modifications to existing *.mtimes files.
In order to implement this, we have to adjust the behavior of
'want_found_object()'. When 'pack-objects' is told that we're *not*
going to retain any cruft packs (i.e. the set of packs marked as kept
in-core does not contain a cruft pack), the behavior is unchanged.
But when there *is* at least one cruft pack that we're holding onto, it
is no longer sufficient to reject a copy of an object found in that
cruft pack for that reason alone. In this case, we only want to reject a
candidate object when copies of that object either:
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- exists in a cruft pack with an mtime at least as recent as the copy
we are debating whether or not to pack, in which case freshening
would be redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
objects when we have a copy in an existing cruft pack with at least as
recent an mtime as our candidate (in which case "freshening" would be
redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 63 +++++++++++++++++++++
4 files changed, 168 insertions(+), 18 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f701b4c9ec..1a3e7dd3d3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -206,6 +206,7 @@ static int have_non_local_packs;
static int incremental;
static int ignore_packed_keep_on_disk;
static int ignore_packed_keep_in_core;
+static int ignore_packed_keep_in_core_has_cruft;
static int allow_ofs_delta;
static struct pack_idx_option pack_idx_opts;
static const char *base_name;
@@ -1530,8 +1531,60 @@ static int have_duplicate_entry(const struct object_id *oid,
return 1;
}
+static int want_cruft_object_mtime(struct repository *r,
+ const struct object_id *oid,
+ unsigned flags, uint32_t mtime)
+{
+ struct packed_git **cache;
+
+ for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+ struct packed_git *p = *cache;
+ off_t ofs;
+ uint32_t candidate_mtime;
+
+ ofs = find_pack_entry_one(oid, p);
+ if (!ofs)
+ continue;
+
+ /*
+ * We have a copy of the object 'oid' in a non-cruft
+ * pack. We can avoid packing an additional copy
+ * regardless of what the existing copy's mtime is since
+ * it is outside of a cruft pack.
+ */
+ if (!p->is_cruft)
+ return 0;
+
+ /*
+ * If we have a copy of the object 'oid' in a cruft
+ * pack, then either read the cruft pack's mtime for
+ * that object, or, if that can't be loaded, assume the
+ * pack's mtime itself.
+ */
+ if (!load_pack_mtimes(p)) {
+ uint32_t pos;
+ if (offset_to_pack_pos(p, ofs, &pos) < 0)
+ continue;
+ candidate_mtime = nth_packed_mtime(p, pos);
+ } else {
+ candidate_mtime = p->mtime;
+ }
+
+ /*
+ * We have a surviving copy of the object in a cruft
+ * pack whose mtime is greater than or equal to the one
+ * we are considering. We can thus avoid packing an
+ * additional copy of that object.
+ */
+ if (mtime <= candidate_mtime)
+ return 0;
+ }
+
+ return -1;
+}
+
static int want_found_object(const struct object_id *oid, int exclude,
- struct packed_git *p)
+ struct packed_git *p, uint32_t mtime)
{
if (exclude)
return 1;
@@ -1581,12 +1634,29 @@ static int want_found_object(const struct object_id *oid, int exclude,
if (ignore_packed_keep_in_core)
flags |= IN_CORE_KEEP_PACKS;
- if (ignore_packed_keep_on_disk && p->pack_keep)
- return 0;
- if (ignore_packed_keep_in_core && p->pack_keep_in_core)
- return 0;
- if (has_object_kept_pack(p->repo, oid, flags))
- return 0;
+ /*
+ * If the object is in a pack that we want to ignore, *and* we
+ * don't have any cruft packs that are being retained, we can
+ * abort quickly.
+ */
+ if (!ignore_packed_keep_in_core_has_cruft) {
+ if (ignore_packed_keep_on_disk && p->pack_keep)
+ return 0;
+ if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+ return 0;
+ if (has_object_kept_pack(p->repo, oid, flags))
+ return 0;
+ } else {
+ /*
+ * But if there is at least one cruft pack which
+ * is being kept, we only want to include the
+ * provided object if it has a strictly greater
+ * mtime than any existing cruft copy.
+ */
+ if (!want_cruft_object_mtime(p->repo, oid, flags,
+ mtime))
+ return 0;
+ }
}
/*
@@ -1605,7 +1675,8 @@ static int want_object_in_pack_one(struct packed_git *p,
const struct object_id *oid,
int exclude,
struct packed_git **found_pack,
- off_t *found_offset)
+ off_t *found_offset,
+ uint32_t found_mtime)
{
off_t offset;
@@ -1621,7 +1692,7 @@ static int want_object_in_pack_one(struct packed_git *p,
*found_offset = offset;
*found_pack = p;
}
- return want_found_object(oid, exclude, p);
+ return want_found_object(oid, exclude, p, found_mtime);
}
return -1;
}
@@ -1635,10 +1706,11 @@ static int want_object_in_pack_one(struct packed_git *p,
* function finds if there is any pack that has the object and returns the pack
* and its offset in these variables.
*/
-static int want_object_in_pack(const struct object_id *oid,
- int exclude,
- struct packed_git **found_pack,
- off_t *found_offset)
+static int want_object_in_pack_mtime(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset,
+ uint32_t found_mtime)
{
int want;
struct list_head *pos;
@@ -1653,7 +1725,8 @@ static int want_object_in_pack(const struct object_id *oid,
* are present we will determine the answer right now.
*/
if (*found_pack) {
- want = want_found_object(oid, exclude, *found_pack);
+ want = want_found_object(oid, exclude, *found_pack,
+ found_mtime);
if (want != -1)
return want;
@@ -1664,7 +1737,7 @@ static int want_object_in_pack(const struct object_id *oid,
for (m = get_multi_pack_index(the_repository); m; m = m->next) {
struct pack_entry e;
if (fill_midx_entry(the_repository, oid, &e, m)) {
- want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset, found_mtime);
if (want != -1)
return want;
}
@@ -1672,7 +1745,7 @@ static int want_object_in_pack(const struct object_id *oid,
list_for_each(pos, get_packed_git_mru(the_repository)) {
struct packed_git *p = list_entry(pos, struct packed_git, mru);
- want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset, found_mtime);
if (!exclude && want > 0)
list_move(&p->mru,
get_packed_git_mru(the_repository));
@@ -1702,6 +1775,15 @@ static int want_object_in_pack(const struct object_id *oid,
return 1;
}
+static inline int want_object_in_pack(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset)
+{
+ return want_object_in_pack_mtime(oid, exclude, found_pack, found_offset,
+ 0);
+}
+
static struct object_entry *create_object_entry(const struct object_id *oid,
enum object_type type,
uint32_t hash,
@@ -3634,7 +3716,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
entry->no_try_delta = no_try_delta(name);
}
} else {
- if (!want_object_in_pack(oid, 0, &pack, &offset))
+ if (!want_object_in_pack_mtime(oid, 0, &pack, &offset, mtime))
return;
if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
/*
@@ -3708,6 +3790,8 @@ static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
struct packed_git *p = item->util;
if (!p)
die(_("could not find pack '%s'"), item->string);
+ if (p->is_cruft && keep)
+ ignore_packed_keep_in_core_has_cruft = 1;
p->pack_keep_in_core = keep;
}
}
diff --git a/packfile.c b/packfile.c
index 2d80d80cb3..9d09f8bc72 100644
--- a/packfile.c
+++ b/packfile.c
@@ -24,6 +24,7 @@
#include "commit-graph.h"
#include "pack-revindex.h"
#include "promisor-remote.h"
+#include "pack-mtimes.h"
char *odb_pack_name(struct repository *r, struct strbuf *buf,
const unsigned char *hash, const char *ext)
@@ -2107,7 +2108,7 @@ static void maybe_invalidate_kept_pack_cache(struct repository *r,
r->objects->kept_pack_cache.flags = 0;
}
-static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
{
maybe_invalidate_kept_pack_cache(r, flags);
diff --git a/packfile.h b/packfile.h
index 00ada7a938..25097213d0 100644
--- a/packfile.h
+++ b/packfile.h
@@ -197,6 +197,8 @@ int has_object_pack(struct repository *r, const struct object_id *oid);
int has_object_kept_pack(struct repository *r, const struct object_id *oid,
unsigned flags);
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index fb52bb36a2..3082e65817 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -242,6 +242,69 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
)
'
+test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
+ git init max-cruft-size-threshold &&
+ (
+ cd max-cruft-size-threshold &&
+
+ test_commit base &&
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ baz="$(generate_random_blob baz $((2*1024*1024)))" &&
+
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
+
+ git repack --cruft -d &&
+
+ # Make an identical copy of foo stored in a pack with a more
+ # recent mtime.
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
+ test-tool chmtime --get -100 \
+ "$packdir/pack-$foo_pack.pack" >foo.new &&
+ git prune-packed &&
+
+ # Make a loose copy of bar, also with a more recent mtime.
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
+
+ # Make a new cruft object $quux to ensure we do not
+ # generate an identical pack to the existing cruft
+ # pack.
+ quux="$(generate_random_blob quux $((1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$quux")" >quux.new &&
+
+ git repack --cruft --max-cruft-size=3M -d &&
+
+ for p in $packdir/pack-*.mtimes
+ do
+ test-tool pack-mtimes "$(basename "$p")" || return 1
+ done >actual.raw &&
+ sort actual.raw >actual &&
+
+ # Among the set of all cruft packs, we should see both
+ # mtimes for object $foo and $bar, as well as the
+ # single new copy of $baz.
+ sort >expect <<-EOF &&
+ $foo $(cat foo.old)
+ $foo $(cat foo.new)
+ $bar $(cat bar.old)
+ $bar $(cat bar.new)
+ $baz $(cat baz.old)
+ $quux $(cat quux.new)
+ EOF
+
+ test_cmp expect actual
+ )
+'
+
test_expect_success '--max-cruft-size with pruning' '
git init max-cruft-size-prune &&
(
--
2.49.0.rc2.6.g9a1eecd400
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
` (5 preceding siblings ...)
2025-03-11 0:21 ` [PATCH v4 6/6] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-03-11 20:13 ` Junio C Hamano
2025-03-12 15:33 ` Taylor Blau
6 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-11 20:13 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
Taylor Blau <me@ttaylorr.com> writes:
> Here is a slightly larger reroll of my series to fix object freshening
> when using multi-cruft packs that I have been meaning to send for a
> couple of days.
>
> I realized after sending the last round that not only was the first
> commit from v1 flawed (for the reasons Patrick identified) but that
> there is currently no way to grow a new cruft pack past the configured
> limit.
>
> Independent of this series suppose for example that we have two 100 MiB
> packs, and the threshold is 200 MiB. We should able to in theory combine
> those packs together. But we can't! The largest pack we'll make is
> 199MiB (and change), since builtin/pack-objects.c::write_one() will
> refuse to write any object which would bust the limit given by
> --max-pack-size.
I am not sure why that is a problem. If we have many loose objects
and the threshold is set at 200, wouldn't we also give up at 199
plus a change when packing these loose objects into a pack? If the
last object that makes us bust the threshold is unusually large, say
50, we may give up at 150 plus a bit, unless we go back to the queue
and pick smaller objects among the remaining ones to fill the
remaining 50 minus a bit, and because we do not do that to enforce
max-pack-size, I am not sure how "give up just before the threshold"
is too bad and needs to be replaced with "give up just after".
Or is the problem that the threshold is applied differently based on
where the objects come from? E.g., packing many loose objects would
stop just after, but repacking from cruft would stop just before, or
something? If the problem is that we are inconsistent, then I would
understand that it may be good to make things consistent.
> This series resurrects the first patch from v1 after introducing a
> behavior change for 'git pack-objects --cruft --max-pack-size'. When
> given with '--cruft', '--max-pack-size' now allows pack-objects to grow
> a pack *just* past the given limit by at most one object.
And what happens when the last object appended is very large, like
70? Would we end up with 270 when the threshold says 200?
I still am not getting what you are trying to explain in the above
two paragraphs, but in general, "give up just before" would be a
better choice than "give up just after", exactly because the threshold
we are letting the user to give is the maximum.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-11 0:21 ` [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold Taylor Blau
@ 2025-03-11 21:59 ` Junio C Hamano
2025-03-12 15:22 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-11 21:59 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
Taylor Blau <me@ttaylorr.com> writes:
> When generating multiple cruft packs with 'git repack --max-cruft-size',
> we use 'git pack-objects --cruft --max-pack-size' (with many other
> elided options), filling in the '--max-pack-size' value with whatever
> was provided via the '--max-cruft-size' flag.
>
> This causes us to generate a pack that is smaller than the specified
> threshold. This poses a problem since we will never be able to generate
> a cruft pack that crosses the threshold.
So far I see absolutely *NO* problem described in the above. The
user said "I want to chop them into 200MB pieces but do not exceed
the threshold" and the system honored that wish.
> In effect, this means that we
> will try and repack its contents over and over again.
The end effect however may be problematic, but isn't it due to the
way when to repack is determined? You see 199MB piece of cruft pack
plus some other cruft data. You have generated no new cruft and no
existing cruft expired out, but you do not know these facts until
you try to repack. Because 200MB is the limit, you include the
199MB one as part of the ones to be recombined into the new cruft
pack because 199MB is smaller than 200MB and you do not know that
the reason why it is 199MB is because the earlier repack operation
found all remaining cruft material to be larger than 1MB; if there
were a 0.5MB cruft, it may have made it closer to 200MB.
So would it be feasible to remember how 199MB cruft pack is lying in
the object store (i.e. earlier we packed as much as possible), and
add a logic that says "if there is nothing to expire out of this
one, do not attempt to repack---this is fine as-is"?
> Instead, change the meaning of '--max-pack-size' in pack-objects when
> combined with '--cruft'. When put together, '--max-pack-size' allows the
> pack to grow larger than the specified threshold, but only by one
> additional object.
I do not think that would work well. You have no control over the
size of that one additional object---it may weigh more than 100MB,
combining your 199MB cruft pack with something else to make it ~300MB
cruft. In other words, "just a little bit larger" sounds like a
wishful thinking handwaving.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-11 21:59 ` Junio C Hamano
@ 2025-03-12 15:22 ` Taylor Blau
2025-03-12 18:26 ` Junio C Hamano
0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-12 15:22 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
On Tue, Mar 11, 2025 at 02:59:10PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > When generating multiple cruft packs with 'git repack --max-cruft-size',
> > we use 'git pack-objects --cruft --max-pack-size' (with many other
> > elided options), filling in the '--max-pack-size' value with whatever
> > was provided via the '--max-cruft-size' flag.
> >
> > This causes us to generate a pack that is smaller than the specified
> > threshold. This poses a problem since we will never be able to generate
> > a cruft pack that crosses the threshold.
>
> So far I see absolutely *NO* problem described in the above. The
> user said "I want to chop them into 200MB pieces but do not exceed
> the threshold" and the system honored that wish.
>
> > In effect, this means that we
> > will try and repack its contents over and over again.
>
> The end effect however may be problematic, but isn't it due to the
> way when to repack is determined? You see 199MB piece of cruft pack
> plus some other cruft data. You have generated no new cruft and no
> existing cruft expired out, but you do not know these facts until
> you try to repack. Because 200MB is the limit, you include the
> 199MB one as part of the ones to be recombined into the new cruft
> pack because 199MB is smaller than 200MB and you do not know that
> the reason why it is 199MB is because the earlier repack operation
> found all remaining cruft material to be larger than 1MB; if there
> were a 0.5MB cruft, it may have made it closer to 200MB.
>
> So would it be feasible to remember how 199MB cruft pack is lying in
> the object store (i.e. earlier we packed as much as possible), and
> add a logic that says "if there is nothing to expire out of this
> one, do not attempt to repack---this is fine as-is"?
I had a similar thought when first thinking about multi-cruft packs, but
the line of thinking is somewhat flawed. When we do a pruning GC, the
vast majority of objects should be expired out of the repository,
leaving only the recent ones that have mtime newer than the cutoff. So
the majority of packs in this case should all be removed, and the small
amount of cruft data remaining can be repacked into a small number of
packs relatively quickly.
> > Instead, change the meaning of '--max-pack-size' in pack-objects when
> > combined with '--cruft'. When put together, '--max-pack-size' allows the
> > pack to grow larger than the specified threshold, but only by one
> > additional object.
>
> I do not think that would work well. You have no control over the
> size of that one additional object---it may weigh more than 100MB,
> combining your 199MB cruft pack with something else to make it ~300MB
> cruft. In other words, "just a little bit larger" sounds like a
> wishful thinking handwaving.
I think that it is somewhat of a handwave, but I would note that our
current rules around --max-pack-size are not quite as strict as I
originally thought. If you have a single object that is 100MB and your
pack limit is 50MB, then pack-objects will generate a 100MB pack today
containing just that object. So I don't think that our --max-pack-size
rules are quite that strict.
Here is the case that I am worried about:
Suppose you have a 100MB cruft limit, and there are two cruft packs in
the repository: one that is 99MB and another that is 1MB in size. Let's
suppose further that if you combine these two packs, the resulting pack
would be exactly 100MB in size.
Today, repack will say, "I have two packs that sum together to be the
value of --max-cruft-size", and mark them both to be removed (and
replaced with the combined pack generated by pack-objects). But if the
combined pack is exactly 100MB, then pack-objects will break the pack
into two just before the 100MB limit, and we'll end up with the same two
packs we started with.
Ideally we would combine those packs into one that is at most one
object's size larger than the threshold, and the steady state would be
to avoid repacking it further. But in current Git we will keep repacking
the two together, only to generate the same two packs we started with
forever.
So I think a reasonable stop-gap here is to let pack-objects generate
cruft packs with a --max-pack-size that are allowed to grow *just*
beyond the threshold by at most one object. Yes, that object can be
large, and so it's possible that you could end up with a pack that is
significantly larger in size than the threshold, if the one-extra-object
is itself large.
But the point of --max-pack-size in conjunction with --cruft is not in
the original spirit of --max-pack-size, which was to work around
filesystems that don't do well with large files. Instead, the utility
here is to bound the amount of repacking we have to do when generating
cruft packs in repositories that have many unreachable objects.
In other words, if my --max-cruft-size is 1G, and I have 20GB of cruft
data, I am less concerned about generating a pack that is 1.1G in size
than I am about repeatedly repacking the same 20GB over and over again
each time I want to add a single unreachable object.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-11 20:13 ` [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs Junio C Hamano
@ 2025-03-12 15:33 ` Taylor Blau
2025-03-12 18:28 ` Junio C Hamano
0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-12 15:33 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
On Tue, Mar 11, 2025 at 01:13:20PM -0700, Junio C Hamano wrote:
> > This series resurrects the first patch from v1 after introducing a
> > behavior change for 'git pack-objects --cruft --max-pack-size'. When
> > given with '--cruft', '--max-pack-size' now allows pack-objects to grow
> > a pack *just* past the given limit by at most one object.
>
> And what happens when the last object appended is very large, like
> 70? Would we end up with 270 when the threshold says 200?
>
> I still am not getting what you are trying to explain in the above
> two paragraphs, but in general, "give up just before" would be a
> better choice than "give up just after", exactly because the threshold
> we are letting the user to give is the maximum.
I think this is similar to the discussion earlier in the thread, but let
me know if there is something here I'm missing.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 15:22 ` Taylor Blau
@ 2025-03-12 18:26 ` Junio C Hamano
2025-03-12 19:02 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-12 18:26 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
Taylor Blau <me@ttaylorr.com> writes:
>> So would it be feasible to remember how 199MB cruft pack is lying in
>> the object store (i.e. earlier we packed as much as possible), and
>> add a logic that says "if there is nothing to expire out of this
>> one, do not attempt to repack---this is fine as-is"?
> .... So
> the majority of packs in this case should all be removed, and the small
> amount of cruft data remaining can be repacked into a small number of
> packs relatively quickly.
Given the above ...
> Suppose you have a 100MB cruft limit, and there are two cruft packs in
> the repository: one that is 99MB and another that is 1MB in size. Let's
> suppose further that if you combine these two packs, the resulting pack
> would be exactly 100MB in size.
>
> Today, repack will say, "I have two packs that sum together to be the
> value of --max-cruft-size", and mark them both to be removed (and
> replaced with the combined pack generated by pack-objects).
... yes, this logic to reach the above decision is exactly what I
said is broken. Is there no way to fix that?
> But if the
> combined pack is exactly 100MB, then pack-objects will break the pack
> into two just before the 100MB limit, and we'll end up with the same two
> packs we started with.
If "the majority of packs should all be removed and remainder combined"
you stated earlier is true, then this case falls in a tiny minority
that we do not have to worry about, doesn't it?
> Ideally we would combine those packs into one that is at most one
> object's size larger than the threshold, and the steady state would be
> to avoid repacking it further.
I do not see why you can call that "Ideally". Ideally, we would
combine those packs to create a pack (or two) without busting the
threshold, *and* avoid needress repacking. Busting the given limit
should not be part of the definition of "ideal" solution.
> But in current Git we will keep repacking
> the two together, only to generate the same two packs we started with
> forever.
Yes. That is because the logic that decides these packs need to be
broken and recombined is flawed. Maybe it does not have sufficient
information to decide that it is no use to attempt combining them,
in which case leaving some more info to help the later invocation of
repack to tell that it would be useless to attempt combining these
packs when you do the initial repack would help, which was what I
suggested. You've thought about the issue much longer than I did,
and would be able to come up with better ideas.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-12 15:33 ` Taylor Blau
@ 2025-03-12 18:28 ` Junio C Hamano
2025-03-12 19:04 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-12 18:28 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
Taylor Blau <me@ttaylorr.com> writes:
> On Tue, Mar 11, 2025 at 01:13:20PM -0700, Junio C Hamano wrote:
>> > This series resurrects the first patch from v1 after introducing a
>> > behavior change for 'git pack-objects --cruft --max-pack-size'. When
>> > given with '--cruft', '--max-pack-size' now allows pack-objects to grow
>> > a pack *just* past the given limit by at most one object.
>>
>> And what happens when the last object appended is very large, like
>> 70? Would we end up with 270 when the threshold says 200?
>>
>> I still am not getting what you are trying to explain in the above
>> two paragraphs, but in general, "give up just before" would be a
>> better choice than "give up just after", exactly because the threshold
>> we are letting the user to give is the maximum.
>
> I think this is similar to the discussion earlier in the thread, but let
> me know if there is something here I'm missing.
I think the only thing you are missing is that max specified is the
ceiling, and "you can bust it, hoping by a little but you do not
know how huge the error is" is unacceptable.
Thanks.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 18:26 ` Junio C Hamano
@ 2025-03-12 19:02 ` Taylor Blau
2025-03-12 19:13 ` Elijah Newren
0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-12 19:02 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
On Wed, Mar 12, 2025 at 11:26:53AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >> So would it be feasible to remember how 199MB cruft pack is lying in
> >> the object store (i.e. earlier we packed as much as possible), and
> >> add a logic that says "if there is nothing to expire out of this
> >> one, do not attempt to repack---this is fine as-is"?
> > .... So
> > the majority of packs in this case should all be removed, and the small
> > amount of cruft data remaining can be repacked into a small number of
> > packs relatively quickly.
>
> Given the above ...
>
> > Suppose you have a 100MB cruft limit, and there are two cruft packs in
> > the repository: one that is 99MB and another that is 1MB in size. Let's
> > suppose further that if you combine these two packs, the resulting pack
> > would be exactly 100MB in size.
> >
> > Today, repack will say, "I have two packs that sum together to be the
> > value of --max-cruft-size", and mark them both to be removed (and
> > replaced with the combined pack generated by pack-objects).
>
> ... yes, this logic to reach the above decision is exactly what I
> said is broken. Is there no way to fix that?
>
> > But if the
> > combined pack is exactly 100MB, then pack-objects will break the pack
> > into two just before the 100MB limit, and we'll end up with the same two
> > packs we started with.
>
> If "the majority of packs should all be removed and remainder combined"
> you stated earlier is true, then this case falls in a tiny minority
> that we do not have to worry about, doesn't it?
Yeah, it is a niche case. But the more I think about it the more I see
it your way. I apologize for all of the back-and-forth here, this is
quite a tricky problem to think through (at least for me), so I
appreciate your patience.
The original implementation in repack was designed to aggregate smaller
cruft packs together first until the combined size exceeds the
threshold. So the above would all be true if no new unreachable objects
were ever added to the repository, but if another 1MB cruft pack
appears, then we would:
- See the first 1MB pack, and decide we can repack it as it's under
the 100MB threshold.
- See the second 1MB pack, and repack it for the similar reasons (this
time because 1+1<100, not 1<100).
- See the 100MB pack, and refuse to repack it because the combined
size of 102MB would be over the threshold.
So I think it's reasonable that if we keep the current behavior of
repacking the smaller ones first that this case is niche enough for me
to feel OK not worrying about it too much.
And, yes, breaking --max-pack-size when given with --cruft is ugly.
> > But in current Git we will keep repacking
> > the two together, only to generate the same two packs we started with
> > forever.
>
> Yes. That is because the logic that decides these packs need to be
> broken and recombined is flawed. Maybe it does not have sufficient
> information to decide that it is no use to attempt combining them,
> in which case leaving some more info to help the later invocation of
> repack to tell that it would be useless to attempt combining these
> packs when you do the initial repack would help, which was what I
> suggested. You've thought about the issue much longer than I did,
> and would be able to come up with better ideas.
I think in the short term I came up with a worse idea than you would
have ;-).
Probably there is a way to improve this niche case as described above,
but I think the solution space is probably complicated enough that given
how narrow of a case it is that it's not worth introducing that much
complexity.
So I think at this point we should consider v3 to be the "good" version,
and discard this v4. I can re-send v3 as v5 if you'd like, or I can
avoid it if it would be redundant. Either is fine, just LMK.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-12 18:28 ` Junio C Hamano
@ 2025-03-12 19:04 ` Taylor Blau
2025-03-12 19:46 ` Junio C Hamano
2025-03-13 6:29 ` Jeff King
0 siblings, 2 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-12 19:04 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
On Wed, Mar 12, 2025 at 11:28:17AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Tue, Mar 11, 2025 at 01:13:20PM -0700, Junio C Hamano wrote:
> >> > This series resurrects the first patch from v1 after introducing a
> >> > behavior change for 'git pack-objects --cruft --max-pack-size'. When
> >> > given with '--cruft', '--max-pack-size' now allows pack-objects to grow
> >> > a pack *just* past the given limit by at most one object.
> >>
> >> And what happens when the last object appended is very large, like
> >> 70? Would we end up with 270 when the threshold says 200?
> >>
> >> I still am not getting what you are trying to explain in the above
> >> two paragraphs, but in general, "give up just before" would be a
> >> better choice than "give up just after", exactly because the threshold
> >> we are letting the user to give is the maximum.
> >
> > I think this is similar to the discussion earlier in the thread, but let
> > me know if there is something here I'm missing.
>
> I think the only thing you are missing is that max specified is the
> ceiling, and "you can bust it, hoping by a little but you do not
> know how huge the error is" is unacceptable.
I agree that in the general case it is unacceptable. I think I might see
it slightly different than you, since for cruft packs the idea is to
bound the working set of what you're repacking using the size of the
resulting packs as a proxy for the former.
Maybe we should call the option something else that makes the cruft pack
use-case clearer. But in the other thread I came around to the idea that
this case is too niche to address completely, so I think we can discard
this round as a terrible idea.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 19:02 ` Taylor Blau
@ 2025-03-12 19:13 ` Elijah Newren
2025-03-12 19:33 ` Taylor Blau
2025-03-12 20:43 ` Junio C Hamano
0 siblings, 2 replies; 54+ messages in thread
From: Elijah Newren @ 2025-03-12 19:13 UTC (permalink / raw)
To: Taylor Blau; +Cc: Junio C Hamano, git, Jeff King, Patrick Steinhardt
On Wed, Mar 12, 2025 at 12:02 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Mar 12, 2025 at 11:26:53AM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > >> So would it be feasible to remember how 199MB cruft pack is lying in
> > >> the object store (i.e. earlier we packed as much as possible), and
> > >> add a logic that says "if there is nothing to expire out of this
> > >> one, do not attempt to repack---this is fine as-is"?
> > > .... So
> > > the majority of packs in this case should all be removed, and the small
> > > amount of cruft data remaining can be repacked into a small number of
> > > packs relatively quickly.
> >
> > Given the above ...
> >
> > > Suppose you have a 100MB cruft limit, and there are two cruft packs in
> > > the repository: one that is 99MB and another that is 1MB in size. Let's
> > > suppose further that if you combine these two packs, the resulting pack
> > > would be exactly 100MB in size.
> > >
> > > Today, repack will say, "I have two packs that sum together to be the
> > > value of --max-cruft-size", and mark them both to be removed (and
> > > replaced with the combined pack generated by pack-objects).
> >
> > ... yes, this logic to reach the above decision is exactly what I
> > said is broken. Is there no way to fix that?
> >
> > > But if the
> > > combined pack is exactly 100MB, then pack-objects will break the pack
> > > into two just before the 100MB limit, and we'll end up with the same two
> > > packs we started with.
> >
> > If "the majority of packs should all be removed and remainder combined"
> > you stated earlier is true, then this case falls in a tiny minority
> > that we do not have to worry about, doesn't it?
>
> Yeah, it is a niche case. But the more I think about it the more I see
> it your way. I apologize for all of the back-and-forth here, this is
> quite a tricky problem to think through (at least for me), so I
> appreciate your patience.
>
> The original implementation in repack was designed to aggregate smaller
> cruft packs together first until the combined size exceeds the
> threshold. So the above would all be true if no new unreachable objects
> were ever added to the repository, but if another 1MB cruft pack
> appears, then we would:
>
> - See the first 1MB pack, and decide we can repack it as it's under
> the 100MB threshold.
>
> - See the second 1MB pack, and repack it for the similar reasons (this
> time because 1+1<100, not 1<100).
>
> - See the 100MB pack, and refuse to repack it because the combined
> size of 102MB would be over the threshold.
>
> So I think it's reasonable that if we keep the current behavior of
> repacking the smaller ones first that this case is niche enough for me
> to feel OK not worrying about it too much.
>
> And, yes, breaking --max-pack-size when given with --cruft is ugly.
>
> > > But in current Git we will keep repacking
> > > the two together, only to generate the same two packs we started with
> > > forever.
> >
> > Yes. That is because the logic that decides these packs need to be
> > broken and recombined is flawed. Maybe it does not have sufficient
> > information to decide that it is no use to attempt combining them,
> > in which case leaving some more info to help the later invocation of
> > repack to tell that it would be useless to attempt combining these
> > packs when you do the initial repack would help, which was what I
> > suggested. You've thought about the issue much longer than I did,
> > and would be able to come up with better ideas.
>
> I think in the short term I came up with a worse idea than you would
> have ;-).
>
> Probably there is a way to improve this niche case as described above,
> but I think the solution space is probably complicated enough that given
> how narrow of a case it is that it's not worth introducing that much
> complexity.
Would it make sense to break the assumption that --max-cruft-size ==
--max-pack-size and perhaps rename the former? I think the problem is
that the two imply different things (one is a minimum, the other a
maximum), and thus really should be different values. E.g.
--combine-cruft-below-size that is set to e.g. half of
--max-pack-size, and then you can continue combining cruft packs
together until they do go above the cruft threshold, while avoiding
actually exceeding the pack size threshold?
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 19:13 ` Elijah Newren
@ 2025-03-12 19:33 ` Taylor Blau
2025-03-12 20:43 ` Junio C Hamano
1 sibling, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-12 19:33 UTC (permalink / raw)
To: Elijah Newren; +Cc: Junio C Hamano, git, Jeff King, Patrick Steinhardt
On Wed, Mar 12, 2025 at 12:13:10PM -0700, Elijah Newren wrote:
> > > > But in current Git we will keep repacking
> > > > the two together, only to generate the same two packs we started with
> > > > forever.
> > >
> > > Yes. That is because the logic that decides these packs need to be
> > > broken and recombined is flawed. Maybe it does not have sufficient
> > > information to decide that it is no use to attempt combining them,
> > > in which case leaving some more info to help the later invocation of
> > > repack to tell that it would be useless to attempt combining these
> > > packs when you do the initial repack would help, which was what I
> > > suggested. You've thought about the issue much longer than I did,
> > > and would be able to come up with better ideas.
> >
> > I think in the short term I came up with a worse idea than you would
> > have ;-).
> >
> > Probably there is a way to improve this niche case as described above,
> > but I think the solution space is probably complicated enough that given
> > how narrow of a case it is that it's not worth introducing that much
> > complexity.
>
> Would it make sense to break the assumption that --max-cruft-size ==
> --max-pack-size and perhaps rename the former? I think the problem is
> that the two imply different things (one is a minimum, the other a
> maximum), and thus really should be different values. E.g.
> --combine-cruft-below-size that is set to e.g. half of
> --max-pack-size, and then you can continue combining cruft packs
> together until they do go above the cruft threshold, while avoiding
> actually exceeding the pack size threshold?
We could, though alternatively I think leaving the behavior as is
presented in v3 is equally OK.
We'll never make a pack that is as large or larger than the given
--max-cruft-size, but because repack tries to aggregate smaller packs
together first, it's unlikely that we would ever repeatedly repack the
larger ones making them effectively "frozen", which is the goal.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-12 19:04 ` Taylor Blau
@ 2025-03-12 19:46 ` Junio C Hamano
2025-03-12 19:52 ` Taylor Blau
2025-03-13 6:29 ` Jeff King
1 sibling, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-12 19:46 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
Taylor Blau <me@ttaylorr.com> writes:
> Maybe we should call the option something else that makes the cruft pack
> use-case clearer. But in the other thread I came around to the idea that
> this case is too niche to address completely, so I think we can discard
> this round as a terrible idea.
Yeah, I do not think "an earlier attempt wouldn't fill the bucket
fully, so a later logic may try to fill the bucket with the same
material in vain, failing the same way" is a problem that is limited
to cruft pack use case. If you are repacking real data into 200MB
chunks every week, the 198MB pack created last week that is lying in
the object store that houses the oldest part of the history is
unlikely to be improved by repacking it (since the attempt failed to
fill the remaining 2MB last week, and the oldest part of the history
hasn't changed) but it is entirely possible that with the pieces of
information left by the current set of tools, the machinery may not
realize that fact.
Allowing the resulting pack to deliberately exceed 200MB ceiling is
one possible way (I still do not think it is the best way among
possible ways) to mark the pack "frozen-do not further touch".
Now, if the essense of the problem is when to mark a pack that we
stopped appending to due to max-size limit as "frozen and do not
touch" to avoid this problem, how such a marking is done, and who
uses it to avoid repeated work that is known (to us) to be futile,
there should be better ways to do such a marking. We have .keep
mechanism which is ugly but proven to work and we know how to deal
with, for example.
I wonder if our pack header or trailer is extensible enough so that
we can record "this pack was closed due to max-size limit of 200MB
and appending even one more object in the queue would have busted
the limit".
No matter how such a marking is done, later repack that sees a pack
marked as such, when it is limited to 200MB, should be able to tell
that and act a bit more intelligently than what it currently does.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-12 19:46 ` Junio C Hamano
@ 2025-03-12 19:52 ` Taylor Blau
2025-03-13 17:17 ` Junio C Hamano
0 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-12 19:52 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
On Wed, Mar 12, 2025 at 12:46:56PM -0700, Junio C Hamano wrote:
> Allowing the resulting pack to deliberately exceed 200MB ceiling is
> one possible way (I still do not think it is the best way among
> possible ways) to mark the pack "frozen-do not further touch".
>
> Now, if the essense of the problem is when to mark a pack that we
> stopped appending to due to max-size limit as "frozen and do not
> touch" to avoid this problem, how such a marking is done, and who
> uses it to avoid repeated work that is known (to us) to be futile,
> there should be better ways to do such a marking. We have .keep
> mechanism which is ugly but proven to work and we know how to deal
> with, for example.
Yeah... I think a .keep-like mechanism would be sufficient here to mark
packs which are close to the --max-cruft-size as frozen if we truly
wanted to avoid this problem.
But if we have, say, packs that are 198MB, 1MB, 2MB, and 3MB in size,
that we're going to accumulate in order of increasing size, so we would
almost never repack the 198MB pack.
The .keep mechanism by itself wouldn't quite do the trick, though,
because we'd have to distinguish between "marked as .kept by the user
and so we absolutely must not delete it" and "marked as frozen by us,
but we could delete it if we needed to (e.g., when pruning)".
So that would maybe suggest introducing a .frozen mechanism, but that is
bolting further ugliness onto the .keep mechanism, which I think is the
wrong direction to be going in.
> I wonder if our pack header or trailer is extensible enough so that
> we can record "this pack was closed due to max-size limit of 200MB
> and appending even one more object in the queue would have busted
> the limit".
>
> No matter how such a marking is done, later repack that sees a pack
> marked as such, when it is limited to 200MB, should be able to tell
> that and act a bit more intelligently than what it currently does.
Perhaps in the longer term, but I think for the reasons above that the
existing behavior (plus the new patch from v3, which we should still
queue) is sufficient.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 19:13 ` Elijah Newren
2025-03-12 19:33 ` Taylor Blau
@ 2025-03-12 20:43 ` Junio C Hamano
2025-03-12 20:49 ` Elijah Newren
1 sibling, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-12 20:43 UTC (permalink / raw)
To: Elijah Newren; +Cc: Taylor Blau, git, Jeff King, Patrick Steinhardt
Elijah Newren <newren@gmail.com> writes:
> Would it make sense to break the assumption that --max-cruft-size ==
> --max-pack-size and perhaps rename the former? I think the problem is
> that the two imply different things (one is a minimum, the other a
> maximum), and thus really should be different values. E.g.
> --combine-cruft-below-size that is set to e.g. half of
> --max-pack-size, and then you can continue combining cruft packs
> together until they do go above the cruft threshold, while avoiding
> actually exceeding the pack size threshold?
With below-size and max-size set to say 180 and 200 respectively, an
attempt to combine the crufts may end up filling a cruft pack to 170
but the smallest of the remaining cruft may weigh 40, which means
including it would cause the max-size to be exceeded. In such a
scenario, there may not be a solution to satisfy given constraints,
i.e. go above the below-size without stay below the max-size.
So I am not sure if the approach would really solve much.
Other than that a separate names, especially losing "max" from the
threshold that really does not mean "max", would solve the confusion
that comes from naming, that is.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 20:43 ` Junio C Hamano
@ 2025-03-12 20:49 ` Elijah Newren
2025-03-13 12:16 ` Junio C Hamano
0 siblings, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-03-12 20:49 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Taylor Blau, git, Jeff King, Patrick Steinhardt
On Wed, Mar 12, 2025 at 1:43 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Would it make sense to break the assumption that --max-cruft-size ==
> > --max-pack-size and perhaps rename the former? I think the problem is
> > that the two imply different things (one is a minimum, the other a
> > maximum), and thus really should be different values. E.g.
> > --combine-cruft-below-size that is set to e.g. half of
> > --max-pack-size, and then you can continue combining cruft packs
> > together until they do go above the cruft threshold, while avoiding
> > actually exceeding the pack size threshold?
>
> With below-size and max-size set to say 180 and 200 respectively, an
> attempt to combine the crufts may end up filling a cruft pack to 170
> but the smallest of the remaining cruft may weigh 40, which means
> including it would cause the max-size to be exceeded. In such a
> scenario, there may not be a solution to satisfy given constraints,
> i.e. go above the below-size without stay below the max-size.
>
> So I am not sure if the approach would really solve much.
>
> Other than that a separate names, especially losing "max" from the
> threshold that really does not mean "max", would solve the confusion
> that comes from naming, that is.
--max-pack-size is a constraint. --combine-cruft-below-size is not.
Think particularly of the case where the user doesn't even have any
cruft packs yet and has only accumulated a little bit of cruft. That
option is merely a guide post to say that if it's smaller than that
size, then feel free to keep trying to add to it (so long as it
doesn't violate constraints such as --max-pack-size).
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-12 19:04 ` Taylor Blau
2025-03-12 19:46 ` Junio C Hamano
@ 2025-03-13 6:29 ` Jeff King
2025-03-13 15:12 ` Junio C Hamano
1 sibling, 1 reply; 54+ messages in thread
From: Jeff King @ 2025-03-13 6:29 UTC (permalink / raw)
To: Taylor Blau; +Cc: Junio C Hamano, git, Elijah Newren, Patrick Steinhardt
On Wed, Mar 12, 2025 at 03:04:58PM -0400, Taylor Blau wrote:
> > > I think this is similar to the discussion earlier in the thread, but let
> > > me know if there is something here I'm missing.
> >
> > I think the only thing you are missing is that max specified is the
> > ceiling, and "you can bust it, hoping by a little but you do not
> > know how huge the error is" is unacceptable.
>
> I agree that in the general case it is unacceptable. I think I might see
> it slightly different than you, since for cruft packs the idea is to
> bound the working set of what you're repacking using the size of the
> resulting packs as a proxy for the former.
>
> Maybe we should call the option something else that makes the cruft pack
> use-case clearer. But in the other thread I came around to the idea that
> this case is too niche to address completely, so I think we can discard
> this round as a terrible idea.
Having read through the thread, I think the naming is really the issue.
Elijah's earlier response was the most clear to me: the new feature is
not a max size at all, but a --combine-cruft-below-size. Our goal is not
to have small packs, but to avoid rewriting bytes over and over again
(which would happen if we simply rolled all cruft packs into one on each
repack). But also to avoid having a zillion cruft packs (which would
happen if we never rolled them up).
So I think all of the discussion about "would we bust the max limit" is
mostly a red herring.
But also, I think there is no need to tell pack-objects to limit the
size of the resulting cruft pack at all. Let's say you have three cruft
packs with sizes 30MB, 20MB, and 10MB. I want to roll up so that each
cruft pack is at least 40MB. We could say (and this is what I think
your series does, based on our earlier off-list discussions, but please
correct me if things have changed):
1. All of those are candidates for rolling up, because they're below
our threshold.
2. We'll feed the packs along with the "max" size to pack-objects,
which will then roll it all up into a 40MB pack, plus a 20MB pack
left over. We'll have written all of the bytes once, but on the
next repack we'd only rewrite 20MB (plus whatever new cruft comes
along).
But do we actually care about eventually having a series of 40MB packs?
Or do we care about having some cutoff so that we don't rewrite those
first 40MB on subsequent repacks?
If the latter, then for step 2, what if we don't feed a max size? We'd
end up with one 60MB pack (again, having written all of the bytes once).
And on the next repack we'd leave it be (since it's over the threshold).
We'll start forming new packs, which will eventually aggregate to 40MB
(or possibly larger).
If I understand the main purpose of the series, it is that we must
rescue objects out of cruft packs if they became fresher (by having
loose copies made). But that is orthogonal to max pack sizes, isn't it?
We just need for pack-objects to be fed those objects (which should be
happening already) and decide _not_ to omit them based on their presence
in the kept cruft packs (based on the mtime in those cruft packs, of
course). Which looks like what your want_cruft_object_mtime() is doing.
-Peff
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-12 20:49 ` Elijah Newren
@ 2025-03-13 12:16 ` Junio C Hamano
2025-03-13 16:23 ` Elijah Newren
0 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-13 12:16 UTC (permalink / raw)
To: Elijah Newren; +Cc: Taylor Blau, git, Jeff King, Patrick Steinhardt
Elijah Newren <newren@gmail.com> writes:
>> With below-size and max-size set to say 180 and 200 respectively, an
>> attempt to combine the crufts may end up filling a cruft pack to 170
>> but the smallest of the remaining cruft may weigh 40, which means
>> including it would cause the max-size to be exceeded. In such a
>> scenario, there may not be a solution to satisfy given constraints,
>> i.e. go above the below-size without stay below the max-size.
>>
>> So I am not sure if the approach would really solve much.
>>
>> Other than that a separate names, especially losing "max" from the
>> threshold that really does not mean "max", would solve the confusion
>> that comes from naming, that is.
>
> --max-pack-size is a constraint. --combine-cruft-below-size is not.
> Think particularly of the case where the user doesn't even have any
> cruft packs yet and has only accumulated a little bit of cruft. That
> option is merely a guide post to say that if it's smaller than that
> size, then feel free to keep trying to add to it (so long as it
> doesn't violate constraints such as --max-pack-size).
That is correct and it is why I said the suggestion solves the name
confusion. But think about the sample situation, before and after
such a repack with two thresholds. You had below- and max-size set
to 180 and 200 respectively, and a cruft pack of size 170, and you
failed to grow that cruft pack beyond 180 because the next available
cruft weighed 40. Then you'll repeat the exercise again, find 170
that is smaller than the below- threshold, try to cram more and
would fail. Isn't that what Taylor's series wanted to prevent from
happening, and isn't the two-threshod approach supposed to be a way
to improve on it?
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-13 6:29 ` Jeff King
@ 2025-03-13 15:12 ` Junio C Hamano
0 siblings, 0 replies; 54+ messages in thread
From: Junio C Hamano @ 2025-03-13 15:12 UTC (permalink / raw)
To: Jeff King; +Cc: Taylor Blau, git, Elijah Newren, Patrick Steinhardt
Jeff King <peff@peff.net> writes:
> But do we actually care about eventually having a series of 40MB packs?
> Or do we care about having some cutoff so that we don't rewrite those
> first 40MB on subsequent repacks?
>
> If the latter, then for step 2, what if we don't feed a max size? We'd
> end up with one 60MB pack (again, having written all of the bytes once).
> And on the next repack we'd leave it be (since it's over the threshold).
> We'll start forming new packs, which will eventually aggregate to 40MB
> (or possibly larger).
>
> If I understand the main purpose of the series, it is that we must
> rescue objects out of cruft packs if they became fresher (by having
> loose copies made). But that is orthogonal to max pack sizes, isn't it?
> We just need for pack-objects to be fed those objects (which should be
> happening already) and decide _not_ to omit them based on their presence
> in the kept cruft packs (based on the mtime in those cruft packs, of
> course). Which looks like what your want_cruft_object_mtime() is doing.
Yeah, no packsize limit on the output side, but making sure that the
decision to roll up existing cruft packs is made sensibly, is what
the above gives us, which I life a lot. The one-before and -after
confusion came exactly because we somehow tried to have threshold on
the output side.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-13 12:16 ` Junio C Hamano
@ 2025-03-13 16:23 ` Elijah Newren
2025-03-13 17:06 ` Junio C Hamano
0 siblings, 1 reply; 54+ messages in thread
From: Elijah Newren @ 2025-03-13 16:23 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Taylor Blau, git, Jeff King, Patrick Steinhardt
On Thu, Mar 13, 2025 at 5:16 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> With below-size and max-size set to say 180 and 200 respectively, an
> >> attempt to combine the crufts may end up filling a cruft pack to 170
> >> but the smallest of the remaining cruft may weigh 40, which means
> >> including it would cause the max-size to be exceeded. In such a
> >> scenario, there may not be a solution to satisfy given constraints,
> >> i.e. go above the below-size without stay below the max-size.
> >>
> >> So I am not sure if the approach would really solve much.
> >>
> >> Other than that a separate names, especially losing "max" from the
> >> threshold that really does not mean "max", would solve the confusion
> >> that comes from naming, that is.
> >
> > --max-pack-size is a constraint. --combine-cruft-below-size is not.
> > Think particularly of the case where the user doesn't even have any
> > cruft packs yet and has only accumulated a little bit of cruft. That
> > option is merely a guide post to say that if it's smaller than that
> > size, then feel free to keep trying to add to it (so long as it
> > doesn't violate constraints such as --max-pack-size).
>
> That is correct and it is why I said the suggestion solves the name
> confusion. But think about the sample situation, before and after
> such a repack with two thresholds. You had below- and max-size set
> to 180 and 200 respectively, and a cruft pack of size 170, and you
> failed to grow that cruft pack beyond 180 because the next available
> cruft weighed 40. Then you'll repeat the exercise again, find 170
> that is smaller than the below- threshold, try to cram more and
> would fail. Isn't that what Taylor's series wanted to prevent from
> happening, and isn't the two-threshod approach supposed to be a way
> to improve on it?
I don't think "always combine" is necessary for improvement. Perhaps,
in your example, this round of repacking can't combine things. But
the next time we want to repack cruft objects and there is anything
new that (individually or collectively) weighs between 10 and 30, we
can add it and get something over the lower threshold and then ignore
the resulting cruft pack it in the future.
In contrast, the single threshold either has to violate the maximum
constraint, or always reconsider everything. The two threshold system
allows progress to be made (so long as it doesn't just look at the
first biggest object and fail every time), but particularly if you set
the thresholds too close to each other or you just have really large
cruft, then you _sometimes_ might not make progress.
Personally, I think I'd set the --combine-cruft-below-size to half of
--max-pack-size, because that guarantees that any two existing cruft
packs being considered for combining can be, and the resulting
combined cruft pack if big enough can then be ignored in the future.
In other words, this scheme would allow you to always make progress.
Am I missing something?
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold
2025-03-13 16:23 ` Elijah Newren
@ 2025-03-13 17:06 ` Junio C Hamano
0 siblings, 0 replies; 54+ messages in thread
From: Junio C Hamano @ 2025-03-13 17:06 UTC (permalink / raw)
To: Elijah Newren; +Cc: Taylor Blau, git, Jeff King, Patrick Steinhardt
Elijah Newren <newren@gmail.com> writes:
> Personally, I think I'd set the --combine-cruft-below-size to half of
> --max-pack-size, because that guarantees that any two existing cruft
> packs being considered for combining can be, and the resulting
> combined cruft pack if big enough can then be ignored in the future.
Yup, combine-below can be set to arbitrarily a small value, and ...
> In other words, this scheme would allow you to always make progress.
... I would even think that setting it to too high a value may be a
misconfiguration. As you outlined above, a half of the max-size
would probably be the practical maximum for the combine-below value
to avoid the issue and to always make progress.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-12 19:52 ` Taylor Blau
@ 2025-03-13 17:17 ` Junio C Hamano
2025-03-13 17:35 ` Taylor Blau
0 siblings, 1 reply; 54+ messages in thread
From: Junio C Hamano @ 2025-03-13 17:17 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
Taylor Blau <me@ttaylorr.com> writes:
> Perhaps in the longer term, but I think for the reasons above that the
> existing behavior (plus the new patch from v3, which we should still
> queue) is sufficient.
The older iteration has a few loose ends <Z8l5hxNjEOALl_g-@pks.im>
we would want to tie. Elijah's "decision to combine should be made
at half or below the max size to always make a progress" would also
makes sense.
Thanks.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-03-06 10:31 ` Patrick Steinhardt
@ 2025-03-13 17:32 ` Taylor Blau
0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-13 17:32 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Junio C Hamano, Jeff King, Elijah Newren
On Thu, Mar 06, 2025 at 11:31:35AM +0100, Patrick Steinhardt wrote:
> Minor nit: it is a bit unusual that a negative value, which typically
> indicates an error, is used as a boolean value here to indicate that we
> don't want to have the object.
This matches the convention of the other "do we want this object?"
functions, where returning -1 indicates that we don't yet know whether
or not we want the object, and should continue looking elsewhere.
Returning -1 from 'want_cruft_object_mtime()' doesn't mean that we can
necessarily return -1 immediately from its caller in
'want_found_object()' because we might pick it up from further down in
that function or in one of its callers.
So I think the return values of that function are consistent. Because
that function never says "yes, I want this object" and only "no" or
"maybe", we could return 0/1 indicating "no" or "maybe". But that feels
like a bug waiting to happen if someone later on mistakes "1" (which in
this hypothetical would mean "maybe") as "yes".
> > diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
> > index 959e6e26488..f427150de5b 100755
> > --- a/t/t7704-repack-cruft.sh
> > +++ b/t/t7704-repack-cruft.sh
> > @@ -304,6 +304,69 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
> > )
> > '
> >
> > +test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
> > + git init max-cruft-size-threshold &&
>
> Let's also delete the repository via `test_when_finished`.
Eh. The point of naming these uniquely is that we don't have to remember
to clean them up afterwords, but I can do so if you want.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs
2025-03-13 17:17 ` Junio C Hamano
@ 2025-03-13 17:35 ` Taylor Blau
0 siblings, 0 replies; 54+ messages in thread
From: Taylor Blau @ 2025-03-13 17:35 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jeff King, Elijah Newren, Patrick Steinhardt
On Thu, Mar 13, 2025 at 10:17:57AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Perhaps in the longer term, but I think for the reasons above that the
> > existing behavior (plus the new patch from v3, which we should still
> > queue) is sufficient.
>
> The older iteration has a few loose ends <Z8l5hxNjEOALl_g-@pks.im>
> we would want to tie. Elijah's "decision to combine should be made
> at half or below the max size to always make a progress" would also
> makes sense.
Let's split these two out. This series (or single patch, as it will
shortly become in v5) is fixing a legitimate bug in freshening
unreachable objects when they appear in multiple cruft packs.
That will allow us a little more time to think through the more
complicated issues while fixing the more straightforward ones quickly.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v5] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
` (5 preceding siblings ...)
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
@ 2025-03-13 18:09 ` Taylor Blau
2025-03-13 18:41 ` Junio C Hamano
6 siblings, 1 reply; 54+ messages in thread
From: Taylor Blau @ 2025-03-13 18:09 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Jeff King, Elijah Newren
Once an object is written into a cruft pack, we can only freshen it by
writing a new loose or packed copy of that object with a more recent
mtime.
Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
a repository at any given time. So freshening unreachable objects was
straightforward when already rewriting the cruft pack (and its *.mtimes
file).
But 61568efa95 changes things: 'pack-objects' now supports writing
multiple cruft packs when invoked with `--cruft` and the
`--max-pack-size` flag. Cruft packs are rewritten until they reach some
size threshold, at which point they are considered "frozen", and will
only be modified in a pruning GC, or if the threshold itself is
adjusted.
Prior to this patch, however, this process breaks down when we attempt
to freshen an object packed in an earlier cruft pack, and that cruft
pack is larger than the threshold and thus will survive the repack.
When this is the case, it is impossible to freshen objects in cruft
pack(s) when those cruft packs are larger than the threshold. This is
because we would avoid writing them in the new cruft pack entirely, for
a couple of reasons.
1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
over the packs we're going to retain (which are marked as kept
in-core by 'read_cruft_objects()').
This means that we will avoid enumerating additional packed copies
of objects found in any cruft packs which are larger than the given
size threshold. Thus there is no opportunity to call
'create_object_entry()' whatsoever.
2. We likewise will discard the loose copy (if one exists) of any
unreachable object packed in a cruft pack that is larger than the
threshold. Here our call path is 'add_unreachable_loose_objects()',
which uses the 'add_loose_object()' callback.
That function will eventually land us in 'want_object_in_pack()'
(via 'add_cruft_object_entry()'), and we'll discard the object as it
appears in one of the packs which we marked as kept in-core.
This means in effect that it is impossible to freshen an unreachable
object once it appears in a cruft pack larger than the given threshold.
Instead, we should pack an additional copy of an unreachable object we
want to freshen even if it appears in a cruft pack, provided that the
cruft copy has an mtime which is before the mtime of the copy we are
trying to pack/freshen. This is sub-optimal in the sense that it
requires keeping an additional copy of unreachable objects upon
freshening, but we don't have a better alternative without the ability
to make in-place modifications to existing *.mtimes files.
In order to implement this, we have to adjust the behavior of
'want_found_object()'. When 'pack-objects' is told that we're *not*
going to retain any cruft packs (i.e. the set of packs marked as kept
in-core does not contain a cruft pack), the behavior is unchanged.
But when there *is* at least one cruft pack that we're holding onto, it
is no longer sufficient to reject a copy of an object found in that
cruft pack for that reason alone. In this case, we only want to reject a
candidate object when copies of that object either:
- exists in a non-cruft pack that we are retaining, regardless of that
pack's mtime, or
- exists in a cruft pack with an mtime at least as recent as the copy
we are debating whether or not to pack, in which case freshening
would be redundant.
To do this, keep track of whether or not we have any cruft packs in our
in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
flag. When we end up in this new special case, we replace a call to
'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
objects when we have a copy in an existing cruft pack with at least as
recent an mtime as our candidate (in which case "freshening" would be
redundant).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
packfile.c | 3 +-
packfile.h | 2 +
t/t7704-repack-cruft.sh | 66 ++++++++++++++++++++++
4 files changed, 171 insertions(+), 18 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 58a9b16126..79e1e6fb52 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -206,6 +206,7 @@ static int have_non_local_packs;
static int incremental;
static int ignore_packed_keep_on_disk;
static int ignore_packed_keep_in_core;
+static int ignore_packed_keep_in_core_has_cruft;
static int allow_ofs_delta;
static struct pack_idx_option pack_idx_opts;
static const char *base_name;
@@ -1502,8 +1503,60 @@ static int have_duplicate_entry(const struct object_id *oid,
return 1;
}
+static int want_cruft_object_mtime(struct repository *r,
+ const struct object_id *oid,
+ unsigned flags, uint32_t mtime)
+{
+ struct packed_git **cache;
+
+ for (cache = kept_pack_cache(r, flags); *cache; cache++) {
+ struct packed_git *p = *cache;
+ off_t ofs;
+ uint32_t candidate_mtime;
+
+ ofs = find_pack_entry_one(oid, p);
+ if (!ofs)
+ continue;
+
+ /*
+ * We have a copy of the object 'oid' in a non-cruft
+ * pack. We can avoid packing an additional copy
+ * regardless of what the existing copy's mtime is since
+ * it is outside of a cruft pack.
+ */
+ if (!p->is_cruft)
+ return 0;
+
+ /*
+ * If we have a copy of the object 'oid' in a cruft
+ * pack, then either read the cruft pack's mtime for
+ * that object, or, if that can't be loaded, assume the
+ * pack's mtime itself.
+ */
+ if (!load_pack_mtimes(p)) {
+ uint32_t pos;
+ if (offset_to_pack_pos(p, ofs, &pos) < 0)
+ continue;
+ candidate_mtime = nth_packed_mtime(p, pos);
+ } else {
+ candidate_mtime = p->mtime;
+ }
+
+ /*
+ * We have a surviving copy of the object in a cruft
+ * pack whose mtime is greater than or equal to the one
+ * we are considering. We can thus avoid packing an
+ * additional copy of that object.
+ */
+ if (mtime <= candidate_mtime)
+ return 0;
+ }
+
+ return -1;
+}
+
static int want_found_object(const struct object_id *oid, int exclude,
- struct packed_git *p)
+ struct packed_git *p, uint32_t mtime)
{
if (exclude)
return 1;
@@ -1553,12 +1606,29 @@ static int want_found_object(const struct object_id *oid, int exclude,
if (ignore_packed_keep_in_core)
flags |= IN_CORE_KEEP_PACKS;
- if (ignore_packed_keep_on_disk && p->pack_keep)
- return 0;
- if (ignore_packed_keep_in_core && p->pack_keep_in_core)
- return 0;
- if (has_object_kept_pack(p->repo, oid, flags))
- return 0;
+ /*
+ * If the object is in a pack that we want to ignore, *and* we
+ * don't have any cruft packs that are being retained, we can
+ * abort quickly.
+ */
+ if (!ignore_packed_keep_in_core_has_cruft) {
+ if (ignore_packed_keep_on_disk && p->pack_keep)
+ return 0;
+ if (ignore_packed_keep_in_core && p->pack_keep_in_core)
+ return 0;
+ if (has_object_kept_pack(p->repo, oid, flags))
+ return 0;
+ } else {
+ /*
+ * But if there is at least one cruft pack which
+ * is being kept, we only want to include the
+ * provided object if it has a strictly greater
+ * mtime than any existing cruft copy.
+ */
+ if (!want_cruft_object_mtime(p->repo, oid, flags,
+ mtime))
+ return 0;
+ }
}
/*
@@ -1577,7 +1647,8 @@ static int want_object_in_pack_one(struct packed_git *p,
const struct object_id *oid,
int exclude,
struct packed_git **found_pack,
- off_t *found_offset)
+ off_t *found_offset,
+ uint32_t found_mtime)
{
off_t offset;
@@ -1593,7 +1664,7 @@ static int want_object_in_pack_one(struct packed_git *p,
*found_offset = offset;
*found_pack = p;
}
- return want_found_object(oid, exclude, p);
+ return want_found_object(oid, exclude, p, found_mtime);
}
return -1;
}
@@ -1607,10 +1678,11 @@ static int want_object_in_pack_one(struct packed_git *p,
* function finds if there is any pack that has the object and returns the pack
* and its offset in these variables.
*/
-static int want_object_in_pack(const struct object_id *oid,
- int exclude,
- struct packed_git **found_pack,
- off_t *found_offset)
+static int want_object_in_pack_mtime(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset,
+ uint32_t found_mtime)
{
int want;
struct list_head *pos;
@@ -1625,7 +1697,8 @@ static int want_object_in_pack(const struct object_id *oid,
* are present we will determine the answer right now.
*/
if (*found_pack) {
- want = want_found_object(oid, exclude, *found_pack);
+ want = want_found_object(oid, exclude, *found_pack,
+ found_mtime);
if (want != -1)
return want;
@@ -1636,7 +1709,7 @@ static int want_object_in_pack(const struct object_id *oid,
for (m = get_multi_pack_index(the_repository); m; m = m->next) {
struct pack_entry e;
if (fill_midx_entry(the_repository, oid, &e, m)) {
- want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset, found_mtime);
if (want != -1)
return want;
}
@@ -1644,7 +1717,7 @@ static int want_object_in_pack(const struct object_id *oid,
list_for_each(pos, get_packed_git_mru(the_repository)) {
struct packed_git *p = list_entry(pos, struct packed_git, mru);
- want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset);
+ want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset, found_mtime);
if (!exclude && want > 0)
list_move(&p->mru,
get_packed_git_mru(the_repository));
@@ -1674,6 +1747,15 @@ static int want_object_in_pack(const struct object_id *oid,
return 1;
}
+static inline int want_object_in_pack(const struct object_id *oid,
+ int exclude,
+ struct packed_git **found_pack,
+ off_t *found_offset)
+{
+ return want_object_in_pack_mtime(oid, exclude, found_pack, found_offset,
+ 0);
+}
+
static struct object_entry *create_object_entry(const struct object_id *oid,
enum object_type type,
uint32_t hash,
@@ -3606,7 +3688,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
entry->no_try_delta = no_try_delta(name);
}
} else {
- if (!want_object_in_pack(oid, 0, &pack, &offset))
+ if (!want_object_in_pack_mtime(oid, 0, &pack, &offset, mtime))
return;
if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
/*
@@ -3680,6 +3762,8 @@ static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
struct packed_git *p = item->util;
if (!p)
die(_("could not find pack '%s'"), item->string);
+ if (p->is_cruft && keep)
+ ignore_packed_keep_in_core_has_cruft = 1;
p->pack_keep_in_core = keep;
}
}
diff --git a/packfile.c b/packfile.c
index 2d80d80cb3..9d09f8bc72 100644
--- a/packfile.c
+++ b/packfile.c
@@ -24,6 +24,7 @@
#include "commit-graph.h"
#include "pack-revindex.h"
#include "promisor-remote.h"
+#include "pack-mtimes.h"
char *odb_pack_name(struct repository *r, struct strbuf *buf,
const unsigned char *hash, const char *ext)
@@ -2107,7 +2108,7 @@ static void maybe_invalidate_kept_pack_cache(struct repository *r,
r->objects->kept_pack_cache.flags = 0;
}
-static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
{
maybe_invalidate_kept_pack_cache(r, flags);
diff --git a/packfile.h b/packfile.h
index 00ada7a938..25097213d0 100644
--- a/packfile.h
+++ b/packfile.h
@@ -197,6 +197,8 @@ int has_object_pack(struct repository *r, const struct object_id *oid);
int has_object_kept_pack(struct repository *r, const struct object_id *oid,
unsigned flags);
+struct packed_git **kept_pack_cache(struct repository *r, unsigned flags);
+
/*
* Return 1 if an object in a promisor packfile is or refers to the given
* object, 0 otherwise.
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 959e6e2648..43d2947d28 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -304,6 +304,72 @@ test_expect_success '--max-cruft-size with freshened objects (packed)' '
)
'
+test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
+ repo="max-cruft-size-threshold" &&
+
+ test_when_finished "rm -fr $repo" &&
+ git init "$repo" &&
+ (
+ cd "$repo" &&
+
+ test_commit base &&
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ baz="$(generate_random_blob baz $((2*1024*1024)))" &&
+
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$foo")" >foo.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.old &&
+ test-tool chmtime --get -100000 \
+ "$objdir/$(test_oid_to_path "$baz")" >baz.old &&
+
+ git repack --cruft -d &&
+
+ # Make an identical copy of foo stored in a pack with a more
+ # recent mtime.
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
+ foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" &&
+ test-tool chmtime --get -100 \
+ "$packdir/pack-$foo_pack.pack" >foo.new &&
+ git prune-packed &&
+
+ # Make a loose copy of bar, also with a more recent mtime.
+ bar="$(generate_random_blob bar $((2*1024*1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$bar")" >bar.new &&
+
+ # Make a new cruft object $quux to ensure we do not
+ # generate an identical pack to the existing cruft
+ # pack.
+ quux="$(generate_random_blob quux $((1024)))" &&
+ test-tool chmtime --get -100 \
+ "$objdir/$(test_oid_to_path "$quux")" >quux.new &&
+
+ git repack --cruft --max-cruft-size=3M -d &&
+
+ for p in $packdir/pack-*.mtimes
+ do
+ test-tool pack-mtimes "$(basename "$p")" || return 1
+ done >actual.raw &&
+ sort actual.raw >actual &&
+
+ # Among the set of all cruft packs, we should see both
+ # mtimes for object $foo and $bar, as well as the
+ # single new copy of $baz.
+ sort >expect <<-EOF &&
+ $foo $(cat foo.old)
+ $foo $(cat foo.new)
+ $bar $(cat bar.old)
+ $bar $(cat bar.new)
+ $baz $(cat baz.old)
+ $quux $(cat quux.new)
+ EOF
+
+ test_cmp expect actual
+ )
+'
+
test_expect_success '--max-cruft-size with pruning' '
git init max-cruft-size-prune &&
(
Range-diff against v4:
1: 390c3a6d85 < -: ---------- t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests
2: e7ebe6c460 < -: ---------- t7704-repack-cruft.sh: consolidate `write_blob()`
3: aa7588f817 < -: ---------- t/lib-cruft.sh: extract some cruft-related helpers
4: f2ca92245a < -: ---------- pack-objects: generate cruft packs at most one object over threshold
5: 12ddea7603 < -: ---------- builtin/repack.c: simplify cruft pack aggregation
6: d44a124c81 ! 1: 1563552bbd builtin/pack-objects.c: freshen objects from existing cruft packs
@@ t/t7704-repack-cruft.sh: test_expect_success '--max-cruft-size with freshened ob
'
+test_expect_success '--max-cruft-size with freshened objects (previously cruft)' '
-+ git init max-cruft-size-threshold &&
++ repo="max-cruft-size-threshold" &&
++
++ test_when_finished "rm -fr $repo" &&
++ git init "$repo" &&
+ (
-+ cd max-cruft-size-threshold &&
++ cd "$repo" &&
+
+ test_commit base &&
+ foo="$(generate_random_blob foo $((2*1024*1024)))" &&
base-commit: 87a0bdbf0f72b7561f3cd50636eee33dcb7dbcc3
--
2.49.0.rc2.1.g1563552bbd
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH v5] builtin/pack-objects.c: freshen objects from existing cruft packs
2025-03-13 18:09 ` [PATCH v5] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
@ 2025-03-13 18:41 ` Junio C Hamano
0 siblings, 0 replies; 54+ messages in thread
From: Junio C Hamano @ 2025-03-13 18:41 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Jeff King, Elijah Newren
Taylor Blau <me@ttaylorr.com> writes:
> Once an object is written into a cruft pack, we can only freshen it by
> writing a new loose or packed copy of that object with a more recent
> mtime.
>
> Prior to 61568efa95 (builtin/pack-objects.c: support `--max-pack-size`
> with `--cruft`, 2023-08-28), we typically had at most one cruft pack in
> a repository at any given time. So freshening unreachable objects was
> straightforward when already rewriting the cruft pack (and its *.mtimes
> file).
>
> But 61568efa95 changes things: 'pack-objects' now supports writing
> multiple cruft packs when invoked with `--cruft` and the
> `--max-pack-size` flag. Cruft packs are rewritten until they reach some
> size threshold, at which point they are considered "frozen", and will
> only be modified in a pruning GC, or if the threshold itself is
> adjusted.
>
> Prior to this patch, however, this process breaks down when we attempt
> to freshen an object packed in an earlier cruft pack, and that cruft
> pack is larger than the threshold and thus will survive the repack.
>
> When this is the case, it is impossible to freshen objects in cruft
> pack(s) when those cruft packs are larger than the threshold. This is
> because we would avoid writing them in the new cruft pack entirely, for
> a couple of reasons.
>
> 1. When enumerating packed objects via 'add_objects_in_unpacked_packs()'
> we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping
> over the packs we're going to retain (which are marked as kept
> in-core by 'read_cruft_objects()').
>
> This means that we will avoid enumerating additional packed copies
> of objects found in any cruft packs which are larger than the given
> size threshold. Thus there is no opportunity to call
> 'create_object_entry()' whatsoever.
>
> 2. We likewise will discard the loose copy (if one exists) of any
> unreachable object packed in a cruft pack that is larger than the
> threshold. Here our call path is 'add_unreachable_loose_objects()',
> which uses the 'add_loose_object()' callback.
>
> That function will eventually land us in 'want_object_in_pack()'
> (via 'add_cruft_object_entry()'), and we'll discard the object as it
> appears in one of the packs which we marked as kept in-core.
>
> This means in effect that it is impossible to freshen an unreachable
> object once it appears in a cruft pack larger than the given threshold.
>
> Instead, we should pack an additional copy of an unreachable object we
> want to freshen even if it appears in a cruft pack, provided that the
> cruft copy has an mtime which is before the mtime of the copy we are
> trying to pack/freshen. This is sub-optimal in the sense that it
> requires keeping an additional copy of unreachable objects upon
> freshening, but we don't have a better alternative without the ability
> to make in-place modifications to existing *.mtimes files.
>
> In order to implement this, we have to adjust the behavior of
> 'want_found_object()'. When 'pack-objects' is told that we're *not*
> going to retain any cruft packs (i.e. the set of packs marked as kept
> in-core does not contain a cruft pack), the behavior is unchanged.
>
> But when there *is* at least one cruft pack that we're holding onto, it
> is no longer sufficient to reject a copy of an object found in that
> cruft pack for that reason alone. In this case, we only want to reject a
> candidate object when copies of that object either:
>
> - exists in a non-cruft pack that we are retaining, regardless of that
> pack's mtime, or
>
> - exists in a cruft pack with an mtime at least as recent as the copy
> we are debating whether or not to pack, in which case freshening
> would be redundant.
>
> To do this, keep track of whether or not we have any cruft packs in our
> in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft'
> flag. When we end up in this new special case, we replace a call to
> 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject
> objects when we have a copy in an existing cruft pack with at least as
> recent an mtime as our candidate (in which case "freshening" would be
> redundant).
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------
> packfile.c | 3 +-
> packfile.h | 2 +
> t/t7704-repack-cruft.sh | 66 ++++++++++++++++++++++
> 4 files changed, 171 insertions(+), 18 deletions(-)
Thanks. Will queue.
^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2025-03-13 18:41 UTC | newest]
Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-27 18:29 [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Taylor Blau
2025-02-27 18:29 ` [PATCH 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
2025-02-27 19:23 ` Elijah Newren
2025-02-27 22:53 ` Taylor Blau
2025-02-28 7:52 ` Patrick Steinhardt
2025-03-04 21:52 ` Elijah Newren
2025-03-05 2:04 ` Junio C Hamano
2025-03-05 0:09 ` Taylor Blau
2025-02-27 18:29 ` [PATCH 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-02-27 19:26 ` Elijah Newren
2025-02-27 23:03 ` Taylor Blau
2025-02-27 19:28 ` [PATCH 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
2025-02-27 23:05 ` Taylor Blau
2025-03-04 21:35 ` [PATCH v2 " Taylor Blau
2025-03-04 21:35 ` [PATCH v2 1/2] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
2025-03-04 21:35 ` [PATCH v2 2/2] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-04 22:55 ` [PATCH v2 0/2] pack-objects: freshen objects with multi-cruft packs Elijah Newren
2025-03-05 0:06 ` Taylor Blau
2025-03-05 0:13 ` Taylor Blau
2025-03-05 0:15 ` [PATCH v3 0/1] " Taylor Blau
2025-03-05 0:15 ` [PATCH v3 1/1] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-06 10:31 ` Patrick Steinhardt
2025-03-13 17:32 ` Taylor Blau
2025-03-06 10:31 ` [PATCH v3 0/1] pack-objects: freshen objects with multi-cruft packs Patrick Steinhardt
2025-03-11 0:21 ` [PATCH v4 0/6] " Taylor Blau
2025-03-11 0:21 ` [PATCH v4 1/6] t/t5329-pack-objects-cruft.sh: evict 'repack'-related tests Taylor Blau
2025-03-11 0:21 ` [PATCH v4 2/6] t7704-repack-cruft.sh: consolidate `write_blob()` Taylor Blau
2025-03-11 0:21 ` [PATCH v4 3/6] t/lib-cruft.sh: extract some cruft-related helpers Taylor Blau
2025-03-11 0:21 ` [PATCH v4 4/6] pack-objects: generate cruft packs at most one object over threshold Taylor Blau
2025-03-11 21:59 ` Junio C Hamano
2025-03-12 15:22 ` Taylor Blau
2025-03-12 18:26 ` Junio C Hamano
2025-03-12 19:02 ` Taylor Blau
2025-03-12 19:13 ` Elijah Newren
2025-03-12 19:33 ` Taylor Blau
2025-03-12 20:43 ` Junio C Hamano
2025-03-12 20:49 ` Elijah Newren
2025-03-13 12:16 ` Junio C Hamano
2025-03-13 16:23 ` Elijah Newren
2025-03-13 17:06 ` Junio C Hamano
2025-03-11 0:21 ` [PATCH v4 5/6] builtin/repack.c: simplify cruft pack aggregation Taylor Blau
2025-03-11 0:21 ` [PATCH v4 6/6] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-11 20:13 ` [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs Junio C Hamano
2025-03-12 15:33 ` Taylor Blau
2025-03-12 18:28 ` Junio C Hamano
2025-03-12 19:04 ` Taylor Blau
2025-03-12 19:46 ` Junio C Hamano
2025-03-12 19:52 ` Taylor Blau
2025-03-13 17:17 ` Junio C Hamano
2025-03-13 17:35 ` Taylor Blau
2025-03-13 6:29 ` Jeff King
2025-03-13 15:12 ` Junio C Hamano
2025-03-13 18:09 ` [PATCH v5] builtin/pack-objects.c: freshen objects from existing cruft packs Taylor Blau
2025-03-13 18:41 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).