On Fri, Mar 29, 2024 at 04:16:48AM +0000, Justin Tobler via GitGitGadget wrote: > From: Justin Tobler > > To reduce the number of on-disk reftables, compaction is performed. > Contiguous tables with the same binary log value of size are grouped > into segments. The segment that has both the lowest binary log value and > contains more than one table is set as the starting point when > identifying the compaction segment. > > Since segments containing a single table are not initially considered > for compaction, if the table appended to the list does not match the > previous table log value, no compaction occurs for the new table. It is > therefore possible for unbounded growth of the table list. This can be > demonstrated by repeating the following sequence: > > git branch -f foo > git branch -d foo > > Each operation results in a new table being written with no compaction > occurring until a separate operation produces a table matching the > previous table log value. > > Instead, to avoid unbounded growth of the table list, the compaction > strategy is updated to ensure tables follow a geometric sequence after > each operation by individually evaluating each table in reverse index > order. This strategy results in a much simpler and more robust algorithm > compared to the previous one while also maintaining a minimal ordered > set of tables on-disk. > > When creating 10 thousand references, the new strategy has no > performance impact: > > Benchmark 1: update-ref: create refs sequentially (revision = HEAD~) > Time (mean ± σ): 26.516 s ± 0.047 s [User: 17.864 s, System: 8.491 s] > Range (min … max): 26.447 s … 26.569 s 10 runs > > Benchmark 2: update-ref: create refs sequentially (revision = HEAD) > Time (mean ± σ): 26.417 s ± 0.028 s [User: 17.738 s, System: 8.500 s] > Range (min … max): 26.366 s … 26.444 s 10 runs > > Summary > update-ref: create refs sequentially (revision = HEAD) ran > 1.00 ± 0.00 times faster than update-ref: create refs sequentially (revision = HEAD~) > > Some tests in `t0610-reftable-basics.sh` assert the on-disk state of > tables and are therefore updated to specify the correct new table count. > Since compaction is more aggressive in ensuring tables maintain a > geometric sequence, the expected table count is reduced in these tests. > In `reftable/stack_test.c` tests related to `sizes_to_segments()` are > removed because the function is no longer needed. Also, the > `test_suggest_compaction_segment()` test is updated to better showcase > and reflect the new geometric compaction behavior. > > Signed-off-by: Justin Tobler > --- > reftable/stack.c | 120 ++++++++++++++++++------------------- > reftable/stack.h | 3 - > reftable/stack_test.c | 67 +++++---------------- > t/t0610-reftable-basics.sh | 45 +++++++++----- > 4 files changed, 103 insertions(+), 132 deletions(-) > > diff --git a/reftable/stack.c b/reftable/stack.c > index 07262beaaf7..e7b9a1de5a4 100644 > --- a/reftable/stack.c > +++ b/reftable/stack.c > @@ -1202,75 +1202,73 @@ static int segment_size(struct segment *s) > return s->end - s->start; > } > > -int fastlog2(uint64_t sz) > -{ > - int l = 0; > - if (sz == 0) > - return 0; > - for (; sz; sz /= 2) { > - l++; > - } > - return l - 1; > -} > - > -struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n) > -{ > - struct segment *segs = reftable_calloc(n, sizeof(*segs)); > - struct segment cur = { 0 }; > - size_t next = 0, i; > - > - if (n == 0) { > - *seglen = 0; > - return segs; > - } > - for (i = 0; i < n; i++) { > - int log = fastlog2(sizes[i]); > - if (cur.log != log && cur.bytes > 0) { > - struct segment fresh = { > - .start = i, > - }; > - > - segs[next++] = cur; > - cur = fresh; > - } > - > - cur.log = log; > - cur.end = i + 1; > - cur.bytes += sizes[i]; > - } > - segs[next++] = cur; > - *seglen = next; > - return segs; > -} > - > struct segment suggest_compaction_segment(uint64_t *sizes, size_t n) > { > - struct segment min_seg = { > - .log = 64, > - }; > - struct segment *segs; > - size_t seglen = 0, i; > - > - segs = sizes_to_segments(&seglen, sizes, n); > - for (i = 0; i < seglen; i++) { > - if (segment_size(&segs[i]) == 1) > - continue; > + struct segment seg = { 0 }; > + uint64_t bytes; > + size_t i; > > - if (segs[i].log < min_seg.log) > - min_seg = segs[i]; > - } > + /* > + * If there are no tables or only a single one then we don't have to > + * compact anything. The sequence is geometric by definition already. > + */ > + if (n <= 1) > + return seg; > > - while (min_seg.start > 0) { > - size_t prev = min_seg.start - 1; > - if (fastlog2(min_seg.bytes) < fastlog2(sizes[prev])) > + /* > + * Find the ending table of the compaction segment needed to restore the > + * geometric sequence. > + * > + * To do so, we iterate backwards starting from the most recent table > + * until a valid segment end is found. If the preceding table is smaller > + * than the current table multiplied by the geometric factor (2), the > + * current table is set as the compaction segment end. > + * > + * Tables after the ending point are not added to the byte count because > + * they are already valid members of the geometric sequence. Due to the > + * properties of a geometric sequence, it is not possible for the sum of > + * these tables to exceed the value of the ending point table. > + * > + * Example table size sequence requiring no compaction: > + * 64, 32, 16, 8, 4, 2, 1 > + * > + * Example compaction segment end set to table with size 3: > + * 64, 32, 16, 8, 4, 3, 1 > + */ > + for (i = n - 1; i > 0; i--) { > + if (sizes[i - 1] < sizes[i] * 2) { > + seg.end = i + 1; > + bytes = sizes[i]; > break; > + } > + } > > - min_seg.start = prev; > - min_seg.bytes += sizes[prev]; > + /* > + * Find the starting table of the compaction segment by iterating > + * through the remaining tables and keeping track of the accumulated > + * size of all tables seen from the segment end table. The previous > + * table is compared to the accumulated size because the tables from the > + * segment end are merged backwards recursively. > + * > + * Note that we keep iterating even after we have found the first > + * starting point. This is because there may be tables in the stack > + * preceding that first starting point which violate the geometric > + * sequence. > + * > + * Example compaction segment start set to table with size 32: > + * 128, 32, 16, 8, 4, 3, 1 > + */ > + for (; i > 0; i--) { > + uint64_t curr = bytes; > + bytes += sizes[i - 1]; > + > + if (sizes[i - 1] < curr * 2) { > + seg.start = i - 1; > + seg.bytes = bytes; > + } > } > > - reftable_free(segs); > - return min_seg; > + return seg; > } > > static uint64_t *stack_table_sizes_for_compaction(struct reftable_stack *st) > diff --git a/reftable/stack.h b/reftable/stack.h > index d919455669e..656f896cc28 100644 > --- a/reftable/stack.h > +++ b/reftable/stack.h > @@ -33,12 +33,9 @@ int read_lines(const char *filename, char ***lines); > > struct segment { > size_t start, end; > - int log; > uint64_t bytes; > }; > > -int fastlog2(uint64_t sz); > -struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n); > struct segment suggest_compaction_segment(uint64_t *sizes, size_t n); > > #endif > diff --git a/reftable/stack_test.c b/reftable/stack_test.c > index 7336757cf53..21541742fe5 100644 > --- a/reftable/stack_test.c > +++ b/reftable/stack_test.c > @@ -717,59 +717,13 @@ static void test_reftable_stack_hash_id(void) > clear_dir(dir); > } > > -static void test_log2(void) > -{ > - EXPECT(1 == fastlog2(3)); > - EXPECT(2 == fastlog2(4)); > - EXPECT(2 == fastlog2(5)); > -} > - > -static void test_sizes_to_segments(void) > -{ > - uint64_t sizes[] = { 2, 3, 4, 5, 7, 9 }; > - /* .................0 1 2 3 4 5 */ > - > - size_t seglen = 0; > - struct segment *segs = > - sizes_to_segments(&seglen, sizes, ARRAY_SIZE(sizes)); > - EXPECT(segs[2].log == 3); > - EXPECT(segs[2].start == 5); > - EXPECT(segs[2].end == 6); > - > - EXPECT(segs[1].log == 2); > - EXPECT(segs[1].start == 2); > - EXPECT(segs[1].end == 5); > - reftable_free(segs); > -} > - > -static void test_sizes_to_segments_empty(void) > -{ > - size_t seglen = 0; > - struct segment *segs = sizes_to_segments(&seglen, NULL, 0); > - EXPECT(seglen == 0); > - reftable_free(segs); > -} > - > -static void test_sizes_to_segments_all_equal(void) > -{ > - uint64_t sizes[] = { 5, 5 }; > - size_t seglen = 0; > - struct segment *segs = > - sizes_to_segments(&seglen, sizes, ARRAY_SIZE(sizes)); > - EXPECT(seglen == 1); > - EXPECT(segs[0].start == 0); > - EXPECT(segs[0].end == 2); > - reftable_free(segs); > -} > - > static void test_suggest_compaction_segment(void) > { > - uint64_t sizes[] = { 128, 64, 17, 16, 9, 9, 9, 16, 16 }; > - /* .................0 1 2 3 4 5 6 */ > + uint64_t sizes[] = { 512, 64, 17, 16, 9, 9, 9, 16, 2, 16 }; > struct segment min = > suggest_compaction_segment(sizes, ARRAY_SIZE(sizes)); > - EXPECT(min.start == 2); > - EXPECT(min.end == 7); > + EXPECT(min.start == 1); > + EXPECT(min.end == 10); > } > > static void test_suggest_compaction_segment_nothing(void) > @@ -880,6 +834,17 @@ static void test_empty_add(void) > reftable_stack_destroy(st2); > } > > +static int fastlog2(uint64_t sz) > +{ > + int l = 0; > + if (sz == 0) > + return 0; > + for (; sz; sz /= 2) { > + l++; > + } Nit: we could drop the curly braces while at it. > + return l - 1; > +} > + > static void test_reftable_stack_auto_compaction(void) > { > struct reftable_write_options cfg = { 0 }; > @@ -1068,7 +1033,6 @@ static void test_reftable_stack_compaction_concurrent_clean(void) > int stack_test_main(int argc, const char *argv[]) > { > RUN_TEST(test_empty_add); > - RUN_TEST(test_log2); > RUN_TEST(test_names_equal); > RUN_TEST(test_parse_names); > RUN_TEST(test_read_file); > @@ -1088,9 +1052,6 @@ int stack_test_main(int argc, const char *argv[]) > RUN_TEST(test_reftable_stack_update_index_check); > RUN_TEST(test_reftable_stack_uptodate); > RUN_TEST(test_reftable_stack_validate_refname); > - RUN_TEST(test_sizes_to_segments); > - RUN_TEST(test_sizes_to_segments_all_equal); > - RUN_TEST(test_sizes_to_segments_empty); > RUN_TEST(test_suggest_compaction_segment); > RUN_TEST(test_suggest_compaction_segment_nothing); > return 0; > diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh > index 434044078ed..b95626549e7 100755 > --- a/t/t0610-reftable-basics.sh > +++ b/t/t0610-reftable-basics.sh > @@ -293,7 +293,7 @@ test_expect_success 'ref transaction: writes cause auto-compaction' ' > test_line_count = 1 repo/.git/reftable/tables.list && > > test_commit -C repo --no-tag A && > - test_line_count = 2 repo/.git/reftable/tables.list && > + test_line_count = 1 repo/.git/reftable/tables.list && > > test_commit -C repo --no-tag B && > test_line_count = 1 repo/.git/reftable/tables.list > @@ -308,12 +308,24 @@ test_expect_success 'ref transaction: environment variable disables auto-compact > do > GIT_TEST_REFTABLE_NO_AUTOCOMPACTION=true git -C repo update-ref branch-$i HEAD || return 1 > done && > - test_line_count = 23 repo/.git/reftable/tables.list && > + test_line_count = 22 repo/.git/reftable/tables.list && > > git -C repo update-ref foo HEAD && > test_line_count = 1 repo/.git/reftable/tables.list > ' > > +test_expect_success 'ref transaction: alternating table sizes are compacted' ' > + test_when_finished "rm -rf repo" && > + git init repo && > + test_commit -C repo A && > + for i in $(test_seq 20) Nit: we could reduce the number of iterations here so that we don't have to spawn 40 Git commands. Using something like 10 or even 5 iterations should be sufficient to demonstrate the problem, no? Patrick