git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: Elijah Newren <newren@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, "Eric W. Biederman" <ebiederm@gmail.com>
Subject: Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`
Date: Sun, 8 Oct 2023 21:37:42 -0400	[thread overview]
Message-ID: <ZSNZZrWyCqRH+0Bd@nand.local> (raw)
In-Reply-To: <20231008173329.GA1557002@coredump.intra.peff.net>

On Sun, Oct 08, 2023 at 01:33:29PM -0400, Jeff King wrote:
> On Sun, Oct 08, 2023 at 12:04:04PM -0400, Taylor Blau wrote:
>
> > > I was interested in the same question as Junio, but from a different
> > > angle.  fast-import documentation points out that the packs it creates
> > > are suboptimal with poorer delta choices.  Are the packs created by
> > > bulk-checkin prone to the same issues?  When I was thinking in terms
> > > of having "git merge" use fast-import for pack creation instead of
> > > writing loose objects (an idea I never investigated very far), I was
> > > wondering if I'd need to mark those packs as "less optimal" and do
> > > something to make sure they were more likely to be repacked.
> > >
> > > I believe geometric repacking didn't exist back when I was thinking
> > > about this, and perhaps geometric repacking automatically handles
> > > things nicely for us.  Does it, or are we risking retaining
> > > sub-optimal deltas from the bulk-checkin code?
> > >
> > > (I've never really cracked open the pack code, so I have absolutely no
> > > idea; I'm just curious.)
> >
> > Yes, the bulk-checkin mechanism suffers from an even worse problem which
> > is the pack it creates will contain no deltas whatsoever. The contents
> > of the pack are just getting written as-is, so there's no fancy
> > delta-ficiation going on.
>
> I wonder how big a deal this would be in practice for merges.
> pack-objects will look for deltas between any two candidates objects,
> but in practice I think most deltas are between objects from multiple
> commits (across the "time" dimension, if you will) rather than within a
> single tree (the "space" dimension). And a merge operation is generally
> creating a single new tree (recursive merging may create intermediate
> states which would delta, but we don't actually need to keep those
> intermediate ones. I won't be surprised if we do, though).
>
> We should be able to test that theory by looking at existing deltas.
> Here's a script which builds an index of blobs and trees to the commits
> that introduce them:
>
>   git rev-list HEAD |
>   git diff-tree --stdin -r -m -t --raw |
>   perl -lne '
>     if (/^[0-9a-f]/) {
>       $commit = $_;
>     } elsif (/^:\S+ \S+ \S+ (\S+)/) {
>       $h{$1} = $commit;
>     }
>     END { print "$_ $h{$_}" for keys(%h) }
>   ' >commits.db
>
> And then we can see which deltas come from the same commit:
>
>   git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)' |
>   perl -alne '
>     BEGIN {
>       open(my $fh, "<", "commits.db");
>       %commit = map { chomp; split } <$fh>;
>     }
>     next if $F[1] =~ /0{40}/; # not a delta
>     next unless defined $commit{$F[0]}; # not in index
>     print $commit{$F[0]} eq $commit{$F[1]} ? "inner" : "outer", " ", $_;
>   '
>
> In git.git, I see 460 "inner" deltas, and 193,081 "outer" ones. The
> inner ones are mostly small single-file test vectors, which makes sense.
> It's possible to have a merge result that does conflict resolution in
> two such files (that would then delta), but it seems like a fairly
> unlikely case. Numbers for linux.git are similar.
>
> So it might just not be a big issue at all for this use case.

Very interesting, thanks for running (and documenting!) this experiment.
I'm mostly with you that it probably doesn't make a huge difference in
practice here.

One thing that I'm not entirely clear on is how we'd treat objects that
could be good delta candidates for each other between two packs. For
instance, if I write a tree corresponding to the merge between two
branches, it's likely that the resulting tree would be a good delta
candidate against either of the trees at the tips of those two refs.

But we won't pack those trees (the ones at the tips of the refs) in the
same pack as the tree containing their merge. If we later on tried to
repack, would we evaluate the tip trees as possible delta candidates
against the merged tree? Or would we look at the merged tree, realize it
isn't delta'd with anything, and then not attempt to find any
candidates?

> > I think Michael Haggerty (?) suggested to me off-list that it might be
> > interesting to have a flag that we could mark packs with bad/no deltas
> > as such so that we don't implicitly trust their contents as having high
> > quality deltas.
>
> I was going to suggest the same thing. ;) Unfortunately it's a bit
> tricky to do as we have no room in the file format for an optional flag.
> You'd have to add a ".mediocre-delta" file or something.

Yeah, I figured that we'd add a new ".baddeltas" file or something. (As
an aside, we probably should have an optional flags section in the .pack
format, since we seem to have a lot of optional pack extensions: .rev,
.bitmap, .keep, .promisor, etc.)

> But here's another approach. I recall discussing a while back the idea
> that we should not necessarily trust the quality of deltas in packs that
> are pushed (and I think Thomas Gummerer even did some experiments inside
> GitHub with those, though I don't remember the results). And one way
> around that is during geometric repacking to consider the biggest/oldest
> pack as "preferred", reuse its deltas, but always compute from scratch
> with the others (neither reusing on-disk deltas, nor skipping
> try_delta() when two objects come from the same pack).
>
> That same strategy would work here (and for incremental fast-import
> packs, though of course not if your fast-import pack is the "big" one
> after you do a from-scratch import).

Yeah, definitely. I think that that code is still living in GitHub's
fork, but inactive since we haven't set any of the relevant
configuration in GitHub's production environment.

> Possibly it exacerbates the "no deltas" issue from above (though it
> would depend on the command).  The bigger question to me is one of
> checkpointing. When do we finish off the pack with a .idx and make it
> available to other readers? We could do it at program exit, but I
> suspect there are some commands that really want to make objects
> available sooner (e.g., as soon as "git add" writes an index, we'd want
> those objects to already be available). Probably every program that
> writes objects would need to be annotated with a checkpoint call (which
> would be a noop in loose object mode).
>
> So maybe it's a dumb direction. I dunno.

I wouldn't say it's a dumb direction ;-). But I'd be hesitant pursuing
it without solving the "no deltas" question from earlier.

Thanks,
Taylor

  reply	other threads:[~2023-10-09  1:37 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-06 22:01 [PATCH 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-06 22:01 ` [PATCH 1/7] bulk-checkin: factor out `format_object_header_hash()` Taylor Blau
2023-10-06 22:01 ` [PATCH 2/7] bulk-checkin: factor out `prepare_checkpoint()` Taylor Blau
2023-10-06 22:01 ` [PATCH 3/7] bulk-checkin: factor out `truncate_checkpoint()` Taylor Blau
2023-10-06 22:01 ` [PATCH 4/7] bulk-checkin: factor our `finalize_checkpoint()` Taylor Blau
2023-10-06 22:02 ` [PATCH 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-06 22:02 ` [PATCH 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-07  3:07   ` Eric Biederman
2023-10-09  1:31     ` Taylor Blau
2023-10-06 22:02 ` [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-06 22:35   ` Junio C Hamano
2023-10-06 23:02     ` Taylor Blau
2023-10-08  7:02       ` Elijah Newren
2023-10-08 16:04         ` Taylor Blau
2023-10-08 17:33           ` Jeff King
2023-10-09  1:37             ` Taylor Blau [this message]
2023-10-09 20:21               ` Jeff King
2023-10-09 17:24             ` Junio C Hamano
2023-10-09 10:54       ` Patrick Steinhardt
2023-10-09 16:08         ` Taylor Blau
2023-10-10  6:36           ` Patrick Steinhardt
2023-10-17 16:31 ` [PATCH v2 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-17 16:31   ` [PATCH v2 1/7] bulk-checkin: factor out `format_object_header_hash()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 2/7] bulk-checkin: factor out `prepare_checkpoint()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 3/7] bulk-checkin: factor out `truncate_checkpoint()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 4/7] bulk-checkin: factor our `finalize_checkpoint()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-18  2:18     ` Junio C Hamano
2023-10-18 16:34       ` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-18 17:07 ` [PATCH v3 00/10] merge-ort: implement support for packing objects together Taylor Blau
2023-10-18 17:07   ` [PATCH v3 01/10] bulk-checkin: factor out `format_object_header_hash()` Taylor Blau
2023-10-18 17:07   ` [PATCH v3 02/10] bulk-checkin: factor out `prepare_checkpoint()` Taylor Blau
2023-10-18 17:07   ` [PATCH v3 03/10] bulk-checkin: factor out `truncate_checkpoint()` Taylor Blau
2023-10-18 17:07   ` [PATCH v3 04/10] bulk-checkin: factor out `finalize_checkpoint()` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 05/10] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-18 23:10     ` Junio C Hamano
2023-10-19 15:19       ` Taylor Blau
2023-10-19 17:55         ` Junio C Hamano
2023-10-18 17:08   ` [PATCH v3 06/10] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 07/10] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-18 17:08   ` [PATCH v3 08/10] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-18 23:18     ` Junio C Hamano
2023-10-19 15:30       ` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 09/10] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 10/10] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-18 18:32 ` [PATCH v4 00/17] bloom: changed-path Bloom filters v2 (& sundries) Taylor Blau
2023-10-18 18:32   ` [PATCH v4 01/17] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2023-10-18 18:32   ` [PATCH v4 02/17] revision.c: consult Bloom filters for root commits Taylor Blau
2023-10-18 18:32   ` [PATCH v4 03/17] commit-graph: ensure Bloom filters are read with consistent settings Taylor Blau
2023-10-18 18:32   ` [PATCH v4 04/17] gitformat-commit-graph: describe version 2 of BDAT Taylor Blau
2023-10-18 18:32   ` [PATCH v4 05/17] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2023-10-18 18:32   ` [PATCH v4 06/17] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2023-10-18 18:32   ` [PATCH v4 07/17] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2023-10-18 18:32   ` [PATCH v4 08/17] t4216: test changed path filters with high bit paths Taylor Blau
2023-10-18 18:32   ` [PATCH v4 09/17] repo-settings: introduce commitgraph.changedPathsVersion Taylor Blau
2023-10-18 18:32   ` [PATCH v4 10/17] commit-graph: new filter ver. that fixes murmur3 Taylor Blau
2023-10-18 18:33   ` [PATCH v4 11/17] bloom: annotate filters with hash version Taylor Blau
2023-10-18 18:33   ` [PATCH v4 12/17] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2023-10-18 18:33   ` [PATCH v4 13/17] commit-graph.c: unconditionally load " Taylor Blau
2023-10-18 18:33   ` [PATCH v4 14/17] commit-graph: drop unnecessary `graph_read_bloom_data_context` Taylor Blau
2023-10-18 18:33   ` [PATCH v4 15/17] object.h: fix mis-aligned flag bits table Taylor Blau
2023-10-18 18:33   ` [PATCH v4 16/17] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2023-10-18 18:33   ` [PATCH v4 17/17] bloom: introduce `deinit_bloom_filters()` Taylor Blau
2023-10-18 23:26   ` [PATCH v4 00/17] bloom: changed-path Bloom filters v2 (& sundries) Junio C Hamano
2023-10-20 17:27     ` Taylor Blau
2023-10-23 20:22       ` SZEDER Gábor
2023-10-30 20:24         ` Taylor Blau
2024-01-16 22:08   ` [PATCH v5 " Taylor Blau
2024-01-16 22:09     ` [PATCH v5 01/17] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 02/17] revision.c: consult Bloom filters for root commits Taylor Blau
2024-01-16 22:09     ` [PATCH v5 03/17] commit-graph: ensure Bloom filters are read with consistent settings Taylor Blau
2024-01-16 22:09     ` [PATCH v5 04/17] gitformat-commit-graph: describe version 2 of BDAT Taylor Blau
2024-01-16 22:09     ` [PATCH v5 05/17] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 06/17] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2024-01-16 22:09     ` [PATCH v5 07/17] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2024-01-16 22:09     ` [PATCH v5 08/17] t4216: test changed path filters with high bit paths Taylor Blau
2024-01-16 22:09     ` [PATCH v5 09/17] repo-settings: introduce commitgraph.changedPathsVersion Taylor Blau
2024-01-29 21:26       ` SZEDER Gábor
2024-01-29 23:58         ` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 10/17] commit-graph: new Bloom filter version that fixes murmur3 Taylor Blau
2024-01-16 22:09     ` [PATCH v5 11/17] bloom: annotate filters with hash version Taylor Blau
2024-01-16 22:09     ` [PATCH v5 12/17] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2024-01-16 22:09     ` [PATCH v5 13/17] commit-graph.c: unconditionally load " Taylor Blau
2024-01-16 22:09     ` [PATCH v5 14/17] commit-graph: drop unnecessary `graph_read_bloom_data_context` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 15/17] object.h: fix mis-aligned flag bits table Taylor Blau
2024-01-16 22:09     ` [PATCH v5 16/17] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2024-01-16 22:09     ` [PATCH v5 17/17] bloom: introduce `deinit_bloom_filters()` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZSNZZrWyCqRH+0Bd@nand.local \
    --to=me@ttaylorr.com \
    --cc=ebiederm@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).