git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: Elijah Newren <newren@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, "Eric W. Biederman" <ebiederm@gmail.com>
Subject: Re: [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack`
Date: Sun, 8 Oct 2023 13:33:29 -0400	[thread overview]
Message-ID: <20231008173329.GA1557002@coredump.intra.peff.net> (raw)
In-Reply-To: <ZSLS9G1lHruig48a@nand.local>

On Sun, Oct 08, 2023 at 12:04:04PM -0400, Taylor Blau wrote:

> > I was interested in the same question as Junio, but from a different
> > angle.  fast-import documentation points out that the packs it creates
> > are suboptimal with poorer delta choices.  Are the packs created by
> > bulk-checkin prone to the same issues?  When I was thinking in terms
> > of having "git merge" use fast-import for pack creation instead of
> > writing loose objects (an idea I never investigated very far), I was
> > wondering if I'd need to mark those packs as "less optimal" and do
> > something to make sure they were more likely to be repacked.
> >
> > I believe geometric repacking didn't exist back when I was thinking
> > about this, and perhaps geometric repacking automatically handles
> > things nicely for us.  Does it, or are we risking retaining
> > sub-optimal deltas from the bulk-checkin code?
> >
> > (I've never really cracked open the pack code, so I have absolutely no
> > idea; I'm just curious.)
> 
> Yes, the bulk-checkin mechanism suffers from an even worse problem which
> is the pack it creates will contain no deltas whatsoever. The contents
> of the pack are just getting written as-is, so there's no fancy
> delta-ficiation going on.

I wonder how big a deal this would be in practice for merges.
pack-objects will look for deltas between any two candidates objects,
but in practice I think most deltas are between objects from multiple
commits (across the "time" dimension, if you will) rather than within a
single tree (the "space" dimension). And a merge operation is generally
creating a single new tree (recursive merging may create intermediate
states which would delta, but we don't actually need to keep those
intermediate ones. I won't be surprised if we do, though).

We should be able to test that theory by looking at existing deltas.
Here's a script which builds an index of blobs and trees to the commits
that introduce them:

  git rev-list HEAD |
  git diff-tree --stdin -r -m -t --raw |
  perl -lne '
    if (/^[0-9a-f]/) {
      $commit = $_;
    } elsif (/^:\S+ \S+ \S+ (\S+)/) {
      $h{$1} = $commit;
    }
    END { print "$_ $h{$_}" for keys(%h) }
  ' >commits.db

And then we can see which deltas come from the same commit:

  git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)' |
  perl -alne '
    BEGIN {
      open(my $fh, "<", "commits.db");
      %commit = map { chomp; split } <$fh>;
    }
    next if $F[1] =~ /0{40}/; # not a delta
    next unless defined $commit{$F[0]}; # not in index
    print $commit{$F[0]} eq $commit{$F[1]} ? "inner" : "outer", " ", $_;
  '

In git.git, I see 460 "inner" deltas, and 193,081 "outer" ones. The
inner ones are mostly small single-file test vectors, which makes sense.
It's possible to have a merge result that does conflict resolution in
two such files (that would then delta), but it seems like a fairly
unlikely case. Numbers for linux.git are similar.

So it might just not be a big issue at all for this use case.

> I think Michael Haggerty (?) suggested to me off-list that it might be
> interesting to have a flag that we could mark packs with bad/no deltas
> as such so that we don't implicitly trust their contents as having high
> quality deltas.

I was going to suggest the same thing. ;) Unfortunately it's a bit
tricky to do as we have no room in the file format for an optional flag.
You'd have to add a ".mediocre-delta" file or something.

But here's another approach. I recall discussing a while back the idea
that we should not necessarily trust the quality of deltas in packs that
are pushed (and I think Thomas Gummerer even did some experiments inside
GitHub with those, though I don't remember the results). And one way
around that is during geometric repacking to consider the biggest/oldest
pack as "preferred", reuse its deltas, but always compute from scratch
with the others (neither reusing on-disk deltas, nor skipping
try_delta() when two objects come from the same pack).

That same strategy would work here (and for incremental fast-import
packs, though of course not if your fast-import pack is the "big" one
after you do a from-scratch import).

> > A couple of the comments earlier in the series suggested this was
> > about streaming blobs to a pack in the bulk checkin code.  Are tree
> > and commit objects also put in the pack, or will those continue to be
> > written loosely?
> 
> This covers both blobs and trees, since IIUC that's all we'd need to
> implement support for merge-tree to be able to write any objects it
> creates into a pack. AFAIK merge-tree never generates any commit
> objects. But teaching 'merge' to perform the same bulk-checkin trick
> would just require us implementing index_bulk_commit_checkin_in_core()
> or similar, which is straightforward to do on top of the existing code.

This is a bit of a devil's advocate question, but: would it make sense
to implement this as a general of Git's object-writing code, and not tie
it to a specific command? That is, what if a user could do:

  git --write-to-pack merge ...

but also:

  git --write-to-pack add ...

and the object-writing code would just write to a pack instead of
writing loose objects. That lets the caller decide when it is or is not
a good idea to use this mode. And if making loose objects gives bad
performance for merges, wouldn't the same be true of other operations
which generate many objects?

Possibly it exacerbates the "no deltas" issue from above (though it
would depend on the command).  The bigger question to me is one of
checkpointing. When do we finish off the pack with a .idx and make it
available to other readers? We could do it at program exit, but I
suspect there are some commands that really want to make objects
available sooner (e.g., as soon as "git add" writes an index, we'd want
those objects to already be available). Probably every program that
writes objects would need to be annotated with a checkpoint call (which
would be a noop in loose object mode).

So maybe it's a dumb direction. I dunno.

-Peff

  reply	other threads:[~2023-10-08 17:33 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-06 22:01 [PATCH 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-06 22:01 ` [PATCH 1/7] bulk-checkin: factor out `format_object_header_hash()` Taylor Blau
2023-10-06 22:01 ` [PATCH 2/7] bulk-checkin: factor out `prepare_checkpoint()` Taylor Blau
2023-10-06 22:01 ` [PATCH 3/7] bulk-checkin: factor out `truncate_checkpoint()` Taylor Blau
2023-10-06 22:01 ` [PATCH 4/7] bulk-checkin: factor our `finalize_checkpoint()` Taylor Blau
2023-10-06 22:02 ` [PATCH 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-06 22:02 ` [PATCH 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-07  3:07   ` Eric Biederman
2023-10-09  1:31     ` Taylor Blau
2023-10-06 22:02 ` [PATCH 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-06 22:35   ` Junio C Hamano
2023-10-06 23:02     ` Taylor Blau
2023-10-08  7:02       ` Elijah Newren
2023-10-08 16:04         ` Taylor Blau
2023-10-08 17:33           ` Jeff King [this message]
2023-10-09  1:37             ` Taylor Blau
2023-10-09 20:21               ` Jeff King
2023-10-09 17:24             ` Junio C Hamano
2023-10-09 10:54       ` Patrick Steinhardt
2023-10-09 16:08         ` Taylor Blau
2023-10-10  6:36           ` Patrick Steinhardt
2023-10-17 16:31 ` [PATCH v2 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-17 16:31   ` [PATCH v2 1/7] bulk-checkin: factor out `format_object_header_hash()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 2/7] bulk-checkin: factor out `prepare_checkpoint()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 3/7] bulk-checkin: factor out `truncate_checkpoint()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 4/7] bulk-checkin: factor our `finalize_checkpoint()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-18  2:18     ` Junio C Hamano
2023-10-18 16:34       ` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-17 16:31   ` [PATCH v2 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-18 17:07 ` [PATCH v3 00/10] merge-ort: implement support for packing objects together Taylor Blau
2023-10-18 17:07   ` [PATCH v3 01/10] bulk-checkin: factor out `format_object_header_hash()` Taylor Blau
2023-10-18 17:07   ` [PATCH v3 02/10] bulk-checkin: factor out `prepare_checkpoint()` Taylor Blau
2023-10-18 17:07   ` [PATCH v3 03/10] bulk-checkin: factor out `truncate_checkpoint()` Taylor Blau
2023-10-18 17:07   ` [PATCH v3 04/10] bulk-checkin: factor out `finalize_checkpoint()` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 05/10] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-18 23:10     ` Junio C Hamano
2023-10-19 15:19       ` Taylor Blau
2023-10-19 17:55         ` Junio C Hamano
2023-10-18 17:08   ` [PATCH v3 06/10] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 07/10] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-18 17:08   ` [PATCH v3 08/10] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-18 23:18     ` Junio C Hamano
2023-10-19 15:30       ` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 09/10] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-18 17:08   ` [PATCH v3 10/10] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-18 18:32 ` [PATCH v4 00/17] bloom: changed-path Bloom filters v2 (& sundries) Taylor Blau
2023-10-18 18:32   ` [PATCH v4 01/17] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2023-10-18 18:32   ` [PATCH v4 02/17] revision.c: consult Bloom filters for root commits Taylor Blau
2023-10-18 18:32   ` [PATCH v4 03/17] commit-graph: ensure Bloom filters are read with consistent settings Taylor Blau
2023-10-18 18:32   ` [PATCH v4 04/17] gitformat-commit-graph: describe version 2 of BDAT Taylor Blau
2023-10-18 18:32   ` [PATCH v4 05/17] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2023-10-18 18:32   ` [PATCH v4 06/17] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2023-10-18 18:32   ` [PATCH v4 07/17] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2023-10-18 18:32   ` [PATCH v4 08/17] t4216: test changed path filters with high bit paths Taylor Blau
2023-10-18 18:32   ` [PATCH v4 09/17] repo-settings: introduce commitgraph.changedPathsVersion Taylor Blau
2023-10-18 18:32   ` [PATCH v4 10/17] commit-graph: new filter ver. that fixes murmur3 Taylor Blau
2023-10-18 18:33   ` [PATCH v4 11/17] bloom: annotate filters with hash version Taylor Blau
2023-10-18 18:33   ` [PATCH v4 12/17] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2023-10-18 18:33   ` [PATCH v4 13/17] commit-graph.c: unconditionally load " Taylor Blau
2023-10-18 18:33   ` [PATCH v4 14/17] commit-graph: drop unnecessary `graph_read_bloom_data_context` Taylor Blau
2023-10-18 18:33   ` [PATCH v4 15/17] object.h: fix mis-aligned flag bits table Taylor Blau
2023-10-18 18:33   ` [PATCH v4 16/17] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2023-10-18 18:33   ` [PATCH v4 17/17] bloom: introduce `deinit_bloom_filters()` Taylor Blau
2023-10-18 23:26   ` [PATCH v4 00/17] bloom: changed-path Bloom filters v2 (& sundries) Junio C Hamano
2023-10-20 17:27     ` Taylor Blau
2023-10-23 20:22       ` SZEDER Gábor
2023-10-30 20:24         ` Taylor Blau
2024-01-16 22:08   ` [PATCH v5 " Taylor Blau
2024-01-16 22:09     ` [PATCH v5 01/17] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 02/17] revision.c: consult Bloom filters for root commits Taylor Blau
2024-01-16 22:09     ` [PATCH v5 03/17] commit-graph: ensure Bloom filters are read with consistent settings Taylor Blau
2024-01-16 22:09     ` [PATCH v5 04/17] gitformat-commit-graph: describe version 2 of BDAT Taylor Blau
2024-01-16 22:09     ` [PATCH v5 05/17] t/helper/test-read-graph.c: extract `dump_graph_info()` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 06/17] bloom.h: make `load_bloom_filter_from_graph()` public Taylor Blau
2024-01-16 22:09     ` [PATCH v5 07/17] t/helper/test-read-graph: implement `bloom-filters` mode Taylor Blau
2024-01-16 22:09     ` [PATCH v5 08/17] t4216: test changed path filters with high bit paths Taylor Blau
2024-01-16 22:09     ` [PATCH v5 09/17] repo-settings: introduce commitgraph.changedPathsVersion Taylor Blau
2024-01-29 21:26       ` SZEDER Gábor
2024-01-29 23:58         ` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 10/17] commit-graph: new Bloom filter version that fixes murmur3 Taylor Blau
2024-01-16 22:09     ` [PATCH v5 11/17] bloom: annotate filters with hash version Taylor Blau
2024-01-16 22:09     ` [PATCH v5 12/17] bloom: prepare to discard incompatible Bloom filters Taylor Blau
2024-01-16 22:09     ` [PATCH v5 13/17] commit-graph.c: unconditionally load " Taylor Blau
2024-01-16 22:09     ` [PATCH v5 14/17] commit-graph: drop unnecessary `graph_read_bloom_data_context` Taylor Blau
2024-01-16 22:09     ` [PATCH v5 15/17] object.h: fix mis-aligned flag bits table Taylor Blau
2024-01-16 22:09     ` [PATCH v5 16/17] commit-graph: reuse existing Bloom filters where possible Taylor Blau
2024-01-16 22:09     ` [PATCH v5 17/17] bloom: introduce `deinit_bloom_filters()` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231008173329.GA1557002@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=ebiederm@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).