git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Johan Herland <johan@herland.net>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	Git mailing list <git@vger.kernel.org>,
	Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCH v2 03/22] t3305: annotate with SHA1 prerequisite
Date: Mon, 27 Jan 2020 13:22:10 +0100 (CET)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.2001271321290.46@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <CALKQrgc9EwjJL=u8eWbSbXOL9Hu8G0LD6cxTfaa50Nfx2iYDrg@mail.gmail.com>

Hi Johan,

On Mon, 27 Jan 2020, Johan Herland wrote:

> On Sun, Jan 26, 2020 at 10:29 PM Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> > On Sun, 26 Jan 2020, Johan Herland wrote:
> > > On Sun, Jan 26, 2020 at 12:16 PM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > > On Sat, 25 Jan 2020, brian m. carlson wrote:
> > > > > This test relies on a roughly equal distribution of hashes for notes in
> > > > > order to ensure that fanouts are compressed.  If there are subtrees with
> > > > > only one item left after removing notes, they'll end up still with one
> > > > > level of fanout, causing the test to fail.
> > > >
> > > > That is _almost_ correct: The heuristic wants to see one bucket that has
> > > > a note in it. Or something like that.
> > > >
> > > > See 73f77b909f8 (Notes API: for_each_note(): Traverse the entire notes
> > > > tree with a callback, 2010-02-13) for details. (Cc:ing Johan.)
> > >
> > > Something like that, yeah... Re-reading this code, I believe we stop
> > > the fanout at the current level when we can find one or more notes
> > > that do not share the high-nibble of their path with another note.
> > >
> > > Here we're at the top level, so this corresponds to looking at the
> > > very first hex character (0-9a-f) of the path (oid of annotated
> > > object), and if there are at least two such objects for each hex
> > > character, we will use a fanout of 1, otherwise, we collapse the
> > > fanout to 0.
> >
> > That makes sense, but when I looked at the failed test, there seems
> > something else at play, at least in addition to what you said: for
> > _adding_ notes, your description is 100% accurate, but when deleting
> > notes, we are apparently not collapsing `PTR_TYPE_SUBTREE` nodes "quickly"
> > enough. Let me show you the part of `git ls-tree -r refs/notes/commits`
> > that starts with the hex digits 7, 8 and 9:
> >
> > -- snip --
> > 100644 blob 22939636f79a53181ea58f04f2136bd745976edd0018465e60a8df89816a9c2b 72/67b1c94ab5ddf005fd4b0e50b6a7816a62e7ed0459cec6aa1a00577b2111ba
> > 100644 blob 2e3f3dad7043b2d02d09ca12b299acbe05d781357d3a56a2d012fdf787409459 76/62d4b264094ca479be86ef7aa66daae63e45afa633a3892dc787b13ecff495
> > 100644 blob 22e7f0c5af7315d692f8a107a43ba0784e1ab00a20ea803fab1acad1319e5f79 7b/8c0a9c86815a94da7ef90b356b1f98d6a4099af3fdc3d8625a8fa793b63821
> > 100644 blob b3ce67c9d5507dc00d95d8fb2000c1d5b70908ad1d2c034e5833f57b7bb85511 7f/0af9cf9259cd6e67c0af3324ca443dd3d56694fbdc94d28e300a768d3d0e6e
> > 100644 blob fafb40e32e87a2c481df6ceb37804d80c995faec3d7772c071b129fd47c2ba8a 8a/f1ad99a5e559f5835007f2bcb1b07de0e8c7434e7fbaa676a2edbd796a7f60
> > 100644 blob b811e8da7c7acc83ef025753504ff6ed2d1eb8d2bb832d6b7487a7786d67aa53 93/f364fafaee6a178d8e8939f15d5b260f71940f663c3731396ee43082fd6551
> > 100644 blob 0b1915ccb6f64cac64f2297893bce1ba7408660f79182302faaada71ee8c3c1c 96/b562b168f94e5c6e6a1e40ec4b817aa168c99769b1d9a46f7e048e93897fb4
> > -- snap --
> >
> > This command was run in the worktree after running an unmodified t3305
> > with `GIT_TEST_DEFAULT_HASH=sha256` in brian's `transition-stage-4` branch
> > (well, I removed the `SHA1` prereq again, that's the modification I made).
> >
> > As you can see, there _is_ a fanout of 1, but there is only a single note
> > for a commit whose ID starts with `8`.
>
> Ah, yes, I glossed over the subtrees in my analysis, blindly assuming
> that they would collapse more "quickly" than they do...
>
> It seems the code values not unpacking subtrees "unnecessarily", and
> this causes them to remain for longer than perhaps ideal.
>
> > Debugging a little, it seems that the `PTR_TYPE_SUBTREE` node for `8` was
> > not collapsed, even if there was only one item left.
> >
> > So I formed a hypothesis that the subtrees are only removed when the
> > _last_ item is removed for any given leading nybble, but that turned out
> > to also be incorrect. It is a bit more tricky than that. This is the
> > smallest set to which I was able to reduce the notes without reducing the
> > fanout to 0:
> >
> > -- snip --
> > 100644 blob cc2ee5e00d8bc3805d704b67439dc8175bcf9497603288e6de2d4d8b3fc7be9f    05/b6e3d1394d020129f71fd8e41cf7ea8cbc58ae0f1332abd5da0c74ea194b71
> > 100644 blob 1c6afe76a1bcd0103c36ab9707a2ca9e68974b6a6bbaae564c0509c43b4392bd    1e/311d64dc3ad5491964bcded60fee15b19d5b9c916a7e62a4f0746fa4e81fa6
> > 100644 blob cf97d105e3970ef1cf9b12ac092be80abcd496c593bc8ca5550d059c3967630a    28/79e092d524b7ae9a42026ab2886094cce8ffc63175f8b3fd5de84faef10df3
> > 100644 blob 78ef788b804dd0e5415c386b4a29668f61033483c35438f0471dfd7c4bfa093b    36/fe8fe67a2e9d0203c665d6e08ca454833ec32a97369769a7138d3938c0000b
> > 100644 blob f46845c2d7e3272319aa5c18e359fdbc37731c88a945bb9632b8c4321983c75a    4a/a271a09d848f99d3fb978e5c156baa812a0fa1c30a88c885831641630be01a
> > 100644 blob ecf5a4a178ca4b51cea457abe7c935761ca15a1d817f83c2da6816ede84db779    51/eaee3ca1a8698cb0aaa4de2d3f339985570da68b28e84af752e1cfd25f5197
> > 100644 blob 2f4e9d6a4a1d050f8cb932ba545e53b48d9b669f6e00dc4a962d88ee9d92482b    62/dd63d43070c3ca7e3a6cdfa4ed970256a00a06e88d10fcb0532acd51419e0f
> > 100644 blob 22939636f79a53181ea58f04f2136bd745976edd0018465e60a8df89816a9c2b    72/67b1c94ab5ddf005fd4b0e50b6a7816a62e7ed0459cec6aa1a00577b2111ba
> > 100644 blob fafb40e32e87a2c481df6ceb37804d80c995faec3d7772c071b129fd47c2ba8a    8a/f1ad99a5e559f5835007f2bcb1b07de0e8c7434e7fbaa676a2edbd796a7f60
> > 100644 blob b811e8da7c7acc83ef025753504ff6ed2d1eb8d2bb832d6b7487a7786d67aa53    93/f364fafaee6a178d8e8939f15d5b260f71940f663c3731396ee43082fd6551
> > 100644 blob 589656c26944c471b5ac65739f8c7b96663a9f827a3d27beb49e39e3707b7294    a2/0e8f30856061125d479779755ef3238a7b561f9336e0143c437daac7d93f4c
> > 100644 blob 7eaa9350d5c6bd6fd0fc4071c5d6a266949046c67987a9e1f665ba34d95f419d    a2/80d13a05aa68cc5ef948a8b69067807457fd37c00ce4a234fa4a0c0753ef4e
> > 100644 blob 7a8378dd60d6024db645757eac7271a80b101a2df230ebd1a57ff52ff2d32e36    b9/2a3a7dc6db290d93f79150bfb31447f3e550cfe4a63c5ddbeac18fec755e86
> > 100644 blob 86fff6007d9911249bee803a6630e182677355a5574e637d1a8301e219c5da86    c1/52ffd73d1c5e7c121c7c247682f1ee971f6f09101c96a84486d18be41d0dd0
> > 100644 blob cd887698e1da81a76ae1caf0eaec19d60830a13ead152ec4700be511ceb8ee33    d0/3f742b8b95f68478946d7fe7495da9462801fadeaaa06c11bf54dbc46610f5
> > 100644 blob 19f7bdfee9687311dbe1195e0a64954b677ae68e6d734fd5fb76ea4ad4f93782    ea/356ae2d38123b46639db98df14953f4c7cdd91738779174ec67876ce9487e3
> > 100644 blob 948c0cba23ec0405c622c9dea8ed8dd7b3fa043c86b5e5a8b4de0d1c6a0e67b9    f7/802d4c716fed3c76fe58c86ac7c3ae3e19b8c0d3ea97c9f90f5939fe5a78d8
> > 100644 blob 28690ad489e29e3607c82e1f626ec24d7f831555c802108bfe9b993fbd794a7e    f7/da03e811b7d9071ee19dabdc721e0f863e28c92dfa3257474282396a73bb44
> > -- snap --
> >
> > You will note that there are two entries that start with `a2`, and two
> > entries that start with `f7`. If I remove any of those, the corresponding
> > subtree will be collapsed, and the fanout will be reduced to 0.
>
> I'm looking at note_tree_remove() and note_tree_consolidate() to
> explain this behavior. AFAICS, at the point where you remove, say, the
> a20e8f entry, the note tree structure should consist of a single
> int_node containing 16 subtree entries. note_tree_remove() will search
> for the a20e8f entry which will cause the a2 subtree to be unpacked
> and replaced with an int_node (representing 'a') referencing another
> int_node (representing '2') containing two leaf_nodes. One of the leaf
> nodes will be removed, and note_tree_consolidate() will then replace
> the 2 int_nodes with the last remaining leaf_node. At this point the
> root int_node should now contain 15 subtrees (not yet unpacked) and 1
> "regular" leaf_node.
>
> We then move on to write_notes_tree() to write out the resulting tree.
> This ends up calling determine_fanout() (via for_each_note_helper()),
> which will look at the root int_node, find the one "regular" leaf_node
> there, and use that to return fanout = 0. (At which point the other
> subtrees will be unpacked and the entire tree is "flattened".)
>
> I now wonder why this did not happen before we got down to 17 notes.
> Let's assume (as is most probable) that the previous notes removed
> only shared _one_ leading nibble (not two). The root int_node would
> have an int_node for the first-nibble which would contain two
> subtrees, one for each second-nibble. One of these would be unpacked
> and promptly removed, and then note_tree_consolidate() would be left
> with the int_node for the first-nibble containing a single subtree
> entry. AFAICS note_tree_consolidate() does _not_ collapse this
> int_node (and move the subtree into the root node), although I can't
> immediately see why it could/should not do this. Even if it did,
> though, that would not be enough to trigger a lower fanout, as
> determine_fanout() does not distinguish between int_nodes and subtree
> nodes.
>
> However, I suspect there may be room for further improvement here: If
> we find ourselves consolidating an int_node whose _only_ non-NULL
> entry is a subtree, it is a fairly safe assumption that the subtree
> itself is probably close to empty as well, and we can probably bear
> the cost of more eagerly unpacking it, as we are then more likely to
> trigger a lower fanout when writing out the notes tree. At that point
> going from fanout 1 to fanout 0 should certainly happen at a less
> ridiculous total note count than 17...

Thank you for digging in even further!

> > But it is only happenstance with SHA-256 that there are these entries that
> > agree not only in the first, but also in the second leading nybble.
>
> True.
>
> > Therefore...
> >
> > > Hence we need an absolute minimum of 32 notes (and some rotten luck)
> > > to get a fanout of 1. As the number of notes increase, the probably of
> > > fanning out increases, passing 50% at ~79 notes, and reaching ~100%
> > > somewhere north of 150 notes.
> >
> > ... I would register that we need an absolute minimum of 16 notes (and
> > some rather crafty craft) to get a fanout of 1.
> >
> > In that light, I think that I would prefer to retract my patch that
> > "only" reduces the remaining number of notes to 20: it should reduce them
> > to 15 or less. So why not reduce it to 10 (because it is only one changed
> > digit).
>
> Agreed. The thing we're looking for in the test (and what is certainly
> more important) is that we _do_ consolidate the fanout when the note
> count decreases. The details around exactly when that happens is more
> of a performance tuning issue, and not something that should break the
> test.
>
> > > > > The test happens to pass with SHA-1, but doesn't necessarily with other
> > > > > hash algorithms, so annotate it with the SHA1 prerequisite.
> > > >
> > > > I would rather see this tested, still, and reducing the number of notes
> > > > that are retained from 50 to 20 before testing that the fanout has been
> > > > reduced to 0 seems to do the trick. Therefore, I would love to submit this
> > > > for squashing:
> > >
> > > Yes, it seems that for SHA1 and the (deterministic) objects used in
> > > the test, we got away with 50 notes, but that is not the case for
> > > other hash algorithms. Lowering the number to 20 definitely results a
> > > fanout of 0, as should any other number below 32.
> > >
> > > +1 to Dscho's squash.
> > >
> > > ...Johan
> >
> > Thank you so much for the analysis. To be honest, I did not quite
> > understand all the details of the comment added in 73f77b989f8 when I
> > wrote the patch I suggested, so I basically just picked that number "20"
> > out of thin air.
> >
> > Together with your insights, I would like to propose this commit message
> > for the squashed commits (I left in the hunk that removes the `SHA1`
> > prerequisite, but of course that won't be part of the final commit):
> >
> > -- snip --
> > t3305: make fanout test more robust (needed for SHA-256)
> >
> > To make things more performant, notes are stored in a "fanout": when
> > there are enough commit notes, they are no longer stored as verbatim
> > commit IDs at the top-level tree of the notes ref, but instead the tree
> > is deepened much like the loose object cache: subtrees are introduced
> > whose names are the two hex digits they "chomp off" the commit IDs.
> >
> > The test case 'deleting most notes triggers fanout consolidation' wants
> > to verify that the fanout level is reduced automatically when enough
> > notes have been deleted.
> >
> > However, that test case expected that reduction to level 0 (i.e. _no_
> > fanout subtrees) to happen after reducing the originally-added 300 notes
> > to 50, which _happened_ to work with SHA-1-based commit IDs, but it is
> > no longer works with SHA-256-based ones.
> >
> > The reason: The heuristic for the fanout looks at the number of entries
> > for leading nybbles (read: hex digits) of the commit IDs. If there are
> > more than a single annotated commit for all of the 16 hex digits, the
> > fanout is incremented. It is a bit more tricky when reducing the number
> > of notes: the fanout is reduced reliably only if there are less notes
> > than hex digits (i.e. less than 15 notes) for a given prefix.
> >
> > For good measure, let's reduce the number of notes to 10 in the test
> > case 'deleting most notes with git-notes' so that the test case
> > 'deleting most notes triggers fanout consolidation' is guaranteed to
> > succeed with _any_ hash algorithm.
>
> Great commit message.
>
> > Original-patch-by: brian m. carlson <sandals@crustytoothpaste.net>
> > Helped-by: Johan Herland <johan@herland.net>
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> Signed-off-by: Johan Herland <johan@herland.net>

Excellent!

> Have fun!
> ...Johan

You, too. Thank you!
Dscho

>
> --
> Johan Herland, <johan@herland.net>
> www.herland.net
>

  reply	other threads:[~2020-01-27 12:22 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-25 23:00 [PATCH v2 00/23] SHA-256 test fixes, part 8 brian m. carlson
2020-01-25 23:00 ` [PATCH v2 01/22] t/lib-pack: support SHA-256 brian m. carlson
2020-01-25 23:00 ` [PATCH v2 02/22] t3206: make hash size independent brian m. carlson
2020-01-25 23:00 ` [PATCH v2 03/22] t3305: annotate with SHA1 prerequisite brian m. carlson
2020-01-26 11:15   ` Johannes Schindelin
2020-01-26 18:18     ` Johan Herland
2020-01-26 21:28       ` Johannes Schindelin
2020-01-26 21:50         ` brian m. carlson
2020-01-26 23:57         ` Johan Herland
2020-01-27 12:22           ` Johannes Schindelin [this message]
2020-01-26 19:44     ` brian m. carlson
2020-01-25 23:00 ` [PATCH v2 04/22] t3308: make test work with SHA-256 brian m. carlson
2020-01-25 23:00 ` [PATCH v2 05/22] t3309: " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 06/22] t3310: " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 07/22] t3311: " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 08/22] t4013: make test hash independent brian m. carlson
2020-01-26 22:08   ` Johannes Schindelin
2020-01-26 22:26     ` brian m. carlson
2020-01-25 23:00 ` [PATCH v2 09/22] t4060: make test work with SHA-256 brian m. carlson
2020-01-25 23:00 ` [PATCH v2 10/22] t4211: make test hash independent brian m. carlson
2020-01-26 22:13   ` Johannes Schindelin
2020-01-25 23:00 ` [PATCH v2 11/22] t5302: make hash size independent brian m. carlson
2020-01-26 22:23   ` Johannes Schindelin
2020-01-26 22:33     ` brian m. carlson
2020-01-25 23:00 ` [PATCH v2 12/22] t5309: make test hash independent brian m. carlson
2020-01-25 23:00 ` [PATCH v2 13/22] t5313: " brian m. carlson
2020-01-28 18:15   ` Junio C Hamano
2020-01-25 23:00 ` [PATCH v2 14/22] t5321: " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 15/22] t5515: " brian m. carlson
2020-01-28 18:28   ` Junio C Hamano
2020-01-29  4:04     ` brian m. carlson
2020-01-25 23:00 ` [PATCH v2 16/22] t5318: update for SHA-256 brian m. carlson
2020-01-25 23:00 ` [PATCH v2 17/22] t5607: make hash size independent brian m. carlson
2020-01-25 23:00 ` [PATCH v2 17/23] t5616: use correct filter syntax brian m. carlson
2020-01-28 19:06   ` Junio C Hamano
2020-01-29  3:53     ` brian m. carlson
2020-01-25 23:00 ` [PATCH v2 18/23] t5607: make hash size independent brian m. carlson
2020-01-25 23:00 ` [PATCH v2 18/22] t5703: make test work with SHA-256 brian m. carlson
2020-01-28 19:09   ` Junio C Hamano
2020-01-29  3:46     ` brian m. carlson
2020-01-25 23:00 ` [PATCH v2 19/23] " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 19/22] t5703: switch tests to use test_oid brian m. carlson
2020-01-25 23:00 ` [PATCH v2 20/23] " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 20/22] t6000: abstract away SHA-1-specific constants brian m. carlson
2020-01-25 23:00 ` [PATCH v2 21/23] " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 21/22] t6006: make hash size independent brian m. carlson
2020-01-25 23:00 ` [PATCH v2 22/23] " brian m. carlson
2020-01-25 23:00 ` [PATCH v2 22/22] t6024: update for SHA-256 brian m. carlson
2020-01-25 23:00 ` [PATCH v2 23/23] " brian m. carlson
2020-01-26 10:25 ` [PATCH v2 00/23] SHA-256 test fixes, part 8 Johannes Schindelin
2020-01-26 19:42   ` brian m. carlson
2020-01-26 21:30     ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.2001271321290.46@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=johan@herland.net \
    --cc=sandals@crustytoothpaste.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).