From: Johannes Schindelin <Johannes.Schindelin@gmx.de> To: "SZEDER Gábor" <firstname.lastname@example.org> Cc: email@example.com, Junio C Hamano <firstname.lastname@example.org>, Garima Singh <email@example.com>, Derrick Stolee <firstname.lastname@example.org>, Jakub Narebski <email@example.com>, Jeff King <firstname.lastname@example.org>, Taylor Blau <email@example.com> Subject: Re: [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Date: Fri, 29 May 2020 00:09:44 +0200 (CEST) [thread overview] Message-ID: <nycvar.QRO.firstname.lastname@example.org> (raw) In-Reply-To: <email@example.com> [-- Attachment #1: Type: text/plain, Size: 4505 bytes --] Hi Gábor, On Fri, 29 May 2020, SZEDER Gábor wrote: > Sigh... but better late than never, right? Well, incremental patches on top of what is _already_ in Git are _even_ better. > I experimented quite a bit with modified path Bloom filters a year and > more ago, and got quite far... but my disappointment in the > inadequacy of all double hashing schemes, the arrival of split > commit-graphs, and, well, life in general has put the whole thing on > the back burner, and I haven't touched it for a couple of releases. > > Now I finally managed to take a closer look at the current changed > paths Bloom filters implementation, and saw that it has some of the > same issues that I had stumbled upon and that it missed some > optimization opportunities. Unfortunately, fixing those issues and > performing those optimizations do require a thorough format change. Thank you for this background information. Your claim that there are opportunities to optimize, as well as your claim that it requires a thorough format change, strike me as best backed up with evidence by porting your optimizations on top of what is already in `master` and which has been reviewed _already_ (as opposed to your new patch series, which essentially threatens to throw away all the time people spent on reviewing Garima's patches). > So here is my proof of concept version, in all its incompleteness, > with the following benefits: > > - Better understanding of the problem it tries to optimize. > - Better understanding of the issues with many small Bloom filters. > - Better hashing scheme (though it should be better still). > - Orders of magnitude lower average false positive rate. > - Consequently, faster pathspec-limited revision walks. > - Faster processing of the tree-diff output and lower memory usage > while computing Bloom filters (from scratch...). > - Optional support for storing Bloom filters for all parents of > merge commits. > - Deduplicates Bloom filters. > - Supports multiple pathspecs right from the start. > - Supports some wildcards in pathspecs. > - Handles as many commits as the commit-graph format can. > - It has the right name :) The diff machinery and all its frontends > report "modified" paths with the letter 'M', not "changed". > - More cleanups, more bugfixes. > - Consistent output with and without modified path Bloom filters for > over 80k random paths in 16 repositories, even with submodules in > them. Well, at least on my machine, if nowhere else... It strikes me as an obvious idea to make all those improvements in an incremental manner. > Alas, the drawbacks are significant: > > - No tests whatsoever. > - Computes all modified path Bloom filters from scratch when > writing, no matter what. > - Doesn't work with split commit-graphs. > - Basically if anything works besides 'git commit-graph write > --reachable' it's a miracle. > - Not a single test. You said that already in the first bullet point. But yes, I get it, there are no tests. > - Many BUG()s, which should rather be graceful errors... though I > have to admit that at this point they are indeed bugs. > - Many TODOs, both in commit messages and code, some incomplete > commit messages, crappy subject lines, even missing signoffs. > - Some ridiculously long variable, function, macro and config > variable names. > - It's based on v2.25.0 (no technical reason, but that's the version > I used to run the baseline benchmarks the last time, which takes > days...) > - I'm pretty sure that there are more... > - Oh, did I mention that there are no tests? Ah, your patch series has no tests. Please do understand that it can be perceived as quite frustrating that an alternative is presented _this_ late, when already such a lot of effort has gone into not only iterating several times on the patch series but also into reviewing all of them. I don't really think that I want to spend the time to review this alternative (not that I am an expert in the space) because it would imply that I help invalidating the effort that went into the current implementation. All this is to say: I appreciate your efforts to improve the path-based Bloom filters, at the same time I wish that those efforts would happen in a collaborative manner instead of simply dismissing other people's work. Ciao, Johannes
next prev parent reply other threads:[~2020-05-29 14:45 UTC|newest] Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-05-29 8:50 SZEDER Gábor 2020-05-28 22:09 ` Johannes Schindelin [this message] 2020-05-29 8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor 2020-05-29 8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor 2020-05-29 8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor 2020-05-29 8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor 2020-06-04 16:43 ` Derrick Stolee 2020-06-27 15:56 ` SZEDER Gábor 2020-06-29 11:30 ` Derrick Stolee 2020-05-29 8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor 2020-05-29 8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor 2020-05-29 8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor 2020-05-29 8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor 2020-05-29 8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor 2020-05-29 8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor 2020-05-29 8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor 2020-05-29 8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor 2020-05-29 8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor 2020-05-29 8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor 2020-05-29 8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor 2020-06-02 17:59 ` SZEDER Gábor 2020-05-29 8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor 2020-05-29 8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor 2020-05-29 8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor 2020-05-29 8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor 2020-05-29 8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor 2020-05-29 8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor 2020-05-29 8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor 2020-05-29 8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor 2020-05-29 8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor 2020-05-29 8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor 2020-05-29 8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor 2020-05-29 8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor 2020-05-29 8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor 2020-05-29 8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor 2020-05-29 8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor 2020-05-29 8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor 2020-05-29 8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor 2020-05-29 8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor 2020-05-29 8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor 2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee 2020-06-01 23:25 ` Taylor Blau 2020-06-02 17:08 ` Junio C Hamano
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style List information: http://vger.kernel.org/majordomo-info.html * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=nycvar.QRO.firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PoC PATCH 00/34] An alternative modified path Bloom filters implementation' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Code repositories for project(s) associated with this inbox: https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).