git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "René Scharfe" <l.s.r@web.de>,
	"SZEDER Gábor via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org, me@ttaylorr.com,
	"Derrick Stolee" <dstolee@microsoft.com>
Subject: Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters
Date: Fri, 26 Jun 2020 08:34:36 +0200	[thread overview]
Message-ID: <20200626063436.GA11341@szeder.dev> (raw)
In-Reply-To: <0fe96c75-2946-8160-2ced-3d9781dca8c0@gmail.com>

On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote:

> >> +	while (p > path) {
> >> +		if (is_dir_sep(*p))
> >> +			fill_bloom_key(path, p - path,
> >> +				       &revs->bloom_keys[path_component_nr++],
> >> +				       revs->bloom_filter_settings);
> >> +		p--;
> >> +	}
> > 
> > This walks the directory hierarchy upwards and adds bloom filters for
> > shorter and shorter paths, ("deepest first").  Good.
> > 
> > And it supports all directory separators.  On Windows that would be
> > slash (/) and backslash (\).  I assume paths are normalized to use
> > only slashes when bloom filters are written, correct?  Then the lookup
> > side needs to normalize a given path to only use slashes as well,
> > otherwise paths with backslashes cannot be found.  This part seems to
> > be missing.
> 
> Yes, that's a good point. We _require_ the paths to be normalized
> here to be Unix-style paths or else the Bloom filter keys are
> incorrect. Thankfully, they are.

Unfortunately, they aren't always...

Path normalization is done in normalize_path_copy_len(), whose
description says, among other things:

   * Performs the following normalizations on src, storing the result in dst:
   * - Ensures that components are separated by '/' (Windows only)

and the code indeed does:

        if (is_dir_sep(c)) {
                *dst++ = '/';

Now, while parsing pathspecs this function is called via:

  parse_pathspec()
    init_pathspec_item()
      prefix_path_gently()
        normalize_path_copy_len()

Unfortunately, init_pathspec_item() has this chain of conditions:

        /* Create match string which will be used for pathspec matching */
        if (pathspec_prefix >= 0) {
                match = xstrdup(copyfrom);
                prefixlen = pathspec_prefix;
        } else if (magic & PATHSPEC_FROMTOP) {
                match = xstrdup(copyfrom);
                prefixlen = 0;
        } else {
                match = prefix_path_gently(prefix, prefixlen,
                                           &prefixlen, copyfrom);
                if (!match) {
                        const char *hint_path = get_git_work_tree();
                        if (!hint_path)
                                hint_path = get_git_dir();
                        die(_("%s: '%s' is outside repository at '%s'"), elt,
                            copyfrom, absolute_path(hint_path));
                }
        }

which means that it doesn't always calls prefix_path_gently(), which,
in turn, means that 'pathspec_item->match' might remain un-normalized
in case of some unusual pathspecs.

The first condition is supposed to handle the case when one Git
process passes pathspecs to another, and is supposed to be "internal
use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via
PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what
that might entail.

The second condition handles pathspecs explicitly relative to the root
of the work tree, i.e. ':/path'.  Adding a printf() to show the
original path and the resulting 'pathspec_item->match' does confirm
that no normalization is performed:

  expecting success of 9999.1 'test': 
          mkdir -p dir &&
          >dir/file &&
          git add ":/dir/file" &&
          git add ":(top)dir/file" &&
          test_might_fail git add ":/dir//file" &&
          git add ":(top)dir//file"
  
  orig:  ':/dir/file'
  match: 'dir/file'
  orig:  ':(top)dir/file'
  match: 'dir/file'
  orig:  ':/dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  orig:  ':(top)dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  not ok 1 - test

This is, of course, bad for Bloom filters, because the repeated
slashes are hashed as well and commits will be omitted from the output
of pathspec-limited revision walks, but apparently it also affects
other parts of Git.

And the else branch handles the rest, which, I believe, is by far the
most common case.

> Let's make that clear in-code by
> using '/' instead of is_dir_sep().
> 
> Thanks,
> -Stolee

  reply	other threads:[~2020-06-26  6:34 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-19 12:58     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
2020-06-18 20:31   ` René Scharfe
2020-06-19  9:14     ` René Scharfe
2020-06-19 17:17   ` Taylor Blau
2020-06-19 17:19     ` Taylor Blau
2020-06-23 13:47     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-18  1:46   ` Derrick Stolee
2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
2020-06-25  7:24     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 14:59       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:02       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:05       ` Derrick Stolee
2020-06-26  6:34         ` SZEDER Gábor [this message]
2020-06-26 14:42           ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-24 23:32     ` Derrick Stolee
2020-06-25  0:38       ` Junio C Hamano
2020-06-25 13:38         ` Derrick Stolee
2020-06-25 16:34           ` Junio C Hamano
2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-06-27 16:33       ` SZEDER Gábor
2020-06-29 13:02         ` Derrick Stolee
2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-10-15 13:21         ` SZEDER Gábor
2020-10-15 21:41           ` Taylor Blau
2020-10-16  2:18             ` Derrick Stolee
2020-10-16  3:18               ` Taylor Blau
2020-10-16 13:52                 ` Derrick Stolee
2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200626063436.GA11341@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).