git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing
@ 2022-11-23 20:17 Zack Weinberg
  2022-11-24 10:31 ` Phillip Wood
  0 siblings, 1 reply; 5+ messages in thread
From: Zack Weinberg @ 2022-11-23 20:17 UTC (permalink / raw)
  To: git

I’m attempting to have my blog, which is a static site generated from
a bunch of Markdown files stored in git, automatically pull the
the most recent modification date for each page out of the git history.
The idea is to use ‘git log --follow --pretty=tformat:'%ct' <file>‘ on
each file and then use the oldest reported timestamp as the creation
date and the newest reported timestamp as the last modification date.

But there’s a catch: there are commits I want to ignore in this
calculation, such as mechanical changes applied to the entire site.
And this brings me to the bug: --invert-grep doesn’t work correctly
when the --grep regex is sufficiently complicated.

Here's the complete set of commits that modified an example file:

$ git log --follow --pretty=tformat:'%ct %s' \
  src/posts/uncat/unearthed-arcana-music-division.md
1668545053 Begin restoring the site structure.
1668545051 Reorganize directory tree prior to setting up Metalsmith
1668545032 Mechanically convert Pandoc to standard YAML metadata delimiters.
1610735533 Mechanically convert to properly delimited YAML metadata.
1417101173 Correct slug for Uncategorized.
1417050416 The Great Dead and Moved Link Cleanup of 2014.
1416938128 Use category_meta plugin to fix category slugs.
1416763607 Initial import of content and Pelican skeleton.

And here's an application of --grep that prints only the commits I _don't_ want:

$ git log --follow -E --pretty=tformat:'%ct %s' \
  --grep='^(Mechanically convert|Begin restoring the site structure|Reorganize directory tree)' \
 src/posts/uncat/unearthed-arcana-music-division.md
1668545053 Begin restoring the site structure.
1668545051 Reorganize directory tree prior to setting up Metalsmith
1668545032 Mechanically convert Pandoc to standard YAML metadata delimiters.
1610735533 Mechanically convert to properly delimited YAML metadata.

Theoretically, adding --invert-grep to that should make it print the
commits I do want, but instead it prints nothing at all:

$ git log --follow -E --pretty=tformat:'%ct %s' --invert-grep \
  --grep='^(Mechanically convert|Begin restoring the site structure|Reorganize directory tree)' \
 src/posts/uncat/unearthed-arcana-music-division.md
[no output]

For clarity, I expected that to print

1417101173 Correct slug for Uncategorized.
1417050416 The Great Dead and Moved Link Cleanup of 2014.
1416938128 Use category_meta plugin to fix category slugs.
1416763607 Initial import of content and Pelican skeleton.

The problem seems to be related to the complexity of the regex, e.g.
(all examples with -E --invert-grep)

--grep='^(Mechanically convert|Begin restoring)'             # works correctly
--grep='^(Mechanically convert|Begin restoring|Reorganize)'  # prints nothing
--grep='^(Mechanically convert|Begin restoring|Correct slug)'# prints incorrect subset

Incidentally, if there is a better way to query the first and last
commit (with filtering) that touched a particular file — or even
a way to do the query for many files at once — please tell me about
it.

Thanks for your attention,
zw

[System Info]
git version:
git version 2.37.2
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 6.0.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.0.8-1 (2022-11-11) x86_64
compiler info: gnuc: 12.1
libc info: glibc: 2.36
$SHELL (typically, interactive shell): /bin/bash


[Enabled Hooks]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing
  2022-11-23 20:17 [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing Zack Weinberg
@ 2022-11-24 10:31 ` Phillip Wood
  2022-11-24 13:36   ` Zack Weinberg
  0 siblings, 1 reply; 5+ messages in thread
From: Phillip Wood @ 2022-11-24 10:31 UTC (permalink / raw)
  To: Zack Weinberg, git

Hi Zack

On 23/11/2022 20:17, Zack Weinberg wrote:
> I’m attempting to have my blog, which is a static site generated from
> a bunch of Markdown files stored in git, automatically pull the
> the most recent modification date for each page out of the git history.
> The idea is to use ‘git log --follow --pretty=tformat:'%ct' <file>‘ on
> each file and then use the oldest reported timestamp as the creation
> date and the newest reported timestamp as the last modification date.
> 
> But there’s a catch: there are commits I want to ignore in this
> calculation, such as mechanical changes applied to the entire site.
> And this brings me to the bug: --invert-grep doesn’t work correctly
> when the --grep regex is sufficiently complicated.
> 
> Here's the complete set of commits that modified an example file:
> 
> $ git log --follow --pretty=tformat:'%ct %s' \
>    src/posts/uncat/unearthed-arcana-music-division.md
> 1668545053 Begin restoring the site structure.
> 1668545051 Reorganize directory tree prior to setting up Metalsmith
> 1668545032 Mechanically convert Pandoc to standard YAML metadata delimiters.
> 1610735533 Mechanically convert to properly delimited YAML metadata.
> 1417101173 Correct slug for Uncategorized.
> 1417050416 The Great Dead and Moved Link Cleanup of 2014.
> 1416938128 Use category_meta plugin to fix category slugs.
> 1416763607 Initial import of content and Pelican skeleton.
> 
> And here's an application of --grep that prints only the commits I _don't_ want:
> 
> $ git log --follow -E --pretty=tformat:'%ct %s' \
>    --grep='^(Mechanically convert|Begin restoring the site structure|Reorganize directory tree)' \
>   src/posts/uncat/unearthed-arcana-music-division.md
> 1668545053 Begin restoring the site structure.
> 1668545051 Reorganize directory tree prior to setting up Metalsmith
> 1668545032 Mechanically convert Pandoc to standard YAML metadata delimiters.
> 1610735533 Mechanically convert to properly delimited YAML metadata.
> 
> Theoretically, adding --invert-grep to that should make it print the
> commits I do want, but instead it prints nothing at all:

I think the problem is that you are excluding the commit that renames 
the file and that stops --follow from following the rename. See below 
for a simple reproduction using git's test suite. I'm afraid I'm not 
familiar with the --follow code so I've no idea how to fix this.

Best Wishes

Phillip

---- >8 ----
#!/bin/sh

test_description='Demonstrate `git log --follow --grep` failure'

. ./test-lib.sh

test_expect_failure '--grep breaks --follow' '
	test_commit first file-a "$(seq 1 100)" &&
	git rm file-a &&
	test_commit second file-b "$(seq 2 101)" &&
	test_commit third file-b "$(seq 3 102)" &&
	git log --follow --grep "^first" file-b >actual &&
	git log first >expect &&
	test_cmp expect actual
'

test_expect_success '--grep breaks --follow' '
	git log --follow -E --grep "^(first|second)" file-b >actual &&
	git log second >expect &&
	test_cmp expect actual
'

test_done

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing
  2022-11-24 10:31 ` Phillip Wood
@ 2022-11-24 13:36   ` Zack Weinberg
  2022-11-24 15:53     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 5+ messages in thread
From: Zack Weinberg @ 2022-11-24 13:36 UTC (permalink / raw)
  To: phillip.wood, git

On Thu, Nov 24, 2022, at 5:31 AM, Phillip Wood wrote:
> I think the problem is that you are excluding the commit that renames 
> the file and that stops --follow from following the rename. See below 
> for a simple reproduction using git's test suite.

Yes, I can confirm.  Both of the "incorrect output" examples I gave
involved excluding a commit that renames the file.  If I don't do that
there is no problem with an `a|b|c` regex.

> I'm afraid I'm not 
> familiar with the --follow code so I've no idea how to fix this.

I'm honestly unsure what the right behavior _should_ be, now.  I
expected --grep to be applied very late in the process, after the
set of commits touching the specified file had already been computed
(including all of its previous names, because of --follow) but the
documentation is ambiguous.

zw

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing
  2022-11-24 13:36   ` Zack Weinberg
@ 2022-11-24 15:53     ` Ævar Arnfjörð Bjarmason
  2022-11-24 16:35       ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-11-24 15:53 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: phillip.wood, git


On Thu, Nov 24 2022, Zack Weinberg wrote:

> On Thu, Nov 24, 2022, at 5:31 AM, Phillip Wood wrote:
> [...]
>> I'm afraid I'm not 
>> familiar with the --follow code so I've no idea how to fix this.
>
> I'm honestly unsure what the right behavior _should_ be, now.  I
> expected --grep to be applied very late in the process, after the
> set of commits touching the specified file had already been computed
> (including all of its previous names, because of --follow) but the
> documentation is ambiguous.

This doesn't help with your case, but I remember there being some
similar confusions and/or expectations mismatches reported in the
patch. E.g. "--since" here:
https://lore.kernel.org/git/220401.86pmm1nmvh.gmgdl@evledraar.gmail.com/

I couldn't find a reference quickly, but I seem to recall (but perhaps
I'm imagining it) that we had a report/discussion semi-recently about:

	git log --reverse --follow -- path

Which has a similar edge case, e.g. try on git.git:

	git log --reverse --follow -- object-name.c

That's also "correct" if you look at it from the POV of how we implement
it, i.e. we'll traverse, and the rename to object-name.c is the first
time we encounter the name from the POV of the walking logic.

Hrm, but shouldn't we show all commits *after* the rename then? Anyway,
I haven't thought about it this time around, just wanted to provide some
rabbit-hole references in case you're interested.

In terms of optimization it's *very* useful that we take these
shortcuts, but as your (and some of these) examples show it can yield
the wrong or unexpected result in some cases, and in those cases we
usually have no non-brute-force way of getting the "right" (or
"desired") result other (brute force being: parse "git log -p"
yourself).

So it would be nice in general if we had some ability to say what
filters apply at what stage in the walk, but I suspect that would
require a rather large UX overhaul...

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing
  2022-11-24 15:53     ` Ævar Arnfjörð Bjarmason
@ 2022-11-24 16:35       ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2022-11-24 16:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Zack Weinberg, phillip.wood, git

On Thu, Nov 24, 2022 at 04:53:28PM +0100, Ævar Arnfjörð Bjarmason wrote:

> This doesn't help with your case, but I remember there being some
> similar confusions and/or expectations mismatches reported in the
> patch. E.g. "--since" here:
> https://lore.kernel.org/git/220401.86pmm1nmvh.gmgdl@evledraar.gmail.com/
> 
> I couldn't find a reference quickly, but I seem to recall (but perhaps
> I'm imagining it) that we had a report/discussion semi-recently about:
> 
> 	git log --reverse --follow -- path
> 
> Which has a similar edge case, e.g. try on git.git:
> 
> 	git log --reverse --follow -- object-name.c
> 
> That's also "correct" if you look at it from the POV of how we implement
> it, i.e. we'll traverse, and the rename to object-name.c is the first
> time we encounter the name from the POV of the walking logic.

I think all of this goes back to Linus's original "--follow is a hack I
bolted on" implementation. It probably should happen up-front as part of
the history simplification. I think it's a combination of nobody wanting
to do the work to extract that, and that it may produce less "streaming"
output, as we have to do a lot more work before producing the first line
of output.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-11-24 16:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-23 20:17 [bug] git log --invert-grep --grep=[sufficiently complicated regex] prints nothing Zack Weinberg
2022-11-24 10:31 ` Phillip Wood
2022-11-24 13:36   ` Zack Weinberg
2022-11-24 15:53     ` Ævar Arnfjörð Bjarmason
2022-11-24 16:35       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).