git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Git Bug Report: out of memory using git tag
@ 2022-10-28 22:29 Martin Englund
  2022-11-01 12:22 ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Englund @ 2022-10-28 22:29 UTC (permalink / raw)
  To: git

Thank you for filling out a Git bug report!
Please answer the following questions to help us understand your issue.

What did you do before the bug happened? (Steps to reproduce your issue)
I created a signed tag (git tag -s) using a ssh-agent key and then ran
git tag -l --format '%(contents:body)' v0.6.1

What did you expect to happen? (Expected behavior)
I get the output

What happened instead? (Actual behavior)
fatal: Out of memory, malloc failed (tried to allocate
18446744073709551266 bytes)

What's different between what you expected and what actually happened?
git tries to allocate an unreasonable amount of memory

Anything else you want to add:
I don't have 18,000 PB of memory

Please review the rest of the bug report below.
You can delete any lines you don't wish to share.


[System Info]
git version:
git version 2.38.1.280.g63bba4fdd8
cpu: x86_64
built from commit: 63bba4fdd86d80ef061c449daa97a981a9be0792
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.15.0-1021-aws #25~20.04.1-Ubuntu SMP Thu Sep 22
13:59:08 UTC 2022 x86_64
compiler info: gnuc: 9.4
libc info: glibc: 2.31
$SHELL (typically, interactive shell): /bin/bash

Cheers,
/Martin
-- 
Martin Englund, martin@englund.nu / GnuPG key: FE91E717
http://blog.englund.nu/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-10-28 22:29 Git Bug Report: out of memory using git tag Martin Englund
@ 2022-11-01 12:22 ` Jeff King
  2022-11-02  0:41   ` Philippe Blain
  2022-11-02  0:42   ` Philippe Blain
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2022-11-01 12:22 UTC (permalink / raw)
  To: Martin Englund; +Cc: Philippe Blain, git

On Fri, Oct 28, 2022 at 03:29:33PM -0700, Martin Englund wrote:

> What did you do before the bug happened? (Steps to reproduce your issue)
> I created a signed tag (git tag -s) using a ssh-agent key and then ran
> git tag -l --format '%(contents:body)' v0.6.1
> 
> What did you expect to happen? (Expected behavior)
> I get the output
> 
> What happened instead? (Actual behavior)
> fatal: Out of memory, malloc failed (tried to allocate
> 18446744073709551266 bytes)

Thanks for the report. This looks like pointer or size_t arithmetic that
has gone negative. Here's a minimal reproduction:

  {
    echo subject
    echo "-----BEGIN PGP SIGNATURE-----"
  } | git tag -F - foo
  git tag -l --format='%(contents:body)' foo

The issue isn't unique to pgp signatures; the problem is in the parsing
done by ref-filter's find_subpos(), so any signature type exhibits the
problem. At the end of that function we do:

      *nonsiglen = sigstart - buf;

but "buf" has moved beyond "sigstart". Presumably because it uses
strstr() to look for end-of-line in buf. Since there isn't one before
the signature begins, we go to the end of the signature.

The bug bisects to 9f75ce3d8f (ref-filter: handle CRLF at end-of-line
more gracefully, 2020-10-29). Before then, I think our loop was careful
about moving past the start of the signature. Author cc'd.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-01 12:22 ` Jeff King
@ 2022-11-02  0:41   ` Philippe Blain
  2022-11-02  7:39     ` Jeff King
  2022-11-02  0:42   ` Philippe Blain
  1 sibling, 1 reply; 15+ messages in thread
From: Philippe Blain @ 2022-11-02  0:41 UTC (permalink / raw)
  To: Jeff King, Martin Englund; +Cc: git

Hi,

Le 2022-11-01 à 08:22, Jeff King a écrit :
> On Fri, Oct 28, 2022 at 03:29:33PM -0700, Martin Englund wrote:
> 
>> What did you do before the bug happened? (Steps to reproduce your issue)
>> I created a signed tag (git tag -s) using a ssh-agent key and then ran
>> git tag -l --format '%(contents:body)' v0.6.1
>>
>> What did you expect to happen? (Expected behavior)
>> I get the output
>>
>> What happened instead? (Actual behavior)
>> fatal: Out of memory, malloc failed (tried to allocate
>> 18446744073709551266 bytes)
> 
> Thanks for the report. This looks like pointer or size_t arithmetic that
> has gone negative. Here's a minimal reproduction:
> 
>   {
>     echo subject
>     echo "-----BEGIN PGP SIGNATURE-----"
>   } | git tag -F - foo
>   git tag -l --format='%(contents:body)' foo
> 
> The issue isn't unique to pgp signatures; the problem is in the parsing
> done by ref-filter's find_subpos(), so any signature type exhibits the
> problem. At the end of that function we do:
> 
>       *nonsiglen = sigstart - buf;
> 
> but "buf" has moved beyond "sigstart". Presumably because it uses
> strstr() to look for end-of-line in buf. Since there isn't one before
> the signature begins, we go to the end of the signature.
> 
> The bug bisects to 9f75ce3d8f (ref-filter: handle CRLF at end-of-line
> more gracefully, 2020-10-29). Before then, I think our loop was careful
> about moving past the start of the signature. Author cc'd.
> 

Thanks for letting me know, and for the simple reproducer, that's very useful.
I'll try to find time to fix that this week, but can't promise anything.

Phil.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-01 12:22 ` Jeff King
  2022-11-02  0:41   ` Philippe Blain
@ 2022-11-02  0:42   ` Philippe Blain
  1 sibling, 0 replies; 15+ messages in thread
From: Philippe Blain @ 2022-11-02  0:42 UTC (permalink / raw)
  To: Jeff King, Martin Englund; +Cc: git

Hi,

Le 2022-11-01 à 08:22, Jeff King a écrit :
> On Fri, Oct 28, 2022 at 03:29:33PM -0700, Martin Englund wrote:
> 
>> What did you do before the bug happened? (Steps to reproduce your issue)
>> I created a signed tag (git tag -s) using a ssh-agent key and then ran
>> git tag -l --format '%(contents:body)' v0.6.1
>>
>> What did you expect to happen? (Expected behavior)
>> I get the output
>>
>> What happened instead? (Actual behavior)
>> fatal: Out of memory, malloc failed (tried to allocate
>> 18446744073709551266 bytes)
> 
> Thanks for the report. This looks like pointer or size_t arithmetic that
> has gone negative. Here's a minimal reproduction:
> 
>   {
>     echo subject
>     echo "-----BEGIN PGP SIGNATURE-----"
>   } | git tag -F - foo
>   git tag -l --format='%(contents:body)' foo
> 
> The issue isn't unique to pgp signatures; the problem is in the parsing
> done by ref-filter's find_subpos(), so any signature type exhibits the
> problem. At the end of that function we do:
> 
>       *nonsiglen = sigstart - buf;
> 
> but "buf" has moved beyond "sigstart". Presumably because it uses
> strstr() to look for end-of-line in buf. Since there isn't one before
> the signature begins, we go to the end of the signature.
> 
> The bug bisects to 9f75ce3d8f (ref-filter: handle CRLF at end-of-line
> more gracefully, 2020-10-29). Before then, I think our loop was careful
> about moving past the start of the signature. Author cc'd.
> 

Thanks for letting me know, and for the simple reproducer, that's very useful.
I'll try to find time to fix that this week, but can't promise anything.

Phil.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-02  0:41   ` Philippe Blain
@ 2022-11-02  7:39     ` Jeff King
  2022-11-02  7:42       ` [PATCH 1/2] ref-filter: fix parsing of signatures without blank lines Jeff King
                         ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Jeff King @ 2022-11-02  7:39 UTC (permalink / raw)
  To: Philippe Blain; +Cc: Martin Englund, git

On Tue, Nov 01, 2022 at 08:41:55PM -0400, Philippe Blain wrote:

> > The issue isn't unique to pgp signatures; the problem is in the parsing
> > done by ref-filter's find_subpos(), so any signature type exhibits the
> > problem. At the end of that function we do:
> > 
> >       *nonsiglen = sigstart - buf;
> > 
> > but "buf" has moved beyond "sigstart". Presumably because it uses
> > strstr() to look for end-of-line in buf. Since there isn't one before
> > the signature begins, we go to the end of the signature.
> > 
> > The bug bisects to 9f75ce3d8f (ref-filter: handle CRLF at end-of-line
> > more gracefully, 2020-10-29). Before then, I think our loop was careful
> > about moving past the start of the signature. Author cc'd.
> > 
> 
> Thanks for letting me know, and for the simple reproducer, that's very useful.
> I'll try to find time to fix that this week, but can't promise anything.

After sleeping on it, I think I fully understand what's going on. There
are actually _two_ bugs, but they are closely related. ;)

Here are patches which fix them both. I may be setting a new record for
the ratio of commit message lines to changed code. But it took me a
while to figure out what was going on, so I wanted to explain it fully.

  [1/2]: ref-filter: fix parsing of signatures without blank lines
  [2/2]: ref-filter: fix parsing of signatures with CRLF and no body

 ref-filter.c            |  8 ++++----
 t/t6300-for-each-ref.sh | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 4 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] ref-filter: fix parsing of signatures without blank lines
  2022-11-02  7:39     ` Jeff King
@ 2022-11-02  7:42       ` Jeff King
  2022-11-02  7:44       ` [PATCH 2/2] ref-filter: fix parsing of signatures with CRLF and no body Jeff King
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2022-11-02  7:42 UTC (permalink / raw)
  To: Philippe Blain; +Cc: Martin Englund, git

When ref-filter is asked to show %(content:subject), etc, we end up in
find_subpos() to parse out the three major parts: the subject, the body,
and the signature (if any).

When searching for the blank line between the subject and body, if we
don't find anything, we try to treat the whole message as the subject,
with no body. But our idea of "the whole message" needs to take into
account the signature, too. Since 9f75ce3d8f (ref-filter: handle CRLF at
end-of-line more gracefully, 2020-10-29), the code instead goes all the
way to the end of the buffer, which produces confusing output.

Here's an example. If we have a tag message like this:

  this is the subject
  -----BEGIN SSH SIGNATURE-----
  ...some stuff...
  -----END SSH SIGNATURE-----

then the current parser will put the start of the body at the end of the
whole buffer. This produces two buggy outcomes:

  - since the subject length is computed as (body - subject), showing
    %(contents:subject) will print both the subject and the signature,
    rather than just the single line

  - since the body length is computed as (sig - body), and the body now
    starts _after_ the signature, we end up with a negative length!
    Fortunately we never access out-of-bounds memory, because the
    negative length is fed to xmemdupz(), which casts it to a size_t,
    and xmalloc() bails trying to allocate an absurdly large value.

    In theory it would be possible for somebody making a malicious tag
    to wrap it around to a more reasonable value, but it would require a
    tag on the order of 2^63 bytes. And even if they did, all they get
    is an out of bounds string read. So the security implications are
    probably not interesting.

We can fix both by correctly putting the start of the body at the same
index as the start of the signature (effectively making the body empty).

Note that this is a real issue with signatures generated with gpg.format
set to "ssh", which would look like the example above. In the new tests
here I use a hard-coded tag message, for a few reasons:

  - regardless of what the ssh-signing code produces now or in the
    future, we should be testing this particular case

  - skipping the actual signature makes the tests simpler to write (and
    allows them to run on more systems)

  - t6300 has helpers for working with gpg signatures; for the purposes
    of this bug, "BEGIN PGP" is just as good a demonstration, and this
    simplifies the tests

Curiously, the same issue doesn't happen with real gpg signatures (and
there are even existing tests in t6300 with cover this). Those have a
blank line between the header and the content, like:

  this is the subject
  -----BEGIN PGP SIGNATURE-----

  ...some stuff...
  -----END PGP SIGNATURE-----

Because we search for the subject/body separator line with a strstr(),
we find the blank line in the signature, even though it's outside of
what we'd consider the body. But that puts us unto a separate code path,
which realizes that we're now in the signature and adjusts the line back
to "sigstart". So this patch is basically just making the "no line found
at all" case match that. And note that "sigstart" is always defined (if
there is no signature, it points to the end of the buffer as you'd
expect).

Reported-by: Martin Englund <martin@englund.nu>
Signed-off-by: Jeff King <peff@peff.net>
---
 ref-filter.c            |  2 +-
 t/t6300-for-each-ref.sh | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index 914908fac5..6c2148c01e 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1380,7 +1380,7 @@ static void find_subpos(const char *buf,
 	/* check if message uses CRLF */
 	} else if (! (eol = strstr(*sub, "\r\n\r\n"))) {
 		/* treat whole message as subject */
-		eol = strrchr(*sub, '\0');
+		eol = sigstart;
 	}
 	buf = eol;
 	*sublen = buf - *sub;
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index dcaab7265f..d7e70027e6 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -1406,4 +1406,19 @@ test_expect_success 'for-each-ref reports broken tags' '
 		refs/tags/broken-tag-*
 '
 
+test_expect_success 'set up tag with signature and no blank lines' '
+	git tag -F - fake-sig-no-blanks <<-\EOF
+	this is the subject
+	-----BEGIN PGP SIGNATURE-----
+	not a real signature, but we just care about the
+	subject/body parsing. It is important here that
+	there are no blank lines in the signature.
+	-----END PGP SIGNATURE-----
+	EOF
+'
+
+test_atom refs/tags/fake-sig-no-blanks contents:subject 'this is the subject'
+test_atom refs/tags/fake-sig-no-blanks contents:body ''
+test_atom refs/tags/fake-sig-no-blanks contents:signature "$sig"
+
 test_done
-- 
2.38.1.677.g4206adeb26


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] ref-filter: fix parsing of signatures with CRLF and no body
  2022-11-02  7:39     ` Jeff King
  2022-11-02  7:42       ` [PATCH 1/2] ref-filter: fix parsing of signatures without blank lines Jeff King
@ 2022-11-02  7:44       ` Jeff King
  2022-11-02  8:14       ` Git Bug Report: out of memory using git tag Elijah Newren
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2022-11-02  7:44 UTC (permalink / raw)
  To: Philippe Blain; +Cc: Martin Englund, git

This commit fixes a bug when parsing tags that have CRLF line endings, a
signature, and no body, like this (the "^M" are marking the CRs):

  this is the subject^M
  -----BEGIN PGP SIGNATURE-----^M
  ^M
  ...some stuff...^M
  -----END PGP SIGNATURE-----^M

When trying to find the start of the body, we look for a blank line
separating the subject and body. In this case, there isn't one. But we
search for it using strstr(), which will find the blank line in the
signature.

In the non-CRLF code path, we check whether the line we found is past
the start of the signature, and if so, put the body pointer at the start
of the signature (effectively making the body empty). But the CRLF code
path doesn't catch the same case, and we end up with the body pointer in
the middle of the signature field. This has two visible problems:

  - printing %(contents:subject) will show part of the signature, too,
    since the subject length is computed as (body - subject)

  - the length of the body is (sig - body), which makes it negative.
    Asking for %(contents:body) causes us to cast this to a very large
    size_t when we feed it to xmemdupz(), which then complains about
    trying to allocate too much memory.

These are essentially the same bugs fixed in the previous commit, except
that they happen when there is a CRLF blank line in the signature,
rather than no blank line at all. Both are caused by the refactoring in
9f75ce3d8f (ref-filter: handle CRLF at end-of-line more gracefully,
2020-10-29).

We can fix this by doing the same "sigstart" check that we do in the
non-CRLF case. And rather than repeat ourselves, we can just use
short-circuiting OR to collapse both cases into a single conditional.
I.e., rather than:

  if (strstr("\n\n"))
    ...found blank, check if it's in signature...
  else if (strstr("\r\n\r\n"))
    ...found blank, check if it's in signature...
  else
    ...no blank line found...

we can collapse this to:

  if (strstr("\n\n")) ||
      strstr("\r\n\r\n")))
    ...found blank, check if it's in signature...
  else
    ...no blank line found...

The tests show the problem and the fix. Though it wasn't broken, I
included contents:signature here to make sure it still behaves as
expected, but note the shell hackery needed to make it work. A
less-clever option would be to skip using test_atom and just "append_cr
>expected" ourselves.

Signed-off-by: Jeff King <peff@peff.net>
---
 ref-filter.c            |  6 +++---
 t/t6300-for-each-ref.sh | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 6c2148c01e..9dc2cd1451 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1375,10 +1375,10 @@ static void find_subpos(const char *buf,
 	/* subject is first non-empty line */
 	*sub = buf;
 	/* subject goes to first empty line before signature begins */
-	if ((eol = strstr(*sub, "\n\n"))) {
+	if ((eol = strstr(*sub, "\n\n")) ||
+	    (eol = strstr(*sub, "\r\n\r\n"))) {
 		eol = eol < sigstart ? eol : sigstart;
-	/* check if message uses CRLF */
-	} else if (! (eol = strstr(*sub, "\r\n\r\n"))) {
+	} else {
 		/* treat whole message as subject */
 		eol = sigstart;
 	}
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index d7e70027e6..fa38b87441 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -1421,4 +1421,29 @@ test_atom refs/tags/fake-sig-no-blanks contents:subject 'this is the subject'
 test_atom refs/tags/fake-sig-no-blanks contents:body ''
 test_atom refs/tags/fake-sig-no-blanks contents:signature "$sig"
 
+test_expect_success 'set up tag with CRLF signature' '
+	append_cr <<-\EOF |
+	this is the subject
+	-----BEGIN PGP SIGNATURE-----
+
+	not a real signature, but we just care about
+	the subject/body parsing. It is important here
+	that there is a blank line separating this
+	from the signature header.
+	-----END PGP SIGNATURE-----
+	EOF
+	git tag -F - --cleanup=verbatim fake-sig-crlf
+'
+
+test_atom refs/tags/fake-sig-crlf contents:subject 'this is the subject'
+test_atom refs/tags/fake-sig-crlf contents:body ''
+
+# CRLF is retained in the signature, so we have to pass our expected value
+# through append_cr. But test_atom requires a shell string, which means command
+# substitution, and the shell will strip trailing newlines from the output of
+# the substitution. Hack around it by adding and then removing a dummy line.
+sig_crlf="$(printf "%s" "$sig" | append_cr; echo dummy)"
+sig_crlf=${sig_crlf%dummy}
+test_atom refs/tags/fake-sig-crlf contents:signature "$sig_crlf"
+
 test_done
-- 
2.38.1.677.g4206adeb26

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-02  7:39     ` Jeff King
  2022-11-02  7:42       ` [PATCH 1/2] ref-filter: fix parsing of signatures without blank lines Jeff King
  2022-11-02  7:44       ` [PATCH 2/2] ref-filter: fix parsing of signatures with CRLF and no body Jeff King
@ 2022-11-02  8:14       ` Elijah Newren
  2022-11-02  9:13         ` gigantic commit messages, was " Jeff King
  2022-11-02  8:24       ` Eric Sunshine
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Elijah Newren @ 2022-11-02  8:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Philippe Blain, Martin Englund, git

On Wed, Nov 2, 2022 at 12:51 AM Jeff King <peff@peff.net> wrote:
>
> Here are patches which fix them both. I may be setting a new record for
> the ratio of commit message lines to changed code

It looks like the first patch is 72 lines of commit message for a
one-line fix, and the second patch is 61 lines of commit message for a
two line fix.

I don't know what the record ratio is, but it's at least 96[1], so
clearly you'll need to figure out how to pad your first commit message
with at least another 25 lines before this series can be accepted.
;-)

[1] https://lore.kernel.org/git/7vaac73265.fsf@alter.siamese.dyndns.org/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-02  7:39     ` Jeff King
                         ` (2 preceding siblings ...)
  2022-11-02  8:14       ` Git Bug Report: out of memory using git tag Elijah Newren
@ 2022-11-02  8:24       ` Eric Sunshine
  2022-11-02 12:13       ` Philippe Blain
  2022-11-03  0:42       ` Taylor Blau
  5 siblings, 0 replies; 15+ messages in thread
From: Eric Sunshine @ 2022-11-02  8:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Philippe Blain, Martin Englund, git

On Wed, Nov 2, 2022 at 3:44 AM Jeff King <peff@peff.net> wrote:
> After sleeping on it, I think I fully understand what's going on. There
> are actually _two_ bugs, but they are closely related. ;)
>
> Here are patches which fix them both. I may be setting a new record for
> the ratio of commit message lines to changed code. But it took me a
> while to figure out what was going on, so I wanted to explain it fully.

Well explained. For someone who has (probably) never looked at that
code (me), the explanations made perfect sense.

(Oh, and I didn't even have to report any s///.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* gigantic commit messages, was Re: Git Bug Report: out of memory using git tag
  2022-11-02  8:14       ` Git Bug Report: out of memory using git tag Elijah Newren
@ 2022-11-02  9:13         ` Jeff King
  2022-11-02 14:26           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2022-11-02  9:13 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Philippe Blain, Martin Englund, git

On Wed, Nov 02, 2022 at 01:14:59AM -0700, Elijah Newren wrote:

> On Wed, Nov 2, 2022 at 12:51 AM Jeff King <peff@peff.net> wrote:
> >
> > Here are patches which fix them both. I may be setting a new record for
> > the ratio of commit message lines to changed code
> 
> It looks like the first patch is 72 lines of commit message for a
> one-line fix, and the second patch is 61 lines of commit message for a
> two line fix.
> 
> I don't know what the record ratio is, but it's at least 96[1], so
> clearly you'll need to figure out how to pad your first commit message
> with at least another 25 lines before this series can be accepted.
> ;-)

Well, if we want to start digging things up... ;)

Try this:

  git log --no-merges --no-renames --format='%H %B' -z --numstat '*.c' |
  perl -0ne '
    chomp;
    if (s/^([0-9a-f]{40}) //) {
      if (defined $commit && $diff) {
        my $ratio = $body / $diff;
        print "$ratio $body $diff $commit\n";
      }
      $commit = $1;
      $body = () = /\n/g;
      $diff = 0;
    } elsif (/^\s*(\d+)\t/) {
      # this counts only added lines, under the assumption that
      # small commits generally remove/add in proportion. Of course
      # ones that _only_ remove lines have infinite ratios.
      $diff += $1;
    } else {
      die "confusing record: $_\n";
    }
  ' |
  sort -rn |
  head

which shows there are a few in the 100's. Pipe through:

  awk '{print $4}' |
  git log --stdin --no-walk=unsorted --stat

for a nicer view. I'm rejecting the top one on the grounds that it's
mostly cut-and-paste output, and also that #2 is mine. ;)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-02  7:39     ` Jeff King
                         ` (3 preceding siblings ...)
  2022-11-02  8:24       ` Eric Sunshine
@ 2022-11-02 12:13       ` Philippe Blain
  2022-11-03  4:32         ` Jeff King
  2022-11-03  0:42       ` Taylor Blau
  5 siblings, 1 reply; 15+ messages in thread
From: Philippe Blain @ 2022-11-02 12:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Englund, git

Hi Peff,

Le 2022-11-02 à 03:39, Jeff King a écrit :
> 
> After sleeping on it, I think I fully understand what's going on. There
> are actually _two_ bugs, but they are closely related. ;)
> 
> Here are patches which fix them both. I may be setting a new record for
> the ratio of commit message lines to changed code. But it took me a
> while to figure out what was going on, so I wanted to explain it fully.
> 
>   [1/2]: ref-filter: fix parsing of signatures without blank lines
>   [2/2]: ref-filter: fix parsing of signatures with CRLF and no body

I read both patches and I concur with Eric, very well explained. I'm sorry
for letting that split through at the time; thanks a lot for the fixes.
I'm sure it took you less time that it would have taken me!

The new code is even clearer with this additional 'else if' removed.

One thing I think that is not mentioned in your commit messages is that 1/2
would also apply to a CRLF-using tag with no blank lines, i.e.

  this is the subject^M
  -----BEGIN PGP SIGNATURE-----^M
  ...some stuff...^M
  -----END PGP SIGNATURE-----^M

Parsing this tag correctly is fixed by 1/2, right?

Anyway thanks again for the fixes,

Philippe.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gigantic commit messages, was Re: Git Bug Report: out of memory using git tag
  2022-11-02  9:13         ` gigantic commit messages, was " Jeff King
@ 2022-11-02 14:26           ` Ævar Arnfjörð Bjarmason
  2022-11-02 15:43             ` Elijah Newren
  0 siblings, 1 reply; 15+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-11-02 14:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren, Philippe Blain, Martin Englund, git


On Wed, Nov 02 2022, Jeff King wrote:

> On Wed, Nov 02, 2022 at 01:14:59AM -0700, Elijah Newren wrote:
>
>> On Wed, Nov 2, 2022 at 12:51 AM Jeff King <peff@peff.net> wrote:
>> >
>> > Here are patches which fix them both. I may be setting a new record for
>> > the ratio of commit message lines to changed code
>> 
>> It looks like the first patch is 72 lines of commit message for a
>> one-line fix, and the second patch is 61 lines of commit message for a
>> two line fix.
>> 
>> I don't know what the record ratio is, but it's at least 96[1], so
>> clearly you'll need to figure out how to pad your first commit message
>> with at least another 25 lines before this series can be accepted.
>> ;-)
>
> Well, if we want to start digging things up... ;)
>
> Try this:
>
>   git log --no-merges --no-renames --format='%H %B' -z --numstat '*.c' |
>   perl -0ne '
>     chomp;
>     if (s/^([0-9a-f]{40}) //) {
>       if (defined $commit && $diff) {
>         my $ratio = $body / $diff;
>         print "$ratio $body $diff $commit\n";
>       }
>       $commit = $1;
>       $body = () = /\n/g;
>       $diff = 0;
>     } elsif (/^\s*(\d+)\t/) {
>       # this counts only added lines, under the assumption that
>       # small commits generally remove/add in proportion. Of course
>       # ones that _only_ remove lines have infinite ratios.
>       $diff += $1;
>     } else {
>       die "confusing record: $_\n";
>     }
>   ' |
>   sort -rn |
>   head
>
> which shows there are a few in the 100's. Pipe through:
>
>   awk '{print $4}' |
>   git log --stdin --no-walk=unsorted --stat
>
> for a nicer view. I'm rejecting the top one on the grounds that it's
> mostly cut-and-paste output, and also that #2 is mine. ;)

I think that '*.c' is cheating, if anything I should be getting more
points when you remove that, as I've been over explaining
adding/removing a compiler flag or something. At least your #2 is tricky
C code :)

I haven't bothered to do this, but I think if you --word-diff
--word-diff-regex=. and parse the resulting diff you'd get "better"
results.

Or, for better & similar (but not the same): compute the levenshtein
distance of the pre- and post-image, and compute edit distance to commit
message length.

I haven't done that, but just from eyeballing it I think [1] beats your
[2] by that criteria. Per:
	
	$ perl -MText::Levenshtein=distance -wE 'say distance @ARGV' int unsigned
	6
	$ perl -MText::Levenshtein=distance -wE 'say distance @ARGV' "" _lf
	3

It should get 2x the score v.s. yours, but yours is <2x the
words/characters.

(Edit: But see [4] below)

There's also e.g. my [3] that's fairly high in the running per your
"only added lines". But I think it shows the perils of doing that,
i.e. in general I don't see why you'd omit deletions, that commit
message is certainly spending most of its time talking about why the
deletion of the code at hand is OK.

Once you count deletions it'll get *way* down the list, as it's 11
deleted lines, 1 added.

Hrm, I take some of the above back, I think [4] might be the winner.
That's just an edit distance of 1, so it's around 2x the commit message
length of yours if we adjust for your score of 6. (~2.5 by
characters)[5].

1. 356c4732950 (credential: treat CR/LF as line endings in the
   credential protocol, 2020-10-03)
2. aec0bba106d (config: work around gcc-10 -Wstringop-overflow warning,
   2020-08-04)
3. f97fe358576 (pickaxe -G: don't special-case create/delete,
   2021-04-12)
4. c58bebd4c67 (ci: update Cirrus-CI image to FreeBSD 12.3, 2022-05-25)
5. All measured with "git show --no-notes --no-patch <commit> | wc",
   because I was lazy.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: gigantic commit messages, was Re: Git Bug Report: out of memory using git tag
  2022-11-02 14:26           ` Ævar Arnfjörð Bjarmason
@ 2022-11-02 15:43             ` Elijah Newren
  0 siblings, 0 replies; 15+ messages in thread
From: Elijah Newren @ 2022-11-02 15:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Philippe Blain, Martin Englund, git

On Wed, Nov 2, 2022 at 7:43 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> On Wed, Nov 02 2022, Jeff King wrote:
>
> > On Wed, Nov 02, 2022 at 01:14:59AM -0700, Elijah Newren wrote:
> >
> >> On Wed, Nov 2, 2022 at 12:51 AM Jeff King <peff@peff.net> wrote:
> >> >
> >> > Here are patches which fix them both. I may be setting a new record for
> >> > the ratio of commit message lines to changed code
> >>
> >> It looks like the first patch is 72 lines of commit message for a
> >> one-line fix, and the second patch is 61 lines of commit message for a
> >> two line fix.
> >>
> >> I don't know what the record ratio is, but it's at least 96[1], so
> >> clearly you'll need to figure out how to pad your first commit message
> >> with at least another 25 lines before this series can be accepted.
> >> ;-)
> >
> > Well, if we want to start digging things up... ;)
> >
> > Try this:
> >
> >   git log --no-merges --no-renames --format='%H %B' -z --numstat '*.c' |
> >   perl -0ne '
> >     chomp;
> >     if (s/^([0-9a-f]{40}) //) {
> >       if (defined $commit && $diff) {
> >         my $ratio = $body / $diff;
> >         print "$ratio $body $diff $commit\n";
> >       }
> >       $commit = $1;
> >       $body = () = /\n/g;
> >       $diff = 0;
> >     } elsif (/^\s*(\d+)\t/) {
> >       # this counts only added lines, under the assumption that
> >       # small commits generally remove/add in proportion. Of course
> >       # ones that _only_ remove lines have infinite ratios.
> >       $diff += $1;
> >     } else {
> >       die "confusing record: $_\n";
> >     }
> >   ' |
> >   sort -rn |
> >   head
> >
> > which shows there are a few in the 100's. Pipe through:
> >
> >   awk '{print $4}' |
> >   git log --stdin --no-walk=unsorted --stat
> >
> > for a nicer view. I'm rejecting the top one on the grounds that it's
> > mostly cut-and-paste output, and also that #2 is mine. ;)
>
> I think that '*.c' is cheating, if anything I should be getting more
> points when you remove that, as I've been over explaining
> adding/removing a compiler flag or something. At least your #2 is tricky
> C code :)
>
> I haven't bothered to do this, but I think if you --word-diff
> --word-diff-regex=. and parse the resulting diff you'd get "better"
> results.
>
> Or, for better & similar (but not the same): compute the levenshtein
> distance of the pre- and post-image, and compute edit distance to commit
> message length.
>
> I haven't done that, but just from eyeballing it I think [1] beats your
> [2] by that criteria. Per:
>
>         $ perl -MText::Levenshtein=distance -wE 'say distance @ARGV' int unsigned
>         6
>         $ perl -MText::Levenshtein=distance -wE 'say distance @ARGV' "" _lf
>         3
>
> It should get 2x the score v.s. yours, but yours is <2x the
> words/characters.
>
> (Edit: But see [4] below)
>
> There's also e.g. my [3] that's fairly high in the running per your
> "only added lines". But I think it shows the perils of doing that,
> i.e. in general I don't see why you'd omit deletions, that commit
> message is certainly spending most of its time talking about why the
> deletion of the code at hand is OK.
>
> Once you count deletions it'll get *way* down the list, as it's 11
> deleted lines, 1 added.
>
> Hrm, I take some of the above back, I think [4] might be the winner.
> That's just an edit distance of 1, so it's around 2x the commit message
> length of yours if we adjust for your score of 6. (~2.5 by
> characters)[5].
>
> 1. 356c4732950 (credential: treat CR/LF as line endings in the
>    credential protocol, 2020-10-03)
> 2. aec0bba106d (config: work around gcc-10 -Wstringop-overflow warning,
>    2020-08-04)
> 3. f97fe358576 (pickaxe -G: don't special-case create/delete,
>    2021-04-12)
> 4. c58bebd4c67 (ci: update Cirrus-CI image to FreeBSD 12.3, 2022-05-25)
> 5. All measured with "git show --no-notes --no-patch <commit> | wc",
>    because I was lazy.

Hehe, my offhand joke started a contest over the whimsical question of
who's the most long-winded.  I think my work here is done.  :-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-02  7:39     ` Jeff King
                         ` (4 preceding siblings ...)
  2022-11-02 12:13       ` Philippe Blain
@ 2022-11-03  0:42       ` Taylor Blau
  5 siblings, 0 replies; 15+ messages in thread
From: Taylor Blau @ 2022-11-03  0:42 UTC (permalink / raw)
  To: Jeff King; +Cc: Philippe Blain, Martin Englund, git

On Wed, Nov 02, 2022 at 03:39:20AM -0400, Jeff King wrote:
> After sleeping on it, I think I fully understand what's going on. There
> are actually _two_ bugs, but they are closely related. ;)
>
> Here are patches which fix them both. I may be setting a new record for
> the ratio of commit message lines to changed code. But it took me a
> while to figure out what was going on, so I wanted to explain it fully.
>
>   [1/2]: ref-filter: fix parsing of signatures without blank lines
>   [2/2]: ref-filter: fix parsing of signatures with CRLF and no body

Indeed. I agree with others below in the thread that this is very well
explained, and the fix(es) both look sensible to me.

Will queue, thanks.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Git Bug Report: out of memory using git tag
  2022-11-02 12:13       ` Philippe Blain
@ 2022-11-03  4:32         ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2022-11-03  4:32 UTC (permalink / raw)
  To: Philippe Blain; +Cc: Martin Englund, git

On Wed, Nov 02, 2022 at 08:13:27AM -0400, Philippe Blain wrote:

> >   [1/2]: ref-filter: fix parsing of signatures without blank lines
> >   [2/2]: ref-filter: fix parsing of signatures with CRLF and no body
> [...]
> One thing I think that is not mentioned in your commit messages is that 1/2
> would also apply to a CRLF-using tag with no blank lines, i.e.
> 
>   this is the subject^M
>   -----BEGIN PGP SIGNATURE-----^M
>   ...some stuff...^M
>   -----END PGP SIGNATURE-----^M
> 
> Parsing this tag correctly is fixed by 1/2, right?

Right. Because there's no blank line to find, neither the "\n\n" nor the
"\r\n\r\n" strstr matches, so it is essentially the same case. The fact
that the other non-blank lines are CRLF doesn't matter for this part of
the code. :)

> Anyway thanks again for the fixes,

No problem!

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-11-03  4:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-28 22:29 Git Bug Report: out of memory using git tag Martin Englund
2022-11-01 12:22 ` Jeff King
2022-11-02  0:41   ` Philippe Blain
2022-11-02  7:39     ` Jeff King
2022-11-02  7:42       ` [PATCH 1/2] ref-filter: fix parsing of signatures without blank lines Jeff King
2022-11-02  7:44       ` [PATCH 2/2] ref-filter: fix parsing of signatures with CRLF and no body Jeff King
2022-11-02  8:14       ` Git Bug Report: out of memory using git tag Elijah Newren
2022-11-02  9:13         ` gigantic commit messages, was " Jeff King
2022-11-02 14:26           ` Ævar Arnfjörð Bjarmason
2022-11-02 15:43             ` Elijah Newren
2022-11-02  8:24       ` Eric Sunshine
2022-11-02 12:13       ` Philippe Blain
2022-11-03  4:32         ` Jeff King
2022-11-03  0:42       ` Taylor Blau
2022-11-02  0:42   ` Philippe Blain

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).