git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] mailinfo: support Unicode scissors
@ 2019-03-31 22:01 Andrei Rybak
  2019-03-31 23:09 ` SZEDER Gábor
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Andrei Rybak @ 2019-03-31 22:01 UTC (permalink / raw)
  To: git; +Cc: Andrei Rybak

'git am --scissors' allows cutting a patch from an email at a scissors
line.  Such a line should contain perforation, i.e. hyphens, and a
scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
recognized by 'git am --scissors' command at the moment.

Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
of character set Emoji 1.0, published in 2015 [2].  With its adoption as
an emoji, availability of this character on keyboards has increased.

Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
--scissors' to be able to cut at Unicode perforation lines in emails.
Note, that Unicode character '✂' is three bytes in UTF-8 encoding.

1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf
   https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt
2. https://unicode.org/Public/emoji/1.0/emoji-data.txt

Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---

This applies on top of ar/t4150-remove-cruft merged into next in 
commit a0106a8d5c, which also edited the test setup in t4150.

 mailinfo.c    |  7 +++++++
 t/t4150-am.sh | 26 +++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..4ef6cdee85 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
 			c++;
 			continue;
 		}
+		if (!memcmp(c, "✂", 3)) {
+			in_perforation = 1;
+			perforation += 3;
+			scissors += 3;
+			c++;
+			continue;
+		}
 		in_perforation = 0;
 	}
 
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 3f7f750cc8..3ea8e8a2cf 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -77,12 +77,20 @@ test_expect_success 'setup: messages' '
 
 	printf "Subject: " >subject-prefix &&
 
-	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF
+	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF &&
 	This line should not be included in the commit message with --scissors enabled.
 
 	 - - >8 - - remove everything above this line - - >8 - -
 
 	EOF
+
+	cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF
+	Lines above unicode scissors line should not be included in the commit
+	message with --scissors enabled.
+
+	- - - ✂ - - - ✂ - - -
+
+	EOF
 '
 
 test_expect_success setup '
@@ -161,6 +169,12 @@ test_expect_success setup '
 	git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml &&
 	git reset --hard HEAD^ &&
 
+	echo file >file &&
+	git add file &&
+	git commit -F msg-with-unicode-scissors &&
+	git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml &&
+	git reset --hard HEAD^ &&
+
 	sed -n -e "3,\$p" msg >file &&
 	git add file &&
 	test_tick &&
@@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' '
 	test_cmp_rev expected-for-scissors HEAD
 '
 
+test_expect_success 'am --scissors cuts the message at the unicode scissors line' '
+	rm -fr .git/rebase-apply &&
+	git reset --hard &&
+	git checkout second &&
+	git am --scissors patch-with-unicode-scissors.eml &&
+	test_path_is_missing .git/rebase-apply &&
+	git diff --exit-code expected-for-scissors &&
+	test_cmp_rev expected-for-scissors HEAD
+'
+
 test_expect_success 'am --no-scissors overrides mailinfo.scissors' '
 	rm -fr .git/rebase-apply &&
 	git reset --hard &&
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
@ 2019-03-31 23:09 ` SZEDER Gábor
  2019-04-01  9:07   ` Junio C Hamano
  2019-04-01 10:11   ` Jeff King
  2019-04-01  9:27 ` Duy Nguyen
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2 siblings, 2 replies; 16+ messages in thread
From: SZEDER Gábor @ 2019-03-31 23:09 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git

On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> diff --git a/mailinfo.c b/mailinfo.c
> index b395adbdf2..4ef6cdee85 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>  			c++;
>  			continue;
>  		}
> +		if (!memcmp(c, "✂", 3)) {

This character is tiny.  Please add a comment that it's supposed to be
a Unicode scissors character.

Should we worry about this memcmp() potentially reading past the end
of the string when 'c' points to the last character?

> +			in_perforation = 1;
> +			perforation += 3;
> +			scissors += 3;
> +			c++;

Here you should jump past the three byte long Unicode character, so
this should be c += 2.

> +			continue;
> +		}
>  		in_perforation = 0;
>  	}
>  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 23:09 ` SZEDER Gábor
@ 2019-04-01  9:07   ` Junio C Hamano
  2019-04-01 10:11   ` Jeff King
  1 sibling, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2019-04-01  9:07 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Andrei Rybak, git

SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
>> diff --git a/mailinfo.c b/mailinfo.c
>> index b395adbdf2..4ef6cdee85 100644
>> --- a/mailinfo.c
>> +++ b/mailinfo.c
>> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>>  			c++;
>>  			continue;
>>  		}
>> +		if (!memcmp(c, "✂", 3)) {
>
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.
>
> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

Quite honestly, I'd rather document what "scissors" line looks like
exactly and make sure no readers would mistake that we'd accept any
Unicode character whose name has substring "scissors" in it.

Ah, wait, we already do.  It is very clear that scissors are either
">8" (for right handers) or "8<" (for lefties) and nothing else.

Unless you are sure that you are (and more importantly, can stay to
be) exhaustive, adding allowed representations for a thing will
force users to learn more non-essential things ("we allow only 8<
and >8" vs "we allow only these 7, even though we are aware that
there are at least 14 more that we do not allow"---the end-user
needs to remember which 7 are allowed) and does not help users.

Taking only "black scissors" U+2702 but not all of U+2700 - U+2704
will be a cause for unnecessary end-user complaints "why do you take
this but not that one?"  Then the next noise would be "why is '-'
the only perforation and not U+2014 Em Dash or U+2013 En Dash?"

Let's try not to be cute in non-essential things like how a pair of
scissors ought to be spelled.  If "8<" had worked well for us for
the past 10 years, we should just stick to it.





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
  2019-03-31 23:09 ` SZEDER Gábor
@ 2019-04-01  9:27 ` Duy Nguyen
  2019-04-01 21:54   ` Andrei Rybak
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2 siblings, 1 reply; 16+ messages in thread
From: Duy Nguyen @ 2019-04-01  9:27 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: Git Mailing List

On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote:
>
> 'git am --scissors' allows cutting a patch from an email at a scissors
> line.  Such a line should contain perforation, i.e. hyphens, and a
> scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
> recognized by 'git am --scissors' command at the moment.
>
> Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
> since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
> of character set Emoji 1.0, published in 2015 [2].  With its adoption as
> an emoji, availability of this character on keyboards has increased.
>
> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
> --scissors' to be able to cut at Unicode perforation lines in emails.
> Note, that Unicode character '✂' is three bytes in UTF-8 encoding.

On top of what was already said in this thread. For some reason (bad
font?) these scissors are drawn cutting _down_ for me instead of left
or right. It looks a bit strange.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 23:09 ` SZEDER Gábor
  2019-04-01  9:07   ` Junio C Hamano
@ 2019-04-01 10:11   ` Jeff King
  1 sibling, 0 replies; 16+ messages in thread
From: Jeff King @ 2019-04-01 10:11 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Andrei Rybak, git

On Mon, Apr 01, 2019 at 01:09:47AM +0200, SZEDER Gábor wrote:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> > diff --git a/mailinfo.c b/mailinfo.c
> > index b395adbdf2..4ef6cdee85 100644
> > --- a/mailinfo.c
> > +++ b/mailinfo.c
> > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
> >  			c++;
> >  			continue;
> >  		}
> > +		if (!memcmp(c, "✂", 3)) {
> 
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.

I think it might also be the first raw UTF-8 character in our source,
which is otherwise ASCII. Usually we'd spell out the binary (with a
comment).

I think I agree with Junio's response, tough, that this is probably not
a road we want to go down, unless this micro-format is being actively
used in the wild (I have no idea, but I have never seen it).

> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

I also wondered if the existing memcmps for ">8", etc, would have this
problem. They don't, but it's somewhat subtle. They are only 2
characters long, and the outer loop guarantees we have at least 1
character. So at most we will look at the NUL. But obviously a 3-byte
sequence like this may invoke undefined behavior, and the existing
memcmps encourage anybody adding code to do it wrong.

I wonder if it's worth re-writing it like:

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..46b1b2a4a8 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -693,8 +693,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if ((starts_with(c, ">8") || starts_with(c, "8<") ||
+		     starts_with(c, ">%") || starts_with(c, "%<"))) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;

-Peff

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 1/2] mailinfo: use starts_with() for clarity
  2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
  2019-03-31 23:09 ` SZEDER Gábor
  2019-04-01  9:27 ` Duy Nguyen
@ 2019-04-01 21:53 ` Andrei Rybak
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
                     ` (2 more replies)
  2 siblings, 3 replies; 16+ messages in thread
From: Andrei Rybak @ 2019-04-01 21:53 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Junio C Hamano, Jeff King, Duy Nguyen

Existing checks using memcmp(3) never read past the end of the line,
because all substrings we are interested in are two characters long, and
the outer loop guarantees we have at least one character. So at most we
will look at the NUL.

However, this is too subtle and may lead to bugs in code which copies
this behavior without realizing substring length requirement.  So use
starts_with() instead, which will stop at NUL regardless of the length
of the prefix. Remove extra pair of parentheses while we are here.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---

On Mon, Apr 01, 2019 at 06:11:57 -0400, Jeff King wrote:
> I wonder if it's worth re-writing it like:

Turned Peff's suggestion into a patch.

 mailinfo.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..f4aaa89788 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -693,8 +693,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if (starts_with(c, ">8") || starts_with(c, "8<") ||
+		    starts_with(c, ">%") || starts_with(c, "%<")) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] mailinfo: support Unicode scissors
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
@ 2019-04-01 21:53   ` Andrei Rybak
  2019-04-02 14:36     ` Jeff King
  2019-04-02 14:28   ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King
  2021-06-08 20:48   ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak
  2 siblings, 1 reply; 16+ messages in thread
From: Andrei Rybak @ 2019-04-01 21:53 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Junio C Hamano, Jeff King, Duy Nguyen

Thank you all for review.  Below is the second version of original patch,
addressing comments by Gábor and Peff.

While preparing v2 I found out that U+2702 was already suggested on the list
eight months before cutting at perforation lines was implemented:
https://public-inbox.org/git/200901181656.37813.markus.heidelberg@web.de/T/#m3856d2e5c5f3e1900210b74bf2be8851b92d2271

---- >8 ----
Subject: [PATCH v2 2/2] mailinfo: support Unicode scissors
Date: Mon, 1 Apr 2019 00:00:00 +0000

'git am --scissors' allows cutting a patch from an email at a scissors
line.  Such a line should contain perforation, i.e. hyphens, and a
scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
recognized by 'git am --scissors' command at the moment.

Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
of character set Emoji 1.0, published in 2015 [2].  With its adoption as
an emoji, availability of this character on keyboards has increased.

Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
--scissors' to be able to cut at Unicode perforation lines in emails.
Note, that Unicode character '✂' is three bytes in UTF-8 encoding and is
spelled out using hexadecimal escape sequence.

1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf
   https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt
2. https://unicode.org/Public/emoji/1.0/emoji-data.txt

Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---
 mailinfo.c    |  7 +++++++
 t/t4150-am.sh | 26 +++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/mailinfo.c b/mailinfo.c
index f4aaa89788..804b07cd8a 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
 			c++;
 			continue;
 		}
+		if (starts_with(c, "\xE2\x9C\x82" /* U-2702 ✂ in UTF-8 */)) {
+			in_perforation = 1;
+			perforation += 3;
+			scissors += 3;
+			c += 2;
+			continue;
+		}
 		in_perforation = 0;
 	}
 
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 3f7f750cc8..3ea8e8a2cf 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -77,12 +77,20 @@ test_expect_success 'setup: messages' '
 
 	printf "Subject: " >subject-prefix &&
 
-	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF
+	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF &&
 	This line should not be included in the commit message with --scissors enabled.
 
 	 - - >8 - - remove everything above this line - - >8 - -
 
 	EOF
+
+	cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF
+	Lines above unicode scissors line should not be included in the commit
+	message with --scissors enabled.
+
+	- - - ✂ - - - ✂ - - -
+
+	EOF
 '
 
 test_expect_success setup '
@@ -161,6 +169,12 @@ test_expect_success setup '
 	git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml &&
 	git reset --hard HEAD^ &&
 
+	echo file >file &&
+	git add file &&
+	git commit -F msg-with-unicode-scissors &&
+	git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml &&
+	git reset --hard HEAD^ &&
+
 	sed -n -e "3,\$p" msg >file &&
 	git add file &&
 	test_tick &&
@@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' '
 	test_cmp_rev expected-for-scissors HEAD
 '
 
+test_expect_success 'am --scissors cuts the message at the unicode scissors line' '
+	rm -fr .git/rebase-apply &&
+	git reset --hard &&
+	git checkout second &&
+	git am --scissors patch-with-unicode-scissors.eml &&
+	test_path_is_missing .git/rebase-apply &&
+	git diff --exit-code expected-for-scissors &&
+	test_cmp_rev expected-for-scissors HEAD
+'
+
 test_expect_success 'am --no-scissors overrides mailinfo.scissors' '
 	rm -fr .git/rebase-apply &&
 	git reset --hard &&
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-04-01  9:27 ` Duy Nguyen
@ 2019-04-01 21:54   ` Andrei Rybak
  0 siblings, 0 replies; 16+ messages in thread
From: Andrei Rybak @ 2019-04-01 21:54 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

On Mon, 1 Apr 2019 16:27:02 +0700, Duy Nguyen wrote:
> On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote:
>> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
>> --scissors' to be able to cut at Unicode perforation lines in emails.
>> Note, that Unicode character '✂' is three bytes in UTF-8 encoding.
> 
> On top of what was already said in this thread. For some reason (bad
> font?) these scissors are drawn cutting _down_ for me instead of left
> or right. It looks a bit strange.

This might be an indication that the font used is rendering the
symbol as an emoji. Most scissors in emoji fonts are vertical:
https://emojipedia.org/black-scissors/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] mailinfo: use starts_with() for clarity
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
@ 2019-04-02 14:28   ` Jeff King
  2021-06-08 20:48   ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak
  2 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2019-04-02 14:28 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git, SZEDER Gábor, Junio C Hamano, Duy Nguyen

On Mon, Apr 01, 2019 at 11:53:33PM +0200, Andrei Rybak wrote:

> Existing checks using memcmp(3) never read past the end of the line,
> because all substrings we are interested in are two characters long, and
> the outer loop guarantees we have at least one character. So at most we
> will look at the NUL.
> 
> However, this is too subtle and may lead to bugs in code which copies
> this behavior without realizing substring length requirement.  So use
> starts_with() instead, which will stop at NUL regardless of the length
> of the prefix. Remove extra pair of parentheses while we are here.
> 
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
> ---
> 
> On Mon, Apr 01, 2019 at 06:11:57 -0400, Jeff King wrote:
> > I wonder if it's worth re-writing it like:
> 
> Turned Peff's suggestion into a patch.

Thanks. I think this may be worth doing regardless of what happens with
patch 2.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mailinfo: support Unicode scissors
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
@ 2019-04-02 14:36     ` Jeff King
  2019-04-03  6:47       ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2019-04-02 14:36 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git, SZEDER Gábor, Junio C Hamano, Duy Nguyen

On Mon, Apr 01, 2019 at 11:53:34PM +0200, Andrei Rybak wrote:

> diff --git a/mailinfo.c b/mailinfo.c
> index f4aaa89788..804b07cd8a 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>  			c++;
>  			continue;
>  		}
> +		if (starts_with(c, "\xE2\x9C\x82" /* U-2702 ✂ in UTF-8 */)) {
> +			in_perforation = 1;
> +			perforation += 3;
> +			scissors += 3;
> +			c += 2;
> +			continue;
> +		}

It might be worth using skip_prefix() instead of starts_with() to
compute the size automatically. E.g.:

  if (skip_prefix(c, "\xE2\x9C\x82", &end)) {
	size_t len = end - c; /* no magic number needed! */
  }

In fact, I think you could then combine this with the previous
conditional and get:

  if (skip_prefix(c, ">8", &end) ||
      skip_prefix(c, "8<", &end) ||
      skip_prefix(c, ">%", &end) ||
      skip_prefix(c, "%<", &end) ||
      /* U-2702 in UTF-8 */
      skip_prefix(c, "\xE2\x9C\x82", &end)) {
          in_perforation = 1;
	  perforation += end - c;
	  scissors += end - c;
	  c = end - 1; /* minus one to account for loop increment */
  }

(Though I'm still on the fence regarding the whole idea, so do not take
this as an endorsement ;) ).

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mailinfo: support Unicode scissors
  2019-04-02 14:36     ` Jeff King
@ 2019-04-03  6:47       ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2019-04-03  6:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Andrei Rybak, git, SZEDER Gábor, Duy Nguyen

Jeff King <peff@peff.net> writes:

> In fact, I think you could then combine this with the previous
> conditional and get:
>
>   if (skip_prefix(c, ">8", &end) ||
>       skip_prefix(c, "8<", &end) ||
>       skip_prefix(c, ">%", &end) ||
>       skip_prefix(c, "%<", &end) ||
>       /* U-2702 in UTF-8 */
>       skip_prefix(c, "\xE2\x9C\x82", &end)) {
>           in_perforation = 1;
> 	  perforation += end - c;
> 	  scissors += end - c;
> 	  c = end - 1; /* minus one to account for loop increment */
>   }
>
> (Though I'm still on the fence regarding the whole idea, so do not take
> this as an endorsement ;) ).

I do not think we want to add more, but use of skip_prefix does
sound sensible.  I was very tempted to suggest

	static const char *scissors[] = {
		">8", "8<", ">%", "%<",
                NULL,
	};
        const char **s;

	for (s = scissors; *s; s++)
		if (skip_prefix, c, *s, &end) {
			in_perforation = 1;
			...
			break;
		}
	}
        if (!s)
		... we are not looking at any of the scissors[] ...

but that would encourage adding more random entries to the array,
which we would want to avoid in order to help reduce the cognirive
load of end-users.

In hindsight, addition of an undocumented '%' was already a mistake.
I wonder how widely it is in use (yes, I am tempted to deprecate and
remove these two to match the code to the docs).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] mailinfo: use starts_with() when checking scissors
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
  2019-04-02 14:28   ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King
@ 2021-06-08 20:48   ` Andrei Rybak
  2021-06-08 21:57     ` Jeff King
  2 siblings, 1 reply; 16+ messages in thread
From: Andrei Rybak @ 2021-06-08 20:48 UTC (permalink / raw)
  To: git; +Cc: Jeff King

Existing checks for scissors characters using memcmp(3) never read past
the end of the line, because all substrings we are interested in are two
characters long, and the outer loop guarantees we have at least one
character.  So at most we will look at the NUL.

However, this is too subtle and may lead to bugs in code which copies
this behavior without realizing substring length requirement.  So use
starts_with() instead, which will stop at NUL regardless of the length
of the prefix.  Remove extra pair of parentheses while we are here.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---
 mailinfo.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

This patch was originally part of:
https://public-inbox.org/git/20190401215334.18678-1-rybak.a.v@gmail.com/
I've finally gotten around to sending it on its own.

diff --git a/mailinfo.c b/mailinfo.c
index ccc6beb27e..8013e5c566 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -705,8 +705,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if (starts_with(c, ">8") || starts_with(c, "8<") ||
+		    starts_with(c, ">%") || starts_with(c, "%<")) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: use starts_with() when checking scissors
  2021-06-08 20:48   ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak
@ 2021-06-08 21:57     ` Jeff King
  2021-06-09  2:22       ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2021-06-08 21:57 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git

On Tue, Jun 08, 2021 at 10:48:41PM +0200, Andrei Rybak wrote:

> Existing checks for scissors characters using memcmp(3) never read past
> the end of the line, because all substrings we are interested in are two
> characters long, and the outer loop guarantees we have at least one
> character.  So at most we will look at the NUL.
> 
> However, this is too subtle and may lead to bugs in code which copies
> this behavior without realizing substring length requirement.  So use
> starts_with() instead, which will stop at NUL regardless of the length
> of the prefix.  Remove extra pair of parentheses while we are here.
> 
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
> ---
>  mailinfo.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> This patch was originally part of:
> https://public-inbox.org/git/20190401215334.18678-1-rybak.a.v@gmail.com/
> I've finally gotten around to sending it on its own.

Well, after seeing the date I don't feel too bad for not remembering it. :)

I think this is an improvement in safety and readability, and worth
applying. I'd also be fine going further and using skip_prefix(), which
would let us drop the magic-number "2", though:

  - as Junio said in that earlier thread, we hope people aren't
    encouraged to add more versions of scissors here anyway).

  - I am not 100% sure that the increment of "c" in the conditional
    is not also relying on that "2", as well (i.e., it is incrementing
    one, and then the for-loop increments one). There's a non-trivial
    risk of introducing off-by-one errors or even more subtlety here. :)

So I'm OK leaving that. Thanks for resurrecting this cleanup.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: use starts_with() when checking scissors
  2021-06-08 21:57     ` Jeff King
@ 2021-06-09  2:22       ` Junio C Hamano
  2021-06-09  3:59         ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2021-06-09  2:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Andrei Rybak, git

Jeff King <peff@peff.net> writes:

> I think this is an improvement in safety and readability, and worth
> applying. I'd also be fine going further and using skip_prefix(), which
> would let us drop the magic-number "2", though:
>
>   - as Junio said in that earlier thread, we hope people aren't
>     encouraged to add more versions of scissors here anyway).
>
>   - I am not 100% sure that the increment of "c" in the conditional
>     is not also relying on that "2", as well (i.e., it is incrementing
>     one, and then the for-loop increments one). There's a non-trivial
>     risk of introducing off-by-one errors or even more subtlety here. :)
>
> So I'm OK leaving that. Thanks for resurrecting this cleanup.

You are correct to point out that the increment of "c" relies on 2.
We increment by one here, and let the loop control to increment
another, to skip over these two bytes.

The posted patch _might_ make it easier to read, but I do not think
it improves safety at all.  At the point of doing memcmp(), we know
that *c is not NUL and because we are dealing with NUL-terminated
string, we know we can check c[1] (otherwise we wouldn't even be
able to see if c is pointing at the end of string), so we check c[0]
and c[1] against four variants of two-byte scissors patterns.  The
current code uses memcmp() of 2 bytes, which is perfectly safe under
the condition, and starts_with() would also be equally safe.

If we were to teach new scissors sequence that is longer than two
bytes, then starts_with() would start becoming safer, but that will
not happen, so...



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: use starts_with() when checking scissors
  2021-06-09  2:22       ` Junio C Hamano
@ 2021-06-09  3:59         ` Jeff King
  2021-06-09  5:11           ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2021-06-09  3:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andrei Rybak, git

On Wed, Jun 09, 2021 at 11:22:10AM +0900, Junio C Hamano wrote:

> The posted patch _might_ make it easier to read, but I do not think
> it improves safety at all.  At the point of doing memcmp(), we know
> that *c is not NUL and because we are dealing with NUL-terminated
> string, we know we can check c[1] (otherwise we wouldn't even be
> able to see if c is pointing at the end of string), so we check c[0]
> and c[1] against four variants of two-byte scissors patterns.  The
> current code uses memcmp() of 2 bytes, which is perfectly safe under
> the condition, and starts_with() would also be equally safe.
> 
> If we were to teach new scissors sequence that is longer than two
> bytes, then starts_with() would start becoming safer, but that will
> not happen, so...

Right, I agree the current code is safe. The main value would be trying
to make the correctness of the code more apparent. Perhaps a comment
would be better there, like:

diff --git a/mailinfo.c b/mailinfo.c
index ccc6beb27e..25b606db28 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -705,12 +705,17 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
+		/*
+		 * passing 2 bytes to memcmp is safe here; we have at least
+		 * one non-NUL character from the loop condition, plus a
+		 * terminating NUL
+		 */
 		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
 		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;
-			c++;
+			c++; /* only 1 here to account for loop increment */
 			continue;
 		}
 		in_perforation = 0;

Though since this is code we'd not plan to modify, and presumably
anybody touching it would have to fully grok the loop anyway, it might
not be that important.

I dunno. I offer it as an alternative (and am happy to add a commit
message). But I'm fine with leaving it as-is, too.

-Peff

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mailinfo: use starts_with() when checking scissors
  2021-06-09  3:59         ` Jeff King
@ 2021-06-09  5:11           ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2021-06-09  5:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Andrei Rybak, git

Jeff King <peff@peff.net> writes:

> I dunno. I offer it as an alternative (and am happy to add a commit
> message). But I'm fine with leaving it as-is, too.

I am fine with Andrei's version as posted, which I've queued.

Thanks, both.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-06-09  5:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
2019-03-31 23:09 ` SZEDER Gábor
2019-04-01  9:07   ` Junio C Hamano
2019-04-01 10:11   ` Jeff King
2019-04-01  9:27 ` Duy Nguyen
2019-04-01 21:54   ` Andrei Rybak
2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
2019-04-02 14:36     ` Jeff King
2019-04-03  6:47       ` Junio C Hamano
2019-04-02 14:28   ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King
2021-06-08 20:48   ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak
2021-06-08 21:57     ` Jeff King
2021-06-09  2:22       ` Junio C Hamano
2021-06-09  3:59         ` Jeff King
2021-06-09  5:11           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).