git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] mailinfo: support Unicode scissors
@ 2019-03-31 22:01 Andrei Rybak
  2019-03-31 23:09 ` SZEDER Gábor
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrei Rybak @ 2019-03-31 22:01 UTC (permalink / raw)
  To: git; +Cc: Andrei Rybak

'git am --scissors' allows cutting a patch from an email at a scissors
line.  Such a line should contain perforation, i.e. hyphens, and a
scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
recognized by 'git am --scissors' command at the moment.

Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
of character set Emoji 1.0, published in 2015 [2].  With its adoption as
an emoji, availability of this character on keyboards has increased.

Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
--scissors' to be able to cut at Unicode perforation lines in emails.
Note, that Unicode character '✂' is three bytes in UTF-8 encoding.

1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf
   https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt
2. https://unicode.org/Public/emoji/1.0/emoji-data.txt

Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---

This applies on top of ar/t4150-remove-cruft merged into next in 
commit a0106a8d5c, which also edited the test setup in t4150.

 mailinfo.c    |  7 +++++++
 t/t4150-am.sh | 26 +++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..4ef6cdee85 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
 			c++;
 			continue;
 		}
+		if (!memcmp(c, "✂", 3)) {
+			in_perforation = 1;
+			perforation += 3;
+			scissors += 3;
+			c++;
+			continue;
+		}
 		in_perforation = 0;
 	}
 
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 3f7f750cc8..3ea8e8a2cf 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -77,12 +77,20 @@ test_expect_success 'setup: messages' '
 
 	printf "Subject: " >subject-prefix &&
 
-	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF
+	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF &&
 	This line should not be included in the commit message with --scissors enabled.
 
 	 - - >8 - - remove everything above this line - - >8 - -
 
 	EOF
+
+	cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF
+	Lines above unicode scissors line should not be included in the commit
+	message with --scissors enabled.
+
+	- - - ✂ - - - ✂ - - -
+
+	EOF
 '
 
 test_expect_success setup '
@@ -161,6 +169,12 @@ test_expect_success setup '
 	git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml &&
 	git reset --hard HEAD^ &&
 
+	echo file >file &&
+	git add file &&
+	git commit -F msg-with-unicode-scissors &&
+	git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml &&
+	git reset --hard HEAD^ &&
+
 	sed -n -e "3,\$p" msg >file &&
 	git add file &&
 	test_tick &&
@@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' '
 	test_cmp_rev expected-for-scissors HEAD
 '
 
+test_expect_success 'am --scissors cuts the message at the unicode scissors line' '
+	rm -fr .git/rebase-apply &&
+	git reset --hard &&
+	git checkout second &&
+	git am --scissors patch-with-unicode-scissors.eml &&
+	test_path_is_missing .git/rebase-apply &&
+	git diff --exit-code expected-for-scissors &&
+	test_cmp_rev expected-for-scissors HEAD
+'
+
 test_expect_success 'am --no-scissors overrides mailinfo.scissors' '
 	rm -fr .git/rebase-apply &&
 	git reset --hard &&
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
@ 2019-03-31 23:09 ` SZEDER Gábor
  2019-04-01  9:07   ` Junio C Hamano
  2019-04-01 10:11   ` Jeff King
  2019-04-01  9:27 ` Duy Nguyen
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2 siblings, 2 replies; 11+ messages in thread
From: SZEDER Gábor @ 2019-03-31 23:09 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git

On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> diff --git a/mailinfo.c b/mailinfo.c
> index b395adbdf2..4ef6cdee85 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>  			c++;
>  			continue;
>  		}
> +		if (!memcmp(c, "✂", 3)) {

This character is tiny.  Please add a comment that it's supposed to be
a Unicode scissors character.

Should we worry about this memcmp() potentially reading past the end
of the string when 'c' points to the last character?

> +			in_perforation = 1;
> +			perforation += 3;
> +			scissors += 3;
> +			c++;

Here you should jump past the three byte long Unicode character, so
this should be c += 2.

> +			continue;
> +		}
>  		in_perforation = 0;
>  	}
>  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 23:09 ` SZEDER Gábor
@ 2019-04-01  9:07   ` Junio C Hamano
  2019-04-01 10:11   ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2019-04-01  9:07 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Andrei Rybak, git

SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
>> diff --git a/mailinfo.c b/mailinfo.c
>> index b395adbdf2..4ef6cdee85 100644
>> --- a/mailinfo.c
>> +++ b/mailinfo.c
>> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>>  			c++;
>>  			continue;
>>  		}
>> +		if (!memcmp(c, "✂", 3)) {
>
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.
>
> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

Quite honestly, I'd rather document what "scissors" line looks like
exactly and make sure no readers would mistake that we'd accept any
Unicode character whose name has substring "scissors" in it.

Ah, wait, we already do.  It is very clear that scissors are either
">8" (for right handers) or "8<" (for lefties) and nothing else.

Unless you are sure that you are (and more importantly, can stay to
be) exhaustive, adding allowed representations for a thing will
force users to learn more non-essential things ("we allow only 8<
and >8" vs "we allow only these 7, even though we are aware that
there are at least 14 more that we do not allow"---the end-user
needs to remember which 7 are allowed) and does not help users.

Taking only "black scissors" U+2702 but not all of U+2700 - U+2704
will be a cause for unnecessary end-user complaints "why do you take
this but not that one?"  Then the next noise would be "why is '-'
the only perforation and not U+2014 Em Dash or U+2013 En Dash?"

Let's try not to be cute in non-essential things like how a pair of
scissors ought to be spelled.  If "8<" had worked well for us for
the past 10 years, we should just stick to it.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
  2019-03-31 23:09 ` SZEDER Gábor
@ 2019-04-01  9:27 ` Duy Nguyen
  2019-04-01 21:54   ` Andrei Rybak
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2 siblings, 1 reply; 11+ messages in thread
From: Duy Nguyen @ 2019-04-01  9:27 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: Git Mailing List

On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote:
>
> 'git am --scissors' allows cutting a patch from an email at a scissors
> line.  Such a line should contain perforation, i.e. hyphens, and a
> scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
> recognized by 'git am --scissors' command at the moment.
>
> Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
> since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
> of character set Emoji 1.0, published in 2015 [2].  With its adoption as
> an emoji, availability of this character on keyboards has increased.
>
> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
> --scissors' to be able to cut at Unicode perforation lines in emails.
> Note, that Unicode character '✂' is three bytes in UTF-8 encoding.

On top of what was already said in this thread. For some reason (bad
font?) these scissors are drawn cutting _down_ for me instead of left
or right. It looks a bit strange.
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-03-31 23:09 ` SZEDER Gábor
  2019-04-01  9:07   ` Junio C Hamano
@ 2019-04-01 10:11   ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff King @ 2019-04-01 10:11 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Andrei Rybak, git

On Mon, Apr 01, 2019 at 01:09:47AM +0200, SZEDER Gábor wrote:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> > diff --git a/mailinfo.c b/mailinfo.c
> > index b395adbdf2..4ef6cdee85 100644
> > --- a/mailinfo.c
> > +++ b/mailinfo.c
> > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
> >  			c++;
> >  			continue;
> >  		}
> > +		if (!memcmp(c, "✂", 3)) {
> 
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.

I think it might also be the first raw UTF-8 character in our source,
which is otherwise ASCII. Usually we'd spell out the binary (with a
comment).

I think I agree with Junio's response, tough, that this is probably not
a road we want to go down, unless this micro-format is being actively
used in the wild (I have no idea, but I have never seen it).

> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

I also wondered if the existing memcmps for ">8", etc, would have this
problem. They don't, but it's somewhat subtle. They are only 2
characters long, and the outer loop guarantees we have at least 1
character. So at most we will look at the NUL. But obviously a 3-byte
sequence like this may invoke undefined behavior, and the existing
memcmps encourage anybody adding code to do it wrong.

I wonder if it's worth re-writing it like:

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..46b1b2a4a8 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -693,8 +693,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if ((starts_with(c, ">8") || starts_with(c, "8<") ||
+		     starts_with(c, ">%") || starts_with(c, "%<"))) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] mailinfo: use starts_with() for clarity
  2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
  2019-03-31 23:09 ` SZEDER Gábor
  2019-04-01  9:27 ` Duy Nguyen
@ 2019-04-01 21:53 ` Andrei Rybak
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
  2019-04-02 14:28   ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King
  2 siblings, 2 replies; 11+ messages in thread
From: Andrei Rybak @ 2019-04-01 21:53 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Junio C Hamano, Jeff King, Duy Nguyen

Existing checks using memcmp(3) never read past the end of the line,
because all substrings we are interested in are two characters long, and
the outer loop guarantees we have at least one character. So at most we
will look at the NUL.

However, this is too subtle and may lead to bugs in code which copies
this behavior without realizing substring length requirement.  So use
starts_with() instead, which will stop at NUL regardless of the length
of the prefix. Remove extra pair of parentheses while we are here.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---

On Mon, Apr 01, 2019 at 06:11:57 -0400, Jeff King wrote:
> I wonder if it's worth re-writing it like:

Turned Peff's suggestion into a patch.

 mailinfo.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..f4aaa89788 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -693,8 +693,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if (starts_with(c, ">8") || starts_with(c, "8<") ||
+		    starts_with(c, ">%") || starts_with(c, "%<")) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] mailinfo: support Unicode scissors
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
@ 2019-04-01 21:53   ` Andrei Rybak
  2019-04-02 14:36     ` Jeff King
  2019-04-02 14:28   ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King
  1 sibling, 1 reply; 11+ messages in thread
From: Andrei Rybak @ 2019-04-01 21:53 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Junio C Hamano, Jeff King, Duy Nguyen

Thank you all for review.  Below is the second version of original patch,
addressing comments by Gábor and Peff.

While preparing v2 I found out that U+2702 was already suggested on the list
eight months before cutting at perforation lines was implemented:
https://public-inbox.org/git/200901181656.37813.markus.heidelberg@web.de/T/#m3856d2e5c5f3e1900210b74bf2be8851b92d2271

---- >8 ----
Subject: [PATCH v2 2/2] mailinfo: support Unicode scissors
Date: Mon, 1 Apr 2019 00:00:00 +0000

'git am --scissors' allows cutting a patch from an email at a scissors
line.  Such a line should contain perforation, i.e. hyphens, and a
scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
recognized by 'git am --scissors' command at the moment.

Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
of character set Emoji 1.0, published in 2015 [2].  With its adoption as
an emoji, availability of this character on keyboards has increased.

Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
--scissors' to be able to cut at Unicode perforation lines in emails.
Note, that Unicode character '✂' is three bytes in UTF-8 encoding and is
spelled out using hexadecimal escape sequence.

1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf
   https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt
2. https://unicode.org/Public/emoji/1.0/emoji-data.txt

Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---
 mailinfo.c    |  7 +++++++
 t/t4150-am.sh | 26 +++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/mailinfo.c b/mailinfo.c
index f4aaa89788..804b07cd8a 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
 			c++;
 			continue;
 		}
+		if (starts_with(c, "\xE2\x9C\x82" /* U-2702 ✂ in UTF-8 */)) {
+			in_perforation = 1;
+			perforation += 3;
+			scissors += 3;
+			c += 2;
+			continue;
+		}
 		in_perforation = 0;
 	}
 
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 3f7f750cc8..3ea8e8a2cf 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -77,12 +77,20 @@ test_expect_success 'setup: messages' '
 
 	printf "Subject: " >subject-prefix &&
 
-	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF
+	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF &&
 	This line should not be included in the commit message with --scissors enabled.
 
 	 - - >8 - - remove everything above this line - - >8 - -
 
 	EOF
+
+	cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF
+	Lines above unicode scissors line should not be included in the commit
+	message with --scissors enabled.
+
+	- - - ✂ - - - ✂ - - -
+
+	EOF
 '
 
 test_expect_success setup '
@@ -161,6 +169,12 @@ test_expect_success setup '
 	git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml &&
 	git reset --hard HEAD^ &&
 
+	echo file >file &&
+	git add file &&
+	git commit -F msg-with-unicode-scissors &&
+	git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml &&
+	git reset --hard HEAD^ &&
+
 	sed -n -e "3,\$p" msg >file &&
 	git add file &&
 	test_tick &&
@@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' '
 	test_cmp_rev expected-for-scissors HEAD
 '
 
+test_expect_success 'am --scissors cuts the message at the unicode scissors line' '
+	rm -fr .git/rebase-apply &&
+	git reset --hard &&
+	git checkout second &&
+	git am --scissors patch-with-unicode-scissors.eml &&
+	test_path_is_missing .git/rebase-apply &&
+	git diff --exit-code expected-for-scissors &&
+	test_cmp_rev expected-for-scissors HEAD
+'
+
 test_expect_success 'am --no-scissors overrides mailinfo.scissors' '
 	rm -fr .git/rebase-apply &&
 	git reset --hard &&
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mailinfo: support Unicode scissors
  2019-04-01  9:27 ` Duy Nguyen
@ 2019-04-01 21:54   ` Andrei Rybak
  0 siblings, 0 replies; 11+ messages in thread
From: Andrei Rybak @ 2019-04-01 21:54 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

On Mon, 1 Apr 2019 16:27:02 +0700, Duy Nguyen wrote:
> On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote:
>> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
>> --scissors' to be able to cut at Unicode perforation lines in emails.
>> Note, that Unicode character '✂' is three bytes in UTF-8 encoding.
> 
> On top of what was already said in this thread. For some reason (bad
> font?) these scissors are drawn cutting _down_ for me instead of left
> or right. It looks a bit strange.

This might be an indication that the font used is rendering the
symbol as an emoji. Most scissors in emoji fonts are vertical:
https://emojipedia.org/black-scissors/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] mailinfo: use starts_with() for clarity
  2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
@ 2019-04-02 14:28   ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff King @ 2019-04-02 14:28 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git, SZEDER Gábor, Junio C Hamano, Duy Nguyen

On Mon, Apr 01, 2019 at 11:53:33PM +0200, Andrei Rybak wrote:

> Existing checks using memcmp(3) never read past the end of the line,
> because all substrings we are interested in are two characters long, and
> the outer loop guarantees we have at least one character. So at most we
> will look at the NUL.
> 
> However, this is too subtle and may lead to bugs in code which copies
> this behavior without realizing substring length requirement.  So use
> starts_with() instead, which will stop at NUL regardless of the length
> of the prefix. Remove extra pair of parentheses while we are here.
> 
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
> ---
> 
> On Mon, Apr 01, 2019 at 06:11:57 -0400, Jeff King wrote:
> > I wonder if it's worth re-writing it like:
> 
> Turned Peff's suggestion into a patch.

Thanks. I think this may be worth doing regardless of what happens with
patch 2.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mailinfo: support Unicode scissors
  2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
@ 2019-04-02 14:36     ` Jeff King
  2019-04-03  6:47       ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2019-04-02 14:36 UTC (permalink / raw)
  To: Andrei Rybak; +Cc: git, SZEDER Gábor, Junio C Hamano, Duy Nguyen

On Mon, Apr 01, 2019 at 11:53:34PM +0200, Andrei Rybak wrote:

> diff --git a/mailinfo.c b/mailinfo.c
> index f4aaa89788..804b07cd8a 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>  			c++;
>  			continue;
>  		}
> +		if (starts_with(c, "\xE2\x9C\x82" /* U-2702 ✂ in UTF-8 */)) {
> +			in_perforation = 1;
> +			perforation += 3;
> +			scissors += 3;
> +			c += 2;
> +			continue;
> +		}

It might be worth using skip_prefix() instead of starts_with() to
compute the size automatically. E.g.:

  if (skip_prefix(c, "\xE2\x9C\x82", &end)) {
	size_t len = end - c; /* no magic number needed! */
  }

In fact, I think you could then combine this with the previous
conditional and get:

  if (skip_prefix(c, ">8", &end) ||
      skip_prefix(c, "8<", &end) ||
      skip_prefix(c, ">%", &end) ||
      skip_prefix(c, "%<", &end) ||
      /* U-2702 in UTF-8 */
      skip_prefix(c, "\xE2\x9C\x82", &end)) {
          in_perforation = 1;
	  perforation += end - c;
	  scissors += end - c;
	  c = end - 1; /* minus one to account for loop increment */
  }

(Though I'm still on the fence regarding the whole idea, so do not take
this as an endorsement ;) ).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] mailinfo: support Unicode scissors
  2019-04-02 14:36     ` Jeff King
@ 2019-04-03  6:47       ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2019-04-03  6:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Andrei Rybak, git, SZEDER Gábor, Duy Nguyen

Jeff King <peff@peff.net> writes:

> In fact, I think you could then combine this with the previous
> conditional and get:
>
>   if (skip_prefix(c, ">8", &end) ||
>       skip_prefix(c, "8<", &end) ||
>       skip_prefix(c, ">%", &end) ||
>       skip_prefix(c, "%<", &end) ||
>       /* U-2702 in UTF-8 */
>       skip_prefix(c, "\xE2\x9C\x82", &end)) {
>           in_perforation = 1;
> 	  perforation += end - c;
> 	  scissors += end - c;
> 	  c = end - 1; /* minus one to account for loop increment */
>   }
>
> (Though I'm still on the fence regarding the whole idea, so do not take
> this as an endorsement ;) ).

I do not think we want to add more, but use of skip_prefix does
sound sensible.  I was very tempted to suggest

	static const char *scissors[] = {
		">8", "8<", ">%", "%<",
                NULL,
	};
        const char **s;

	for (s = scissors; *s; s++)
		if (skip_prefix, c, *s, &end) {
			in_perforation = 1;
			...
			break;
		}
	}
        if (!s)
		... we are not looking at any of the scissors[] ...

but that would encourage adding more random entries to the array,
which we would want to avoid in order to help reduce the cognirive
load of end-users.

In hindsight, addition of an undocumented '%' was already a mistake.
I wonder how widely it is in use (yes, I am tempted to deprecate and
remove these two to match the code to the docs).

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-04-03  6:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak
2019-03-31 23:09 ` SZEDER Gábor
2019-04-01  9:07   ` Junio C Hamano
2019-04-01 10:11   ` Jeff King
2019-04-01  9:27 ` Duy Nguyen
2019-04-01 21:54   ` Andrei Rybak
2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak
2019-04-01 21:53   ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak
2019-04-02 14:36     ` Jeff King
2019-04-03  6:47       ` Junio C Hamano
2019-04-02 14:28   ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git