* [PATCH] mailinfo: support Unicode scissors @ 2019-03-31 22:01 Andrei Rybak 2019-03-31 23:09 ` SZEDER Gábor ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Andrei Rybak @ 2019-03-31 22:01 UTC (permalink / raw) To: git; +Cc: Andrei Rybak 'git am --scissors' allows cutting a patch from an email at a scissors line. Such a line should contain perforation, i.e. hyphens, and a scissors symbol. Only ASCII graphics scissors '8<' '>8' '%<' '>%' are recognized by 'git am --scissors' command at the moment. Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode since version 1.0.0 [1]. Since then 'BLACK SCISSORS' also became part of character set Emoji 1.0, published in 2015 [2]. With its adoption as an emoji, availability of this character on keyboards has increased. Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am --scissors' to be able to cut at Unicode perforation lines in emails. Note, that Unicode character '✂' is three bytes in UTF-8 encoding. 1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt 2. https://unicode.org/Public/emoji/1.0/emoji-data.txt Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> --- This applies on top of ar/t4150-remove-cruft merged into next in commit a0106a8d5c, which also edited the test setup in t4150. mailinfo.c | 7 +++++++ t/t4150-am.sh | 26 +++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/mailinfo.c b/mailinfo.c index b395adbdf2..4ef6cdee85 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) c++; continue; } + if (!memcmp(c, "✂", 3)) { + in_perforation = 1; + perforation += 3; + scissors += 3; + c++; + continue; + } in_perforation = 0; } diff --git a/t/t4150-am.sh b/t/t4150-am.sh index 3f7f750cc8..3ea8e8a2cf 100755 --- a/t/t4150-am.sh +++ b/t/t4150-am.sh @@ -77,12 +77,20 @@ test_expect_success 'setup: messages' ' printf "Subject: " >subject-prefix && - cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF + cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF && This line should not be included in the commit message with --scissors enabled. - - >8 - - remove everything above this line - - >8 - - EOF + + cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF + Lines above unicode scissors line should not be included in the commit + message with --scissors enabled. + + - - - ✂ - - - ✂ - - - + + EOF ' test_expect_success setup ' @@ -161,6 +169,12 @@ test_expect_success setup ' git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml && git reset --hard HEAD^ && + echo file >file && + git add file && + git commit -F msg-with-unicode-scissors && + git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml && + git reset --hard HEAD^ && + sed -n -e "3,\$p" msg >file && git add file && test_tick && @@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' ' test_cmp_rev expected-for-scissors HEAD ' +test_expect_success 'am --scissors cuts the message at the unicode scissors line' ' + rm -fr .git/rebase-apply && + git reset --hard && + git checkout second && + git am --scissors patch-with-unicode-scissors.eml && + test_path_is_missing .git/rebase-apply && + git diff --exit-code expected-for-scissors && + test_cmp_rev expected-for-scissors HEAD +' + test_expect_success 'am --no-scissors overrides mailinfo.scissors' ' rm -fr .git/rebase-apply && git reset --hard && -- 2.21.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: support Unicode scissors 2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak @ 2019-03-31 23:09 ` SZEDER Gábor 2019-04-01 9:07 ` Junio C Hamano 2019-04-01 10:11 ` Jeff King 2019-04-01 9:27 ` Duy Nguyen 2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak 2 siblings, 2 replies; 16+ messages in thread From: SZEDER Gábor @ 2019-03-31 23:09 UTC (permalink / raw) To: Andrei Rybak; +Cc: git On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote: > diff --git a/mailinfo.c b/mailinfo.c > index b395adbdf2..4ef6cdee85 100644 > --- a/mailinfo.c > +++ b/mailinfo.c > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) > c++; > continue; > } > + if (!memcmp(c, "✂", 3)) { This character is tiny. Please add a comment that it's supposed to be a Unicode scissors character. Should we worry about this memcmp() potentially reading past the end of the string when 'c' points to the last character? > + in_perforation = 1; > + perforation += 3; > + scissors += 3; > + c++; Here you should jump past the three byte long Unicode character, so this should be c += 2. > + continue; > + } > in_perforation = 0; > } > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: support Unicode scissors 2019-03-31 23:09 ` SZEDER Gábor @ 2019-04-01 9:07 ` Junio C Hamano 2019-04-01 10:11 ` Jeff King 1 sibling, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2019-04-01 9:07 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Andrei Rybak, git SZEDER Gábor <szeder.dev@gmail.com> writes: > On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote: >> diff --git a/mailinfo.c b/mailinfo.c >> index b395adbdf2..4ef6cdee85 100644 >> --- a/mailinfo.c >> +++ b/mailinfo.c >> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) >> c++; >> continue; >> } >> + if (!memcmp(c, "✂", 3)) { > > This character is tiny. Please add a comment that it's supposed to be > a Unicode scissors character. > > Should we worry about this memcmp() potentially reading past the end > of the string when 'c' points to the last character? Quite honestly, I'd rather document what "scissors" line looks like exactly and make sure no readers would mistake that we'd accept any Unicode character whose name has substring "scissors" in it. Ah, wait, we already do. It is very clear that scissors are either ">8" (for right handers) or "8<" (for lefties) and nothing else. Unless you are sure that you are (and more importantly, can stay to be) exhaustive, adding allowed representations for a thing will force users to learn more non-essential things ("we allow only 8< and >8" vs "we allow only these 7, even though we are aware that there are at least 14 more that we do not allow"---the end-user needs to remember which 7 are allowed) and does not help users. Taking only "black scissors" U+2702 but not all of U+2700 - U+2704 will be a cause for unnecessary end-user complaints "why do you take this but not that one?" Then the next noise would be "why is '-' the only perforation and not U+2014 Em Dash or U+2013 En Dash?" Let's try not to be cute in non-essential things like how a pair of scissors ought to be spelled. If "8<" had worked well for us for the past 10 years, we should just stick to it. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: support Unicode scissors 2019-03-31 23:09 ` SZEDER Gábor 2019-04-01 9:07 ` Junio C Hamano @ 2019-04-01 10:11 ` Jeff King 1 sibling, 0 replies; 16+ messages in thread From: Jeff King @ 2019-04-01 10:11 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Andrei Rybak, git On Mon, Apr 01, 2019 at 01:09:47AM +0200, SZEDER Gábor wrote: > On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote: > > diff --git a/mailinfo.c b/mailinfo.c > > index b395adbdf2..4ef6cdee85 100644 > > --- a/mailinfo.c > > +++ b/mailinfo.c > > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) > > c++; > > continue; > > } > > + if (!memcmp(c, "✂", 3)) { > > This character is tiny. Please add a comment that it's supposed to be > a Unicode scissors character. I think it might also be the first raw UTF-8 character in our source, which is otherwise ASCII. Usually we'd spell out the binary (with a comment). I think I agree with Junio's response, tough, that this is probably not a road we want to go down, unless this micro-format is being actively used in the wild (I have no idea, but I have never seen it). > Should we worry about this memcmp() potentially reading past the end > of the string when 'c' points to the last character? I also wondered if the existing memcmps for ">8", etc, would have this problem. They don't, but it's somewhat subtle. They are only 2 characters long, and the outer loop guarantees we have at least 1 character. So at most we will look at the NUL. But obviously a 3-byte sequence like this may invoke undefined behavior, and the existing memcmps encourage anybody adding code to do it wrong. I wonder if it's worth re-writing it like: diff --git a/mailinfo.c b/mailinfo.c index b395adbdf2..46b1b2a4a8 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -693,8 +693,8 @@ static int is_scissors_line(const char *line) perforation++; continue; } - if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) || - !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) { + if ((starts_with(c, ">8") || starts_with(c, "8<") || + starts_with(c, ">%") || starts_with(c, "%<"))) { in_perforation = 1; perforation += 2; scissors += 2; -Peff ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: support Unicode scissors 2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak 2019-03-31 23:09 ` SZEDER Gábor @ 2019-04-01 9:27 ` Duy Nguyen 2019-04-01 21:54 ` Andrei Rybak 2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak 2 siblings, 1 reply; 16+ messages in thread From: Duy Nguyen @ 2019-04-01 9:27 UTC (permalink / raw) To: Andrei Rybak; +Cc: Git Mailing List On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote: > > 'git am --scissors' allows cutting a patch from an email at a scissors > line. Such a line should contain perforation, i.e. hyphens, and a > scissors symbol. Only ASCII graphics scissors '8<' '>8' '%<' '>%' are > recognized by 'git am --scissors' command at the moment. > > Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode > since version 1.0.0 [1]. Since then 'BLACK SCISSORS' also became part > of character set Emoji 1.0, published in 2015 [2]. With its adoption as > an emoji, availability of this character on keyboards has increased. > > Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am > --scissors' to be able to cut at Unicode perforation lines in emails. > Note, that Unicode character '✂' is three bytes in UTF-8 encoding. On top of what was already said in this thread. For some reason (bad font?) these scissors are drawn cutting _down_ for me instead of left or right. It looks a bit strange. -- Duy ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: support Unicode scissors 2019-04-01 9:27 ` Duy Nguyen @ 2019-04-01 21:54 ` Andrei Rybak 0 siblings, 0 replies; 16+ messages in thread From: Andrei Rybak @ 2019-04-01 21:54 UTC (permalink / raw) To: Duy Nguyen; +Cc: Git Mailing List On Mon, 1 Apr 2019 16:27:02 +0700, Duy Nguyen wrote: > On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote: >> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am >> --scissors' to be able to cut at Unicode perforation lines in emails. >> Note, that Unicode character '✂' is three bytes in UTF-8 encoding. > > On top of what was already said in this thread. For some reason (bad > font?) these scissors are drawn cutting _down_ for me instead of left > or right. It looks a bit strange. This might be an indication that the font used is rendering the symbol as an emoji. Most scissors in emoji fonts are vertical: https://emojipedia.org/black-scissors/ ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/2] mailinfo: use starts_with() for clarity 2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak 2019-03-31 23:09 ` SZEDER Gábor 2019-04-01 9:27 ` Duy Nguyen @ 2019-04-01 21:53 ` Andrei Rybak 2019-04-01 21:53 ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak ` (2 more replies) 2 siblings, 3 replies; 16+ messages in thread From: Andrei Rybak @ 2019-04-01 21:53 UTC (permalink / raw) To: git; +Cc: SZEDER Gábor, Junio C Hamano, Jeff King, Duy Nguyen Existing checks using memcmp(3) never read past the end of the line, because all substrings we are interested in are two characters long, and the outer loop guarantees we have at least one character. So at most we will look at the NUL. However, this is too subtle and may lead to bugs in code which copies this behavior without realizing substring length requirement. So use starts_with() instead, which will stop at NUL regardless of the length of the prefix. Remove extra pair of parentheses while we are here. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> --- On Mon, Apr 01, 2019 at 06:11:57 -0400, Jeff King wrote: > I wonder if it's worth re-writing it like: Turned Peff's suggestion into a patch. mailinfo.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mailinfo.c b/mailinfo.c index b395adbdf2..f4aaa89788 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -693,8 +693,8 @@ static int is_scissors_line(const char *line) perforation++; continue; } - if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) || - !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) { + if (starts_with(c, ">8") || starts_with(c, "8<") || + starts_with(c, ">%") || starts_with(c, "%<")) { in_perforation = 1; perforation += 2; scissors += 2; -- 2.21.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 2/2] mailinfo: support Unicode scissors 2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak @ 2019-04-01 21:53 ` Andrei Rybak 2019-04-02 14:36 ` Jeff King 2019-04-02 14:28 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King 2021-06-08 20:48 ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak 2 siblings, 1 reply; 16+ messages in thread From: Andrei Rybak @ 2019-04-01 21:53 UTC (permalink / raw) To: git; +Cc: SZEDER Gábor, Junio C Hamano, Jeff King, Duy Nguyen Thank you all for review. Below is the second version of original patch, addressing comments by Gábor and Peff. While preparing v2 I found out that U+2702 was already suggested on the list eight months before cutting at perforation lines was implemented: https://public-inbox.org/git/200901181656.37813.markus.heidelberg@web.de/T/#m3856d2e5c5f3e1900210b74bf2be8851b92d2271 ---- >8 ---- Subject: [PATCH v2 2/2] mailinfo: support Unicode scissors Date: Mon, 1 Apr 2019 00:00:00 +0000 'git am --scissors' allows cutting a patch from an email at a scissors line. Such a line should contain perforation, i.e. hyphens, and a scissors symbol. Only ASCII graphics scissors '8<' '>8' '%<' '>%' are recognized by 'git am --scissors' command at the moment. Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode since version 1.0.0 [1]. Since then 'BLACK SCISSORS' also became part of character set Emoji 1.0, published in 2015 [2]. With its adoption as an emoji, availability of this character on keyboards has increased. Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am --scissors' to be able to cut at Unicode perforation lines in emails. Note, that Unicode character '✂' is three bytes in UTF-8 encoding and is spelled out using hexadecimal escape sequence. 1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt 2. https://unicode.org/Public/emoji/1.0/emoji-data.txt Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> --- mailinfo.c | 7 +++++++ t/t4150-am.sh | 26 +++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/mailinfo.c b/mailinfo.c index f4aaa89788..804b07cd8a 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) c++; continue; } + if (starts_with(c, "\xE2\x9C\x82" /* U-2702 ✂ in UTF-8 */)) { + in_perforation = 1; + perforation += 3; + scissors += 3; + c += 2; + continue; + } in_perforation = 0; } diff --git a/t/t4150-am.sh b/t/t4150-am.sh index 3f7f750cc8..3ea8e8a2cf 100755 --- a/t/t4150-am.sh +++ b/t/t4150-am.sh @@ -77,12 +77,20 @@ test_expect_success 'setup: messages' ' printf "Subject: " >subject-prefix && - cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF + cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF && This line should not be included in the commit message with --scissors enabled. - - >8 - - remove everything above this line - - >8 - - EOF + + cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF + Lines above unicode scissors line should not be included in the commit + message with --scissors enabled. + + - - - ✂ - - - ✂ - - - + + EOF ' test_expect_success setup ' @@ -161,6 +169,12 @@ test_expect_success setup ' git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml && git reset --hard HEAD^ && + echo file >file && + git add file && + git commit -F msg-with-unicode-scissors && + git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml && + git reset --hard HEAD^ && + sed -n -e "3,\$p" msg >file && git add file && test_tick && @@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' ' test_cmp_rev expected-for-scissors HEAD ' +test_expect_success 'am --scissors cuts the message at the unicode scissors line' ' + rm -fr .git/rebase-apply && + git reset --hard && + git checkout second && + git am --scissors patch-with-unicode-scissors.eml && + test_path_is_missing .git/rebase-apply && + git diff --exit-code expected-for-scissors && + test_cmp_rev expected-for-scissors HEAD +' + test_expect_success 'am --no-scissors overrides mailinfo.scissors' ' rm -fr .git/rebase-apply && git reset --hard && -- 2.21.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] mailinfo: support Unicode scissors 2019-04-01 21:53 ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak @ 2019-04-02 14:36 ` Jeff King 2019-04-03 6:47 ` Junio C Hamano 0 siblings, 1 reply; 16+ messages in thread From: Jeff King @ 2019-04-02 14:36 UTC (permalink / raw) To: Andrei Rybak; +Cc: git, SZEDER Gábor, Junio C Hamano, Duy Nguyen On Mon, Apr 01, 2019 at 11:53:34PM +0200, Andrei Rybak wrote: > diff --git a/mailinfo.c b/mailinfo.c > index f4aaa89788..804b07cd8a 100644 > --- a/mailinfo.c > +++ b/mailinfo.c > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) > c++; > continue; > } > + if (starts_with(c, "\xE2\x9C\x82" /* U-2702 ✂ in UTF-8 */)) { > + in_perforation = 1; > + perforation += 3; > + scissors += 3; > + c += 2; > + continue; > + } It might be worth using skip_prefix() instead of starts_with() to compute the size automatically. E.g.: if (skip_prefix(c, "\xE2\x9C\x82", &end)) { size_t len = end - c; /* no magic number needed! */ } In fact, I think you could then combine this with the previous conditional and get: if (skip_prefix(c, ">8", &end) || skip_prefix(c, "8<", &end) || skip_prefix(c, ">%", &end) || skip_prefix(c, "%<", &end) || /* U-2702 in UTF-8 */ skip_prefix(c, "\xE2\x9C\x82", &end)) { in_perforation = 1; perforation += end - c; scissors += end - c; c = end - 1; /* minus one to account for loop increment */ } (Though I'm still on the fence regarding the whole idea, so do not take this as an endorsement ;) ). -Peff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] mailinfo: support Unicode scissors 2019-04-02 14:36 ` Jeff King @ 2019-04-03 6:47 ` Junio C Hamano 0 siblings, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2019-04-03 6:47 UTC (permalink / raw) To: Jeff King; +Cc: Andrei Rybak, git, SZEDER Gábor, Duy Nguyen Jeff King <peff@peff.net> writes: > In fact, I think you could then combine this with the previous > conditional and get: > > if (skip_prefix(c, ">8", &end) || > skip_prefix(c, "8<", &end) || > skip_prefix(c, ">%", &end) || > skip_prefix(c, "%<", &end) || > /* U-2702 in UTF-8 */ > skip_prefix(c, "\xE2\x9C\x82", &end)) { > in_perforation = 1; > perforation += end - c; > scissors += end - c; > c = end - 1; /* minus one to account for loop increment */ > } > > (Though I'm still on the fence regarding the whole idea, so do not take > this as an endorsement ;) ). I do not think we want to add more, but use of skip_prefix does sound sensible. I was very tempted to suggest static const char *scissors[] = { ">8", "8<", ">%", "%<", NULL, }; const char **s; for (s = scissors; *s; s++) if (skip_prefix, c, *s, &end) { in_perforation = 1; ... break; } } if (!s) ... we are not looking at any of the scissors[] ... but that would encourage adding more random entries to the array, which we would want to avoid in order to help reduce the cognirive load of end-users. In hindsight, addition of an undocumented '%' was already a mistake. I wonder how widely it is in use (yes, I am tempted to deprecate and remove these two to match the code to the docs). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] mailinfo: use starts_with() for clarity 2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak 2019-04-01 21:53 ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak @ 2019-04-02 14:28 ` Jeff King 2021-06-08 20:48 ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak 2 siblings, 0 replies; 16+ messages in thread From: Jeff King @ 2019-04-02 14:28 UTC (permalink / raw) To: Andrei Rybak; +Cc: git, SZEDER Gábor, Junio C Hamano, Duy Nguyen On Mon, Apr 01, 2019 at 11:53:33PM +0200, Andrei Rybak wrote: > Existing checks using memcmp(3) never read past the end of the line, > because all substrings we are interested in are two characters long, and > the outer loop guarantees we have at least one character. So at most we > will look at the NUL. > > However, this is too subtle and may lead to bugs in code which copies > this behavior without realizing substring length requirement. So use > starts_with() instead, which will stop at NUL regardless of the length > of the prefix. Remove extra pair of parentheses while we are here. > > Helped-by: Jeff King <peff@peff.net> > Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> > --- > > On Mon, Apr 01, 2019 at 06:11:57 -0400, Jeff King wrote: > > I wonder if it's worth re-writing it like: > > Turned Peff's suggestion into a patch. Thanks. I think this may be worth doing regardless of what happens with patch 2. -Peff ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] mailinfo: use starts_with() when checking scissors 2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak 2019-04-01 21:53 ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak 2019-04-02 14:28 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King @ 2021-06-08 20:48 ` Andrei Rybak 2021-06-08 21:57 ` Jeff King 2 siblings, 1 reply; 16+ messages in thread From: Andrei Rybak @ 2021-06-08 20:48 UTC (permalink / raw) To: git; +Cc: Jeff King Existing checks for scissors characters using memcmp(3) never read past the end of the line, because all substrings we are interested in are two characters long, and the outer loop guarantees we have at least one character. So at most we will look at the NUL. However, this is too subtle and may lead to bugs in code which copies this behavior without realizing substring length requirement. So use starts_with() instead, which will stop at NUL regardless of the length of the prefix. Remove extra pair of parentheses while we are here. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> --- mailinfo.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) This patch was originally part of: https://public-inbox.org/git/20190401215334.18678-1-rybak.a.v@gmail.com/ I've finally gotten around to sending it on its own. diff --git a/mailinfo.c b/mailinfo.c index ccc6beb27e..8013e5c566 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -705,8 +705,8 @@ static int is_scissors_line(const char *line) perforation++; continue; } - if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) || - !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) { + if (starts_with(c, ">8") || starts_with(c, "8<") || + starts_with(c, ">%") || starts_with(c, "%<")) { in_perforation = 1; perforation += 2; scissors += 2; -- 2.31.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: use starts_with() when checking scissors 2021-06-08 20:48 ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak @ 2021-06-08 21:57 ` Jeff King 2021-06-09 2:22 ` Junio C Hamano 0 siblings, 1 reply; 16+ messages in thread From: Jeff King @ 2021-06-08 21:57 UTC (permalink / raw) To: Andrei Rybak; +Cc: git On Tue, Jun 08, 2021 at 10:48:41PM +0200, Andrei Rybak wrote: > Existing checks for scissors characters using memcmp(3) never read past > the end of the line, because all substrings we are interested in are two > characters long, and the outer loop guarantees we have at least one > character. So at most we will look at the NUL. > > However, this is too subtle and may lead to bugs in code which copies > this behavior without realizing substring length requirement. So use > starts_with() instead, which will stop at NUL regardless of the length > of the prefix. Remove extra pair of parentheses while we are here. > > Helped-by: Jeff King <peff@peff.net> > Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> > --- > mailinfo.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > This patch was originally part of: > https://public-inbox.org/git/20190401215334.18678-1-rybak.a.v@gmail.com/ > I've finally gotten around to sending it on its own. Well, after seeing the date I don't feel too bad for not remembering it. :) I think this is an improvement in safety and readability, and worth applying. I'd also be fine going further and using skip_prefix(), which would let us drop the magic-number "2", though: - as Junio said in that earlier thread, we hope people aren't encouraged to add more versions of scissors here anyway). - I am not 100% sure that the increment of "c" in the conditional is not also relying on that "2", as well (i.e., it is incrementing one, and then the for-loop increments one). There's a non-trivial risk of introducing off-by-one errors or even more subtlety here. :) So I'm OK leaving that. Thanks for resurrecting this cleanup. -Peff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: use starts_with() when checking scissors 2021-06-08 21:57 ` Jeff King @ 2021-06-09 2:22 ` Junio C Hamano 2021-06-09 3:59 ` Jeff King 0 siblings, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2021-06-09 2:22 UTC (permalink / raw) To: Jeff King; +Cc: Andrei Rybak, git Jeff King <peff@peff.net> writes: > I think this is an improvement in safety and readability, and worth > applying. I'd also be fine going further and using skip_prefix(), which > would let us drop the magic-number "2", though: > > - as Junio said in that earlier thread, we hope people aren't > encouraged to add more versions of scissors here anyway). > > - I am not 100% sure that the increment of "c" in the conditional > is not also relying on that "2", as well (i.e., it is incrementing > one, and then the for-loop increments one). There's a non-trivial > risk of introducing off-by-one errors or even more subtlety here. :) > > So I'm OK leaving that. Thanks for resurrecting this cleanup. You are correct to point out that the increment of "c" relies on 2. We increment by one here, and let the loop control to increment another, to skip over these two bytes. The posted patch _might_ make it easier to read, but I do not think it improves safety at all. At the point of doing memcmp(), we know that *c is not NUL and because we are dealing with NUL-terminated string, we know we can check c[1] (otherwise we wouldn't even be able to see if c is pointing at the end of string), so we check c[0] and c[1] against four variants of two-byte scissors patterns. The current code uses memcmp() of 2 bytes, which is perfectly safe under the condition, and starts_with() would also be equally safe. If we were to teach new scissors sequence that is longer than two bytes, then starts_with() would start becoming safer, but that will not happen, so... ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: use starts_with() when checking scissors 2021-06-09 2:22 ` Junio C Hamano @ 2021-06-09 3:59 ` Jeff King 2021-06-09 5:11 ` Junio C Hamano 0 siblings, 1 reply; 16+ messages in thread From: Jeff King @ 2021-06-09 3:59 UTC (permalink / raw) To: Junio C Hamano; +Cc: Andrei Rybak, git On Wed, Jun 09, 2021 at 11:22:10AM +0900, Junio C Hamano wrote: > The posted patch _might_ make it easier to read, but I do not think > it improves safety at all. At the point of doing memcmp(), we know > that *c is not NUL and because we are dealing with NUL-terminated > string, we know we can check c[1] (otherwise we wouldn't even be > able to see if c is pointing at the end of string), so we check c[0] > and c[1] against four variants of two-byte scissors patterns. The > current code uses memcmp() of 2 bytes, which is perfectly safe under > the condition, and starts_with() would also be equally safe. > > If we were to teach new scissors sequence that is longer than two > bytes, then starts_with() would start becoming safer, but that will > not happen, so... Right, I agree the current code is safe. The main value would be trying to make the correctness of the code more apparent. Perhaps a comment would be better there, like: diff --git a/mailinfo.c b/mailinfo.c index ccc6beb27e..25b606db28 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -705,12 +705,17 @@ static int is_scissors_line(const char *line) perforation++; continue; } + /* + * passing 2 bytes to memcmp is safe here; we have at least + * one non-NUL character from the loop condition, plus a + * terminating NUL + */ if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) || !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) { in_perforation = 1; perforation += 2; scissors += 2; - c++; + c++; /* only 1 here to account for loop increment */ continue; } in_perforation = 0; Though since this is code we'd not plan to modify, and presumably anybody touching it would have to fully grok the loop anyway, it might not be that important. I dunno. I offer it as an alternative (and am happy to add a commit message). But I'm fine with leaving it as-is, too. -Peff ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mailinfo: use starts_with() when checking scissors 2021-06-09 3:59 ` Jeff King @ 2021-06-09 5:11 ` Junio C Hamano 0 siblings, 0 replies; 16+ messages in thread From: Junio C Hamano @ 2021-06-09 5:11 UTC (permalink / raw) To: Jeff King; +Cc: Andrei Rybak, git Jeff King <peff@peff.net> writes: > I dunno. I offer it as an alternative (and am happy to add a commit > message). But I'm fine with leaving it as-is, too. I am fine with Andrei's version as posted, which I've queued. Thanks, both. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-06-09 5:11 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-31 22:01 [PATCH] mailinfo: support Unicode scissors Andrei Rybak 2019-03-31 23:09 ` SZEDER Gábor 2019-04-01 9:07 ` Junio C Hamano 2019-04-01 10:11 ` Jeff King 2019-04-01 9:27 ` Duy Nguyen 2019-04-01 21:54 ` Andrei Rybak 2019-04-01 21:53 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Andrei Rybak 2019-04-01 21:53 ` [PATCH v2 2/2] mailinfo: support Unicode scissors Andrei Rybak 2019-04-02 14:36 ` Jeff King 2019-04-03 6:47 ` Junio C Hamano 2019-04-02 14:28 ` [PATCH v2 1/2] mailinfo: use starts_with() for clarity Jeff King 2021-06-08 20:48 ` [PATCH] mailinfo: use starts_with() when checking scissors Andrei Rybak 2021-06-08 21:57 ` Jeff King 2021-06-09 2:22 ` Junio C Hamano 2021-06-09 3:59 ` Jeff King 2021-06-09 5:11 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).