bug report: symbolic-ref --short command echos the wrong text while use Chinese language

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* bug report: symbolic-ref --short command echos the wrong text while use Chinese language
@ 2023-02-13  6:38 孟子易
  2023-02-13 20:18 ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: 孟子易 @ 2023-02-13  6:38 UTC (permalink / raw)
  To: git

System: Mac Os (Ventura 13.2)
Language: Chinese simplified
Preconditions:
# git checkout -b 测试-加-增加-加-增加
# git symbolic-ref --short HEAD
Wrong Echo (Current Echo):
测试-�
Correct Echo:
// I Don't know, may be "测试-加" ?

Tip:
not all Chinese words can cause this bug.
"加" is one of them, but not the only one.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-13  6:38 bug report: symbolic-ref --short command echos the wrong text while use Chinese language 孟子易
@ 2023-02-13 20:18 ` Jeff King
  2023-02-13 22:58   ` Eric Sunshine
  2023-02-15 16:26   ` Torsten Bögershausen
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff King @ 2023-02-13 20:18 UTC (permalink / raw)
  To: 孟子易; +Cc: git

On Mon, Feb 13, 2023 at 02:38:08PM +0800, 孟子易 wrote:

> System: Mac Os (Ventura 13.2)
> Language: Chinese simplified
> Preconditions:
> # git checkout -b 测试-加-增加-加-增加
> # git symbolic-ref --short HEAD
> Wrong Echo (Current Echo):
> 测试-�
> Correct Echo:
> // I Don't know, may be "测试-加" ?

Hmm, I can't reproduce here on Linux:

  $ git init
  $ git commit --allow-empty -m foo
  $ git checkout -b 测试-加-增加-加-增加
  $ git symbolic-ref --short HEAD
  测试-加-增加-加-增加

I wonder if it is related to using macOS. The refs are stored as
individual files in the filesystem, and HFS+ will do some unicode
normalization. So I get:

  $ ls .git/refs/heads/ | xxd
  00000000: 6d61 696e 0ae6 b58b e8af 952d e58a a02d  main.......-...-
  00000010: e5a2 9ee5 8aa0 2de5 8aa0 2de5 a29e e58a  ......-...-.....
  00000020: a00a          

Are your on-disk bytes different?

My instinct was that this might be related to the shortening code
treating the names as bytes, rather than characters. But looking at
shorten_unambiguous_ref(), it is really operating at the level of path
components, and should never split a partial string.

Another possibility: the shortening is done by applying our usual
ref-resolving rules one by one via scanf(). There's an assumption in the
code that the resulting string can never be longer than the input:

	/* buffer for scanf result, at most refname must fit */
	short_name = xstrdup(refname);

	...
        for (i = nr_rules - 1; i > 0 ; --i) {
		...
                if (1 != sscanf(refname, scanf_fmts[i], short_name))
                        continue;

Is it possible that this assumption is violated based on some particular
combination of unicode normalization and locale? That seems unlikely to
me, but it wouldn't be the first time I've been surprised by subtle
unicode implications.

Is it possible for you to run Git in a debugger and check the
intermediate steps happening in refs_shorten_unambiguous_ref()?

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-13 20:18 ` Jeff King
@ 2023-02-13 22:58   ` Eric Sunshine
  2023-02-14  1:39     ` Jeff King
  2023-02-15 16:26   ` Torsten Bögershausen
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-13 22:58 UTC (permalink / raw)
  To: Jeff King; +Cc: 孟子易, git

On Mon, Feb 13, 2023 at 3:49 PM Jeff King <peff@peff.net> wrote:
> On Mon, Feb 13, 2023 at 02:38:08PM +0800, 孟子易 wrote:
> > System: Mac Os (Ventura 13.2)
> > Language: Chinese simplified
> > Preconditions:
> > # git checkout -b 测试-加-增加-加-增加
> > # git symbolic-ref --short HEAD
> > Wrong Echo (Current Echo):
> > 测试-�
> > Correct Echo:
> > // I Don't know, may be "测试-加" ?
>
> Hmm, I can't reproduce here on Linux:
>
>   $ git init
>   $ git commit --allow-empty -m foo
>   $ git checkout -b 测试-加-增加-加-增加
>   $ git symbolic-ref --short HEAD
>   测试-加-增加-加-增加

I am able to reproduce the broken behavior on maOS 10.13.

> I wonder if it is related to using macOS. The refs are stored as
> individual files in the filesystem, and HFS+ will do some unicode
> normalization. So I get:
>
>   $ ls .git/refs/heads/ | xxd
>   00000000: 6d61 696e 0ae6 b58b e8af 952d e58a a02d  main.......-...-
>   00000010: e5a2 9ee5 8aa0 2de5 8aa0 2de5 a29e e58a  ......-...-.....
>   00000020: a00a
>
> Are your on-disk bytes different?

I get the exact same results (on HFS+) as you show:

  $ ls .git/refs/heads/ | xxd
  00000000: 6d61 696e 0ae6 b58b e8af 952d e58a a02d  main.......-...-
  00000010: e5a2 9ee5 8aa0 2de5 8aa0 2de5 a29e e58a  ......-...-.....
  00000020: a00a                                     ..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-13 22:58   ` Eric Sunshine
@ 2023-02-14  1:39     ` Jeff King
  2023-02-14  5:15       ` Eric Sunshine
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-14  1:39 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: 孟子易, git

On Mon, Feb 13, 2023 at 05:58:06PM -0500, Eric Sunshine wrote:

> I am able to reproduce the broken behavior on maOS 10.13.

That's helpful.

> > Are your on-disk bytes different?
> 
> I get the exact same results (on HFS+) as you show:

Hmm, so we can probably rule out HFS weirdness (and anyway, I guess we'd
be getting the name from the HEAD file, which would be stored verbatim).

If you do something like the patch below:

diff --git a/refs.c b/refs.c
index e31dbcda59..2fa53fef7f 100644
--- a/refs.c
+++ b/refs.c
@@ -1350,7 +1350,8 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		return xstrdup(refname);
 
 	/* buffer for scanf result, at most refname must fit */
-	short_name = xstrdup(refname);
+	warning("strlen(refname) = %d", (int)strlen(refname));
+	short_name = xcalloc(1, 2*strlen(refname));
 
 	/* skip first rule, it will always match */
 	for (i = nr_rules - 1; i > 0 ; --i) {
@@ -1362,6 +1363,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 			continue;
 
 		short_name_len = strlen(short_name);
+		warning("strlen(short_name) = %d", (int)short_name_len);
 
 		/*
 		 * in strict mode, all (except the matched one) rules

Does it help at all? And if so, is short_name_len longer than we might
expect it to be (I get 39 for the full refname and 28 for the scanf'd
name).  I'm having trouble coming up with a reason that the scanf value
_would_ be unexpectedly long, but just trying to rule things out.

-Peff

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  1:39     ` Jeff King
@ 2023-02-14  5:15       ` Eric Sunshine
  2023-02-14  5:33         ` Eric Sunshine
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-14  5:15 UTC (permalink / raw)
  To: Jeff King; +Cc: 孟子易, git

On Mon, Feb 13, 2023 at 8:39 PM Jeff King <peff@peff.net> wrote:
> On Mon, Feb 13, 2023 at 05:58:06PM -0500, Eric Sunshine wrote:
> > I am able to reproduce the broken behavior on maOS 10.13.
>
> If you do something like the patch below:
>
> -       short_name = xstrdup(refname);
> +       warning("strlen(refname) = %d", (int)strlen(refname));
> +       short_name = xcalloc(1, 2*strlen(refname));
>
>                 short_name_len = strlen(short_name);
> +               warning("strlen(short_name) = %d", (int)short_name_len);
>
> Does it help at all? And if so, is short_name_len longer than we might
> expect it to be (I get 39 for the full refname and 28 for the scanf'd
> name).  I'm having trouble coming up with a reason that the scanf value
> _would_ be unexpectedly long, but just trying to rule things out.

I get results different from yours:

  warning: strlen(refname) = 39
  warning: strlen(short_name) = 9

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  5:15       ` Eric Sunshine
@ 2023-02-14  5:33         ` Eric Sunshine
  2023-02-14  5:40           ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-14  5:33 UTC (permalink / raw)
  To: Jeff King; +Cc: 孟子易, git

On Tue, Feb 14, 2023 at 12:15 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Mon, Feb 13, 2023 at 8:39 PM Jeff King <peff@peff.net> wrote:
> > If you do something like the patch below:
> >
> > -       short_name = xstrdup(refname);
> > +       warning("strlen(refname) = %d", (int)strlen(refname));
> > +       short_name = xcalloc(1, 2*strlen(refname));
> >
> >                 short_name_len = strlen(short_name);
> > +               warning("strlen(short_name) = %d", (int)short_name_len);
> >
> > Does it help at all? And if so, is short_name_len longer than we might
> > expect it to be (I get 39 for the full refname and 28 for the scanf'd
> > name).  I'm having trouble coming up with a reason that the scanf value
> > _would_ be unexpectedly long, but just trying to rule things out.
>
> I get results different from yours:
>
>   warning: strlen(refname) = 39
>   warning: strlen(short_name) = 9

sscanf() seems to be the culprit. If I set LANG, then I get the same
result as you get:

  $ LANG=zh-CN.UTF-8 git symbolic-ref --short HEAD
  warning: strlen(refname) = 39
  warning: strlen(short_name) = 28
  测试-加-增加-加-增加

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  5:33         ` Eric Sunshine
@ 2023-02-14  5:40           ` Junio C Hamano
  2023-02-14  6:05             ` Eric Sunshine
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14  5:40 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jeff King, 孟子易, git

Eric Sunshine <sunshine@sunshineco.com> writes:

> sscanf() seems to be the culprit. If I set LANG, then I get the same
> result as you get:
>
>   $ LANG=zh-CN.UTF-8 git symbolic-ref --short HEAD
>   warning: strlen(refname) = 39
>   warning: strlen(short_name) = 28
>   测试-加-增加-加-增加

Well, that's ... criminal.

I wonder if LANG=$ANY_VALID_ONE.UTF-8 would work the same way.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  5:40           ` Junio C Hamano
@ 2023-02-14  6:05             ` Eric Sunshine
  2023-02-14  6:45               ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-14  6:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, 孟子易, git

On Tue, Feb 14, 2023 at 12:40 AM Junio C Hamano <gitster@pobox.com> wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> > sscanf() seems to be the culprit. If I set LANG, then I get the same
> > result as you get:
> >
> >   $ LANG=zh-CN.UTF-8 git symbolic-ref --short HEAD
> >   warning: strlen(refname) = 39
> >   warning: strlen(short_name) = 28
> >   测试-加-增加-加-增加
>
> Well, that's ... criminal.
>
> I wonder if LANG=$ANY_VALID_ONE.UTF-8 would work the same way.

Um, I suppose you are pointing out that I messed up by using invalid
LANG code "zh-CN"?

Using (presumably) valid LANG codes results in the buggy truncated
output, but "LANG=C" produces the correct result:

  $ for i in C en_US fr_FR de_DE ru_RU zh_CN; do printf "$i: " &&
LANG=$i.UTF-8 git symbolic-ref --short HEAD; done
  C: 测试-加-增加-加-增加
  en_US: 测试-?
  fr_FR: 测试-?
  de_DE: 测试-?
  ru_RU: 测试-?
  zh_CN: 测试-?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  6:05             ` Eric Sunshine
@ 2023-02-14  6:45               ` Junio C Hamano
  2023-02-14  6:55                 ` Eric Sunshine
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14  6:45 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jeff King, 孟子易, git

Eric Sunshine <sunshine@sunshineco.com> writes:

> Using (presumably) valid LANG codes results in the buggy truncated
> output, but "LANG=C" produces the correct result:
>
>   $ for i in C en_US fr_FR de_DE ru_RU zh_CN; do printf "$i: " &&
> LANG=$i.UTF-8 git symbolic-ref --short HEAD; done
>   C: 测试-加-增加-加-增加
>   en_US: 测试-?
>   fr_FR: 测试-?
>   de_DE: 测试-?
>   ru_RU: 测试-?
>   zh_CN: 测试-?

Interesting.  

So the system cares more than just "is this a valid UTF-8 sequence?"
but somehow knows that the given sequence is a valid Chinese and not
valid English?  ---oh, no, zh_CN is rejected, but your earlier zh-CN
somehow was accepted?

Now, it is beyond my ability to guess what macOS is internally doing
wrong X-<.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  6:45               ` Junio C Hamano
@ 2023-02-14  6:55                 ` Eric Sunshine
  2023-02-14 16:01                   ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-14  6:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, 孟子易, git

On Tue, Feb 14, 2023 at 1:45 AM Junio C Hamano <gitster@pobox.com> wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> > Using (presumably) valid LANG codes results in the buggy truncated
> > output, but "LANG=C" produces the correct result:
> >
> >   $ for i in C en_US fr_FR de_DE ru_RU zh_CN; do printf "$i: " &&
> > LANG=$i.UTF-8 git symbolic-ref --short HEAD; done
> >   C: 测试-加-增加-加-增加
> >   en_US: 测试-?
> >   fr_FR: 测试-?
> >   de_DE: 测试-?
> >   ru_RU: 测试-?
> >   zh_CN: 测试-?
>
> So the system cares more than just "is this a valid UTF-8 sequence?"
> but somehow knows that the given sequence is a valid Chinese and not
> valid English?  ---oh, no, zh_CN is rejected, but your earlier zh-CN
> somehow was accepted?
>
> Now, it is beyond my ability to guess what macOS is internally doing
> wrong X-<.

I  don't think the earlier incorrect "zh-CN" (in which I used "-"
rather than "_") was accepted. Rather, the system simply didn't
recognize it, thus presumably fell back to "C" locale. The same
"correct" output results from any bogus LANG value:

  $ LANG=bogus git symbolic-ref --short HEAD
  测试-加-增加-加-增加

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14  6:55                 ` Eric Sunshine
@ 2023-02-14 16:01                   ` Jeff King
  2023-02-14 16:29                     ` Eric Sunshine
  2023-02-14 16:40                     ` bug report: symbolic-ref --short command echos the wrong text while use Chinese language Junio C Hamano
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff King @ 2023-02-14 16:01 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

On Tue, Feb 14, 2023 at 01:55:32AM -0500, Eric Sunshine wrote:

> On Tue, Feb 14, 2023 at 1:45 AM Junio C Hamano <gitster@pobox.com> wrote:
> > Eric Sunshine <sunshine@sunshineco.com> writes:
> > > Using (presumably) valid LANG codes results in the buggy truncated
> > > output, but "LANG=C" produces the correct result:
> > >
> > >   $ for i in C en_US fr_FR de_DE ru_RU zh_CN; do printf "$i: " &&
> > > LANG=$i.UTF-8 git symbolic-ref --short HEAD; done
> > >   C: 测试-加-增加-加-增加
> > >   en_US: 测试-?
> > >   fr_FR: 测试-?
> > >   de_DE: 测试-?
> > >   ru_RU: 测试-?
> > >   zh_CN: 测试-?
> >
> > So the system cares more than just "is this a valid UTF-8 sequence?"
> > but somehow knows that the given sequence is a valid Chinese and not
> > valid English?  ---oh, no, zh_CN is rejected, but your earlier zh-CN
> > somehow was accepted?
> >
> > Now, it is beyond my ability to guess what macOS is internally doing
> > wrong X-<.
> 
> I  don't think the earlier incorrect "zh-CN" (in which I used "-"
> rather than "_") was accepted. Rather, the system simply didn't
> recognize it, thus presumably fell back to "C" locale. The same
> "correct" output results from any bogus LANG value:
> 
>   $ LANG=bogus git symbolic-ref --short HEAD
>   测试-加-增加-加-增加

Oof. So it is some weird locale thing that scanf is doing. I don't even
want to think about what the details could be. ;)

Since scanf is such a bad and error-prone interface in the first place
(and I'd actually like to put it on the banned list), what about just
parsing manually here? We are already implicitly assuming that each
rev-parse rule has a single "%.*s" in it. Armed with that knowledge,
it's not too hard to match using skip_prefix() and strip_suffix(). Or
with a little bit more custom code, we can avoid the step to pre-process
the rule strings completely. Something like:

 refs.c | 80 ++++++++++++++++++-----------------
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/refs.c b/refs.c
index e31dbcda59..8ec7426c05 100644
--- a/refs.c
+++ b/refs.c
@@ -1310,55 +1310,59 @@ int update_ref(const char *msg, const char *refname,
 			       old_oid, flags, onerr);
 }
 
+/*
+ * Check that the string refname matches a rule of the form
+ * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
+ * "foo/%.*s/baz", and return the string "bar".
+ */
+static char *match_parse_rule(const char *refname, const char *rule)
+{
+	size_t len;
+
+	/*
+	 * Check that rule matches refname up to the first percent
+	 * in the rule. This is basically skip_prefix(), but
+	 * ending at percent in the prefix, rather than end-of-string.
+	 */
+	do {
+		if (!*rule)
+			BUG("rev-parse rule did not have percent");
+		if (*rule == '%')
+			break;
+	} while (*refname++ == *rule++);
+
+	/*
+	 * Check that we matched all the way to the "%" placeholder,
+	 * and skip past it within the rule string. If so, "refname" at
+	 * this point is the beginning of a potential match.
+	 */
+	if (!skip_prefix(rule, "%.*s", &rule))
+		return NULL;
+
+	/*
+	 * And now check that our suffix (if any) matches.
+	 */
+	if (!strip_suffix(refname, rule, &len))
+		return NULL;
+
+	return xmemdupz(refname, len);
+}
+
 char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 				   const char *refname, int strict)
 {
 	int i;
-	static char **scanf_fmts;
-	static int nr_rules;
 	char *short_name;
 	struct strbuf resolved_buf = STRBUF_INIT;
 
-	if (!nr_rules) {
-		/*
-		 * Pre-generate scanf formats from ref_rev_parse_rules[].
-		 * Generate a format suitable for scanf from a
-		 * ref_rev_parse_rules rule by interpolating "%s" at the
-		 * location of the "%.*s".
-		 */
-		size_t total_len = 0;
-		size_t offset = 0;
-
-		/* the rule list is NULL terminated, count them first */
-		for (nr_rules = 0; ref_rev_parse_rules[nr_rules]; nr_rules++)
-			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[nr_rules]) - 2 + 1;
-
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), nr_rules), total_len));
-
-		offset = 0;
-		for (i = 0; i < nr_rules; i++) {
-			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[nr_rules] + offset;
-			offset += xsnprintf(scanf_fmts[i], total_len - offset,
-					    ref_rev_parse_rules[i], 2, "%s") + 1;
-		}
-	}
-
-	/* bail out if there are no rules */
-	if (!nr_rules)
-		return xstrdup(refname);
-
-	/* buffer for scanf result, at most refname must fit */
-	short_name = xstrdup(refname);
-
 	/* skip first rule, it will always match */
-	for (i = nr_rules - 1; i > 0 ; --i) {
+	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
 		int short_name_len;
 
-		if (1 != sscanf(refname, scanf_fmts[i], short_name))
+		short_name = match_parse_rule(refname, ref_rev_parse_rules[i]);
+		if (!short_name)
 			continue;
 
 		short_name_len = strlen(short_name);
@@ -1368,7 +1372,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 * must fail to resolve to a valid non-ambiguous ref
 		 */
 		if (strict)
-			rules_to_fail = nr_rules;
+			rules_to_fail = NUM_REV_PARSE_RULES;
 
 		/*
 		 * check if the short name resolves to a valid ref,

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14 16:01                   ` Jeff King
@ 2023-02-14 16:29                     ` Eric Sunshine
  2023-02-14 17:07                       ` Jeff King
  2023-02-14 16:40                     ` bug report: symbolic-ref --short command echos the wrong text while use Chinese language Junio C Hamano
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-14 16:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, 孟子易, git

On Tue, Feb 14, 2023 at 11:01 AM Jeff King <peff@peff.net> wrote:
> On Tue, Feb 14, 2023 at 01:55:32AM -0500, Eric Sunshine wrote:
> > On Tue, Feb 14, 2023 at 1:45 AM Junio C Hamano <gitster@pobox.com> wrote:
> > > Eric Sunshine <sunshine@sunshineco.com> writes:
> > > > Using (presumably) valid LANG codes results in the buggy truncated
> > > > output, but "LANG=C" produces the correct result:
> > > >
> > > >   $ for i in C en_US fr_FR de_DE ru_RU zh_CN; do printf "$i: " &&
> > > > LANG=$i.UTF-8 git symbolic-ref --short HEAD; done
> > > >   C: 测试-加-增加-加-增加
> > > >   en_US: 测试-?

Imadummy. The example loop incorrectly uses LANG=C.UTF-8 rather than
LANG=C for the first item. However, LANG=C does indeed result in the
correct output:

  $ LANG=C git symbolic-ref --short HEAD
  测试-加-增加-加-增加

> >   $ LANG=bogus git symbolic-ref --short HEAD
> >   测试-加-增加-加-增加
>
> Oof. So it is some weird locale thing that scanf is doing. I don't even
> want to think about what the details could be. ;)
>
> Since scanf is such a bad and error-prone interface in the first place
> (and I'd actually like to put it on the banned list), what about just
> parsing manually here? We are already implicitly assuming that each
> rev-parse rule has a single "%.*s" in it. Armed with that knowledge,
> it's not too hard to match using skip_prefix() and strip_suffix(). Or
> with a little bit more custom code, we can avoid the step to pre-process
> the rule strings completely. Something like:
>
> +/*
> + * Check that the string refname matches a rule of the form
> + * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
> + * "foo/%.*s/baz", and return the string "bar".
> + */
> +static char *match_parse_rule(const char *refname, const char *rule)

Yes, this works nicely and fixes the reported problem:

  % $GIT_DIR/bin-wrappers/git symbolic-ref --short HEAD
  测试-加-增加-加-增加

I'm all for this approach. Fewer scanf()'s, the better.

The new code itself looks correct; I think it properly covers all the
edge-cases (at least those that came to my mind).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14 16:01                   ` Jeff King
  2023-02-14 16:29                     ` Eric Sunshine
@ 2023-02-14 16:40                     ` Junio C Hamano
  2023-02-14 17:40                       ` Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14 16:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

> Oof. So it is some weird locale thing that scanf is doing. I don't even
> want to think about what the details could be. ;)
>
> Since scanf is such a bad and error-prone interface in the first place
> (and I'd actually like to put it on the banned list), what about just
> parsing manually here?

Me likee.

This will eradicate the only use of sscanf() from the tree;
unfortunately there is a topic that adds a new one or two back in
flight X-<.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14 16:29                     ` Eric Sunshine
@ 2023-02-14 17:07                       ` Jeff King
  2023-02-14 18:38                         ` [PATCH 0/3] get rid of sscanf() when shortening refs Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-14 17:07 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

On Tue, Feb 14, 2023 at 11:29:15AM -0500, Eric Sunshine wrote:

> > +static char *match_parse_rule(const char *refname, const char *rule)
> 
> Yes, this works nicely and fixes the reported problem:
> 
>   % $GIT_DIR/bin-wrappers/git symbolic-ref --short HEAD
>   测试-加-增加-加-增加
> 
> I'm all for this approach. Fewer scanf()'s, the better.
> 
> The new code itself looks correct; I think it properly covers all the
> edge-cases (at least those that came to my mind).

There's a leak in it when there is an ambiguous ref (because we allocate
short_name in each iteration of the loop, rather than overwriting the
buffer). I'll post a fixed version in a moment.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-14 16:40                     ` bug report: symbolic-ref --short command echos the wrong text while use Chinese language Junio C Hamano
@ 2023-02-14 17:40                       ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-02-14 17:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, 孟子易, git

On Tue, Feb 14, 2023 at 08:40:23AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Oof. So it is some weird locale thing that scanf is doing. I don't even
> > want to think about what the details could be. ;)
> >
> > Since scanf is such a bad and error-prone interface in the first place
> > (and I'd actually like to put it on the banned list), what about just
> > parsing manually here?
> 
> Me likee.
> 
> This will eradicate the only use of sscanf() from the tree;
> unfortunately there is a topic that adds a new one or two back in
> flight X-<.

Sadly, not quite. There's one in test-date.c, though it would be pretty
easy to convert to strtol() or similar. But there's also an fscanf() in
builtin/gc.c. To my mind, fscanf() is even worse, as you cannot know the
size of the "%s" you're about to receive (the case here works around it
with a max-size specifier, so I don't think it's buggy, but the
contortions one must go through seem...not worth it).

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/3] get rid of sscanf() when shortening refs
  2023-02-14 17:07                       ` Jeff King
@ 2023-02-14 18:38                         ` Jeff King
  2023-02-14 18:39                           ` [PATCH 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
                                             ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Jeff King @ 2023-02-14 18:38 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

On Tue, Feb 14, 2023 at 12:07:17PM -0500, Jeff King wrote:

> There's a leak in it when there is an ambiguous ref (because we allocate
> short_name in each iteration of the loop, rather than overwriting the
> buffer). I'll post a fixed version in a moment.

OK, here it is. I split it into a few patches to hopefully make it a bit
easier to follow.

  [1/3]: shorten_unambiguous_ref(): avoid integer truncation
  [2/3]: shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
  [3/3]: shorten_unambiguous_ref(): avoid sscanf()

 refs.c | 92 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 47 insertions(+), 45 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/3] shorten_unambiguous_ref(): avoid integer truncation
  2023-02-14 18:38                         ` [PATCH 0/3] get rid of sscanf() when shortening refs Jeff King
@ 2023-02-14 18:39                           ` Jeff King
  2023-02-14 18:40                           ` [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-02-14 18:39 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

We parse the shortened name "foo" out of the full refname
"refs/heads/foo", and then assign the result of strlen(short_name) to an
int, which may truncate or wrap to negative.

In practice, this should never happen, as it requires a 2GB refname. And
even somebody trying to do something malicious should at worst end up
with a confused answer (we use the size only to feed back as a
placeholder length to strbuf_addf() to see if there are any collisions
in the lookup rules).

And it may even be impossible to trigger this, as we parse the string
with sscanf(), and stdio formatting functions are not known for handling
large strings well. I didn't test, but I wouldn't be surprised if
sscanf() on many platforms simply reports no match here.

But even if it is not a problem in practice so far, it is worth fixing
for two reasons:

  1. We'll shortly be replacing the sscanf() call with a real parser
     which will handle arbitrary-sized strings.

  2. Assigning strlen() to an int is an anti-pattern that requires
     people to look twice when auditing for real overflow problems.

So we'll make this a size_t. Unfortunately we still have to cast to int
eventually for the strbuf_addf() call, but at least we can localize the
cast there, and check that it will be valid. I used our new cast helper
here, which will just bail completely. That should be OK, as anybody
with a 2GB refname is up to no good, but if we really wanted to, we
could detect it manually and just refuse to shorten the refname.

Signed-off-by: Jeff King <peff@peff.net>
---
 refs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/refs.c b/refs.c
index e31dbcda59..94d938390d 100644
--- a/refs.c
+++ b/refs.c
@@ -1356,7 +1356,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 	for (i = nr_rules - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
-		int short_name_len;
+		size_t short_name_len;

 		if (1 != sscanf(refname, scanf_fmts[i], short_name))
 			continue;
@@ -1388,7 +1388,8 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 			 */
 			strbuf_reset(&resolved_buf);
 			strbuf_addf(&resolved_buf, rule,
-				    short_name_len, short_name);
+				    cast_size_t_to_int(short_name_len),
+				    short_name);
 			if (refs_ref_exists(refs, resolved_buf.buf))
 				break;
 		}
-- 
2.39.1.849.g86e176252e

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
  2023-02-14 18:38                         ` [PATCH 0/3] get rid of sscanf() when shortening refs Jeff King
  2023-02-14 18:39                           ` [PATCH 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
@ 2023-02-14 18:40                           ` Jeff King
  2023-02-14 21:34                             ` Junio C Hamano
  2023-02-14 18:41                           ` [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
  2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
  3 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-14 18:40 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

The ref_rev_parse_rules[] array is terminated with a NULL entry, and we
count it and store the result in the local nr_rules variable. But we
don't need to do so; since the array is a constant, we can compute its
size directly. The original code probably didn't do that because it was
written as part of for-each-ref, and saw the array only as a pointer. It
was migrated in 7c2b3029df (make get_short_ref a public function,
2009-04-07) and could have been updated then, but that subtlety was not
noticed.

We even have a constant that represents this value already, courtesy of
60650a48c0 (remote: make refspec follow the same disambiguation rule as
local refs, 2018-08-01), though again, nobody noticed at the time that
it could be used here, too.

The current count-up isn't a big deal, as we need to preprocess that
array anyway. But it will become more cumbersome as we refactor the
shortening code. So let's get rid of it and just use the constant
everywhere.

Note that there are two things here that aren't just simple text
replacements:

  1. We also use nr_rules to see if a previous call has initialized the
     static pre-processing variables. We can just use the scanf_fmts
     pointer to do the same thing, as it is non-NULL only after we've
     done that initialization.

  2. If nr_rules is zero after we've counted it up, we bail from the
     function. This code is unreachable, though, as the set of rules is
     hard-coded and non-empty. And that becomes even more apparent now
     that we are using the constant. So we can drop this conditional
     completely (and ironically, the code would have the same output if
     it _did_ trigger, as we'd simply skip the loop entirely and return
     the whole refname).

Signed-off-by: Jeff King <peff@peff.net>
---
 refs.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/refs.c b/refs.c
index 94d938390d..84f344d8af 100644
--- a/refs.c
+++ b/refs.c
@@ -1315,11 +1315,10 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 {
 	int i;
 	static char **scanf_fmts;
-	static int nr_rules;
 	char *short_name;
 	struct strbuf resolved_buf = STRBUF_INIT;
 
-	if (!nr_rules) {
+	if (!scanf_fmts) {
 		/*
 		 * Pre-generate scanf formats from ref_rev_parse_rules[].
 		 * Generate a format suitable for scanf from a
@@ -1329,31 +1328,26 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		size_t total_len = 0;
 		size_t offset = 0;
 
-		/* the rule list is NULL terminated, count them first */
-		for (nr_rules = 0; ref_rev_parse_rules[nr_rules]; nr_rules++)
+		for (i = 0; i < NUM_REV_PARSE_RULES; i++)
 			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[nr_rules]) - 2 + 1;
+			total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1;
 
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), nr_rules), total_len));
+		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len));
 
 		offset = 0;
-		for (i = 0; i < nr_rules; i++) {
+		for (i = 0; i < NUM_REV_PARSE_RULES; i++) {
 			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[nr_rules] + offset;
+			scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset;
 			offset += xsnprintf(scanf_fmts[i], total_len - offset,
 					    ref_rev_parse_rules[i], 2, "%s") + 1;
 		}
 	}
 
-	/* bail out if there are no rules */
-	if (!nr_rules)
-		return xstrdup(refname);
-
 	/* buffer for scanf result, at most refname must fit */
 	short_name = xstrdup(refname);
 
 	/* skip first rule, it will always match */
-	for (i = nr_rules - 1; i > 0 ; --i) {
+	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
 		size_t short_name_len;
@@ -1368,7 +1362,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 * must fail to resolve to a valid non-ambiguous ref
 		 */
 		if (strict)
-			rules_to_fail = nr_rules;
+			rules_to_fail = NUM_REV_PARSE_RULES;
 
 		/*
 		 * check if the short name resolves to a valid ref,
-- 
2.39.1.849.g86e176252e


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 18:38                         ` [PATCH 0/3] get rid of sscanf() when shortening refs Jeff King
  2023-02-14 18:39                           ` [PATCH 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
  2023-02-14 18:40                           ` [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
@ 2023-02-14 18:41                           ` Jeff King
  2023-02-14 21:48                             ` Junio C Hamano
  2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
  3 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-14 18:41 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

To shorten a fully qualified ref (e.g., taking "refs/heads/foo" to just
"foo"), we munge the usual lookup rules ("refs/heads/%.*s", etc) to drop
the ".*" modifier (so "refs/heads/%s"), and then use sscanf() to match
that against the refname, pulling the "%s" content into a separate
buffer.

This has two downsides:

  - sscanf("%s") reportedly misbehaves on macOS with some input and
    locale combinations, returning a partial or garbled string. See
    this thread:

      https://lore.kernel.org/git/CAGF3oAcCi+fG12j-1U0hcrWwkF5K_9WhOi6ZPHBzUUzfkrZDxA@mail.gmail.com/

  - scanf in general is an error-prone interface. For example, scanning
    for "%s" will copy bytes into a destination string, which must have
    been correctly sized ahead of time to avoid a buffer overflow. In
    this case, the code is OK (the buffer is pessimistically sized to
    match the original string, which should give us a maximum). But in
    general, we do not want to encourage people to use scanf at all.

So instead, let's note that our lookup rules are not arbitrary format
strings, but all contain exactly one "%.*s" placeholder. We already rely
on this, both for lookup (we feed the lookup format along with exactly
one int/ptr combo to snprintf, etc) and for shortening (we munge "%.*s"
to "%s", and then insist that sscanf() finds exactly one result).

We can parse this manually by just matching the bytes that occur before
and after the "%.*s" placeholder. While we have a few extra lines of
parsing code, the result is arguably simpler, as can skip the
preprocessing step and its tricky memory management entirely.

The in-code comments should explain the parsing strategy, but there's
one subtle change here. The original code allocated a single buffer, and
then overwrote it in each loop iteration, since that's the only option
sscanf() gives us. But our parser can actually return a ptr/len combo
for the matched string, which is all we need (since we just feed it back
to the lookup rules with "%.*s"), and then copy it only when returning
to the caller.

Reported-by: 孟子易 <mengziyi540841@gmail.com>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Jeff King <peff@peff.net>
---
BTW, this diff is generated with --patience, which generates a _much_
nicer output in this case. Not important to this series, but since there
was discussion of switching the default in a nearby thread, it seemed
like an interesting example.

 refs.c | 77 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/refs.c b/refs.c
index 84f344d8af..d8ce7e9ee1 100644
--- a/refs.c
+++ b/refs.c
@@ -1310,53 +1310,61 @@ int update_ref(const char *msg, const char *refname,
 			       old_oid, flags, onerr);
 }
 
+/*
+ * Check that the string refname matches a rule of the form
+ * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
+ * "foo/%.*s/baz", and return the string "bar".
+ */
+static const char *match_parse_rule(const char *refname, const char *rule,
+				    size_t *len)
+{
+	/*
+	 * Check that rule matches refname up to the first percent
+	 * in the rule. This is basically skip_prefix(), but
+	 * ending at percent in the prefix, rather than end-of-string.
+	 */
+	do {
+		if (!*rule)
+			BUG("rev-parse rule did not have percent");
+		if (*rule == '%')
+			break;
+	} while (*refname++ == *rule++);
+
+	/*
+	 * Check that we matched all the way to the "%" placeholder,
+	 * and skip past it within the rule string. If so, "refname" at
+	 * this point is the beginning of a potential match.
+	 */
+	if (!skip_prefix(rule, "%.*s", &rule))
+		return NULL;
+
+	/*
+	 * And now check that our suffix (if any) matches.
+	 */
+	if (!strip_suffix(refname, rule, len))
+		return NULL;
+
+	return refname; /* len set by strip_suffix() */
+}
+
 char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 				   const char *refname, int strict)
 {
 	int i;
-	static char **scanf_fmts;
-	char *short_name;
 	struct strbuf resolved_buf = STRBUF_INIT;
 
-	if (!scanf_fmts) {
-		/*
-		 * Pre-generate scanf formats from ref_rev_parse_rules[].
-		 * Generate a format suitable for scanf from a
-		 * ref_rev_parse_rules rule by interpolating "%s" at the
-		 * location of the "%.*s".
-		 */
-		size_t total_len = 0;
-		size_t offset = 0;
-
-		for (i = 0; i < NUM_REV_PARSE_RULES; i++)
-			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1;
-
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len));
-
-		offset = 0;
-		for (i = 0; i < NUM_REV_PARSE_RULES; i++) {
-			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset;
-			offset += xsnprintf(scanf_fmts[i], total_len - offset,
-					    ref_rev_parse_rules[i], 2, "%s") + 1;
-		}
-	}
-
-	/* buffer for scanf result, at most refname must fit */
-	short_name = xstrdup(refname);
-
 	/* skip first rule, it will always match */
 	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
+		const char *short_name;
 		size_t short_name_len;
 
-		if (1 != sscanf(refname, scanf_fmts[i], short_name))
+		short_name = match_parse_rule(refname, ref_rev_parse_rules[i],
+					      &short_name_len);
+		if (!short_name)
 			continue;
 
-		short_name_len = strlen(short_name);
-
 		/*
 		 * in strict mode, all (except the matched one) rules
 		 * must fail to resolve to a valid non-ambiguous ref
@@ -1394,12 +1402,11 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 */
 		if (j == rules_to_fail) {
 			strbuf_release(&resolved_buf);
-			return short_name;
+			return xmemdupz(short_name, short_name_len);
 		}
 	}
 
 	strbuf_release(&resolved_buf);
-	free(short_name);
 	return xstrdup(refname);
 }
 
-- 
2.39.1.849.g86e176252e

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
  2023-02-14 18:40                           ` [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
@ 2023-02-14 21:34                             ` Junio C Hamano
  2023-02-14 22:23                               ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14 21:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

> The current count-up isn't a big deal, as we need to preprocess that
> array anyway. But it will become more cumbersome as we refactor the
> shortening code. So let's get rid of it and just use the constant
> everywhere.

OK.  As the array is constant, we could lose its NULL-termination
and -1 from the definition of NUM_REV_PARSE_RULES, but that has iffy
upside, and can come on top of the series if we really wanted to.

Looking good so far.

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 18:41                           ` [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
@ 2023-02-14 21:48                             ` Junio C Hamano
  2023-02-14 22:25                               ` Junio C Hamano
                                                 ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14 21:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

> +/*
> + * Check that the string refname matches a rule of the form
> + * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
> + * "foo/%.*s/baz", and return the string "bar".
> + */
> +static const char *match_parse_rule(const char *refname, const char *rule,
> +				    size_t *len)
> +{
> +	/*
> +	 * Check that rule matches refname up to the first percent
> +	 * in the rule. This is basically skip_prefix(), but
> +	 * ending at percent in the prefix, rather than end-of-string.
> +	 */
> +	do {
> +		if (!*rule)
> +			BUG("rev-parse rule did not have percent");
> +		if (*rule == '%')
> +			break;
> +	} while (*refname++ == *rule++);

So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then
we'll scan refname and rule to skip over their "refs/" prefix, and
the next iteration, where post-increment moved the pointers to point
at 'h' (at the beginning of "heads/frotz") on the refname side and
'%' on the rule side, we iterate once more, notice *rule is '%', and
break out of the loop.  We have refname="heads/frotz" and rule="%.*s"

If we have refname="refsXheads/frotz" and rule="refs/%.*s", after
skipping over "refs", refname points at 'X' while rule points at '/'
and the loop needs to break.  Both pointers are post-incremented,
and now we have refname="heads/frotz" and rule="%.*s".

Am I reading the loop correctly?  I wanted the bogus refname not to
match the rule, but without peeking back refname[-1], I cannot tell
the two cases apart at this point.

> +	/*
> +	 * Check that we matched all the way to the "%" placeholder,
> +	 * and skip past it within the rule string. If so, "refname" at
> +	 * this point is the beginning of a potential match.
> +	 */
> +	if (!skip_prefix(rule, "%.*s", &rule))
> +		return NULL;

And we now have rule pointing at "" (i.e. "refs/%.*s" has been fully
consumed).  refname points at "heads/frotz".

> +	/*
> +	 * And now check that our suffix (if any) matches.
> +	 */
> +	if (!strip_suffix(refname, rule, len))
> +		return NULL;
> +
> +	return refname; /* len set by strip_suffix() */
> +}

And the suffix "" is stripped and we yield "heads/frotz".

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
  2023-02-14 21:34                             ` Junio C Hamano
@ 2023-02-14 22:23                               ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-02-14 22:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, 孟子易, git

On Tue, Feb 14, 2023 at 01:34:10PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The current count-up isn't a big deal, as we need to preprocess that
> > array anyway. But it will become more cumbersome as we refactor the
> > shortening code. So let's get rid of it and just use the constant
> > everywhere.
> 
> OK.  As the array is constant, we could lose its NULL-termination
> and -1 from the definition of NUM_REV_PARSE_RULES, but that has iffy
> upside, and can come on top of the series if we really wanted to.

Yeah. We'd have to update the other callers. Not hard, but I agree it
doesn't buy us much.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 21:48                             ` Junio C Hamano
@ 2023-02-14 22:25                               ` Junio C Hamano
  2023-02-14 22:30                               ` Jeff King
  2023-02-14 23:20                               ` Eric Sunshine
  2 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14 22:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Junio C Hamano <gitster@pobox.com> writes:

> Am I reading the loop correctly?  I wanted the bogus refname not to
> match the rule, but without peeking back refname[-1], I cannot tell
> the two cases apart at this point.


Heh.

    $ git symbolic-ref BOGO refsXheads/naster
    $ git symbolic-ref --short BOGO
    heads/naster


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 21:48                             ` Junio C Hamano
  2023-02-14 22:25                               ` Junio C Hamano
@ 2023-02-14 22:30                               ` Jeff King
  2023-02-14 22:34                                 ` Junio C Hamano
  2023-02-14 23:20                               ` Eric Sunshine
  2 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-14 22:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, 孟子易, git

On Tue, Feb 14, 2023 at 01:48:57PM -0800, Junio C Hamano wrote:

> > +	do {
> > +		if (!*rule)
> > +			BUG("rev-parse rule did not have percent");
> > +		if (*rule == '%')
> > +			break;
> > +	} while (*refname++ == *rule++);
> 
> So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then
> we'll scan refname and rule to skip over their "refs/" prefix, and
> the next iteration, where post-increment moved the pointers to point
> at 'h' (at the beginning of "heads/frotz") on the refname side and
> '%' on the rule side, we iterate once more, notice *rule is '%', and
> break out of the loop.  We have refname="heads/frotz" and rule="%.*s"
> 
> If we have refname="refsXheads/frotz" and rule="refs/%.*s", after
> skipping over "refs", refname points at 'X' while rule points at '/'
> and the loop needs to break.  Both pointers are post-incremented,
> and now we have refname="heads/frotz" and rule="%.*s".

Thanks for being careful. I had originally detected a match by setting a
flag in the loop when we see the "%", but then thought it wasn't needed.
And it's not for the matching case, but it is for the non-match.

This would fix it:

diff --git a/refs.c b/refs.c
index d8ce7e9ee1..2c26cf02d3 100644
--- a/refs.c
+++ b/refs.c
@@ -1318,6 +1318,8 @@ int update_ref(const char *msg, const char *refname,
 static const char *match_parse_rule(const char *refname, const char *rule,
 				    size_t *len)
 {
+	int matched = 0;
+
 	/*
 	 * Check that rule matches refname up to the first percent
 	 * in the rule. This is basically skip_prefix(), but
@@ -1326,10 +1328,15 @@ static const char *match_parse_rule(const char *refname, const char *rule,
 	do {
 		if (!*rule)
 			BUG("rev-parse rule did not have percent");
-		if (*rule == '%')
+		if (*rule == '%') {
+			matched = 1;
 			break;
+		}
 	} while (*refname++ == *rule++);
 
+	if (!matched)
+		return 0;
+
 	/*
 	 * Check that we matched all the way to the "%" placeholder,
 	 * and skip past it within the rule string. If so, "refname" at

but I have a feeling that it gets more readable if we flip the break
conditional and the loop condition.

I had also imagined this as a skip_prefix_to_percent() helper, which
makes the logic nicer, but we actually need to advance in both the
refname and the prefix, which makes for a weird interface.

-Peff

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 22:30                               ` Jeff King
@ 2023-02-14 22:34                                 ` Junio C Hamano
  2023-02-14 22:40                                   ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-14 22:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

> but I have a feeling that it gets more readable if we flip the break
> conditional and the loop condition.

Yeah, the somewhoat unusual loop structure was what motivated me to
look at its corner case.  Flipping the logic around may make it more
straight forward.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 22:34                                 ` Junio C Hamano
@ 2023-02-14 22:40                                   ` Jeff King
  2023-02-15  5:10                                     ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-14 22:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, 孟子易, git

On Tue, Feb 14, 2023 at 02:34:01PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > but I have a feeling that it gets more readable if we flip the break
> > conditional and the loop condition.
> 
> Yeah, the somewhoat unusual loop structure was what motivated me to
> look at its corner case.  Flipping the logic around may make it more
> straight forward.

It does indeed. I pulled the logic from skip_prefix(), thinking that by
relying on it I would avoid making a stupid mistake. Oh well. :)

Doing it like this is much more readable:

diff --git a/refs.c b/refs.c
index d8ce7e9ee1..725adafcd8 100644
--- a/refs.c
+++ b/refs.c
@@ -1323,12 +1323,12 @@ static const char *match_parse_rule(const char *refname, const char *rule,
 	 * in the rule. This is basically skip_prefix(), but
 	 * ending at percent in the prefix, rather than end-of-string.
 	 */
-	do {
+	while (*rule != '%') {
 		if (!*rule)
 			BUG("rev-parse rule did not have percent");
-		if (*rule == '%')
-			break;
-	} while (*refname++ == *rule++);
+		if (*refname++ != *rule++)
+			return 0;
+	}
 
 	/*
 	 * Check that we matched all the way to the "%" placeholder,

I'll hold on to that (plus an adjustment to the comment below to match,
and perhaps a test for this negative-match case) for a day or so to give
anybody else a chance to comment, and then send out a v2 tomorrow.

-Peff

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 21:48                             ` Junio C Hamano
  2023-02-14 22:25                               ` Junio C Hamano
  2023-02-14 22:30                               ` Jeff King
@ 2023-02-14 23:20                               ` Eric Sunshine
  2 siblings, 0 replies; 46+ messages in thread
From: Eric Sunshine @ 2023-02-14 23:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, 孟子易, git

On Tue, Feb 14, 2023 at 4:48 PM Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
> > +     do {
> > +             if (!*rule)
> > +                     BUG("rev-parse rule did not have percent");
> > +             if (*rule == '%')
> > +                     break;
> > +     } while (*refname++ == *rule++);
>
> So, if we have refname="refs/heads/frotz" and rule="refs/%.*s", then
> we'll scan refname and rule to skip over their "refs/" prefix, and
> the next iteration, where post-increment moved the pointers to point
> at 'h' (at the beginning of "heads/frotz") on the refname side and
> '%' on the rule side, we iterate once more, notice *rule is '%', and
> break out of the loop.  We have refname="heads/frotz" and rule="%.*s"
>
> If we have refname="refsXheads/frotz" and rule="refs/%.*s", after
> skipping over "refs", refname points at 'X' while rule points at '/'
> and the loop needs to break.  Both pointers are post-incremented,
> and now we have refname="heads/frotz" and rule="%.*s".

Nice catch. I had sat staring at and worrying about the combined
comparison and post-increment, trying to come up with a failing edge
case but missed this one entirely.

This logic error missed by two people suggests that the patch probably
ought to be accompanied by some new tests.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-14 22:40                                   ` Jeff King
@ 2023-02-15  5:10                                     ` Junio C Hamano
  2023-02-15 14:30                                       ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-15  5:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

> It does indeed. I pulled the logic from skip_prefix(), thinking that by
> relying on it I would avoid making a stupid mistake. Oh well. :)
>
> Doing it like this is much more readable:
> ...
> I'll hold on to that (plus an adjustment to the comment below to match,
> and perhaps a test for this negative-match case) for a day or so to give
> anybody else a chance to comment, and then send out a v2 tomorrow.

Thanks, and surely that is very readable.

Alternatively, I think you can just compare refname and rule until
they diverge, without doing any special casing for per-cent on the
rule side inside the loop.

If you do not find any difference, or the byte that differ is not
the per-cent at the beginning of "%.*s" on the rule side, they they
do not match.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-15  5:10                                     ` Junio C Hamano
@ 2023-02-15 14:30                                       ` Jeff King
  2023-02-15 16:41                                         ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-15 14:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, 孟子易, git

On Tue, Feb 14, 2023 at 09:10:04PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It does indeed. I pulled the logic from skip_prefix(), thinking that by
> > relying on it I would avoid making a stupid mistake. Oh well. :)
> >
> > Doing it like this is much more readable:
> > ...
> > I'll hold on to that (plus an adjustment to the comment below to match,
> > and perhaps a test for this negative-match case) for a day or so to give
> > anybody else a chance to comment, and then send out a v2 tomorrow.
> 
> Thanks, and surely that is very readable.
> 
> Alternatively, I think you can just compare refname and rule until
> they diverge, without doing any special casing for per-cent on the
> rule side inside the loop.
> 
> If you do not find any difference, or the byte that differ is not
> the per-cent at the beginning of "%.*s" on the rule side, they they
> do not match.

I had a similar thought, but I think it is fooled by "refs/heads/%foo".
The correct shortening there is "%foo".  But we'd parse the
"refs/heads/%.*s" rule up to the ".", and then complain that they do not
match.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 0/3] get rid of sscanf() when shortening refs
  2023-02-14 18:38                         ` [PATCH 0/3] get rid of sscanf() when shortening refs Jeff King
                                             ` (2 preceding siblings ...)
  2023-02-14 18:41                           ` [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
@ 2023-02-15 15:16                           ` Jeff King
  2023-02-15 15:16                             ` [PATCH v2 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
                                               ` (3 more replies)
  3 siblings, 4 replies; 46+ messages in thread
From: Jeff King @ 2023-02-15 15:16 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

On Tue, Feb 14, 2023 at 01:38:13PM -0500, Jeff King wrote:

> OK, here it is. I split it into a few patches to hopefully make it a bit
> easier to follow.
> 
>   [1/3]: shorten_unambiguous_ref(): avoid integer truncation
>   [2/3]: shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
>   [3/3]: shorten_unambiguous_ref(): avoid sscanf()

And here's a v2 fixing the "refs/headsXfoo" problem from v1. I also
added a few tests to cover that case along with some others, including
the original problem that spawned this thread.

I hope setting LC_ALL is OK in the test suite (i.e., at worst it just
becomes a noop on platforms that don't have that locale).  We can drop
it, but then the test itself becomes kind of meaningless, as I do not
think it would even fail on macOS. I would be curious to see if the test
as written does fail with the current code (I lack access to a system to
test).

In writing these, I noticed that this patch is actually fixing another
bug in the original. ;) The "%s" match in scanf is greedy, so we'd never
turn "refs/remotes/foo/HEAD" into "foo" (instead we say "foo/HEAD"). I
doubt anybody cares very much, and you might even argue that "foo/HEAD",
while not the shortest possible answer, is preferable because it's more
descriptive. But I think "foo" matches the intent of the code, and
certainly it's an unambiguous shortening.

Anyway, here's the patches and the range diff.

  [1/3]: shorten_unambiguous_ref(): avoid integer truncation
  [2/3]: shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
  [3/3]: shorten_unambiguous_ref(): avoid sscanf()

 refs.c                  | 93 +++++++++++++++++++++--------------------
 t/t1401-symbolic-ref.sh | 34 +++++++++++++++
 2 files changed, 82 insertions(+), 45 deletions(-)

1:  56762dc84b = 1:  f84edd1791 shorten_unambiguous_ref(): avoid integer truncation
2:  83326a396a = 2:  9287afb50f shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
3:  754ea4eb40 ! 3:  7cc6a8b89d shorten_unambiguous_ref(): avoid sscanf()
    @@ Commit message
         that against the refname, pulling the "%s" content into a separate
         buffer.
     
    -    This has two downsides:
    +    This has a few downsides:
     
           - sscanf("%s") reportedly misbehaves on macOS with some input and
             locale combinations, returning a partial or garbled string. See
             this thread:
     
               https://lore.kernel.org/git/CAGF3oAcCi+fG12j-1U0hcrWwkF5K_9WhOi6ZPHBzUUzfkrZDxA@mail.gmail.com/
     
    +      - scanf's matching of "%s" is greedy. So the "refs/remotes/%s/HEAD"
    +        rule would never pull "origin" out of "refs/remotes/origin/HEAD".
    +        Instead it always produced "origin/HEAD", which is redundant with
    +        the "refs/remotes/%s" rule.
    +
           - scanf in general is an error-prone interface. For example, scanning
             for "%s" will copy bytes into a destination string, which must have
             been correctly sized ahead of time to avoid a buffer overflow. In
    @@ Commit message
         to the lookup rules with "%.*s"), and then copy it only when returning
         to the caller.
     
    +    There are a few new tests here, all using symbolic-ref (the code can be
    +    triggered in many ways, but symrefs are convenient in that we don't need
    +    to create a real ref, which avoids any complications from the filesystem
    +    munging the name):
    +
    +      - the first covers the real-world case which misbehaved on macOS.
    +        Setting LC_ALL is required to trigger the problem there (since
    +        otherwise our tests use LC_ALL=C), and hopefully is at worst simply
    +        ignored on other systems (and doesn't cause libc to complain, etc,
    +        on systems without that locale).
    +
    +      - the second covers the "origin/HEAD" case as discussed above, which
    +        is now fixed
    +
    +      - the remainder are for "weird" cases that work both before and after
    +        this patch, but would be easy to get wrong with off-by-one problems
    +        in the parsing (and came out of discussions and earlier iterations
    +        of the patch that did get them wrong).
    +
    +      - absent here are tests of boring, expected-to-work cases like
    +        "refs/heads/foo", etc. Those are covered all over the test suite
    +        both explicitly (for-each-ref's refname:short) and implicitly (in
    +        the output of git-status, etc).
    +
         Reported-by: 孟子易 <mengziyi540841@gmail.com>
         Helped-by: Eric Sunshine <sunshine@sunshineco.com>
         Signed-off-by: Jeff King <peff@peff.net>
    @@ refs.c: int update_ref(const char *msg, const char *refname,
     -		size_t total_len = 0;
     -		size_t offset = 0;
     +	/*
    -+	 * Check that rule matches refname up to the first percent
    -+	 * in the rule. This is basically skip_prefix(), but
    -+	 * ending at percent in the prefix, rather than end-of-string.
    ++	 * Check that rule matches refname up to the first percent in the rule.
    ++	 * We can bail immediately if not, but otherwise we leave "rule" at the
    ++	 * %-placeholder, and "refname" at the start of the potential matched
    ++	 * name.
     +	 */
    -+	do {
    ++	while (*rule != '%') {
     +		if (!*rule)
     +			BUG("rev-parse rule did not have percent");
    -+		if (*rule == '%')
    -+			break;
    -+	} while (*refname++ == *rule++);
    ++		if (*refname++ != *rule++)
    ++			return NULL;
    ++	}
      
     -		for (i = 0; i < NUM_REV_PARSE_RULES; i++)
     -			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
     -			total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1;
     +	/*
    -+	 * Check that we matched all the way to the "%" placeholder,
    -+	 * and skip past it within the rule string. If so, "refname" at
    -+	 * this point is the beginning of a potential match.
    ++	 * Check that our "%" is the expected placeholder. This assumes there
    ++	 * are no other percents (placeholder or quoted) in the string, but
    ++	 * that is sufficient for our rev-parse rules.
     +	 */
     +	if (!skip_prefix(rule, "%.*s", &rule))
     +		return NULL;
    @@ refs.c: char *refs_shorten_unambiguous_ref(struct ref_store *refs,
      	return xstrdup(refname);
      }
      
    +
    + ## t/t1401-symbolic-ref.sh ##
    +@@ t/t1401-symbolic-ref.sh: test_expect_success 'symbolic-ref pointing at another' '
    + 	test_cmp expect actual
    + '
    + 
    ++test_expect_success 'symbolic-ref --short handles complex utf8 case' '
    ++	name="测试-加-增加-加-增加" &&
    ++	git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
    ++	# In the real world, we saw problems with this case only
    ++	# when the locale includes UTF-8. Set it here to try to make things as
    ++	# hard as possible for us to pass, but in practice we should do the
    ++	# right thing regardless (and of course some platforms may not even
    ++	# have this locale).
    ++	LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
    ++	echo "$name" >expect &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'symbolic-ref --short handles name with suffix' '
    ++	git symbolic-ref TEST_SYMREF "refs/remotes/origin/HEAD" &&
    ++	git symbolic-ref --short TEST_SYMREF >actual &&
    ++	echo "origin" >expect &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'symbolic-ref --short handles almost-matching name' '
    ++	git symbolic-ref TEST_SYMREF "refs/headsXfoo" &&
    ++	git symbolic-ref --short TEST_SYMREF >actual &&
    ++	echo "headsXfoo" >expect &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'symbolic-ref --short handles name with percent' '
    ++	git symbolic-ref TEST_SYMREF "refs/heads/%foo" &&
    ++	git symbolic-ref --short TEST_SYMREF >actual &&
    ++	echo "%foo" >expect &&
    ++	test_cmp expect actual
    ++'
    ++
    + test_done

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 1/3] shorten_unambiguous_ref(): avoid integer truncation
  2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
@ 2023-02-15 15:16                             ` Jeff King
  2023-02-15 15:16                             ` [PATCH v2 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-02-15 15:16 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

We parse the shortened name "foo" out of the full refname
"refs/heads/foo", and then assign the result of strlen(short_name) to an
int, which may truncate or wrap to negative.

In practice, this should never happen, as it requires a 2GB refname. And
even somebody trying to do something malicious should at worst end up
with a confused answer (we use the size only to feed back as a
placeholder length to strbuf_addf() to see if there are any collisions
in the lookup rules).

And it may even be impossible to trigger this, as we parse the string
with sscanf(), and stdio formatting functions are not known for handling
large strings well. I didn't test, but I wouldn't be surprised if
sscanf() on many platforms simply reports no match here.

But even if it is not a problem in practice so far, it is worth fixing
for two reasons:

  1. We'll shortly be replacing the sscanf() call with a real parser
     which will handle arbitrary-sized strings.

  2. Assigning strlen() to an int is an anti-pattern that requires
     people to look twice when auditing for real overflow problems.

So we'll make this a size_t. Unfortunately we still have to cast to int
eventually for the strbuf_addf() call, but at least we can localize the
cast there, and check that it will be valid. I used our new cast helper
here, which will just bail completely. That should be OK, as anybody
with a 2GB refname is up to no good, but if we really wanted to, we
could detect it manually and just refuse to shorten the refname.

Signed-off-by: Jeff King <peff@peff.net>
---
 refs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/refs.c b/refs.c
index e31dbcda59..94d938390d 100644
--- a/refs.c
+++ b/refs.c
@@ -1356,7 +1356,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 	for (i = nr_rules - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
-		int short_name_len;
+		size_t short_name_len;

 		if (1 != sscanf(refname, scanf_fmts[i], short_name))
 			continue;
@@ -1388,7 +1388,8 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 			 */
 			strbuf_reset(&resolved_buf);
 			strbuf_addf(&resolved_buf, rule,
-				    short_name_len, short_name);
+				    cast_size_t_to_int(short_name_len),
+				    short_name);
 			if (refs_ref_exists(refs, resolved_buf.buf))
 				break;
 		}
-- 
2.39.2.881.gb6410a20aa

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant
  2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
  2023-02-15 15:16                             ` [PATCH v2 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
@ 2023-02-15 15:16                             ` Jeff King
  2023-02-15 15:16                             ` [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
  2023-02-15 18:00                             ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Junio C Hamano
  3 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2023-02-15 15:16 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

The ref_rev_parse_rules[] array is terminated with a NULL entry, and we
count it and store the result in the local nr_rules variable. But we
don't need to do so; since the array is a constant, we can compute its
size directly. The original code probably didn't do that because it was
written as part of for-each-ref, and saw the array only as a pointer. It
was migrated in 7c2b3029df (make get_short_ref a public function,
2009-04-07) and could have been updated then, but that subtlety was not
noticed.

We even have a constant that represents this value already, courtesy of
60650a48c0 (remote: make refspec follow the same disambiguation rule as
local refs, 2018-08-01), though again, nobody noticed at the time that
it could be used here, too.

The current count-up isn't a big deal, as we need to preprocess that
array anyway. But it will become more cumbersome as we refactor the
shortening code. So let's get rid of it and just use the constant
everywhere.

Note that there are two things here that aren't just simple text
replacements:

  1. We also use nr_rules to see if a previous call has initialized the
     static pre-processing variables. We can just use the scanf_fmts
     pointer to do the same thing, as it is non-NULL only after we've
     done that initialization.

  2. If nr_rules is zero after we've counted it up, we bail from the
     function. This code is unreachable, though, as the set of rules is
     hard-coded and non-empty. And that becomes even more apparent now
     that we are using the constant. So we can drop this conditional
     completely (and ironically, the code would have the same output if
     it _did_ trigger, as we'd simply skip the loop entirely and return
     the whole refname).

Signed-off-by: Jeff King <peff@peff.net>
---
 refs.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/refs.c b/refs.c
index 94d938390d..84f344d8af 100644
--- a/refs.c
+++ b/refs.c
@@ -1315,11 +1315,10 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 {
 	int i;
 	static char **scanf_fmts;
-	static int nr_rules;
 	char *short_name;
 	struct strbuf resolved_buf = STRBUF_INIT;
 
-	if (!nr_rules) {
+	if (!scanf_fmts) {
 		/*
 		 * Pre-generate scanf formats from ref_rev_parse_rules[].
 		 * Generate a format suitable for scanf from a
@@ -1329,31 +1328,26 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		size_t total_len = 0;
 		size_t offset = 0;
 
-		/* the rule list is NULL terminated, count them first */
-		for (nr_rules = 0; ref_rev_parse_rules[nr_rules]; nr_rules++)
+		for (i = 0; i < NUM_REV_PARSE_RULES; i++)
 			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[nr_rules]) - 2 + 1;
+			total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1;
 
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), nr_rules), total_len));
+		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len));
 
 		offset = 0;
-		for (i = 0; i < nr_rules; i++) {
+		for (i = 0; i < NUM_REV_PARSE_RULES; i++) {
 			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[nr_rules] + offset;
+			scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset;
 			offset += xsnprintf(scanf_fmts[i], total_len - offset,
 					    ref_rev_parse_rules[i], 2, "%s") + 1;
 		}
 	}
 
-	/* bail out if there are no rules */
-	if (!nr_rules)
-		return xstrdup(refname);
-
 	/* buffer for scanf result, at most refname must fit */
 	short_name = xstrdup(refname);
 
 	/* skip first rule, it will always match */
-	for (i = nr_rules - 1; i > 0 ; --i) {
+	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
 		size_t short_name_len;
@@ -1368,7 +1362,7 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 * must fail to resolve to a valid non-ambiguous ref
 		 */
 		if (strict)
-			rules_to_fail = nr_rules;
+			rules_to_fail = NUM_REV_PARSE_RULES;
 
 		/*
 		 * check if the short name resolves to a valid ref,
-- 
2.39.2.881.gb6410a20aa


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
  2023-02-15 15:16                             ` [PATCH v2 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
  2023-02-15 15:16                             ` [PATCH v2 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
@ 2023-02-15 15:16                             ` Jeff King
  2023-02-16  5:56                               ` Torsten Bögershausen
  2023-02-15 18:00                             ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Junio C Hamano
  3 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-15 15:16 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Junio C Hamano, 孟子易, git

To shorten a fully qualified ref (e.g., taking "refs/heads/foo" to just
"foo"), we munge the usual lookup rules ("refs/heads/%.*s", etc) to drop
the ".*" modifier (so "refs/heads/%s"), and then use sscanf() to match
that against the refname, pulling the "%s" content into a separate
buffer.

This has a few downsides:

  - sscanf("%s") reportedly misbehaves on macOS with some input and
    locale combinations, returning a partial or garbled string. See
    this thread:

      https://lore.kernel.org/git/CAGF3oAcCi+fG12j-1U0hcrWwkF5K_9WhOi6ZPHBzUUzfkrZDxA@mail.gmail.com/

  - scanf's matching of "%s" is greedy. So the "refs/remotes/%s/HEAD"
    rule would never pull "origin" out of "refs/remotes/origin/HEAD".
    Instead it always produced "origin/HEAD", which is redundant with
    the "refs/remotes/%s" rule.

  - scanf in general is an error-prone interface. For example, scanning
    for "%s" will copy bytes into a destination string, which must have
    been correctly sized ahead of time to avoid a buffer overflow. In
    this case, the code is OK (the buffer is pessimistically sized to
    match the original string, which should give us a maximum). But in
    general, we do not want to encourage people to use scanf at all.

So instead, let's note that our lookup rules are not arbitrary format
strings, but all contain exactly one "%.*s" placeholder. We already rely
on this, both for lookup (we feed the lookup format along with exactly
one int/ptr combo to snprintf, etc) and for shortening (we munge "%.*s"
to "%s", and then insist that sscanf() finds exactly one result).

We can parse this manually by just matching the bytes that occur before
and after the "%.*s" placeholder. While we have a few extra lines of
parsing code, the result is arguably simpler, as can skip the
preprocessing step and its tricky memory management entirely.

The in-code comments should explain the parsing strategy, but there's
one subtle change here. The original code allocated a single buffer, and
then overwrote it in each loop iteration, since that's the only option
sscanf() gives us. But our parser can actually return a ptr/len combo
for the matched string, which is all we need (since we just feed it back
to the lookup rules with "%.*s"), and then copy it only when returning
to the caller.

There are a few new tests here, all using symbolic-ref (the code can be
triggered in many ways, but symrefs are convenient in that we don't need
to create a real ref, which avoids any complications from the filesystem
munging the name):

  - the first covers the real-world case which misbehaved on macOS.
    Setting LC_ALL is required to trigger the problem there (since
    otherwise our tests use LC_ALL=C), and hopefully is at worst simply
    ignored on other systems (and doesn't cause libc to complain, etc,
    on systems without that locale).

  - the second covers the "origin/HEAD" case as discussed above, which
    is now fixed

  - the remainder are for "weird" cases that work both before and after
    this patch, but would be easy to get wrong with off-by-one problems
    in the parsing (and came out of discussions and earlier iterations
    of the patch that did get them wrong).

  - absent here are tests of boring, expected-to-work cases like
    "refs/heads/foo", etc. Those are covered all over the test suite
    both explicitly (for-each-ref's refname:short) and implicitly (in
    the output of git-status, etc).

Reported-by: 孟子易 <mengziyi540841@gmail.com>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 refs.c                  | 78 +++++++++++++++++++++++------------------
 t/t1401-symbolic-ref.sh | 34 ++++++++++++++++++
 2 files changed, 77 insertions(+), 35 deletions(-)

diff --git a/refs.c b/refs.c
index 84f344d8af..aeae31c972 100644
--- a/refs.c
+++ b/refs.c
@@ -1310,53 +1310,62 @@ int update_ref(const char *msg, const char *refname,
 			       old_oid, flags, onerr);
 }
 
-char *refs_shorten_unambiguous_ref(struct ref_store *refs,
-				   const char *refname, int strict)
+/*
+ * Check that the string refname matches a rule of the form
+ * "{prefix}%.*s{suffix}". So "foo/bar/baz" would match the rule
+ * "foo/%.*s/baz", and return the string "bar".
+ */
+static const char *match_parse_rule(const char *refname, const char *rule,
+				    size_t *len)
 {
-	int i;
-	static char **scanf_fmts;
-	char *short_name;
-	struct strbuf resolved_buf = STRBUF_INIT;
-
-	if (!scanf_fmts) {
-		/*
-		 * Pre-generate scanf formats from ref_rev_parse_rules[].
-		 * Generate a format suitable for scanf from a
-		 * ref_rev_parse_rules rule by interpolating "%s" at the
-		 * location of the "%.*s".
-		 */
-		size_t total_len = 0;
-		size_t offset = 0;
+	/*
+	 * Check that rule matches refname up to the first percent in the rule.
+	 * We can bail immediately if not, but otherwise we leave "rule" at the
+	 * %-placeholder, and "refname" at the start of the potential matched
+	 * name.
+	 */
+	while (*rule != '%') {
+		if (!*rule)
+			BUG("rev-parse rule did not have percent");
+		if (*refname++ != *rule++)
+			return NULL;
+	}
 
-		for (i = 0; i < NUM_REV_PARSE_RULES; i++)
-			/* -2 for strlen("%.*s") - strlen("%s"); +1 for NUL */
-			total_len += strlen(ref_rev_parse_rules[i]) - 2 + 1;
+	/*
+	 * Check that our "%" is the expected placeholder. This assumes there
+	 * are no other percents (placeholder or quoted) in the string, but
+	 * that is sufficient for our rev-parse rules.
+	 */
+	if (!skip_prefix(rule, "%.*s", &rule))
+		return NULL;
 
-		scanf_fmts = xmalloc(st_add(st_mult(sizeof(char *), NUM_REV_PARSE_RULES), total_len));
+	/*
+	 * And now check that our suffix (if any) matches.
+	 */
+	if (!strip_suffix(refname, rule, len))
+		return NULL;
 
-		offset = 0;
-		for (i = 0; i < NUM_REV_PARSE_RULES; i++) {
-			assert(offset < total_len);
-			scanf_fmts[i] = (char *)&scanf_fmts[NUM_REV_PARSE_RULES] + offset;
-			offset += xsnprintf(scanf_fmts[i], total_len - offset,
-					    ref_rev_parse_rules[i], 2, "%s") + 1;
-		}
-	}
+	return refname; /* len set by strip_suffix() */
+}
 
-	/* buffer for scanf result, at most refname must fit */
-	short_name = xstrdup(refname);
+char *refs_shorten_unambiguous_ref(struct ref_store *refs,
+				   const char *refname, int strict)
+{
+	int i;
+	struct strbuf resolved_buf = STRBUF_INIT;
 
 	/* skip first rule, it will always match */
 	for (i = NUM_REV_PARSE_RULES - 1; i > 0 ; --i) {
 		int j;
 		int rules_to_fail = i;
+		const char *short_name;
 		size_t short_name_len;
 
-		if (1 != sscanf(refname, scanf_fmts[i], short_name))
+		short_name = match_parse_rule(refname, ref_rev_parse_rules[i],
+					      &short_name_len);
+		if (!short_name)
 			continue;
 
-		short_name_len = strlen(short_name);
-
 		/*
 		 * in strict mode, all (except the matched one) rules
 		 * must fail to resolve to a valid non-ambiguous ref
@@ -1394,12 +1403,11 @@ char *refs_shorten_unambiguous_ref(struct ref_store *refs,
 		 */
 		if (j == rules_to_fail) {
 			strbuf_release(&resolved_buf);
-			return short_name;
+			return xmemdupz(short_name, short_name_len);
 		}
 	}
 
 	strbuf_release(&resolved_buf);
-	free(short_name);
 	return xstrdup(refname);
 }
 
diff --git a/t/t1401-symbolic-ref.sh b/t/t1401-symbolic-ref.sh
index d708acdb81..be23be30c7 100755
--- a/t/t1401-symbolic-ref.sh
+++ b/t/t1401-symbolic-ref.sh
@@ -189,4 +189,38 @@ test_expect_success 'symbolic-ref pointing at another' '
 	test_cmp expect actual
 '
 
+test_expect_success 'symbolic-ref --short handles complex utf8 case' '
+	name="测试-加-增加-加-增加" &&
+	git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
+	# In the real world, we saw problems with this case only
+	# when the locale includes UTF-8. Set it here to try to make things as
+	# hard as possible for us to pass, but in practice we should do the
+	# right thing regardless (and of course some platforms may not even
+	# have this locale).
+	LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
+	echo "$name" >expect &&
+	test_cmp expect actual
+'
+
+test_expect_success 'symbolic-ref --short handles name with suffix' '
+	git symbolic-ref TEST_SYMREF "refs/remotes/origin/HEAD" &&
+	git symbolic-ref --short TEST_SYMREF >actual &&
+	echo "origin" >expect &&
+	test_cmp expect actual
+'
+
+test_expect_success 'symbolic-ref --short handles almost-matching name' '
+	git symbolic-ref TEST_SYMREF "refs/headsXfoo" &&
+	git symbolic-ref --short TEST_SYMREF >actual &&
+	echo "headsXfoo" >expect &&
+	test_cmp expect actual
+'
+
+test_expect_success 'symbolic-ref --short handles name with percent' '
+	git symbolic-ref TEST_SYMREF "refs/heads/%foo" &&
+	git symbolic-ref --short TEST_SYMREF >actual &&
+	echo "%foo" >expect &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.39.2.881.gb6410a20aa

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-13 20:18 ` Jeff King
  2023-02-13 22:58   ` Eric Sunshine
@ 2023-02-15 16:26   ` Torsten Bögershausen
  2023-02-15 16:37     ` Eric Sunshine
  1 sibling, 1 reply; 46+ messages in thread
From: Torsten Bögershausen @ 2023-02-15 16:26 UTC (permalink / raw)
  To: Jeff King; +Cc: 孟子易, git

On Mon, Feb 13, 2023 at 03:18:28PM -0500, Jeff King wrote:
> On Mon, Feb 13, 2023 at 02:38:08PM +0800, 孟子易 wrote:
>
> > System: Mac Os (Ventura 13.2)
> > Language: Chinese simplified
> > Preconditions:
> > # git checkout -b 测试-加-增加-加-增加
> > # git symbolic-ref --short HEAD
> > Wrong Echo (Current Echo):
> > 测试-�
> > Correct Echo:
> > // I Don't know, may be "测试-加" ?
>
> Hmm, I can't reproduce here on Linux:
>
>   $ git init
>   $ git commit --allow-empty -m foo
>   $ git checkout -b 测试-加-增加-加-增加
>   $ git symbolic-ref --short HEAD
>   测试-加-增加-加-增加

Neither can I - MacOs pre-Ventura ;-)

>
> I wonder if it is related to using macOS. The refs are stored as
> individual files in the filesystem, and HFS+ will do some unicode
> normalization. So I get:
>
>   $ ls .git/refs/heads/ | xxd
>   00000000: 6d61 696e 0ae6 b58b e8af 952d e58a a02d  main.......-...-
>   00000010: e5a2 9ee5 8aa0 2de5 8aa0 2de5 a29e e58a  ......-...-.....
>   00000020: a00a
>
> Are your on-disk bytes different?

In my case there are the same.
Trying to convert from UTF-8 into UTF-8-MAC didn't change anything here.
Side note:
MacOs Ventura is probably not using HFS+, but apfs, which doesn't do
the unicode decomposition on file system level.

It would be helpful, to pipe the result into xxd:
git symbolic-ref --short HEAD | xxd
And then see, if there is any garbling inside or outside of Git ?

>
> My instinct was that this might be related to the shortening code
> treating the names as bytes, rather than characters. But looking at
> shorten_unambiguous_ref(), it is really operating at the level of path
> components, and should never split a partial string.
>
> Another possibility: the shortening is done by applying our usual
> ref-resolving rules one by one via scanf(). There's an assumption in the
> code that the resulting string can never be longer than the input:
>
> 	/* buffer for scanf result, at most refname must fit */
> 	short_name = xstrdup(refname);
>
> 	...
>         for (i = nr_rules - 1; i > 0 ; --i) {
> 		...
>                 if (1 != sscanf(refname, scanf_fmts[i], short_name))
>                         continue;
>
> Is it possible that this assumption is violated based on some particular
> combination of unicode normalization and locale? That seems unlikely to
> me, but it wouldn't be the first time I've been surprised by subtle
> unicode implications.
>
> Is it possible for you to run Git in a debugger and check the
> intermediate steps happening in refs_shorten_unambiguous_ref()?
>
> -Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-15 16:26   ` Torsten Bögershausen
@ 2023-02-15 16:37     ` Eric Sunshine
  2023-02-15 17:19       ` Torsten Bögershausen
  0 siblings, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-15 16:37 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Jeff King, 孟子易, git

On Wed, Feb 15, 2023 at 11:33 AM Torsten Bögershausen <tboegi@web.de> wrote:
> It would be helpful, to pipe the result into xxd:
> git symbolic-ref --short HEAD | xxd
> And then see, if there is any garbling inside or outside of Git ?

Here's what I get:

  $ git symbolic-ref --short HEAD | xxd
  00000000: e6b5 8be8 af95 2de5 8a0a                 ......-...
  $ LANG=C git symbolic-ref --short HEAD | xxd
  00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a  ......-...-.....
  00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a         .-...-.......

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-15 14:30                                       ` Jeff King
@ 2023-02-15 16:41                                         ` Junio C Hamano
  0 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-02-15 16:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

> I had a similar thought, but I think it is fooled by "refs/heads/%foo".
> The correct shortening there is "%foo".  But we'd parse the
> "refs/heads/%.*s" rule up to the ".", and then complain that they do not
> match.

Yeah, you're right.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-15 16:37     ` Eric Sunshine
@ 2023-02-15 17:19       ` Torsten Bögershausen
  2023-02-16  6:08         ` Eric Sunshine
  0 siblings, 1 reply; 46+ messages in thread
From: Torsten Bögershausen @ 2023-02-15 17:19 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Jeff King, 孟子易, git

On Wed, Feb 15, 2023 at 11:37:11AM -0500, Eric Sunshine wrote:
> On Wed, Feb 15, 2023 at 11:33 AM Torsten Bögershausen <tboegi@web.de> wrote:
> > It would be helpful, to pipe the result into xxd:
> > git symbolic-ref --short HEAD | xxd
> > And then see, if there is any garbling inside or outside of Git ?
>
> Here's what I get:
>
>   $ git symbolic-ref --short HEAD | xxd
>   00000000: e6b5 8be8 af95 2de5 8a0a                 ......-...
>   $ LANG=C git symbolic-ref --short HEAD | xxd
>   00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a  ......-...-.....
>   00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a         .-...-.......

Interesting.
I just saw that there are already fixes going on:
Is the fix from Peff helping here ?

And what do you see on disk ?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 0/3] get rid of sscanf() when shortening refs
  2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
                                               ` (2 preceding siblings ...)
  2023-02-15 15:16                             ` [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
@ 2023-02-15 18:00                             ` Junio C Hamano
  3 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-02-15 18:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, 孟子易, git

Jeff King <peff@peff.net> writes:

>     +      - scanf's matching of "%s" is greedy. So the "refs/remotes/%s/HEAD"
>     +        rule would never pull "origin" out of "refs/remotes/origin/HEAD".
>     +        Instead it always produced "origin/HEAD", which is redundant with
>     +        the "refs/remotes/%s" rule.

Wow, that thing is new.  Fixing it may break expectations by
existing script by end-users, but it may probably be OK.

Everything looks good.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-15 15:16                             ` [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
@ 2023-02-16  5:56                               ` Torsten Bögershausen
  2023-02-16  6:16                                 ` Eric Sunshine
  2023-02-16 17:31                                 ` Jeff King
  0 siblings, 2 replies; 46+ messages in thread
From: Torsten Bögershausen @ 2023-02-16  5:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, Junio C Hamano, 孟子易, git

On Wed, Feb 15, 2023 at 10:16:21AM -0500, Jeff King wrote:
[]
> +test_expect_success 'symbolic-ref --short handles complex utf8 case' '
> +	name="测试-加-增加-加-增加" &&
> +	git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
> +	# In the real world, we saw problems with this case only
> +	# when the locale includes UTF-8. Set it here to try to make things as
> +	# hard as possible for us to pass, but in practice we should do the
> +	# right thing regardless (and of course some platforms may not even
> +	# have this locale).
> +	LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
> +	echo "$name" >expect &&
> +	test_cmp expect actual
> +'
> +

Hm, I may have 1 or two questions here.
The real world was MacOs, should that be mentioned here?

The other thing seems to be that there is a bug even with
LANG=C, see the response from Eric here:

$ git symbolic-ref --short HEAD | xxd
00000000: e6b5 8be8 af95 2de5 8a0a                 ......-...
$ LANG=C git symbolic-ref --short HEAD | xxd
00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a  ......-...-.....
00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a         .-...-.......

Does it make sense to
a) Use the local locale, what ever that is
b) Re-run with LC_ALL=en_US.UTF-8
c) Re-run with LANG=C (that is where I had suspected problems when using UTF-8)
d) Mention MacOs here ?

Somewhat in that style:

test_expect_success 'symbolic-ref --short handles complex utf8 case' '
	name="测试-加-增加-加-增加" &&
	git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
	# In the real world, we saw problems with this case only under MacOs Ventura
	# when the locale includes UTF-8. Try it here to try to make things as
	# hard as possible for us to pass, but in practice we should do the
	# right thing regardless (and of course some platforms may not even
	# have this locale).
	# Use try even the default and LANG=C
	git symbolic-ref --short TEST_SYMREF >actual &&
	echo "$name" >expect &&
	test_cmp expect actual &&
	LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
	echo "$name" >expect &&
	test_cmp expect actual &&
	LANG=C git symbolic-ref --short TEST_SYMREF >actual &&
	echo "$name" >expect &&
	test_cmp expect actual
'

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: bug report: symbolic-ref --short command echos the wrong text while use Chinese language
  2023-02-15 17:19       ` Torsten Bögershausen
@ 2023-02-16  6:08         ` Eric Sunshine
  0 siblings, 0 replies; 46+ messages in thread
From: Eric Sunshine @ 2023-02-16  6:08 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Jeff King, 孟子易, git

On Wed, Feb 15, 2023 at 12:19 PM Torsten Bögershausen <tboegi@web.de> wrote:
> On Wed, Feb 15, 2023 at 11:37:11AM -0500, Eric Sunshine wrote:
> > On Wed, Feb 15, 2023 at 11:33 AM Torsten Bögershausen <tboegi@web.de> wrote:
> > > It would be helpful, to pipe the result into xxd:
> > > git symbolic-ref --short HEAD | xxd
> >
> >   $ git symbolic-ref --short HEAD | xxd
> >   00000000: e6b5 8be8 af95 2de5 8a0a                 ......-...
> >   $ LANG=C git symbolic-ref --short HEAD | xxd
> >   00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a  ......-...-.....
> >   00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a         .-...-.......
>
> Interesting.
> I just saw that there are already fixes going on:
> Is the fix from Peff helping here ?

Peff's fix successfully sidesteps the problem on my machine.

(main) % git symbolic-ref --short HEAD
测试-?
(peff) % $(GIT)/bin-wrappers/git symbolic-ref --short HEAD
测试-加-增加-加-增加

> And what do you see on disk ?

I'm not sure what you mean by "on disk". Do you mean the output of `ls
.git/refs/heads/ | xxd` which I posted in [1]?

FWIW: I'm on High Sierra (10.13) using HFS+.

[1]: https://lore.kernel.org/git/CAPig+cQ9f0aW0TcP9A5WrKmYcQsEZvPOiPrgmzsy1frWkHd34w@mail.gmail.com/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-16  5:56                               ` Torsten Bögershausen
@ 2023-02-16  6:16                                 ` Eric Sunshine
  2023-02-16 17:21                                   ` Junio C Hamano
  2023-02-16 17:31                                 ` Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: Eric Sunshine @ 2023-02-16  6:16 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Jeff King, Junio C Hamano, 孟子易, git

On Thu, Feb 16, 2023 at 12:56 AM Torsten Bögershausen <tboegi@web.de> wrote:
> On Wed, Feb 15, 2023 at 10:16:21AM -0500, Jeff King wrote:
> > +test_expect_success 'symbolic-ref --short handles complex utf8 case' '
> > +     name="测试-加-增加-加-增加" &&
> > +     git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
> > +     # In the real world, we saw problems with this case only
> > +     # when the locale includes UTF-8. Set it here to try to make things as
> > +     # hard as possible for us to pass, but in practice we should do the
> > +     # right thing regardless (and of course some platforms may not even
> > +     # have this locale).
> > +     LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
> > +     echo "$name" >expect &&
> > +     test_cmp expect actual
> > +'
>
> The other thing seems to be that there is a bug even with
> LANG=C, see the response from Eric here:
>
> $ git symbolic-ref --short HEAD | xxd
> 00000000: e6b5 8be8 af95 2de5 8a0a                 ......-...
> $ LANG=C git symbolic-ref --short HEAD | xxd
> 00000000: e6b5 8be8 af95 2de5 8aa0 2de5 a29e e58a  ......-...-.....
> 00000010: a02d e58a a02d e5a2 9ee5 8aa0 0a         .-...-.......

I'm confused. To what bug do you refer? In my tests, LANG=C seemed to
sidestep the problem.

> Does it make sense to
> a) Use the local locale, what ever that is
> b) Re-run with LC_ALL=en_US.UTF-8
> c) Re-run with LANG=C (that is where I had suspected problems when using UTF-8)

In my tests, LANG=C is the only case which seemed to work correctly
when the implementation used fscanf().

> d) Mention MacOs here ?

Certainly, a good idea.

> Somewhat in that style:
>
> test_expect_success 'symbolic-ref --short handles complex utf8 case' '
>         name="测试-加-增加-加-增加" &&
>         git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
>         # In the real world, we saw problems with this case only under MacOs Ventura

I'm on ancient High Sierra (10.13) using HFS+, so the problem is not
Ventura-specific. The original bug report did mention Ventura (which
presumably is using APFS).

>         # when the locale includes UTF-8. Try it here to try to make things as
>         # hard as possible for us to pass, but in practice we should do the
>         # right thing regardless (and of course some platforms may not even
>         # have this locale).
>         # Use try even the default and LANG=C
>         git symbolic-ref --short TEST_SYMREF >actual &&
>         echo "$name" >expect &&
>         test_cmp expect actual &&
>         LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
>         echo "$name" >expect &&
>         test_cmp expect actual &&
>         LANG=C git symbolic-ref --short TEST_SYMREF >actual &&
>         echo "$name" >expect &&
>         test_cmp expect actual
> '

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-16  6:16                                 ` Eric Sunshine
@ 2023-02-16 17:21                                   ` Junio C Hamano
  2023-02-16 17:28                                     ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2023-02-16 17:21 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Torsten Bögershausen, Jeff King, 孟子易,
	git

Eric Sunshine <sunshine@sunshineco.com> writes:

>> d) Mention MacOs here ?
>
> Certainly, a good idea.

Yes.

>> test_expect_success 'symbolic-ref --short handles complex utf8 case' '
>>         name="测试-加-增加-加-增加" &&
>>         git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
>>         # In the real world, we saw problems with this case only under MacOs Ventura
>
> I'm on ancient High Sierra (10.13) using HFS+, so the problem is not
> Ventura-specific. The original bug report did mention Ventura (which
> presumably is using APFS).

I do not think the bug is in sscanf() and not in the code that deals
with filenames in the filesystem.  The symbolic-ref TEST_SYMREF is
implemented by writing the problematic string into a regular file,
and we read it as a stream of bytes---there is no chance for things
like filename normalization the filesystem tries to do to corrupt
it.  So reference to "under macOS Ventuara" I think is about their C
library, not a particular filesystem it uses.

How about a bit more detail on sscanf(), like this?

 t/t1401-symbolic-ref.sh | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git c/t/t1401-symbolic-ref.sh w/t/t1401-symbolic-ref.sh
index be23be30c7..dafcb4d61b 100755
--- c/t/t1401-symbolic-ref.sh
+++ w/t/t1401-symbolic-ref.sh
@@ -192,11 +192,13 @@ test_expect_success 'symbolic-ref pointing at another' '
 test_expect_success 'symbolic-ref --short handles complex utf8 case' '
 	name="测试-加-增加-加-增加" &&
 	git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
-	# In the real world, we saw problems with this case only
-	# when the locale includes UTF-8. Set it here to try to make things as
-	# hard as possible for us to pass, but in practice we should do the
-	# right thing regardless (and of course some platforms may not even
-	# have this locale).
+	# In the real world, we saw this case misbehaved on macOS only
+	# when the locale includes UTF-8, back when "symbolic-ref --short"
+	# used sscanf(3) as part of its implementation.  Set it here to
+	# try to make things as hard as possible for us to pass, but in
+	# practice we should do the right thing regardless (and of course
+	# some platforms may not even have this locale), as we no longer
+	# use platform sscanf(3) there.
 	LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
 	echo "$name" >expect &&
 	test_cmp expect actual


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-16 17:21                                   ` Junio C Hamano
@ 2023-02-16 17:28                                     ` Jeff King
  2023-02-16 23:36                                       ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-16 17:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Eric Sunshine, Torsten Bögershausen,
	孟子易, git

On Thu, Feb 16, 2023 at 09:21:15AM -0800, Junio C Hamano wrote:

> How about a bit more detail on sscanf(), like this?
> 
>  t/t1401-symbolic-ref.sh | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git c/t/t1401-symbolic-ref.sh w/t/t1401-symbolic-ref.sh
> index be23be30c7..dafcb4d61b 100755
> --- c/t/t1401-symbolic-ref.sh
> +++ w/t/t1401-symbolic-ref.sh
> @@ -192,11 +192,13 @@ test_expect_success 'symbolic-ref pointing at another' '
>  test_expect_success 'symbolic-ref --short handles complex utf8 case' '
>  	name="测试-加-增加-加-增加" &&
>  	git symbolic-ref TEST_SYMREF "refs/heads/$name" &&
> -	# In the real world, we saw problems with this case only
> -	# when the locale includes UTF-8. Set it here to try to make things as
> -	# hard as possible for us to pass, but in practice we should do the
> -	# right thing regardless (and of course some platforms may not even
> -	# have this locale).
> +	# In the real world, we saw this case misbehaved on macOS only
> +	# when the locale includes UTF-8, back when "symbolic-ref --short"
> +	# used sscanf(3) as part of its implementation.  Set it here to
> +	# try to make things as hard as possible for us to pass, but in
> +	# practice we should do the right thing regardless (and of course
> +	# some platforms may not even have this locale), as we no longer
> +	# use platform sscanf(3) there.
>  	LC_ALL=en_US.UTF-8 git symbolic-ref --short TEST_SYMREF >actual &&
>  	echo "$name" >expect &&
>  	test_cmp expect actual

I am OK with that squashed in. I hadn't bothered to mention macOS, etc,
because that was covered in the commit message. My point in this comment
was mostly to say "don't just remove this LC_ALL! It is doing
something".

But your text makes the situation even more clear.

I do kind of wonder if this test is even doing much. It is nice to
verify the fix (and hopefully somebody with macOS did indeed verify that
it fails before the fix!). But it does not seem all that likely that we
are going to regress in this area. I think it's reasonable to include
it, but if the LC_ALL bit starts creating any portability problems, my
first instinct would be to drop the test.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-16  5:56                               ` Torsten Bögershausen
  2023-02-16  6:16                                 ` Eric Sunshine
@ 2023-02-16 17:31                                 ` Jeff King
  2023-02-17  6:46                                   ` Torsten Bögershausen
  1 sibling, 1 reply; 46+ messages in thread
From: Jeff King @ 2023-02-16 17:31 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Eric Sunshine, Junio C Hamano, 孟子易, git

On Thu, Feb 16, 2023 at 06:56:41AM +0100, Torsten Bögershausen wrote:

> Does it make sense to
> a) Use the local locale, what ever that is

It is always "C", because that's what our tests do. And we found that
wasn't particularly interesting to this case.

> b) Re-run with LC_ALL=en_US.UTF-8

That part's what we already do, and...

> c) Re-run with LANG=C (that is where I had suspected problems when using UTF-8)

This won't do anything, because we set LC_ALL in the test scripts, which
takes precedence over LANG. Eric used it in his real-world testing
because he quite sensibly does not have LC_ALL set in his usual shell. :)

> d) Mention MacOs here ?

Yup, discussed downthread.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-16 17:28                                     ` Jeff King
@ 2023-02-16 23:36                                       ` Junio C Hamano
  0 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2023-02-16 23:36 UTC (permalink / raw)
  To: Jeff King
  Cc: Eric Sunshine, Torsten Bögershausen,
	孟子易, git

Jeff King <peff@peff.net> writes:

> I do kind of wonder if this test is even doing much. It is nice to
> verify the fix (and hopefully somebody with macOS did indeed verify that
> it fails before the fix!). But it does not seem all that likely that we
> are going to regress in this area. I think it's reasonable to include
> it, but if the LC_ALL bit starts creating any portability problems, my
> first instinct would be to drop the test.

That was my thought as well.  I do not see this particular test to
be about protecting against regression.  With the greediness of its
patterns, it is not likely that our future update will reintroduce
use of sscanf(3) and suffer from the same platform bug.  It is about
documenting the historical bug, i.e. the value of the test part of
the patch primarily is that macOS folks can apply it (and revert the
code fix) and demonstrate the presence of a problem.

The refs/remotes/%s/HEAD test does have value---it is to keep us
away from the temptation to go back to the sscanf(3) based solution.
So do the other two to lessor degree, as they cover corner cases
that a similar/rejected implementation would have failed to handle.

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf()
  2023-02-16 17:31                                 ` Jeff King
@ 2023-02-17  6:46                                   ` Torsten Bögershausen
  0 siblings, 0 replies; 46+ messages in thread
From: Torsten Bögershausen @ 2023-02-17  6:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, Junio C Hamano, 孟子易, git

On Thu, Feb 16, 2023 at 12:31:48PM -0500, Jeff King wrote:
> On Thu, Feb 16, 2023 at 06:56:41AM +0100, Torsten Bögershausen wrote:

[]
Thanks for the patience with my somewhat confusing emails-
I was to much confused by the fact that I couldn't reproduce the problem here.
In short: The bug is in good hands.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2023-02-17  6:46 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-13  6:38 bug report: symbolic-ref --short command echos the wrong text while use Chinese language 孟子易
2023-02-13 20:18 ` Jeff King
2023-02-13 22:58   ` Eric Sunshine
2023-02-14  1:39     ` Jeff King
2023-02-14  5:15       ` Eric Sunshine
2023-02-14  5:33         ` Eric Sunshine
2023-02-14  5:40           ` Junio C Hamano
2023-02-14  6:05             ` Eric Sunshine
2023-02-14  6:45               ` Junio C Hamano
2023-02-14  6:55                 ` Eric Sunshine
2023-02-14 16:01                   ` Jeff King
2023-02-14 16:29                     ` Eric Sunshine
2023-02-14 17:07                       ` Jeff King
2023-02-14 18:38                         ` [PATCH 0/3] get rid of sscanf() when shortening refs Jeff King
2023-02-14 18:39                           ` [PATCH 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
2023-02-14 18:40                           ` [PATCH 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
2023-02-14 21:34                             ` Junio C Hamano
2023-02-14 22:23                               ` Jeff King
2023-02-14 18:41                           ` [PATCH 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
2023-02-14 21:48                             ` Junio C Hamano
2023-02-14 22:25                               ` Junio C Hamano
2023-02-14 22:30                               ` Jeff King
2023-02-14 22:34                                 ` Junio C Hamano
2023-02-14 22:40                                   ` Jeff King
2023-02-15  5:10                                     ` Junio C Hamano
2023-02-15 14:30                                       ` Jeff King
2023-02-15 16:41                                         ` Junio C Hamano
2023-02-14 23:20                               ` Eric Sunshine
2023-02-15 15:16                           ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Jeff King
2023-02-15 15:16                             ` [PATCH v2 1/3] shorten_unambiguous_ref(): avoid integer truncation Jeff King
2023-02-15 15:16                             ` [PATCH v2 2/3] shorten_unambiguous_ref(): use NUM_REV_PARSE_RULES constant Jeff King
2023-02-15 15:16                             ` [PATCH v2 3/3] shorten_unambiguous_ref(): avoid sscanf() Jeff King
2023-02-16  5:56                               ` Torsten Bögershausen
2023-02-16  6:16                                 ` Eric Sunshine
2023-02-16 17:21                                   ` Junio C Hamano
2023-02-16 17:28                                     ` Jeff King
2023-02-16 23:36                                       ` Junio C Hamano
2023-02-16 17:31                                 ` Jeff King
2023-02-17  6:46                                   ` Torsten Bögershausen
2023-02-15 18:00                             ` [PATCH v2 0/3] get rid of sscanf() when shortening refs Junio C Hamano
2023-02-14 16:40                     ` bug report: symbolic-ref --short command echos the wrong text while use Chinese language Junio C Hamano
2023-02-14 17:40                       ` Jeff King
2023-02-15 16:26   ` Torsten Bögershausen
2023-02-15 16:37     ` Eric Sunshine
2023-02-15 17:19       ` Torsten Bögershausen
2023-02-16  6:08         ` Eric Sunshine

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).