ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:115070] [Ruby master Feature#19930] [Documentation] class Regexp: Character Classes ranges
@ 2023-10-17 13:42 noraj-acceis (Alexandre ZANNI) via ruby-core
  0 siblings, 0 replies; only message in thread
From: noraj-acceis (Alexandre ZANNI) via ruby-core @ 2023-10-17 13:42 UTC (permalink / raw
  To: ruby-core; +Cc: noraj-acceis (Alexandre ZANNI)

Issue #19930 has been reported by noraj-acceis (Alexandre ZANNI).

----------------------------------------
Feature #19930: [Documentation] class Regexp: Character Classes ranges
https://bugs.ruby-lang.org/issues/19930

* Author: noraj-acceis (Alexandre ZANNI)
* Status: Open
* Priority: Normal
----------------------------------------
cf. https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-Character+Classes

> POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.

Reading this description, we globally expect that metacharacters are ASCII only and that POSIX _bracket expressions_ are Unicode aware. But as _bracket expressions_ are POSIX compliant, for example `[:xdigit:]` use only ASCII range `[A-Fa-f0-9]` and not the `Hex_Digit` Unicode property that is also including the Halfwidth and Fullwidth Forms Number Decimal like `0` (U+FF10, FULLWIDTH DIGIT ZERO). So the above description is confusing as we would expect [[:xdigit:]]` to _encompass non-ASCII characters_ too. On the contrary `[:space:]` will look for `[\p{Z}\t\r\n\v\f]` (`\s` plus `\p{Z}` (Separator)) while the description is talking only about `[:blank:], newline, carriage return`.

My point is, in the end, that it's hard to determine what to expect as ranges for character classes while reading the Ruby Regexp documentation alone. To know what is the exact behavior I'll have to read the source code or at least reading the POSIX spec.

My feature request is about adding a comparison table like the one on https://www.regular-expressions.info/posixbrackets.html (for Java) with: the POSIX bracket expression, the description, the ASCII exact range, the Unicode exact range, the shorthand metacharacter (ASCII), the long escape sequence (Unicode). So we could know precisely what to expect by reading the doc.

---Files--------------------------------
Screenshot_20231017_154208.png (145 KB)


-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2023-10-17 13:43 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-17 13:42 [ruby-core:115070] [Ruby master Feature#19930] [Documentation] class Regexp: Character Classes ranges noraj-acceis (Alexandre ZANNI) via ruby-core

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).