[ruby-core:60038] Clarification on the behaviour of String#scrub

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:60038] Clarification on the behaviour of String#scrub
@ 2014-01-23 23:34 Yorick Peterse
  2014-01-31  2:26 ` [ruby-core:60365] " NARUSE, Yui
  0 siblings, 1 reply; 2+ messages in thread
From: Yorick Peterse @ 2014-01-23 23:34 UTC (permalink / raw
  To: ruby-core

I am currently working on porting String#scrub and String#scrub! to
Rubinius (https://github.com/rubinius/rubinius/issues/2901). Looking at
the source code of this method in MRI
(https://github.com/ruby/ruby/blob/trunk/string.c#L8022) and the
corresponding tests there are several different paths the code takes.
For example, if I'm reading it correctly it will use different
replacement values depending on the input encoding.

Since my C knowledge and the understanding of the MRI internals is
limited I'd like to request some clarification on the behaviour of these
methods. In particular, I'd like to know the following:

* In what cases are certain replacement values used when no custom one
  is given?

* How exactly are groups of invalid sequences determined and replaced?
  It seems that in some cases two invalid characters are replaced
  separately whereas in other cases they are replaced as a group.

* When exactly would Encoding::CompatibilityError be raised? When both
  the input String and replacement are in non matching encodings?

To clarify the second item, consider the following snippet:

    "\xE3\x80".scrub('-') # => "-"

Here the two sequences get replaced as a group, resulting in only one
instance of "-". However, in the following snippet they are replaced
separately:

    "\x80\x80".scrub('-') # => "--"

Maybe I'm not fully understanding Unicode but it would be nice if this
behaviour was documented somewhere as right now it's not clear whether
this is intentional or a bug.

The closest thing to a spec of the behaviour I could find is
https://bugs.ruby-lang.org/issues/6752 but most of this is in Japanese,
a language I sadly can't read.

Thanks for the info!

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [ruby-core:60365] Re: Clarification on the behaviour of String#scrub
  2014-01-23 23:34 [ruby-core:60038] Clarification on the behaviour of String#scrub Yorick Peterse
@ 2014-01-31  2:26 ` NARUSE, Yui
  0 siblings, 0 replies; 2+ messages in thread
From: NARUSE, Yui @ 2014-01-31  2:26 UTC (permalink / raw
  To: Ruby developers

Hi,

> * In what cases are certain replacement values used when no custom one
>   is given?

Current CRuby uses:
Unicode family: U+FFFD
others: ?

> * How exactly are groups of invalid sequences determined and replaced?
>   It seems that in some cases two invalid characters are replaced
>   separately whereas in other cases they are replaced as a group.

It follows Unicode spec (5.22 Best Practice for U+FFFD Substitution)
http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf
The practice says "The maximal subpart should be replaced".

> * When exactly would Encoding::CompatibilityError be raised? When both
>  the input String and replacement are in non matching encodings?

Following logic.

if the replacement string is broken
  raise ArgumentError
else if the coderange of the replacement is 7bit
  if the input is not ASCII compatible
    raise Encoding::CompatibilityError.
  end
else
  if the encoding of the input and the encoding of the replacement is different
    raise Encoding::CompatibilityError.
  end
end

Thanks,


2014-01-24 Yorick Peterse <yorickpeterse@gmail.com>:
> I am currently working on porting String#scrub and String#scrub! to
> Rubinius (https://github.com/rubinius/rubinius/issues/2901). Looking at
> the source code of this method in MRI
> (https://github.com/ruby/ruby/blob/trunk/string.c#L8022) and the
> corresponding tests there are several different paths the code takes.
> For example, if I'm reading it correctly it will use different
> replacement values depending on the input encoding.
>
> Since my C knowledge and the understanding of the MRI internals is
> limited I'd like to request some clarification on the behaviour of these
> methods. In particular, I'd like to know the following:
>
> * In what cases are certain replacement values used when no custom one
>   is given?
>
> * How exactly are groups of invalid sequences determined and replaced?
>   It seems that in some cases two invalid characters are replaced
>   separately whereas in other cases they are replaced as a group.
>
> * When exactly would Encoding::CompatibilityError be raised? When both
>   the input String and replacement are in non matching encodings?
>
> To clarify the second item, consider the following snippet:
>
>     "\xE3\x80".scrub('-') # => "-"
>
> Here the two sequences get replaced as a group, resulting in only one
> instance of "-". However, in the following snippet they are replaced
> separately:
>
>     "\x80\x80".scrub('-') # => "--"
>
> Maybe I'm not fully understanding Unicode but it would be nice if this
> behaviour was documented somewhere as right now it's not clear whether
> this is intentional or a bug.
>
> The closest thing to a spec of the behaviour I could find is
> https://bugs.ruby-lang.org/issues/6752 but most of this is in Japanese,
> a language I sadly can't read.
>
> Thanks for the info!



-- 
NARUSE, Yui  <naruse@airemix.jp>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-01-31  2:40 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-23 23:34 [ruby-core:60038] Clarification on the behaviour of String#scrub Yorick Peterse
2014-01-31  2:26 ` [ruby-core:60365] " NARUSE, Yui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).