[ruby-core:20125] Playing with String#bytes

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:20125] Playing with String#bytes
@ 2008-11-26 16:07 Emiel van de Laar
  2008-11-26 18:03 ` [ruby-core:20126] " Radosław Bułat
  2008-11-27 17:03 ` [ruby-core:20140] Re: Playing with String#bytes Ken Bloom
  0 siblings, 2 replies; 7+ messages in thread
From: Emiel van de Laar @ 2008-11-26 16:07 UTC (permalink / raw
  To: ruby-core

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1740 bytes --]

Hello ruby-core,

Today I was playing around with manipulating strings containing
binary data, i.e. "\xaa\xab\xac\xad\xae" and using the new String
methods available in Ruby 1.9.

The exercise I was trying out was to extract out a range of bytes as
Fixnums. Kind of like String#bytes but I was only interested in a
subarray. Like so:

"\xaa\xab\xac\xad\xae".bytes.to_a[1,3] # => [171, 172, 173]

This works but operates on the entire data set which I imagine is
fairly expensive... So I chopped it up before hand like so:

"\xaa\xab\xac\xad\xae"[1,3].bytes.to_a
=> [171, 172, 173]

Much better. This works fine when using the ASCII-8BIT encoding.
As soon as you use something like UTF-8 it fails because the []
method now works on characters instead of bytes.

data = "\xc3\xa9\xc3\xa9" # => "\xC3\xA9\xC3\xA9"
data.force_encoding("utf-8") # => "éé"
data[0,2].bytes.to_a  # => [195, 169, 195, 169]

Here I get four bytes instead of the first two which I wanted.

data.bytes.to_a[0,2] # => [195, 169]

So having said that I must ensure that the binary data I am working
with is encoded using ASCII-8BIT. This is off course completely
reasonable and recommended.

Anyway, my real comment is that it might be nice to have String#getbyte
or String#bytes be able to get a subset of bytes from the string.
For example:

"abcde".bytes(1,3).to_a # => [98, 99, 100]
"abcde".bytes(1..3).to_a # => [98, 99, 100]
"abcde".bytes(-2,2).to_a # => [100, 101]

"abcde".getbyte(1,3) # => [98, 99, 100]
"abcde".getbyte(1..3) # => [98, 99, 100]
"abcde".getbyte(-2,2) # => [100, 101]

String#getbyte is singular as opposed to plurar which doesn't sit
well with me.

Thanks for reading!

 - Emiel van de Laar

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:20126] Re: Playing with String#bytes
  2008-11-26 16:07 [ruby-core:20125] Playing with String#bytes Emiel van de Laar
@ 2008-11-26 18:03 ` Radosław Bułat
  2008-11-26 21:49   ` [ruby-core:20130] " Brian Candler
  2008-11-27 17:03 ` [ruby-core:20140] Re: Playing with String#bytes Ken Bloom
  1 sibling, 1 reply; 7+ messages in thread
From: Radosław Bułat @ 2008-11-26 18:03 UTC (permalink / raw
  To: ruby-core

What about:
data.force_encoding("ASCII-8BIT")[1,3].bytes.to_a
?

-- 
Pozdrawiam

Radosław Bułat
http://radarek.jogger.pl - mój blog

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:20130] Re: Playing with String#bytes
  2008-11-26 18:03 ` [ruby-core:20126] " Radosław Bułat
@ 2008-11-26 21:49   ` Brian Candler
  2008-11-26 23:38     ` [ruby-core:20133] " Michael Selig
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Candler @ 2008-11-26 21:49 UTC (permalink / raw
  To: ruby-core

On Thu, Nov 27, 2008 at 03:03:49AM +0900, Radosław Bułat wrote:
> What about:
> data.force_encoding("ASCII-8BIT")[1,3].bytes.to_a
> ?

But that changes the encoding of 'data' as a side-effect. To prevent that,
you'd need

  data.dup.force_encoding("ASCII-8BIT")[1,3].bytes.to_a

which is getting a bit messy. OTOH, I'm not sure how often you'd want to
handle a string which has been tagged as UTF-8 in this way.

BTW, "BINARY" is a synonym for "ASCII-8BIT" and probably makes more sense
here. But oddly, you can't use a Symbol.

irb(main):036:0> data.force_encoding("binary")
=> "\xC3\xA9\xC3\xA9"
irb(main):037:0> data.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):038:0> data.force_encoding(:binary)
TypeError: can't convert Symbol into String
	from (irb):38:in `force_encoding'
	from (irb):38
	from /usr/local/bin/irb19:12:in `<main>'

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:20133] Re: Playing with String#bytes
  2008-11-26 21:49   ` [ruby-core:20130] " Brian Candler
@ 2008-11-26 23:38     ` Michael Selig
  2008-11-27  0:05       ` [ruby-core:20134] " Yukihiro Matsumoto
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Selig @ 2008-11-26 23:38 UTC (permalink / raw
  To: ruby-core

"Brian Candler" <B.Candler@pobox.com> wrote:

> On Thu, Nov 27, 2008 at 03:03:49AM +0900, Radosław Bułat wrote:
>> What about:
>> data.force_encoding("ASCII-8BIT")[1,3].bytes.to_a
>> ?
>
> But that changes the encoding of 'data' as a side-effect. To prevent that,
> you'd need
>
>  data.dup.force_encoding("ASCII-8BIT")[1,3].bytes.to_a
>
> which is getting a bit messy.

In retrospect it might have been nice to have String#force_encoding! doing 
what force_encoding now does, as well as a "duplicating" 
String#force_encoding, but I think it's way too late for that now.

> OTOH, I'm not sure how often you'd want to
> handle a string which has been tagged as UTF-8 in this way.

I think one of the problems here is that string literals containing \x are 
not always set to ASCII-8BIT, but to the source encoding, which may very 
well be UTF-8. This is an issue that I have been trying to highlight for a 
while.
I would much prefer string literals with "\x" to be always ASCII-8BIT.

Cheers
Mike 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:20134] Re: Playing with String#bytes
  2008-11-26 23:38     ` [ruby-core:20133] " Michael Selig
@ 2008-11-27  0:05       ` Yukihiro Matsumoto
  2008-11-27  4:21         ` [ruby-core:20137] Re: ASCII-8BIT String literals (Was: Re: Playing with String#bytes) Michael Selig
  0 siblings, 1 reply; 7+ messages in thread
From: Yukihiro Matsumoto @ 2008-11-27  0:05 UTC (permalink / raw
  To: ruby-core

Hi,

In message "Re: [ruby-core:20133] Re: Playing with String#bytes"
    on Thu, 27 Nov 2008 08:38:03 +0900, "Michael Selig" <michael.selig@fs.com.au> writes:

|I would much prefer string literals with "\x" to be always ASCII-8BIT.

It appeared nice at first sight, but it turned out to cause troubles
than it helps.  \x notation for multibyte strings are useful when you
don't have proper input facilities.  Besides that, #dump will no longer
go round for non-Unicode multibyte strings.

							matz.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:20137] Re: ASCII-8BIT String literals (Was: Re: Playing with String#bytes)
  2008-11-27  0:05       ` [ruby-core:20134] " Yukihiro Matsumoto
@ 2008-11-27  4:21         ` Michael Selig
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Selig @ 2008-11-27  4:21 UTC (permalink / raw
  To: ruby-core

Hi,

From: "Yukihiro Matsumoto" <matz@ruby-lang.org> wrote:

> |I would much prefer string literals with "\x" to be always ASCII-8BIT.
>
> It appeared nice at first sight, but it turned out to cause troubles
> than it helps.  \x notation for multibyte strings are useful when you
> don't have proper input facilities.  Besides that, #dump will no longer
> go round for non-Unicode multibyte strings.

Thank you very much for explaining this.

Then what about a construct like %q/%Q (perhaps %b/%B) to quote a binary 
(ASCII-8BIT) string instead of having to use force_encoding? At least that 
way it is very obvious that the programmer means to use a binary (byte) 
string.

Cheers
Mike 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:20140] Re: Playing with String#bytes
  2008-11-26 16:07 [ruby-core:20125] Playing with String#bytes Emiel van de Laar
  2008-11-26 18:03 ` [ruby-core:20126] " Radosław Bułat
@ 2008-11-27 17:03 ` Ken Bloom
  1 sibling, 0 replies; 7+ messages in thread
From: Ken Bloom @ 2008-11-27 17:03 UTC (permalink / raw
  To: ruby-core

On Thu, 27 Nov 2008 01:07:08 +0900, Emiel van de Laar wrote:

> Hello ruby-core,
> 
> Today I was playing around with manipulating strings containing binary
> data, i.e. "\xaa\xab\xac\xad\xae" and using the new String methods
> available in Ruby 1.9.
> 
> The exercise I was trying out was to extract out a range of bytes as
> Fixnums. Kind of like String#bytes but I was only interested in a
> subarray. Like so:
> 
> "\xaa\xab\xac\xad\xae".bytes.to_a[1,3] # => [171, 172, 173]
> 
> This works but operates on the entire data set which I imagine is fairly
> expensive... So I chopped it up before hand like so:

In most cases, it probably isn't, so trying to change things around may 
be premature optimization.

> data = "\xc3\xa9\xc3\xa9" # => "\xC3\xA9\xC3\xA9"
> data.force_encoding("utf-8") # => "éé" data[0,2].bytes.to_a  # => [195,
> 169, 195, 169]
> 
> Here I get four bytes instead of the first two which I wanted.
> 
> data.bytes.to_a[0,2] # => [195, 169]

If you do need to optimize, try unpack, which treats the string as an 
array of bytes anyway.

data = "\xc3\xa9\xc3\xa9\xc3\xa9\xc3\xa9\xc3\xa9\xc3\xa9\xc3\xa9\xc3\xa9"
data.force_encoding("utf-8")
data.unpack("@5C2")  => [169, 195]

-- 
Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-11-27 17:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-26 16:07 [ruby-core:20125] Playing with String#bytes Emiel van de Laar
2008-11-26 18:03 ` [ruby-core:20126] " Radosław Bułat
2008-11-26 21:49   ` [ruby-core:20130] " Brian Candler
2008-11-26 23:38     ` [ruby-core:20133] " Michael Selig
2008-11-27  0:05       ` [ruby-core:20134] " Yukihiro Matsumoto
2008-11-27  4:21         ` [ruby-core:20137] Re: ASCII-8BIT String literals (Was: Re: Playing with String#bytes) Michael Selig
2008-11-27 17:03 ` [ruby-core:20140] Re: Playing with String#bytes Ken Bloom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).