You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Nathan Beyer (JIRA)" <ji...@apache.org> on 2012/07/07 01:40:34 UTC

[jira] [Commented] (THRIFT-1023) Thrift encoding (UTF-8) issue with Ruby 1.9.2

    [ https://issues.apache.org/jira/browse/THRIFT-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408454#comment-13408454 ] 

Nathan Beyer commented on THRIFT-1023:
--------------------------------------

I've been doing some research and experimentation and here's what I think needs to be done for this issue and once I get a moment will try to create a patch for it.

For Ruby 1.9+ support -
* Make sure all Transport related classes are working with byte buffers (i.e. String with BINARY/ASCII8BIT encoding). This means that any Strings passed to the Transport classes should be assumed to be byte buffers, which would mean all Strings are force encoded to BINARY (via String#force_encoding methods), if they aren't already when passed.
* Make sure all Protocol related classes are working Strings in UTF-8 encoding. If a String of a different encoding is passed, then transcoding should be perfomed (via String#encode). When passing string-data to the Transport classes, the Strings which are in UTF-8 encoded should be converted to byte buffers (force encoding to BINARY) before being passed on.

Currently, the Transport classes used Strings for the byte buffers, which is fine, but I think it might be cleaner and easier to understand if a buffer class were introduced that encapsulated encoding manipulation code. There's already some code that's in a utility class that seems to already match this description, perhaps it just needs refactoring a bit.

Please post if this doesn't make sense or there are other thoughts.
                
> Thrift encoding  (UTF-8) issue with Ruby 1.9.2
> ----------------------------------------------
>
>                 Key: THRIFT-1023
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1023
>             Project: Thrift
>          Issue Type: Bug
>          Components: Ruby - Library
>    Affects Versions: 0.5
>         Environment: OSX, Ruby 1.9.2, Thrift Gem version 0.5.0
>            Reporter: Vincent Peres
>            Assignee: Jake Farrell
>         Attachments: thrift-1023-utf8-encoding-issue.path
>
>
> I came up with an encoding issue coming from the Thrift library, and especially the BufferedTransport class.
> I've decided to write down few tests to give you a concrete example :
> # encoding: utf-8
> require 'spec_helper'
> describe "encoding" do
>  before do
>    transport = Thrift::BufferedTransport.new(Thrift::Socket.new(MR_CONFIG['host'], 9090))
>    protocol  = Thrift::BinaryProtocol.new(transport)
>    @client   = Apache::Hadoop::Hbase::Thrift::Hbase::Client.new(protocol)
>    transport.open()
>    @table_name = "encoding_test"
>    @column_family = "info:"
>  end
>  it "should create a new table" do
>    column = Apache::Hadoop::Hbase::Thrift::ColumnDescriptor.new{|c| c.name= @column_family}
>    @client.createTable(@table_name, [column]).should be_nil
>  end
>  it "should save standard caracteres" do
>    m        = Apache::Hadoop::Hbase::Thrift::Mutation.new
>    m.column = "info:first_name"
>    m.value  = "Vincent"
>    m.value.encoding.should == Encoding::UTF_8
>    @client.mutateRow(@table_name, "ID1", [m]).should be_nil
>  end
>  it "should save UTF8 caracteres" do
>    m        = Apache::Hadoop::Hbase::Thrift::Mutation.new
>    m.column = "info:first_name"
>    m.value  = "Thorbjørn"
>    m.value.encoding.should == Encoding::UTF_8
>    @client.mutateRow(@table_name, "ID1", [m]).should be_nil
>  end
>  it "should destroy the table" do
>    @client.disableTable(@table_name).should be_nil
>    @client.deleteTable(@table_name).should be_nil
>  end
> end
> It fails when it tries to save the UTF8 string including the caractere 'ø'.
> Here is the output :
>  1) encoding should save UTF8 caracteres
>     Failure/Error: @client.mutateRow(@table_name, "ID1", [m]).should be_nil
>     incompatible character encodings: ASCII-8BIT and UTF-8
>     #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/transport/buffered_transport.rb:59:in
> `write'
>     #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/protocol/binary_protocol.rb:107:in
> `write_string'
>     #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/client.rb:35:in
> `write'
>     #/Users/vincentp/.rvm/gems/ruby-1.9.2-p0/gems/thrift-0.5.0/lib/thrift/client.rb:35:in
> `send_message'
>     # ./lib/thrift/hbase.rb:289:in `send_mutateRow'
>     # ./lib/thrift/hbase.rb:284:in `mutateRow'
>     # ./spec/thrift/cases/encoding_spec.rb:37:in `block (2 levels) in <top
> (required)>'
> Let me know if you need any other details, thank you !

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira