You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Nathan Beyer (JIRA)" <ji...@apache.org> on 2012/11/12 22:15:13 UTC

[jira] [Commented] (THRIFT-1727) Ruby-1.9: data loss: "binary" fields are re-encoded

    [ https://issues.apache.org/jira/browse/THRIFT-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495645#comment-13495645 ] 

Nathan Beyer commented on THRIFT-1727:
--------------------------------------

I believe the core issue is that there is no 'binary' type. According to the Thrift Types (http://thrift.apache.org/docs/types/) document, there is only a 'string' base type and a 'binary' special type that is a specialized form of 'string'. 

I'm not sure how this manifests on other languages, but in Ruby, when an IDL has a 'binary' type, will add some metadata to the field definitions. Here's an example -
{code}
# IDL with a struct that has string and binary types
struct Combo {
  1: string sdata
  2: binary bdata
}

# Generated Ruby code
    class Combo
      include ::Thrift::Struct, ::Thrift::Struct_Union
      SDATA = 1
      BDATA = 2

      FIELDS = {
        SDATA => {:type => ::Thrift::Types::STRING, :name => 'sdata'},
        BDATA => {:type => ::Thrift::Types::STRING, :name => 'bdata', :binary => true}
      }

      def struct_fields; FIELDS; end

      def validate
      end

      ::Thrift::Struct.generate_accessors self
    end
{code}

Unfortunately, this field information is not available in the protocol classes when serializing and deserializing. Since 'binary' is not a base type, there is no 'write_binary' or 'read_binary'. As such, all that's invoked is 'write_string' or 'read_string' and these methods don't seem to have enough context to get that field definition data. Please let me know if there is access to this information, as it could be used to avoid transcoding the data and forcing the encoding to BINARY.

How are the other libraries dealing with this special 'binary' type?
                
> Ruby-1.9: data loss: "binary" fields are re-encoded
> ---------------------------------------------------
>
>                 Key: THRIFT-1727
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1727
>             Project: Thrift
>          Issue Type: Bug
>          Components: Ruby - Library
>    Affects Versions: 0.9
>         Environment: JRuby 1.6.8 using "--1.9" command line parameter.
>            Reporter: XB
>
> When setting a binary field of a Thrift object with some binary data (e.g. a string whose encoding is "ASCII-8BIT") and then serializing this object, the binary data is re-encoded. That is, it is encoded as if it were not a sequence of bytes but a sequence of characters, encoded using the ISO-8859-1 encoding. This assumed ISO-8859-1 sequence of characters is then converted into UTF-8 (by BinaryProtocol or CompactProtocol). This basically means that all bytes whose values are between 0x80 (inclusive) and 0x100 (exclusive) are converted into multi-byte sequences. This leads to data corruption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira