You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Jackie Murphy (JIRA)" <ji...@apache.org> on 2015/02/02 18:03:34 UTC

[jira] [Created] (AVRO-1637) Handling multibyte UTF-8 characters in Ruby

Jackie Murphy created AVRO-1637:
-----------------------------------

             Summary: Handling multibyte UTF-8 characters in Ruby
                 Key: AVRO-1637
                 URL: https://issues.apache.org/jira/browse/AVRO-1637
             Project: Avro
          Issue Type: Bug
            Reporter: Jackie Murphy
            Priority: Minor


It looks like the Ruby implementation of Avro doesn't successfully round-trip UTF-8 encoded strings containing multibyte characters.

Example:

{code}
require 'avro'

def serialize(obj, schema)
  buffer = StringIO.new
  encoder = Avro::IO::BinaryEncoder.new(buffer)
  datum_writer = Avro::IO::DatumWriter.new(schema)
  datum_writer.write(obj, encoder)
  buffer.seek(0)
  buffer.read
end

def deserialize(avro_obj, schema)
  reader = StringIO.new(avro_obj)
  decoder = Avro::IO::BinaryDecoder.new(reader)
  datum_reader = Avro::IO::DatumReader.new(schema)
  datum_reader.read(decoder)
end
{code}

{code}
> schema = Avro::Schema.parse("{\"type\":\"record\",\"name\":\"Example\",\"fields\":[{\"name\":\"example_field\",\"type\":\"string\"}, {\"name\":\"other_field\",\"type\":\"string\"}]}")

> deserialize(serialize({'example_field'=> 'héllö world', 'other_field'=>'goodbye world'}, schema), schema)

{"example_field"=>"h\xC3\xA9ll\xC3\xB6 wor", "other_field"=>"d\x1Agoodbye world"}
{code}

Note that it looks like it's computing the length of the first field incorrectly (length of string in characters rather than in bytes?), and end of the first field spills into the second field.

Also, if the bytes happen to be especially unlucky in how they line up, we can get an {{ArgumentError}}

{code}
> deserialize(serialize({'example_field'=> '‘hello’ world', 'other_field'=>'goodbye world'}, schema), schema)
ArgumentError: negative length -56 given
{code}

This looks similar to a previous issue with the Perl implementation in AVRO-1517




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)