You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Ryon Day <ry...@yahoo.com> on 2015/08/28 20:52:44 UTC

Utf8 to String ClassCastException upon deserialization, setting properties on Union schema

Thanks in advance for any help anyone can offer me with this issue. Forgive the scattered presentation, I am new to Avro and am unsure of how to phrase the exact issue I'm experiencing.
I have been using Java to read Avro messages off of a Kafka topic application in a heterogeneous environment (the messages are put onto a Kafka queue by a Go client written by a different team; I have no control over what is put onto the wire or how).
My team controls the schema for the project; we have a .avsc schema file and use the "com.commercehub.gradle.plugin.avro" Gradle plug-in to generate our Java classes. Currently, I can deserialize messages into our auto-generated classes effectively. However, the Utf8 data type used in lieu of String is a pretty constant vexation. We use String quite often, CharSequence rarely and Utf8 never. Hence, our code is cluttered with constant toString() calls on Utf8 objects which makes navigating and understanding the source very cumbersome.
The Gradle plugin in question has a "stringType" option which when set to String, does indeed generate the domain classes with standard Strings instead of the Utf8 objects, but when I attempt to de-serialize generic messages off the queue, I receive the following Stack trace (we are using the Avro 1.7.7 Java library:
java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be cast to java.lang.String at com.employer.avro.schema.DomainObject.put(DomainObject.java:64) at org.apache.avro.generic.GenericData.setField(GenericData.java:573) at org.apache.avro.generic.GenericData.setField(GenericData.java:590) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

I am using a similar method to the following to read byte arrays that come off the queue:
byte[] message = ReadTheQueueHere.....SpecificDatumReader<SomeType> = new SpecificDatumReader<>(ourSchema);BinaryDecoder someTypeDecoder = DecoderFactory.get().binaryDecoder(message, null);SomeType record = specificDatumReader.read(null, someTypeDecoder); <-- Exception thrown here
Other things I have tried: 
I tried setting the "avro.java.string": "String" property on all String fields within the Schema. This actually did nothing; the problem still occurs. In fact, the exception above occurs on a field number that is explicitly marked with "avro.java.string": "String" in the schema file: 
Java: public void put(int field$, java.lang.Object value$) {...case 3: some_field = (java.lang.String)value$; break;

Schema file:..{        "name": "some_field",        "type": "string",        "avro.java.string": "String"},..
In fact, when I toString() the schema, it shows up there too: {"name":"some_field","type":{"type":"string","avro.java.string":"String"},"avro.java.string":"String"}
I tried using GenericData.setStringType( DomainObject.getClassSchema(), StringType.String); This also did nothing (but see below).I tried setting the String type on the schema as a whole: Schema schema = new Schema.Parser().parse(this.getClass().getResourceAsStream("/ourSchema.avsc"));
 GenericData.setStringType( schema, StringType.String);This resulted in the following exception: org.apache.avro.AvroRuntimeException: Can't set properties on a union: [Our Schema as a String]. I suppose this because of some subtlety in our schema arising from our inexperience with the platform; from what little poking I've done into this, our entire schema (with several record types within it) is defined as a union, something I've found very little explanation of what or how it is. 
I worked up a small sample method to reproduce what I think is happening. I believe that the Go end of things is writing generic records to Kafka (I don't even know if this makes a difference).
public static SomeType encodeToSpecificRecord(GenericRecord someRecord, Schema writerSchema, Schema readerSchema) throws Exception {        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(writerSchema);
        ByteArrayOutputStream out = new ByteArrayOutputStream();        Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);        datumWriter.write(someRecord, encoder);        encoder.flush();        byte[] inBytes = out.toByteArray();        out.close();              SpecificDatumReader<SomeType> reader = new SpecificDatumReader<>(readerSchema);        Decoder decoder = DecoderFactory.get().binaryDecoder(inBytes, null);        SomeType result = reader.read(null, decoder);        logger.debug("Read record {}", result.toString());
        return result;    }
The only way that I can get this to work is if I do the following:GenericData.setStringType(SomeType.getClassSchema(), StringType.String)and call the method with the SomeType.getClassSchema() schema for both reader and writer schema.
This will not do, because I am pretty sure that this is not how the Go end of things is writing the data; I am pretty sure they are using the overarching schema as read from the file. If I mismatch reader and writer schemas, I get the following: 
java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:257)
Can anyone offer any assistance in detangling this problem? Constant conversions of String <-> Utf8 clutters up the code, reduces efficiency and impacts readability and maintainability. 
Thank you very much!