You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Doug Cutting <cu...@apache.org> on 2012/06/07 19:03:21 UTC
Re: How to declare an optional field

It looks like you're perhaps using GenericData#toString() for output
then using JsonDecoder for input.  These unfortunately do not encode
Avro data in JSON compatibly.

JsonEncoder/JsonDecoder are lossless (implementing the rules in
http://avro.apache.org/docs/current/spec.html#json_encoding) while
GenericData#toString() generates the JSON that most folks expect.  The
difference centers around unions.  Avro differentiates between int and
long, string and bytes or enum, record and map; and so when these are
combined in a union it must tag them in the JSON with the intended
branch.  For example, if you have a record X with a field "a" in the
union ["X", {"type":"map", "values":"int"}] then Avro wouldn't know
which was meant when reading {"a":1}, so must encode this as {"X":
{"a":1}} or {"map": {"a":1}} in order to tell.

Perhaps GenericData#toString() should use this encoding, but in many
cases folks want the simpler JSON when producing output that's won't
be consumed by Avro.

If this is indeed what's causing you problems, the fix is to replace
your use of  GenericData#toString() with a DatumWriter that uses a
JsonEncoder.

Cheers,

Doug

On Thu, Jun 7, 2012 at 1:48 AM, François Kawala <fk...@bestofmedia.com> wrote:
> Hello,
>
> Firstly thanks for your help. I've corrected my schema according to your
> advice, but I've still the same kind of issue :
>
> ________________________________
>
> With this schema :
>
> (...)
> {"name": "in_reply_to", "type": ["null", "long" ], "default": null },
> (...)
> {"name":"urls","type":["null",{"type":"array","items": (record) }]}
> (...)
>
> Using this schema, the following data :
>
> {"created_at": "Mon, 28 May 2012 00:01:25 +0000", "emitter": 405427230,
> "emitter_name": "CallmeOceane_", "geo": null, "hashtags": null,
> "in_reply_to": 206897508021055489,
> "lang": "fr", "msg": "@Chloe_OneD Aaaah puuuuutain j'ai toujours pas finis
> Wild Souls machin truc", "uid": 206897932501385217, "urls": null,
> "usermentions":
> [{"id": 288136906, "indices": [0, 11], "name": "Happiness \u10e6",
> "screen_name": "Chloe_OneD"}]}|
>
> Ends on this error :
>
> 2012-06-07 10:16:07,831 WARN org.apache.hadoop.streaming.PipeMapRed:
> org.apache.avro.AvroTypeException: Expected start-union. Got
> VALUE_NUMBER_INT
> 	at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
> 	at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
> 	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
> 	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> 	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
> 	at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> 	at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
> 	at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
> 	at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
> 	at
> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
> 	at
> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
> 	at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
> 	at
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
>
> ________________________________
>
> While using this data :
>
> {"created_at": "Mon, 28 May 2012 00:00:10 +0000", "emitter": 59809965,
> "emitter_name": "Droolius", "geo": null, "hashtags": null, "in_reply_to":
> null, "lang": "en", "msg":
> "RT @davidchang: Thank you again Amy Rowat &amp; team UCLA @scienceandfood :
> Umami Reverse Engineering + The Joy of MSG http://t.co/nk1QBGbg", "uid":
> 206897616326377472,
> "urls": [{"display_url": "bit.ly/KvD0QZ", "expanded_url":
> "http://bit.ly/KvD0QZ", "indices": [119, 139], "url":
> "http://t.co/nk1QBGbg"}],
> "usermentions": [{"id": 221185711, "indices": [3, 14], "name": "Dave Chang",
> "screen_name": "davidchang"},
> {"id": 526175293, "indices": [58, 73], "name": "UCLA Science & Food",
> "screen_name": "scienceandfood"}]}|
>
> It ends with :
>
> 2012-06-07 10:38:19,530 WARN org.apache.hadoop.streaming.PipeMapRed:
> org.apache.avro.AvroTypeException: Expected start-union. Got START_ARRAY
> 	at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
> 	at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
> 	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
> 	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> 	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
> 	at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> 	at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
> 	at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
> 	at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
> 	at
> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
> 	at
> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
> 	at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
> 	at
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
>
> ________________________________
>
> Accordingly to these error stacks I guess that my problem has something to
> do with the custom output format which relies on org.apache.avro.generic,
> am I right (and consequently on the strict java implementation) ?
>
> All the best, again thanks for reading :)
>
> Regards,
> François.
>
>
>
>
> since the Avro writer in the one available in : , I
>
> According to the spec, the default value for a union is assumed to have
> the type of the first element of the union.
>
> http://avro.apache.org/docs/current/spec.html#schema_record
>
> So some valid fields would be:
>
> {"name":"x", "type":["long", "null"], "default": 0}
> {"name":"y", "type":["null", "long"], "default": null}
>
> The following are invalid fields, since the type of the default value
> does not match that of the first union element.
>
> {"name":"x", "type":["long", "null"], "default": null}
> {"name":"y", "type":["null", "long"], "default": 0}
>
> Python may not implement this strictly, but Java does.
>
> This is a common point of confusion.  We should probably document it
> better.  I'm not sure whether it's causing the problem you're seeing,
> but perhaps it is.
>
> Cheers,
>
> Doug
>
> On 06/06/2012 04:15 AM, François Kawala wrote:
>> Dear all,
>>
>> Despite my desperate effort to get a working schema I can not manage to
>> specify that a field of a given record can be either : "a given type" or
>> "null". I've tried with unions but the back-end that I have to use seems
>> to be unhappy with it. More precisely : I'm trying to output the result
>> of a Streaming MR job within an AVRO container. This job is written in
>> python an executed through dumbo (http://www.dumbotics.com), and a
>> custom OutputFormat is used
>>
>> (https://github.com/tomslabs/avro-utils/tree/master/src/main/java/com/tomslabs/grid/avro)
>>
>>
>> However since this custom OutputFormat relies on org.apache.avro
>> sources, I've thought this list could be a good spot to call for help.
>>
>> Thanks for reading,
>> François.
>>
>> ------------------------------------------------------------------------
>>
>> Here is some complementary elements :
>>
>> Fragment of the schema that I think to be responsible of my troubles :
>>
>> {"name": "in_reply_to", "type": [{"type": "long"},"null"],
>> "default":"null"}
>>
>> I've also unsuccessfully tried :
>>
>> {"name": "in_reply_to", "type": [{"type": "long"},"null"]}
>> {"name": "in_reply_to", "type": ["null",{"type": "long"}]}
>>
>>     Each ending with the same error message :
>>
>>         org.apache.avro.AvroTypeException: Expected start-union. Got
>> VALUE_NUMBER_INT
>>
>>     Error Stack :
>>
>>     	at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
>>     	at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
>>     	at
>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
>>     	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>     	at
>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
>>     	at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
>>     	at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
>>     	at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
>>     	at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
>>     	at
>> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
>>     	at
>> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
>>     	at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
>>     	at
>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
>>
>>
>> 	
>>
>>
>>
>>
>
>