You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by David Arthur <mu...@gmail.com> on 2013/05/08 20:49:51 UTC

Jackson and Avro, nested schema

I'm attempting to use Jackson and Avro together to map JSON documents to 
a generated Avro class. I have looked at the Json schema included with 
Avro, but this requires a top-level "value" element which I don't want. 
Essentially, I have JSON documents that have a few typed top level 
fields, and one field called "fields" which is more or less arbitrary JSON.

I've reduced this down to strings and ints for simplicity

My first attempt was:

  {
     "type": "record",
     "name": "Json",
     "fields": [
       {
         "name": "value",
         "type": [ "string", "int", {"type": "map", "values": "Json"} ]
       }
     ]
   },

   {
     "name": "Document",
     "type": "record",
     "fields": [
       {
         "name": "id",
         "type": "string"
       },
       {
         "name": "fields",
         "type": {"type": "map", "values": ["string", "int", {"type": 
"map", "values": "Json"}]}
       }
     ]
   }

Given a JSON document like:

{
   "id": "doc1",
   "fields": {
     "foo": "bar",
     "spam": "eggs",
     "answer": 42,
     "x": {"a": 1}
   }
}

this seems to work, but it doesn't. When I turn around and try to 
serialize this object with Avro, I get the following exception:

java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.avro.generic.IndexedRecord
     at org.apache.avro.generic.GenericData.getField(GenericData.java:526)
     at org.apache.avro.generic.GenericData.getField(GenericData.java:541)
     at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
     at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
     at 
org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:173)
     at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
     at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
     at 
org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:173)
     at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
     at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
     at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
     at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)

My best guess is that since the "fields" field is a union, the 
representation of it in the generate class is an Object which Jackson 
happily throws whatever into.

If I change my schema to explicitly use "int" instead of the "Json" 
type, it works fine for my test document

         "type": {"type": "map", "values": ["string", "int", {"type": 
"map", "values": "int"}]}

However now I need to enumerate the types for each level of nesting I 
want. This is not ideal, and limits me to a fixed level of nesting

To be clear, my issue is not modelling my schema in Avro, but rather 
getting Jackson to map JSON onto the generated classes without too much 
pain. I have also tried 
https://github.com/FasterXML/jackson-dataformat-avro without much luck.

Any help is appreciated

-David






Re: Jackson and Avro, nested schema

Posted by Doug Cutting <cu...@apache.org>.
On Wed, May 8, 2013 at 11:49 AM, David Arthur <mu...@gmail.com> wrote:
> I have looked at the Json schema included with Avro, but this requires a
> top-level "value" element which I don't want.

There's code in Avro that will read and write Jackson JsonNode
directly, without creating any intermediate "value" structure.

http://avro.apache.org/docs/current/api/java/org/apache/avro/data/Json.html

One should be able to easily write a JsonParser and JsonGenerator that
read and write directly using this schema, so that Jackson's
ObjectCodec could then be used to read and write arbitrary Pojos to
Avro data files.

Doug

Re: Jackson and Avro, nested schema

Posted by Pankaj Shroff <sh...@gmail.com>.
It seems to me that you defined "fields" as an Array (an IndexedRecord) but
you provided input as a single Record. It might help if you change your
JSON document so that "fields" is an array with one element in it (notice
the additional square bracktes [ ] for array notation):

"fields" : [  { "foo": "bar", "spam": "eggs",
                  "answer": 42,
                  "x": {"a": 1}
                 }
              ]

Have you tried this input and does it work if you did?

Pankaj



On Wed, May 8, 2013 at 2:49 PM, David Arthur <mu...@gmail.com> wrote:

> I'm attempting to use Jackson and Avro together to map JSON documents to a
> generated Avro class. I have looked at the Json schema included with Avro,
> but this requires a top-level "value" element which I don't want.
> Essentially, I have JSON documents that have a few typed top level fields,
> and one field called "fields" which is more or less arbitrary JSON.
>
> I've reduced this down to strings and ints for simplicity
>
> My first attempt was:
>
>  {
>     "type": "record",
>     "name": "Json",
>     "fields": [
>       {
>         "name": "value",
>         "type": [ "string", "int", {"type": "map", "values": "Json"} ]
>       }
>     ]
>   },
>
>   {
>     "name": "Document",
>     "type": "record",
>     "fields": [
>       {
>         "name": "id",
>         "type": "string"
>       },
>       {
>         "name": "fields",
>         "type": {"type": "map", "values": ["string", "int", {"type":
> "map", "values": "Json"}]}
>       }
>     ]
>   }
>
> Given a JSON document like:
>
> {
>   "id": "doc1",
>   "fields": {
>     "foo": "bar",
>     "spam": "eggs",
>     "answer": 42,
>     "x": {"a": 1}
>   }
> }
>
> this seems to work, but it doesn't. When I turn around and try to
> serialize this object with Avro, I get the following exception:
>
> java.lang.ClassCastException: java.lang.Integer cannot be cast to
> org.apache.avro.generic.**IndexedRecord
>     at org.apache.avro.generic.**GenericData.getField(**
> GenericData.java:526)
>     at org.apache.avro.generic.**GenericData.getField(**
> GenericData.java:541)
>     at org.apache.avro.generic.**GenericDatumWriter.**writeRecord(**
> GenericDatumWriter.java:104)
>     at org.apache.avro.generic.**GenericDatumWriter.write(**
> GenericDatumWriter.java:66)
>     at org.apache.avro.generic.**GenericDatumWriter.writeMap(**
> GenericDatumWriter.java:173)
>     at org.apache.avro.generic.**GenericDatumWriter.write(**
> GenericDatumWriter.java:69)
>     at org.apache.avro.generic.**GenericDatumWriter.write(**
> GenericDatumWriter.java:73)
>     at org.apache.avro.generic.**GenericDatumWriter.writeMap(**
> GenericDatumWriter.java:173)
>     at org.apache.avro.generic.**GenericDatumWriter.write(**
> GenericDatumWriter.java:69)
>     at org.apache.avro.generic.**GenericDatumWriter.**writeRecord(**
> GenericDatumWriter.java:106)
>     at org.apache.avro.generic.**GenericDatumWriter.write(**
> GenericDatumWriter.java:66)
>     at org.apache.avro.generic.**GenericDatumWriter.write(**
> GenericDatumWriter.java:58)
>
> My best guess is that since the "fields" field is a union, the
> representation of it in the generate class is an Object which Jackson
> happily throws whatever into.
>
> If I change my schema to explicitly use "int" instead of the "Json" type,
> it works fine for my test document
>
>         "type": {"type": "map", "values": ["string", "int", {"type":
> "map", "values": "int"}]}
>
> However now I need to enumerate the types for each level of nesting I
> want. This is not ideal, and limits me to a fixed level of nesting
>
> To be clear, my issue is not modelling my schema in Avro, but rather
> getting Jackson to map JSON onto the generated classes without too much
> pain. I have also tried https://github.com/FasterXML/**
> jackson-dataformat-avro<https://github.com/FasterXML/jackson-dataformat-avro>without much luck.
>
> Any help is appreciated
>
> -David
>
>
>
>
>
>


-- 
Pankaj Shroff
shroffG@Gmail.com

Re: Jackson and Avro, nested schema

Posted by Scott Carey <sc...@apache.org>.
It appears that you will need to modify the JSON decoder in Avro to
achieve this.

The JSON decoder in Avro was built to encode any Avro schema into JSON
with 100% fidelity, so that the decoder can read it back.  The decoder
does not work with any arbitrary JSON.

This is because there are ambiguities:

In your example:
{
  "id": "doc1",
  "fields": {
    "foo": "bar",
    "spam": "eggs",
    "answer": 42,
    "x": {"a": 1}
  }
}


This can be interpreted by Avro in several ways.  Is the value of "fields"
a map or a record with four fields?  is the value of "x" a map or a record
with one field?  Is "answer" an int, long, float, or double?  is a string
"doc1" a string or a bytes literal?

If you want to bake in the assumption that it is "maps, all the way down",
you'll need to extend / modify the JSON Decoder.

It would be a useful contribution to have a generic JSON schema and
decoder for it.  We could have a "JSON" schema record (one field, a union
of null, string, double, and map of string to self) and this type's field
would automatically be un-nested by the special JSON decoder and not
interpreted as a record.

-Scott

On 5/8/13 11:49 AM, "David Arthur" <mu...@gmail.com> wrote:

>I'm attempting to use Jackson and Avro together to map JSON documents to
>a generated Avro class. I have looked at the Json schema included with
>Avro, but this requires a top-level "value" element which I don't want.
>Essentially, I have JSON documents that have a few typed top level
>fields, and one field called "fields" which is more or less arbitrary
>JSON.
>
>I've reduced this down to strings and ints for simplicity
>
>My first attempt was:
>
>  {
>     "type": "record",
>     "name": "Json",
>     "fields": [
>       {
>         "name": "value",
>         "type": [ "string", "int", {"type": "map", "values": "Json"} ]
>       }
>     ]
>   },
>
>   {
>     "name": "Document",
>     "type": "record",
>     "fields": [
>       {
>         "name": "id",
>         "type": "string"
>       },
>       {
>         "name": "fields",
>         "type": {"type": "map", "values": ["string", "int", {"type":
>"map", "values": "Json"}]}
>       }
>     ]
>   }
>
>Given a JSON document like:
>
>{
>   "id": "doc1",
>   "fields": {
>     "foo": "bar",
>     "spam": "eggs",
>     "answer": 42,
>     "x": {"a": 1}
>   }
>}
>
>this seems to work, but it doesn't. When I turn around and try to
>serialize this object with Avro, I get the following exception:
>
>java.lang.ClassCastException: java.lang.Integer cannot be cast to
>org.apache.avro.generic.IndexedRecord
>     at org.apache.avro.generic.GenericData.getField(GenericData.java:526)
>     at org.apache.avro.generic.GenericData.getField(GenericData.java:541)
>     at 
>org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.
>java:104)
>     at 
>org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
>6)
>     at 
>org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.jav
>a:173)
>     at 
>org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
>9)
>     at 
>org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:7
>3)
>     at 
>org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.jav
>a:173)
>     at 
>org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
>9)
>     at 
>org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.
>java:106)
>     at 
>org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:6
>6)
>     at 
>org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:5
>8)
>
>My best guess is that since the "fields" field is a union, the
>representation of it in the generate class is an Object which Jackson
>happily throws whatever into.
>
>If I change my schema to explicitly use "int" instead of the "Json"
>type, it works fine for my test document
>
>         "type": {"type": "map", "values": ["string", "int", {"type":
>"map", "values": "int"}]}
>
>However now I need to enumerate the types for each level of nesting I
>want. This is not ideal, and limits me to a fixed level of nesting
>
>To be clear, my issue is not modelling my schema in Avro, but rather
>getting Jackson to map JSON onto the generated classes without too much
>pain. I have also tried
>https://github.com/FasterXML/jackson-dataformat-avro without much luck.
>
>Any help is appreciated
>
>-David
>
>
>
>
>