You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by "Gregory (Grisha) Trubetskoy" <gr...@apache.org> on 2013/05/22 23:26:44 UTC

Newb question on imorting JSON and defaults

Hello!

I have a test.json file that looks like this:

{"first":"John", "last":"Doe", "middle":"C"}
{"first":"John", "last":"Doe"}

(Second line does NOT have a "middle" element).

And I have a test.schema file that looks like this:

{"name":"test",
  "type":"record",
  "fields": [
     {"name":"first",  "type":"string"},
     {"name":"middle", "type":"string", "default":""},
     {"name":"last",   "type":"string"}
]}

I then try to use fromjson, as follows, and it chokes on the second line:

$ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema test.json > test.avro
Exception in thread "main" org.apache.avro.AvroTypeException: Expected field name not found: middle
         at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
         at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
         at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
         at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
         at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
         at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
         at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:348)
         at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:341)
         at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
         at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
         at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
         at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
         at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
         at org.apache.avro.tool.Main.run(Main.java:80)
         at org.apache.avro.tool.Main.main(Main.java:69)


The short story is - I need to convert a bunch of JSON where an element 
may not be present sometimes, in which case I'd want it to default to 
something sensible, e.g. blank or null.

According to the Schema Resolution "if the reader's record schema has a 
field that contains a default value, and writer's schema does not have a 
field with the same name, then the reader should use the default value 
from its field."

I'm clearly missing something obvious, any help would be appreciated!

Grisha

Re: Newb question on imorting JSON and defaults

Posted by "Gregory (Grisha) Trubetskoy" <gr...@apache.org>.

Thanks Scott!

So it looks like "fromjson" is mainly meant for processing JSON generated 
by "tojson" and not as a general "JSON importing tool" (although it could 
be used as such) - it's probably my short attention span, but somehow that 
point got lost on me. (As I later learned it also seems that the schema 
that the fromjson expects is a simplified version - e.g. specifying a 
union will give an error.)

So if I expect to be dealing with data coming in as JSON and would need to 
be converting it to Avro - the current "best practice" is to write a 
program of your own? This seems like a fairly common thing to do, perhaps 
if there isn't a general tool, this could be something useful to hack on 
for the Avro project...

Grisha

On Thu, 23 May 2013, Scott Carey wrote:

>
>
> On 5/22/13 2:26 PM, "Gregory (Grisha) Trubetskoy" <gr...@apache.org>
> wrote:
>
>>
>> Hello!
>>
>> I have a test.json file that looks like this:
>>
>> {"first":"John", "last":"Doe", "middle":"C"}
>> {"first":"John", "last":"Doe"}
>>
>> (Second line does NOT have a "middle" element).
>>
>> And I have a test.schema file that looks like this:
>>
>> {"name":"test",
>>  "type":"record",
>>  "fields": [
>>     {"name":"first",  "type":"string"},
>>     {"name":"middle", "type":"string", "default":""},
>>     {"name":"last",   "type":"string"}
>> ]}
>>
>> I then try to use fromjson, as follows, and it chokes on the second line:
>>
>> $ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema
>> test.json > test.avro
>> Exception in thread "main" org.apache.avro.AvroTypeException: Expected
>> field name not found: middle
>>         at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
>>         at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>         at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
>>         at
>> org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
>>         at
>> org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
>>         at
>> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107
>> )
>>         at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>> ava:348)
>>         at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>> ava:341)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15
>> 4)
>>         at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
>> ava:177)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
>> 8)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
>> 9)
>>         at
>> org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
>>         at org.apache.avro.tool.Main.run(Main.java:80)
>>         at org.apache.avro.tool.Main.main(Main.java:69)
>>
>>
>> The short story is - I need to convert a bunch of JSON where an element
>> may not be present sometimes, in which case I'd want it to default to
>> something sensible, e.g. blank or null.
>>
>> According to the Schema Resolution "if the reader's record schema has a
>> field that contains a default value, and writer's schema does not have a
>> field with the same name, then the reader should use the default value
>> from its field."
>>
>> I'm clearly missing something obvious, any help would be appreciated!
>
> There are two things that seem to be missing here:
> 1. The fromjson tool is configuring the "writer's schema" (and readers's)
> the one you provided.   Avro is expecting every
> JSON fragment you are giving it to have the same schema.
> 2. The tool will not work for all arbitrary json, it expects json in the
> format that the Avro JSON Encoder writes.  There are a few differences
> with expectations, primarily when disambiguating union types and maps from
> records.
>
> To perform schema evolution while reading, you may need to separate json
> fragments missing "middle" from those that have it, and run the tool
> twice, with corresponding schemas for each case.
> Alternatively the tool could be modified to handle schema resolution or
> deal with different json encodings as
> well(tools/src/main/java/org/apache/avro/tool/DataFileWriteTool).
>
> Alternatively, you can avoid schema resolution and write two files, one
> with data in each schema after separating the records.   Then you can deal
> with schema resolution in a later pass through the data with other tools
> (e.g. data file reader + writer), or only lazily
> when reading resolve the data into the schema you wish.
>
>
>
>>
>> Grisha
>>
>
>

Re: Newb question on imorting JSON and defaults

Posted by Scott Carey <sc...@apache.org>.

On 5/22/13 2:26 PM, "Gregory (Grisha) Trubetskoy" <gr...@apache.org>
wrote:

>
>Hello!
>
>I have a test.json file that looks like this:
>
>{"first":"John", "last":"Doe", "middle":"C"}
>{"first":"John", "last":"Doe"}
>
>(Second line does NOT have a "middle" element).
>
>And I have a test.schema file that looks like this:
>
>{"name":"test",
>  "type":"record",
>  "fields": [
>     {"name":"first",  "type":"string"},
>     {"name":"middle", "type":"string", "default":""},
>     {"name":"last",   "type":"string"}
>]}
>
>I then try to use fromjson, as follows, and it chokes on the second line:
>
>$ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema
>test.json > test.avro
>Exception in thread "main" org.apache.avro.AvroTypeException: Expected
>field name not found: middle
>         at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
>         at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>         at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
>         at 
>org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
>         at 
>org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
>         at 
>org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107
>)
>         at 
>org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>ava:348)
>         at 
>org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>ava:341)
>         at 
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15
>4)
>         at 
>org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
>ava:177)
>         at 
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
>8)
>         at 
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
>9)
>         at 
>org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
>         at org.apache.avro.tool.Main.run(Main.java:80)
>         at org.apache.avro.tool.Main.main(Main.java:69)
>
>
>The short story is - I need to convert a bunch of JSON where an element
>may not be present sometimes, in which case I'd want it to default to
>something sensible, e.g. blank or null.
>
>According to the Schema Resolution "if the reader's record schema has a
>field that contains a default value, and writer's schema does not have a
>field with the same name, then the reader should use the default value
>from its field."
>
>I'm clearly missing something obvious, any help would be appreciated!

There are two things that seem to be missing here:
 1. The fromjson tool is configuring the "writer's schema" (and readers's)
the one you provided.   Avro is expecting every
JSON fragment you are giving it to have the same schema.
 2. The tool will not work for all arbitrary json, it expects json in the
format that the Avro JSON Encoder writes.  There are a few differences
with expectations, primarily when disambiguating union types and maps from
records.

To perform schema evolution while reading, you may need to separate json
fragments missing "middle" from those that have it, and run the tool
twice, with corresponding schemas for each case.
Alternatively the tool could be modified to handle schema resolution or
deal with different json encodings as
well(tools/src/main/java/org/apache/avro/tool/DataFileWriteTool).

Alternatively, you can avoid schema resolution and write two files, one
with data in each schema after separating the records.   Then you can deal
with schema resolution in a later pass through the data with other tools
(e.g. data file reader + writer), or only lazily
when reading resolve the data into the schema you wish.

>
>Grisha
>