You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2017/05/25 15:45:04 UTC

[jira] [Commented] (ORC-200) json-schema and convert commands should support schema evolution of json documents

    [ https://issues.apache.org/jira/browse/ORC-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024887#comment-16024887 ] 

Owen O'Malley commented on ORC-200:
-----------------------------------

Actually, how will it create trouble? The schema evolution part of the reader will map the columns by name, assuming that the reader passes down the schema that they want to read with. That said, I'm not against preserving the order of the fields instead of sorting them. You'll just have different issues for the common case where the writer of the JSON documents doesn't pick a particular order for the attributes. Manually comparing schemas becomes much more annoying then.

Take a look at what I've been doing on the converter in [Owen's orc-199|https://github.com/omalley/orc/tree/orc-199], which adds a CSV reader to the converter. In particular, I extended the schema discoverer with the ability to merge in the schema directly. It will still lose on some things like maps.

> json-schema and convert commands should support schema evolution of json documents
> ----------------------------------------------------------------------------------
>
>                 Key: ORC-200
>                 URL: https://issues.apache.org/jira/browse/ORC-200
>             Project: ORC
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.5.0
>            Reporter: Shawn Hooton
>            Assignee: Shawn Hooton
>         Attachments: example-v1.json, example-v2.json
>
>
> Using the command (sample payloads attached):
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v1.json
> Produces the following output:
> create table tbl (
>   about string,
>   address string,
>   age tinyint,
>   balance string,
>   company string,
>   email string,
>   eyeColor string,
>   favoriteFruit string,
>   friends array <struct <
>       id: tinyint,
>       name: string>>,
>   gender string,
>   greeting string,
>   guid string,
>   id binary,
>   index tinyint,
>   isActive boolean,
>   latitude decimal(8,6),
>   longitude decimal(8,6),
>   name string,
>   phone string,
>   picture string,
>   registered timestamp,
>   tags array <string>
> )
> Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure.  This causes problems for the convert command as well.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc
> <output ommited for brevity>
>   "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
>   "schema": [
>     {
>       "columnId": 0,
>       "columnType": "STRUCT",
>       "childColumnNames": [
>         "about",
>         "address",
>         "age",
>         "balance",
>         "company",
>         "email",
>         "eyeColor",
>         "favoriteFruit",
>         "friends",
>         "gender",
>         "greeting",
>         "guid",
>         "id",
>         "index",
>         "isActive",
>         "latitude",
>         "longitude",
>         "name",
>         "phone",
>         "picture",
>         "registered",
>         "tags"
>       ],
> <output ommited for brevity>
> This causes *major* problems when a field is added to the JSON document later
> e.g.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v2.json
> Examine where the newField field is added in the example-v2.json document and then examine the output below.  This also affects the convert command.
> create table tbl (
>   about string,
>   address string,
>   age tinyint,
>   balance string,
>   company string,
>   email string,
>   eyeColor string,
>   favoriteFruit string,
>   friends array <struct <
>       id: tinyint,
>       name: string>>,
>   gender string,
>   greeting string,
>   guid string,
>   id binary,
>   index tinyint,
>   isActive boolean,
>   latitude decimal(8,6),
>   longitude decimal(8,6),
>   name string,
>   newField string,
>   phone string,
>   picture string,
>   registered timestamp,
>   tags array <string>
> )
> The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.
> Pull request *with* test cases incoming :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)