You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Shawn Hooton (JIRA)" <ji...@apache.org> on 2017/05/25 05:27:04 UTC
[jira] [Updated] (ORC-200) json-schema and convert commands should
support schema evolution of json documents
[ https://issues.apache.org/jira/browse/ORC-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shawn Hooton updated ORC-200:
-----------------------------
Description:
Using the command (sample payloads attached):
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json
Produces the following output:
create table tbl (
about string,
address string,
age tinyint,
balance string,
company string,
email string,
eyeColor string,
favoriteFruit string,
friends array <struct <
id: tinyint,
name: string>>,
gender string,
greeting string,
guid string,
id binary,
index tinyint,
isActive boolean,
latitude decimal(8,6),
longitude decimal(8,6),
name string,
phone string,
picture string,
registered timestamp,
tags array <string>
)
Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc
<output ommited for brevity>
"schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
"schema": [
{
"columnId": 0,
"columnType": "STRUCT",
"childColumnNames": [
"about",
"address",
"age",
"balance",
"company",
"email",
"eyeColor",
"favoriteFruit",
"friends",
"gender",
"greeting",
"guid",
"id",
"index",
"isActive",
"latitude",
"longitude",
"name",
"phone",
"picture",
"registered",
"tags"
],
<output ommited for brevity>
This causes *major* problems when a field is added to the JSON document later
e.g.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json
Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command.
create table tbl (
about string,
address string,
age tinyint,
balance string,
company string,
email string,
eyeColor string,
favoriteFruit string,
friends array <struct <
id: tinyint,
name: string>>,
gender string,
greeting string,
guid string,
id binary,
index tinyint,
isActive boolean,
latitude decimal(8,6),
longitude decimal(8,6),
name string,
* newField string,
phone string,
picture string,
registered timestamp,
tags array <string>
)
The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.
Pull request *with* test cases incoming :)
was:
Using the command (sample payloads attached):
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json
Produces the following output:
create table tbl (
about string,
address string,
age tinyint,
balance string,
company string,
email string,
eyeColor string,
favoriteFruit string,
friends array <struct <
id: tinyint,
name: string>>,
gender string,
greeting string,
guid string,
id binary,
index tinyint,
isActive boolean,
latitude decimal(8,6),
longitude decimal(8,6),
name string,
phone string,
picture string,
registered timestamp,
tags array <string>
)
Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc
*** output ommited for brevity
"schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
"schema": [
{
"columnId": 0,
"columnType": "STRUCT",
"childColumnNames": [
"about",
"address",
"age",
"balance",
"company",
"email",
"eyeColor",
"favoriteFruit",
"friends",
"gender",
"greeting",
"guid",
"id",
"index",
"isActive",
"latitude",
"longitude",
"name",
"phone",
"picture",
"registered",
"tags"
],
*** output ommited for brevity
This causes *major* problems when a field is added to the JSON document later
e.g.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json
Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command.
create table tbl (
about string,
address string,
age tinyint,
balance string,
company string,
email string,
eyeColor string,
favoriteFruit string,
friends array <struct <
id: tinyint,
name: string>>,
gender string,
greeting string,
guid string,
id binary,
index tinyint,
isActive boolean,
latitude decimal(8,6),
longitude decimal(8,6),
name string,
***** newField string,
phone string,
picture string,
registered timestamp,
tags array <string>
)
The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.
Pull request *with* test cases incoming :)
> json-schema and convert commands should support schema evolution of json documents
> ----------------------------------------------------------------------------------
>
> Key: ORC-200
> URL: https://issues.apache.org/jira/browse/ORC-200
> Project: ORC
> Issue Type: Bug
> Components: Java
> Affects Versions: 1.5.0
> Reporter: Shawn Hooton
> Assignee: Shawn Hooton
> Attachments: example-v1.json, example-v2.json
>
>
> Using the command (sample payloads attached):
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json
> Produces the following output:
> create table tbl (
> about string,
> address string,
> age tinyint,
> balance string,
> company string,
> email string,
> eyeColor string,
> favoriteFruit string,
> friends array <struct <
> id: tinyint,
> name: string>>,
> gender string,
> greeting string,
> guid string,
> id binary,
> index tinyint,
> isActive boolean,
> latitude decimal(8,6),
> longitude decimal(8,6),
> name string,
> phone string,
> picture string,
> registered timestamp,
> tags array <string>
> )
> Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc
> <output ommited for brevity>
> "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
> "schema": [
> {
> "columnId": 0,
> "columnType": "STRUCT",
> "childColumnNames": [
> "about",
> "address",
> "age",
> "balance",
> "company",
> "email",
> "eyeColor",
> "favoriteFruit",
> "friends",
> "gender",
> "greeting",
> "guid",
> "id",
> "index",
> "isActive",
> "latitude",
> "longitude",
> "name",
> "phone",
> "picture",
> "registered",
> "tags"
> ],
> <output ommited for brevity>
> This causes *major* problems when a field is added to the JSON document later
> e.g.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json
> Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command.
> create table tbl (
> about string,
> address string,
> age tinyint,
> balance string,
> company string,
> email string,
> eyeColor string,
> favoriteFruit string,
> friends array <struct <
> id: tinyint,
> name: string>>,
> gender string,
> greeting string,
> guid string,
> id binary,
> index tinyint,
> isActive boolean,
> latitude decimal(8,6),
> longitude decimal(8,6),
> name string,
> * newField string,
> phone string,
> picture string,
> registered timestamp,
> tags array <string>
> )
> The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.
> Pull request *with* test cases incoming :)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)