You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Shawn Hooton (JIRA)" <ji...@apache.org> on 2017/05/25 05:27:04 UTC
[jira] [Updated] (ORC-200) json-schema and convert commands should support schema evolution of json documents

     [ https://issues.apache.org/jira/browse/ORC-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shawn Hooton updated ORC-200:
-----------------------------
    Description: 
Using the command (sample payloads attached):
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v1.json

Produces the following output:
create table tbl (
  about string,
  address string,
  age tinyint,
  balance string,
  company string,
  email string,
  eyeColor string,
  favoriteFruit string,
  friends array <struct <
      id: tinyint,
      name: string>>,
  gender string,
  greeting string,
  guid string,
  id binary,
  index tinyint,
  isActive boolean,
  latitude decimal(8,6),
  longitude decimal(8,6),
  name string,
  phone string,
  picture string,
  registered timestamp,
  tags array <string>
)

Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure.  This causes problems for the convert command as well.

java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc

<output ommited for brevity>

  "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
  "schema": [
    {
      "columnId": 0,
      "columnType": "STRUCT",
      "childColumnNames": [
        "about",
        "address",
        "age",
        "balance",
        "company",
        "email",
        "eyeColor",
        "favoriteFruit",
        "friends",
        "gender",
        "greeting",
        "guid",
        "id",
        "index",
        "isActive",
        "latitude",
        "longitude",
        "name",
        "phone",
        "picture",
        "registered",
        "tags"
      ],
<output ommited for brevity>

This causes *major* problems when a field is added to the JSON document later

e.g.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v2.json

Examine where the newField field is added in the example-v2.json document and then examine the output below.  This also affects the convert command.

create table tbl (
  about string,
  address string,
  age tinyint,
  balance string,
  company string,
  email string,
  eyeColor string,
  favoriteFruit string,
  friends array <struct <
      id: tinyint,
      name: string>>,
  gender string,
  greeting string,
  guid string,
  id binary,
  index tinyint,
  isActive boolean,
  latitude decimal(8,6),
  longitude decimal(8,6),
  name string,
  * newField string,
  phone string,
  picture string,
  registered timestamp,
  tags array <string>
)

The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.

Pull request *with* test cases incoming :)

  was:
Using the command (sample payloads attached):
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v1.json

Produces the following output:
create table tbl (
  about string,
  address string,
  age tinyint,
  balance string,
  company string,
  email string,
  eyeColor string,
  favoriteFruit string,
  friends array <struct <
      id: tinyint,
      name: string>>,
  gender string,
  greeting string,
  guid string,
  id binary,
  index tinyint,
  isActive boolean,
  latitude decimal(8,6),
  longitude decimal(8,6),
  name string,
  phone string,
  picture string,
  registered timestamp,
  tags array <string>
)

Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure.  This causes problems for the convert command as well.

java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc

*** output ommited for brevity

  "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
  "schema": [
    {
      "columnId": 0,
      "columnType": "STRUCT",
      "childColumnNames": [
        "about",
        "address",
        "age",
        "balance",
        "company",
        "email",
        "eyeColor",
        "favoriteFruit",
        "friends",
        "gender",
        "greeting",
        "guid",
        "id",
        "index",
        "isActive",
        "latitude",
        "longitude",
        "name",
        "phone",
        "picture",
        "registered",
        "tags"
      ],
*** output ommited for brevity

This causes *major* problems when a field is added to the JSON document later

e.g.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v2.json

Examine where the newField field is added in the example-v2.json document and then examine the output below.  This also affects the convert command.

create table tbl (
  about string,
  address string,
  age tinyint,
  balance string,
  company string,
  email string,
  eyeColor string,
  favoriteFruit string,
  friends array <struct <
      id: tinyint,
      name: string>>,
  gender string,
  greeting string,
  guid string,
  id binary,
  index tinyint,
  isActive boolean,
  latitude decimal(8,6),
  longitude decimal(8,6),
  name string,
*****  newField string,
  phone string,
  picture string,
  registered timestamp,
  tags array <string>
)

The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.

Pull request *with* test cases incoming :)


> json-schema and convert commands should support schema evolution of json documents
> ----------------------------------------------------------------------------------
>
>                 Key: ORC-200
>                 URL: https://issues.apache.org/jira/browse/ORC-200
>             Project: ORC
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.5.0
>            Reporter: Shawn Hooton
>            Assignee: Shawn Hooton
>         Attachments: example-v1.json, example-v2.json
>
>
> Using the command (sample payloads attached):
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v1.json
> Produces the following output:
> create table tbl (
>   about string,
>   address string,
>   age tinyint,
>   balance string,
>   company string,
>   email string,
>   eyeColor string,
>   favoriteFruit string,
>   friends array <struct <
>       id: tinyint,
>       name: string>>,
>   gender string,
>   greeting string,
>   guid string,
>   id binary,
>   index tinyint,
>   isActive boolean,
>   latitude decimal(8,6),
>   longitude decimal(8,6),
>   name string,
>   phone string,
>   picture string,
>   registered timestamp,
>   tags array <string>
> )
> Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure.  This causes problems for the convert command as well.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc
> <output ommited for brevity>
>   "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
>   "schema": [
>     {
>       "columnId": 0,
>       "columnType": "STRUCT",
>       "childColumnNames": [
>         "about",
>         "address",
>         "age",
>         "balance",
>         "company",
>         "email",
>         "eyeColor",
>         "favoriteFruit",
>         "friends",
>         "gender",
>         "greeting",
>         "guid",
>         "id",
>         "index",
>         "isActive",
>         "latitude",
>         "longitude",
>         "name",
>         "phone",
>         "picture",
>         "registered",
>         "tags"
>       ],
> <output ommited for brevity>
> This causes *major* problems when a field is added to the JSON document later
> e.g.
> java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema  -t ~/example-v2.json
> Examine where the newField field is added in the example-v2.json document and then examine the output below.  This also affects the convert command.
> create table tbl (
>   about string,
>   address string,
>   age tinyint,
>   balance string,
>   company string,
>   email string,
>   eyeColor string,
>   favoriteFruit string,
>   friends array <struct <
>       id: tinyint,
>       name: string>>,
>   gender string,
>   greeting string,
>   guid string,
>   id binary,
>   index tinyint,
>   isActive boolean,
>   latitude decimal(8,6),
>   longitude decimal(8,6),
>   name string,
>   * newField string,
>   phone string,
>   picture string,
>   registered timestamp,
>   tags array <string>
> )
> The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.
> Pull request *with* test cases incoming :)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)