You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Tilak Patidar (JIRA)" <ji...@apache.org> on 2019/01/24 14:21:00 UTC

[jira] [Commented] (GOBBLIN-571) JsonIntermediateToParquetGroupConverter generates wrong parquet schema for complex types such as enums, arrays and maps

    [ https://issues.apache.org/jira/browse/GOBBLIN-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751171#comment-16751171 ] 

Tilak Patidar commented on GOBBLIN-571:
---------------------------------------

Is this issue assigned yet?

> JsonIntermediateToParquetGroupConverter generates wrong parquet schema for complex types such as enums, arrays and maps
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-571
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-571
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Tilak Patidar
>            Priority: Critical
>             Fix For: 0.15.0
>
>
> For complex types such as arrays, maps and enums 
> JsonIntermediateToParquetGroupConverter is generating wrong schema. For enums, arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField is messed up.
>  
> Due to this spark throws the following errors when reading parquet files generated using JsonIntermediateToParquetGroupConverter
> {code:java}
> Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 {code}
> Ex of a wrong schema generated is below. Notice the field payload.action is marked as required
> {code:java}
> message EventData {
> optional int64 id;
> optional binary type (UTF8);
> required group actor {
> optional int64 id;
> optional binary login (UTF8);
> optional binary gravatar_id (UTF8);
> optional binary url (UTF8);
> optional binary avatar_url (UTF8);
> }
> required group repo {
> optional int64 id;
> optional binary name (UTF8);
> optional binary url (UTF8);
> optional binary urlid (UTF8);
> }
> required group payload {
> optional int64 id;
> optional binary ref (UTF8);
> optional binary ref_type (UTF8);
> optional binary master_branch (UTF8);
> optional binary description (UTF8);
> optional binary pusher_type (UTF8);
> optional binary before (UTF8);
> required binary action (UTF8);
> }
> optional boolean public;
> optional binary created_at (UTF8);
> optional binary created_at_id (UTF8);
> }
> {code}
> But the field payload.action which is defined in the source.schema property is set to isNullable: true
> {code:java}
> [ ....
>     {
>     "columnName": "payload",
>     "dataType": {
>       "type": "record",
>       "name": "payloadDetails",
>       "values": [
>         ....
>         {
>           "columnName": "action",
>           "isNullable": true,
>           "dataType": {
>             "type": "enum",
>             "name": "actionType",
>             "symbols": [
>               "started",
>               "published",
>               "opened",
>               "closed",
>               "created",
>               "reopened",
>               "added"
>             ]
>           }
>         }
>       ]
>     }
>   }....
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)