You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/03/19 17:59:00 UTC

[jira] [Work logged] (GOBBLIN-571) JsonIntermediateToParquetGroupConverter generates wrong parquet schema for complex types such as enums, arrays and maps

     [ https://issues.apache.org/jira/browse/GOBBLIN-571?focusedWorklogId=569031&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-569031 ]

ASF GitHub Bot logged work on GOBBLIN-571:
------------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Mar/21 17:58
            Start Date: 19/Mar/21 17:58
    Worklog Time Spent: 10m 
      Work Description: vikrambohra opened a new pull request #3248:
URL: https://github.com/apache/gobblin/pull/3248


   …g to task output directory
   
   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-571] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-571
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if applicable):
   
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason:
   
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 569031)
    Remaining Estimate: 0h
            Time Spent: 10m

> JsonIntermediateToParquetGroupConverter generates wrong parquet schema for complex types such as enums, arrays and maps
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-571
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-571
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Tilak Patidar
>            Assignee: Shirshanka Das
>            Priority: Critical
>             Fix For: 0.15.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For complex types such as arrays, maps and enums 
> JsonIntermediateToParquetGroupConverter is generating wrong schema. For enums, arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField is messed up.
>  
> Due to this spark throws the following errors when reading parquet files generated using JsonIntermediateToParquetGroupConverter
> {code:java}
> Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 {code}
> Ex of a wrong schema generated is below. Notice the field payload.action is marked as required
> {code:java}
> message EventData {
> optional int64 id;
> optional binary type (UTF8);
> required group actor {
> optional int64 id;
> optional binary login (UTF8);
> optional binary gravatar_id (UTF8);
> optional binary url (UTF8);
> optional binary avatar_url (UTF8);
> }
> required group repo {
> optional int64 id;
> optional binary name (UTF8);
> optional binary url (UTF8);
> optional binary urlid (UTF8);
> }
> required group payload {
> optional int64 id;
> optional binary ref (UTF8);
> optional binary ref_type (UTF8);
> optional binary master_branch (UTF8);
> optional binary description (UTF8);
> optional binary pusher_type (UTF8);
> optional binary before (UTF8);
> required binary action (UTF8);
> }
> optional boolean public;
> optional binary created_at (UTF8);
> optional binary created_at_id (UTF8);
> }
> {code}
> But the field payload.action which is defined in the source.schema property is set to isNullable: true
> {code:java}
> [ ....
>     {
>     "columnName": "payload",
>     "dataType": {
>       "type": "record",
>       "name": "payloadDetails",
>       "values": [
>         ....
>         {
>           "columnName": "action",
>           "isNullable": true,
>           "dataType": {
>             "type": "enum",
>             "name": "actionType",
>             "symbols": [
>               "started",
>               "published",
>               "opened",
>               "closed",
>               "created",
>               "reopened",
>               "added"
>             ]
>           }
>         }
>       ]
>     }
>   }....
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)