You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Shirshanka Das (Jira)" <ji...@apache.org> on 2019/12/18 19:55:00 UTC
[jira] [Assigned] (GOBBLIN-571)
JsonIntermediateToParquetGroupConverter generates wrong parquet schema for
complex types such as enums, arrays and maps
[ https://issues.apache.org/jira/browse/GOBBLIN-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shirshanka Das reassigned GOBBLIN-571:
--------------------------------------
Assignee: Shirshanka Das
> JsonIntermediateToParquetGroupConverter generates wrong parquet schema for complex types such as enums, arrays and maps
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: GOBBLIN-571
> URL: https://issues.apache.org/jira/browse/GOBBLIN-571
> Project: Apache Gobblin
> Issue Type: Bug
> Reporter: Tilak Patidar
> Assignee: Shirshanka Das
> Priority: Critical
> Fix For: 0.15.0
>
>
> For complex types such as arrays, maps and enums
> JsonIntermediateToParquetGroupConverter is generating wrong schema. For enums, arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField is messed up.
>
> Due to this spark throws the following errors when reading parquet files generated using JsonIntermediateToParquetGroupConverter
> {code:java}
> Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 {code}
> Ex of a wrong schema generated is below. Notice the field payload.action is marked as required
> {code:java}
> message EventData {
> optional int64 id;
> optional binary type (UTF8);
> required group actor {
> optional int64 id;
> optional binary login (UTF8);
> optional binary gravatar_id (UTF8);
> optional binary url (UTF8);
> optional binary avatar_url (UTF8);
> }
> required group repo {
> optional int64 id;
> optional binary name (UTF8);
> optional binary url (UTF8);
> optional binary urlid (UTF8);
> }
> required group payload {
> optional int64 id;
> optional binary ref (UTF8);
> optional binary ref_type (UTF8);
> optional binary master_branch (UTF8);
> optional binary description (UTF8);
> optional binary pusher_type (UTF8);
> optional binary before (UTF8);
> required binary action (UTF8);
> }
> optional boolean public;
> optional binary created_at (UTF8);
> optional binary created_at_id (UTF8);
> }
> {code}
> But the field payload.action which is defined in the source.schema property is set to isNullable: true
> {code:java}
> [ ....
> {
> "columnName": "payload",
> "dataType": {
> "type": "record",
> "name": "payloadDetails",
> "values": [
> ....
> {
> "columnName": "action",
> "isNullable": true,
> "dataType": {
> "type": "enum",
> "name": "actionType",
> "symbols": [
> "started",
> "published",
> "opened",
> "closed",
> "created",
> "reopened",
> "added"
> ]
> }
> }
> ]
> }
> }....
> ]
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)