You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Werner Daehn (JIRA)" <ji...@apache.org> on 2018/01/04 17:47:00 UTC

[jira] [Comment Edited] (PARQUET-129) AvroParquetWriter can't save object who can have link to object of same type

    [ https://issues.apache.org/jira/browse/PARQUET-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311732#comment-16311732 ] 

Werner Daehn edited comment on PARQUET-129 at 1/4/18 5:46 PM:
--------------------------------------------------------------

Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases but....same argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
  Walter
    Joe
    Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, create it.
# If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter.

How does that sound?




was (Author: wdaehn):
Gee, I am hitting the same old issues. Feel like stalking Ryan ;-)

Kidding aside, I would like to reopen that. If there is valid Avro data and I want to persist that in Parquet, what should I do? Failing at the Parquet write is too late, it would need to fail when creating the Avro message already. Imagine you have a Kafka server and people put all kind of Avro messages into it, long term persistence shall be Parquet.
The solution with the max-depth has the character of a workaround. Yes, it helps for 90% of the cases but....same argument as before.

Given my limited understanding of Parquet, I would redefine the problem as "How would you store that in a relational table?". Parquet wants to store all primitive fields of schemas and their nested schemas in single columns. Hence the problem redefinition makes sense, I hope. The answer would be then "By reusing the columns and adding a parent_key.

Example with above schema:

Input data shall be
{"Fritz", [Fritz-friend1: {"Walter", [Walter-friend1: {"Joe", null}, Walter-friend2: {"Jil", null}]]}

or rendered as a tree:
{{
Fritz
+--Walter
+--+--Joe
+--+--Jil
}}

Converting that to a Parquet structure:
||id||name||friend||root||
|1|Fritz|2|true|
|2|Walter|3|false|
|3|Joe|null|false|
|4|Walter|5|false|
|5|Jil|null|false|

In other words:
# Whenever you find a record definition in the Avro Schema the first time, create it.
# If that record definition is being reused by name (without actual definition of the fields) its data goes into the already created structure.
# If such a schema is used more than once, add the id and root column. The id can hopefully be an Parquet internal pointer exposed as number hopefully.

The nice thing about that is, it would delay the decision of the max-depth to the Spark query time. Such a table is easy to read with Spark, supports the full flexibility available in tree structures including unbalanced trees, recursions in the data, .. everything. And you do not need to change Spark or Parquet itself, it is just the logic within the AvroWriter.

How does that sound?



> AvroParquetWriter can't save object who can have link to object of same type
> ----------------------------------------------------------------------------
>
>                 Key: PARQUET-129
>                 URL: https://issues.apache.org/jira/browse/PARQUET-129
>             Project: Parquet
>          Issue Type: Bug
>         Environment: parquet version 1.5.0
>            Reporter: Dmitriy
>
> When i try to write instance of UserTestOne created from following schema 
> {"namespace": "com.example.avro",
>  "type": "record",
>  "name": "UserTestOne",
>  "fields": [{"name": "name", "type": "string"},   {"name": "friend",  "type": ["null", "UserTestOne"], "default":null} ]
> }
> I get java.lang.StackOverflowError 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)