You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gengliang Wang (Jira)" <ji...@apache.org> on 2023/02/14 23:48:00 UTC

[jira] [Reopened] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta

     [ https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gengliang Wang reopened SPARK-42406:
------------------------------------

> [PROTOBUF] Recursive field handling is incompatible with delta
> --------------------------------------------------------------
>
>                 Key: SPARK-42406
>                 URL: https://issues.apache.org/jira/browse/SPARK-42406
>             Project: Spark
>          Issue Type: Bug
>          Components: Protobuf
>    Affects Versions: 3.4.0
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>            Priority: Major
>             Fix For: 3.4.0
>
>
> Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. 
> It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type.
> Actually `Array[NullType]` is not really useful anyway.
> How about this fix: Drop the recursive field when the limit reached rather than using a NullType. 
> The example below makes it clear:
> Consider a recursive Protobuf:
>  
> {code:python}
> message TreeNode {
>   string value = 1;
>   repeated TreeNode children = 2;
> }
> {code}
> Allow depth of 2: 
>  
> {code:python}
>    df.select(
>     'proto',
>      messageName = 'TreeNode',
>      options = { ... "recursive.fields.max.depth" : "2" }
>   ).printSchema()
> {code}
> Schema looks like this:
> {noformat}
> root
> |– from_protobuf(proto): struct (nullable = true)|
> | |– value: string (nullable = true)|
> | |– children: array (nullable = false)|
> | | |– element: struct (containsNull = false)|
> | | | |– value: string (nullable = true)|
> | | | |– children: array (nullable = false)|
> | | | | |– element: struct (containsNull = false)|
> | | | | | |– value: string (nullable = true)|
> | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop this field === ]|
> | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE === ] 
> {noformat}
> When we try to write this to a delta table, we get an error:
> {noformat}
> AnalysisException: Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types.
> {noformat}
>  
> We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field.
> Another issue is setting for 'recursive.fields.max.depth': It is not enforced correctly. '0' does not make sense. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org