You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Raghu Angadi (Jira)" <ji...@apache.org> on 2023/02/11 20:54:00 UTC

[jira] [Created] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta

Raghu Angadi created SPARK-42406:
------------------------------------

             Summary: [PROTOBUF] Recursive field handling is incompatible with delta
                 Key: SPARK-42406
                 URL: https://issues.apache.org/jira/browse/SPARK-42406
             Project: Spark
          Issue Type: Bug
          Components: Protobuf
    Affects Versions: 3.4.0
            Reporter: Raghu Angadi
             Fix For: 3.4.1


Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. 

It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type.

Actually `Array[NullType]` is not really useful anyway.

How about this fix: Drop the recursive field when the limit reached rather than using a NullType. 

The example below makes it clear:

Consider a recursive Protobuf:
```
message TreeNode {
  string value = 1;
  repeated TreeNode children = 2;
}
```

Allow depth of 2: 

```python
   df.select(
    'proto',
     messageName = 'TreeNode',
    options = { ... "recursive.fields.max.depth" : "2" }
  ).printSchema()
```
Schema looks like this:
```
root
 |-- from_protobuf(proto): struct (nullable = true)
 |    |-- value: string (nullable = true)
 |    |-- children: array (nullable = false)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- children: array (nullable = false)
 |    |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |    |-- children: array (nullable = false). [ === Proposed fix: Drop this field === ] 
 |    |    |    |    |    |    |-- element: void (containsNull = false) [ === NOTICE 'void' HERE === ] 
```

When we try to write this to a delta table, we get an error:
```
AnalysisException:  Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types.
```
 
We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org