You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "XiDuo You (Jira)" <ji...@apache.org> on 2022/01/11 17:07:00 UTC

[jira] [Commented] (SPARK-37855) IllegalStateException when transforming an array inside a nested struct

    [ https://issues.apache.org/jira/browse/SPARK-37855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472938#comment-17472938 ] 

XiDuo You commented on SPARK-37855:
-----------------------------------

The regression seems from SPARK-35636, for the quick work around, you can set config

{code:java}
set spark.sql.optimizer.nestedSchemaPruning.enabled=false;
{code}


> IllegalStateException when transforming an array inside a nested struct
> -----------------------------------------------------------------------
>
>                 Key: SPARK-37855
>                 URL: https://issues.apache.org/jira/browse/SPARK-37855
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>         Environment: OS: Ubuntu 20.04.3 LTS
> Scala version: 2.12.12
>  
>            Reporter: G Muciaccia
>            Priority: Major
>
> *NOTE*: this bug is only present in version {{3.2.0}}. Downgrading to {{3.1.2}} solves the problem.
> h3. Prerequisites to reproduce the bug
> # use Spark version 3.2.0
> # create a DataFrame with an array field, which contains a struct field with a nested array field
> # *apply a limit* to the DataFrame
> # transform the outer array, renaming one of its fields
> # transform the inner array too, which requires two {{getField}} in sequence
> h3. Example that reproduces the bug
> This is a minimal example (as minimal as I could make it) to reproduce the bug:
> {code}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{DataFrame, Row}
> def makeInput(): DataFrame = {
>     val innerElement1 = Row(3, 3.12)
>     val innerElement2 = Row(4, 2.1)
>     val innerElement3 = Row(1, 985.2)
>     val innerElement4 = Row(10, 757548.0)
>     val innerElement5 = Row(1223, 0.665)
>     val outerElement1 = Row(1, Row(List(innerElement1, innerElement2)))
>     val outerElement2 = Row(2, Row(List(innerElement3)))
>     val outerElement3 = Row(3, Row(List(innerElement4, innerElement5)))
>     val data = Seq(
>         Row("row1", List(outerElement1)),
>         Row("row2", List(outerElement2, outerElement3)),
>     )
>     val schema = new StructType()
>         .add("name", StringType)
>         .add("outer_array", ArrayType(new StructType()
>             .add("id", IntegerType)
>             .add("inner_array_struct", new StructType()
>                 .add("inner_array", ArrayType(new StructType()
>                     .add("id", IntegerType)
>                     .add("value", DoubleType)
>                 ))
>             )
>         ))
>     spark.createDataFrame(spark.sparkContext
>         .parallelize(data),schema)
> }
> // val df = makeInput()
> val df = makeInput().limit(2)
> // val df = makeInput().limit(2).cache()
> val res = df.withColumn("extracted", transform(
>     col("outer_array"),
>     c1 => {
>         struct(
>             c1.getField("id").alias("outer_id"),
>             transform(
>                 c1.getField("inner_array_struct").getField("inner_array"),
>                 c2 => {
>                     struct(
>                         c2.getField("value").alias("inner_value")
>                     )
>                 }
>             )
>         )
>     }
> ))
> res.printSchema()
> res.show(false)
> {code}
> h4. Executing the example code
> When executing it as-is, the execution will fail on the {{show}} statement, with
> {code}
> java.lang.IllegalStateException Couldn't find _extract_inner_array#23 in [name#2,outer_array#3]
> {code}
> However, *if the limit is not applied, or if the DataFrame is cached after the limit, everything works* (you can uncomment the corresponding lines in the example to try it).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org