You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sergey Kotlov (Jira)" <ji...@apache.org> on 2021/11/03 01:41:00 UTC

[jira] [Created] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

Sergey Kotlov created SPARK-37201:
-------------------------------------

             Summary: Spark SQL reads unnecessary nested fields (filter after explode)
                 Key: SPARK-37201
                 URL: https://issues.apache.org/jira/browse/SPARK-37201
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Sergey Kotlov


In this example, reading unnecessary nested fields still happens.

Data preparation:

 
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 

v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: struct<struct:struct<v1:string,v2:string,v3:string>,array:array<string>>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org