You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sergey Kotlov (Jira)" <ji...@apache.org> on 2021/11/03 01:41:00 UTC
[jira] [Created] (SPARK-37201) Spark SQL reads unnecessary nested
fields (filter after explode)
Sergey Kotlov created SPARK-37201:
-------------------------------------
Summary: Spark SQL reads unnecessary nested fields (filter after explode)
Key: SPARK-37201
URL: https://issues.apache.org/jira/browse/SPARK-37201
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.2.0
Reporter: Sergey Kotlov
In this example, reading unnecessary nested fields still happens.
Data preparation:
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])
Seq(
Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
.select($"struct.v1", explode($"array").as("el"))
.filter($"el" === "cx1")
.explain(true)
== Physical Plan ==
... ReadSchema: struct<struct:struct<v1:string,v2:string,v3:string>,array:array<string>>
{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is fine:
{code:java}
spark.table("table")
.select($"struct.v1", explode($"array").as("el"))
//.filter($"el" === "cx1")
.explain(true)
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
spark.table("table")
.select($"struct.v1", $"array")
.select($"v1", explode($"array").as("el"))
.filter($"el" === "cx1")
.explain(true)
// ... ReadSchema: struct<struct:struct<v1:string>,array:array<string>>
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org