You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jiri Humpolicek (Jira)" <ji...@apache.org> on 2023/03/21 07:01:00 UTC
[jira] [Created] (SPARK-42879) Spark SQL reads unnecessary nested fields
Jiri Humpolicek created SPARK-42879:
---------------------------------------
Summary: Spark SQL reads unnecessary nested fields
Key: SPARK-42879
URL: https://issues.apache.org/jira/browse/SPARK-42879
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.3.2
Reporter: Jiri Humpolicek
When we use more than one field from structure after explode, all fields will be read.
Example:
1) Loading data
{code:scala}
val jsonStr = """{
"items": [
{"itemId": 1, "itemData1": "a", "itemData2": 11},
{"itemId": 2, "itemData1": "b", "itemData2": 22}
]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read
.select(explode('items).as('item))
.select($"item.itemId", $"item.itemData1")
.explain
// ReadSchema: struct<items:array<struct<itemData1:string,itemData2:bigint,itemId:bigint>>>
{code}
We use only *itemId* and *itemData1* fields from structure in array, but read schema contains *itemData2* field as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org