You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jiri Humpolicek (Jira)" <ji...@apache.org> on 2023/03/20 13:56:00 UTC
[jira] [Created] (SPARK-42872) Spark SQL reads unnecessary nested fields
Jiri Humpolicek created SPARK-42872:
---------------------------------------
Summary: Spark SQL reads unnecessary nested fields
Key: SPARK-42872
URL: https://issues.apache.org/jira/browse/SPARK-42872
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.3.2
Reporter: Jiri Humpolicek
When we use high order functions in spark sql query, it would be great if it will be somehow possible to write following example in way that spark will read only necessary nested fields.
Example:
1) Loading data
{code:scala}
val jsonStr = """{
"items": [
{"itemId": 1, "itemData": "a"},
{"itemId": 2, "itemData": "b"}
]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select(transform($"items", i=>i.getItem("itemId")).as('itemIds)).explain(true)
// ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
{code}
We use only *itemId* field from structure in array, but read schema contains all fields of structure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org