You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jiri Humpolicek (Jira)" <ji...@apache.org> on 2023/03/20 13:56:00 UTC

[jira] [Created] (SPARK-42872) Spark SQL reads unnecessary nested fields

Jiri Humpolicek created SPARK-42872:
---------------------------------------

             Summary: Spark SQL reads unnecessary nested fields
                 Key: SPARK-42872
                 URL: https://issues.apache.org/jira/browse/SPARK-42872
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.2
            Reporter: Jiri Humpolicek


When we use high order functions in spark sql query, it would be great if it will be somehow possible to write following example in way that spark will read only necessary nested fields.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(transform($"items", i=>i.getItem("itemId")).as('itemIds)).explain(true)
// ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
{code}
We use only *itemId* field from structure in array, but read schema contains all fields of structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org