You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jiri Humpolicek (Jira)" <ji...@apache.org> on 2023/03/21 07:01:00 UTC

[jira] [Created] (SPARK-42879) Spark SQL reads unnecessary nested fields

Jiri Humpolicek created SPARK-42879:
---------------------------------------

             Summary: Spark SQL reads unnecessary nested fields
                 Key: SPARK-42879
                 URL: https://issues.apache.org/jira/browse/SPARK-42879
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.2
            Reporter: Jiri Humpolicek


When we use more than one field from structure after explode, all fields will be read.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData1": "a", "itemData2": 11},
   {"itemId": 2, "itemData1": "b", "itemData2": 22}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read
    .select(explode('items).as('item))
    .select($"item.itemId", $"item.itemData1")
    .explain
// ReadSchema: struct<items:array<struct<itemData1:string,itemData2:bigint,itemId:bigint>>>
{code}
We use only *itemId* and *itemData1* fields from structure in array, but read schema contains *itemData2* field as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org