You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "L. C. Hsieh (Jira)" <ji...@apache.org> on 2021/11/24 02:33:00 UTC
[jira] [Comment Edited] (SPARK-37450) Spark SQL reads unnecessary nested fields (another type of pruning case)

    [ https://issues.apache.org/jira/browse/SPARK-37450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448328#comment-17448328 ] 

L. C. Hsieh edited comment on SPARK-37450 at 11/24/21, 2:32 AM:
----------------------------------------------------------------

Hmm, this is the optimized plan.

{code}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(true)#20299L]
+- Project
   +- Generate explode(items#20293), [0], false, [item#20296]
      +- Filter ((size(items#20293, true) > 0) AND isnotnull(items#20293))
         +- Relation default.table[items#20293] parquet
{code}

Because here you are counting "item" so Spark must read "items" and explode it to count nested elements. And because there is no particular nested field is specified, Spark reads the full nested struct ("itemId" and "itemData") without any pruning.

For example, if you change to "read.select(explode($"items").as('item)).select(count($"item.itemData")).explain(true)", Spark will prune the "itemId":


{code}
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(_extract_itemData#20302)], output=[count(item.itemData)#20300L])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#24668]
      +- HashAggregate(keys=[], functions=[partial_count(_extract_itemData#20302)], output=[count#20307L])
         +- Project [item#20296 AS _extract_itemData#20302]
            +- Generate explode(_extract_itemData#20304), false, [item#20296]
               +- Project [items#20293.itemData AS _extract_itemData#20304]
                  +- Filter ((size(items#20293.itemData, true) > 0) AND isnotnull(items#20293.itemData))
                     +- FileScan parquet default.table[items#20293] Batched: false, DataFilters: [(size(items#20293.itemData, true) > 0), isnotnull(items#20293.itemData)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<items:array<struct<itemData:string>>>
{code}

That's said I don't think it is an issue.



was (Author: viirya):
Hmm, this is the optimized plan.

{code}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(true)#20299L]
+- Project
   +- Generate explode(items#20293), [0], false, [item#20296]
      +- Filter ((size(items#20293, true) > 0) AND isnotnull(items#20293))
         +- Relation default.table[items#20293] parquet
{code}

Because here you are counting "item" so Spark must read "items" and explode it to count nested elements. And because there is no particular nested field is specified, Spark reads the full nested struct ("itemId" and "itemData") without any pruning.

For example, if you change to "read.select(explode($"items").as('item)).select(count($"item.itemData")).explain(true)", Spark will prune the "itemId":


{code}
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(_extract_itemData#20302)], output=[count(item.itemData)#20300L])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#24668]
      +- HashAggregate(keys=[], functions=[partial_count(_extract_itemData#20302)], output=[count#20307L])
         +- Project [item#20296 AS _extract_itemData#20302]
            +- Generate explode(_extract_itemData#20304), false, [item#20296]
               +- Project [items#20293.itemData AS _extract_itemData#20304]
                  +- Filter ((size(items#20293.itemData, true) > 0) AND isnotnull(items#20293.itemData))
                     +- FileScan parquet default.table[items#20293] Batched: false, DataFilters: [(size(items#20293.itemData, true) > 0), isnotnull(items#20293.itemData)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<items:array<struct<itemData:string>>>
{code}


> Spark SQL reads unnecessary nested fields (another type of pruning case)
> ------------------------------------------------------------------------
>
>                 Key: SPARK-37450
>                 URL: https://issues.apache.org/jira/browse/SPARK-37450
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Jiri Humpolicek
>            Priority: Major
>
> Based on this [SPARK-34638|https://issues.apache.org/jira/browse/SPARK-34638] Maybe I found another nested fields pruning case. In this case I found full read with `count` function
> Example:
> 1) Loading data
> {code:scala}
> val jsonStr = """{
>  "items": [
>    {"itemId": 1, "itemData": "a"},
>    {"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {code}
> 2) read query with explain
> {code:scala}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select(explode($"items").as('item)).select(count(lit(true))).explain(true)
> // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org