You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/01/25 06:19:00 UTC

[jira] [Resolved] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

     [ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun resolved SPARK-29721.
-----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 26978
[https://github.com/apache/spark/pull/26978]

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --------------------------------------------------------------------------
>
>                 Key: SPARK-29721
>                 URL: https://issues.apache.org/jira/browse/SPARK-29721
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>            Reporter: Kai Kang
>            Assignee: L. C. Hsieh
>            Priority: Major
>             Fix For: 3.0.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.
> We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>    {"itemId": 1, "itemData": "a"},
>    {"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct<items:array<struct<itemId:bigint>>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct<items:array<struct<itemData:string,itemId:bigint>>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org