You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kai Kang (Jira)" <ji...@apache.org> on 2019/11/02 05:15:00 UTC
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested
fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kai Kang updated SPARK-29721:
-----------------------------
Description:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.
We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.
The following code illustrates the issue.
Part 1: loading some nested data
{quote}{{import spark.implicits._}}
val jsonStr = """{
"items": [
{
"itemId": 1,
"itemData": "a"
},
{
"itemId": 1,
"itemData": "b"
}
]
}"""
{{val df = spark.read.json(Seq(jsonStr).toDS)}}
{{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData
{quote}
was:
This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.
We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.
The following code illustrates the issue.
Part 1: loading some nested data
{quote}{{import spark.implicits._}}
{{val jsonStr = """{}}
{{ "items": [}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "a"}}
{{ },}}
{{ {}}
{{ "itemId": 1,}}
{{ "itemData": "b"}}
{{ }}}
{{ ]}"""}}
{{val df = spark.read.json(Seq(jsonStr).toDS)}}
{{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
{quote}
Part 2: reading it back and explaining the queries
{quote}val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
read.select($"items.itemId").explain(true) // pruned, only loading itemId
read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData
{quote}
> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --------------------------------------------------------------------------
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.4
> Reporter: Kai Kang
> Priority: Critical
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column pruning for nested structures. However, when explode() is called on a nested field, all columns for that nested structure is still fetched from data source.
> We are working on a project to create a parquet store for a big pre-joined table between two tables that has one-to-many relationship, and this is a blocking issue for us.
>
> The following code illustrates the issue.
> Part 1: loading some nested data
> {quote}{{import spark.implicits._}}
> val jsonStr = """{
> "items": [
> {
> "itemId": 1,
> "itemData": "a"
> },
> {
> "itemId": 1,
> "itemData": "b"
> }
> ]
> }"""
> {{val df = spark.read.json(Seq(jsonStr).toDS)}}
> {{df.write.format("parquet").mode("overwrite").saveAsTable("persisted")}}
> {quote}
> Part 2: reading it back and explaining the queries
> {quote}val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> read.select($"items.itemId").explain(true) // pruned, only loading itemId
> read.select(explode($"items.itemId")).explain(true) // not pruned, loading both itemId and itemData
> {quote}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org