You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/13 13:12:49 UTC

[GitHub] [spark] peter-toth opened a new pull request, #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

peter-toth opened a new pull request, #38640:
URL: https://github.com/apache/spark/pull/38640

   ### What changes were proposed in this pull request?
   This PR adds DSv2 plan stability tests to be able to track plan changes with new Spark changes.
   
   ### Why are the changes needed?
   We see a few issues with TPVDS DSv2 plans that needs fixes in the future. But, as a first step it would be good to track the plan changes.
   Please note that currently:
   - q14a, q14b, q38, q87 are removed from `TPCDSV1_4_V2PlanStabilitySuite`
   - q14, q14a are from `TPCDSV2_7_V2PlanStabilitySuite`
   
   as those queries would fail in`PushDownLeftSemiAntiJoin`. That is because the rule decides on pusing down joins over an aggregate if the join can be planned as a broadcast join, but the optimization need stats to be available. As the batch `Early Filter and Projection Push-Down` hasn't constructed the V2 Scans,`DataSourceV2Relation.computeStats()` throws an exception due to missing accurate stats.
   
   ### Does this PR introduce _any_ user-facing change?
   'No'
   
   ### How was this patch tested?
   This PR adds only new UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1327229997

   I see your point, there are many commands that needs to be tested...
   
   How about we just enable v2 file sources for read paths first? https://github.com/apache/spark/pull/38640/commits/123103ffd49dce14a088ab1615d7c2159c494651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1313612631

   > This is great! Is it using v2 parquet?
   
   Yes it is, like other stability tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1313560978

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #38640:
URL: https://github.com/apache/spark/pull/38640#discussion_r1022701848


##########
sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala:
##########
@@ -351,6 +353,62 @@ class TPCDSModifiedPlanStabilityWithStatsSuite extends PlanStabilitySuite with T
   }
 }
 
+trait V2TPCDSBase extends TPCDSBase {
+  override protected def sparkConf: SparkConf =
+    super.sparkConf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName)

Review Comment:
   You are right. Admittedly, I'm a bit lost when it comes to DSv2...
   I've just added `ParquetV2TableCatalog` for this test in: https://github.com/apache/spark/pull/38640/commits/fc9d086c10499446ec5f095368a1c06e803e9455



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1592155345

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1335242766

   @cloud-fan I've extracted the commit that enables using v2 file tables in session catalog into a separete PR: https://github.com/apache/spark/pull/38885, please take a look at it if you have some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1316622905

   the in-memory v2 table has very limited pushdown abilities (it was created for tests) and I'm not sure it's worthwhile to add plan golden files for it.
   
   Can we keep this PR open and fix the issues we discovered? We can commit the golden files when v2 parquet table is available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1317171823

   
   > the in-memory v2 table has very limited pushdown abilities (it was created for tests) and I'm not sure it's worthwhile to add plan golden files for it.
   > 
   > Can we keep this PR open and fix the issues we discovered? We can commit the golden files when v2 parquet table is available.
   
   Sure, we can.
   Actually, I'm happy to work on making parquet v2 tables available in a separate ticket/PR if you can give my some guidance. We need something like this commit: https://github.com/apache/spark/pull/38640/commits/426a5086c46ed4bb6ddffa0a77b710d363494619, do we?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1313585893

   This is great! Is it using v2 parquet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38640:
URL: https://github.com/apache/spark/pull/38640#discussion_r1022316324


##########
sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala:
##########
@@ -351,6 +353,62 @@ class TPCDSModifiedPlanStabilityWithStatsSuite extends PlanStabilitySuite with T
   }
 }
 
+trait V2TPCDSBase extends TPCDSBase {
+  override protected def sparkConf: SparkConf =
+    super.sparkConf.set("spark.sql.catalog.testcat", classOf[InMemoryCatalog].getName)

Review Comment:
   seems this is testing in memory v2 table, but not parquet v2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1327051116

   This looks promising, please make sure all the behaviors are well defined: ALTER TABLE, REFRESH TABLE, df.write.mode("overwrite").saveAsTable, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1312731989

   Side note: There seem to be many issues with V2 Scan construction in batch `Early Filter and Projection Push-Down` currently. As that is the only place where the scans are constructed the batch seems to be:
   - too late for `PushDownLeftSemiAntiJoin` (see the issue above with some of the queries)
   - and too early for `RewritePredicateSubquery` + `ColumnPruning` in batch `RewriteSubquery` as obviuosly the scans should be (re)constructed after the latest `ColumnPruning` or filter modifying rule to not return unecessary data. (This actually causes serious performance degradation in q94, q16 with DSv2 sources.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1315299552

   Ah I just realized that there is no way to use v2 parquet table today. Shall we support it first before benchmarking it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1315256115

   @peter-toth there is an easy way to enable parquet v2: set `spark.sql.sources.useV1SourceList` to empty.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1325947062

   > Actually, I'm happy to work on making parquet v2 tables available in a separate ticket/PR if you can give my some guidance.
   
   I tried to do it long time ago but failed as there are some design issues. We need to fully understand the use cases of `CREATE TABLE ... USING v1Source` and see how to make it work for v2 sources:
   1. Just a name mapping, so that people can use table name instead of providing all the data source information every time. JDBC data source is a good example of it.
   2. Schema cache. The data source may have a large cost to infer the data schema and need the Spark catalog to cache it. File source is a good example here.
   
   We also need to think about the semantic of `ALTER TABLE`, `REFRESH TABLE`, `df.write.mode("overwrite").saveAsTable`, etc.
   
   Some code references. For v1 source, we have a rule `FindDataSourceTable` to resolve table with v1 source. For v2 source, we probably should have a similar rule to resolve v2 source to `TableProvider`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1326507386

   @cloud-fan, thanks for the details!
   
   In this PR I'm focusing only on file source v2 implementation. I've found that for v1 sources are resolved in `FindDataSourceTable` but why do we need a new rule for v2? I thought that v2 tables should be resolved in `ResolveRelations`.
   
   Basically what I was trying to do in the first commit of this PR: https://github.com/apache/spark/pull/38640/commits/efe4311a150c4dbee0c09dec9e409bca76cfdc25 (and probably it should be extracted to a separate ticket/PR if it works) is to:
   - enable v2 file sources in `ResolveSessionCatalog.isV2Provider()` to not construct v1 commands
   - modify `V2SessionCatalog.loadTable()` to return with `FileTabes`s (instead of `V1Table`s) for v2 file sources to let `ResolveRelations` resolve v2 file tables
   - modify `FileTable` implementations to optionally contain `CatalogTable`s and implement `V2TableWithV1Fallback` to provide v1 fallback if needed (streaming usecases) and `FileTable`s file index and schema can be constructed from the catalog table
   - split `SQLQuerySuite` into v1 and v2 versions to add various v2 tests
   - adjust some of the existing v2 tests
   
   Do you think this is a viable solution? Or did I miss something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1315310978

   > Ah I just realized that there is no way to use v2 parquet table today. Shall we support it first before benchmarking it?
   
   I'm ok with switching this benchmark to parquet v2 when it becomes available. But do you think the in memory v2 version of this PR can go in now? As it also shows the general issues (https://github.com/apache/spark/pull/38640#issuecomment-1312731989) with DSv2 optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38640: [SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1315293892

   > @peter-toth there is an easy way to enable parquet v2: set `spark.sql.sources.useV1SourceList` to empty.
   
   I thought that config only controls sources when `spark.read.parquet(...)` is used.
   
   Seemingly DSv2 sources are not used when I try to use it through the default catalog:
   ```
   bin/spark-shell --conf "spark.sql.sources.useV1SourceList="
   
   scala> sql("create table t(id int) using parquet")
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> sql("select * from t").explain(true)
   == Parsed Logical Plan ==
   'Project [*]
   +- 'UnresolvedRelation [t], [], false
   
   == Analyzed Logical Plan ==
   id: int
   Project [id#21]
   +- SubqueryAlias spark_catalog.default.t
      +- Relation spark_catalog.default.t[id#21] parquet
   
   == Optimized Logical Plan ==
   Relation spark_catalog.default.t[id#21] parquet
   
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet spark_catalog.default.t[id#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/petertoth/git/apache/spark/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
   ```
   Or did I get it wrong?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites
URL: https://github.com/apache/spark/pull/38640


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org