You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/09 06:34:28 UTC

[GitHub] [spark] francis0407 opened a new pull request #24321: SPARK-27411: DataSourceV2Strategy should not eliminate subquery

francis0407 opened a new pull request #24321: SPARK-27411: DataSourceV2Strategy should not eliminate subquery 
URL: https://github.com/apache/spark/pull/24321
 
 
   ## What changes were proposed in this pull request?
   
   In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters.
   We have an sql with a scalar subquery:
   
   ``` scala
   val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
   plan.explain(true)
   ```
   
   And we get the log info of DataSourceV2Strategy:
   ```
   Pushing operators to csv:examples/src/main/resources/t2.txt
   Pushed Filters: 
   Post-Scan Filters: isnotnull(t2a#30)
   Output: t2a#30, t2b#31
   ```
   
   The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake.
   ```
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Filter ('t2a > scalar-subquery#56 [])
      :  +- 'Project [unresolvedalias('max('t1a), None)]
      :     +- 'UnresolvedRelation `t1`
      +- 'UnresolvedRelation `t2`
   
   == Analyzed Logical Plan ==
   t2a: string, t2b: string
   Project [t2a#30, t2b#31]
   +- Filter (t2a#30 > scalar-subquery#56 [])
      :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
      :     +- SubqueryAlias `t1`
      :        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
      +- SubqueryAlias `t2`
         +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
   
   == Optimized Logical Plan ==
   Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
   :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
   :     +- Project [t1a#13]
   :        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
   +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
   
   == Physical Plan ==
   *(1) Project [t2a#30, t2b#31]
   +- *(1) Filter isnotnull(t2a#30)
      +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
   ```
   ## How was this patch tested?
   
   ut
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org