You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "wang-zhun (Jira)" <ji...@apache.org> on 2021/12/09 09:43:00 UTC

[jira] [Created] (SPARK-37595) DatasourceV2 `exists ... select *` column push down

wang-zhun created SPARK-37595:
---------------------------------

             Summary: DatasourceV2 `exists ... select *` column push down
                 Key: SPARK-37595
                 URL: https://issues.apache.org/jira/browse/SPARK-37595
             Project: Spark
          Issue Type: Wish
          Components: SQL
    Affects Versions: 3.2.0, 3.1.2
            Reporter: wang-zhun


The datasourcev2 table is very slow when executing TPCDS, because `exists ... select *` will not push down the cropped columns to the data source

Add test in `org.apache.spark.sql.connector.DataSourceV2SQLSuite`
```
test("datasourcev2 exists") {
    val t1 = s"${catalogAndNamespace}t1"
    withTable(t1) {
      sql(s"CREATE TABLE $t1 (col1 string, col2 string) USING $v2Format")
      val t2 = s"${catalogAndNamespace}t2"
      withTable(t2) {
        sql(s"CREATE TABLE $t2 (col1 string, col2 string) USING $v2Format")
        val query = sql(s"select * from $t1 where not exists" +
            s"(select * from $t2 where t1.col1=t2.col1)").queryExecution
        // scalastyle:off println
        println(query.executedPlan)
      }
    }
  }

AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [col1#17], [col1#19], LeftSemi, BuildRight, false
   :- Project [col1#17, col2#18]
   :  +- BatchScan[col1#17, col2#18] class org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]),false), [id=#28]
      +- Project [col1#19]
         +- BatchScan[col1#19, col2#20] class org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []

Expectation is `BatchScan[col1#19] class org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []`
```

Reason `Batch("Early Filter and Projection Push-Down" V2ScanRelationPushDown` is executed before `Batch("RewriteSubquery"`, parallel datasourceV2 does not support `FileSourceStrategy`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org