You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "grundprinzip (via GitHub)" <gi...@apache.org> on 2023/07/20 22:24:13 UTC

[GitHub] [spark] grundprinzip opened a new pull request, #42099: [SPARK-44505][DSv2] Provide override for columnar support in Scan

grundprinzip opened a new pull request, #42099:
URL: https://github.com/apache/spark/pull/42099

   ### What changes were proposed in this pull request?
   Previously, when a new DSv2 data source is implemented during planning, it will always call `BatchScanExec:supportsColumnar` which will in turn iterate over all input partitions to check if they support columnar or not. 
   
   When the `planInputPartitions` method is expensive this can be problematic. This patch adds an option to the Scan interface that allows specifying a default value. For backward compatibility the default value provided by the Scan interface is partition defined, but a Scan can change it accordingly.
   
   ### Why are the changes needed?
   Avoid costly operations during explain operations.
   
   ### Does this PR introduce _any_ user-facing change?
   Np
   
   ### How was this patch tested?
   Added new UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #42099:
URL: https://github.com/apache/spark/pull/42099#issuecomment-1653563500

   thanks, merging to master/3.5 (fixes a perf bug)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1271399150


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   Yeah, `columnarSupportMode ` if fine to me



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   Yeah, `columnarSupportMode ` is fine to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #42099: [SPARK-44505][DSv2] Provide override for columnar support in Scan

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #42099:
URL: https://github.com/apache/spark/pull/42099#issuecomment-1644701265

   @hvanhovell PTAL if you have some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270685014


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   Renamed the method and enum type accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270667366


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   then how about we name it `columnarSupportMode`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270668141


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,28 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    var ex = intercept[IllegalArgumentException](df.explain())
+    assert(ex.getMessage == "planInputPartitions must not be called")
+
+    val dfScan = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "SUPPORTED").load()
+    dfScan.explain()
+    // Will fail during regular execution.
+    ex = intercept[IllegalArgumentException](dfScan.count())
+    assert(ex.getMessage == "planInputPartitions must not be called")
+
+    val dfScanUnsupported = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "SUPPORTED").load()

Review Comment:
   ```suggestion
         .option("columnar", "UNSUPPORTED").load()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270355484


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,24 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    intercept[IllegalArgumentException](df.explain())
+    val df_scan = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #42099:
URL: https://github.com/apache/spark/pull/42099#issuecomment-1646771365

   @cloud-fan unfortunately, I found another issue where explain triggers checking the input partitions that is not as easy solvable. When supports columnar becomes false, we trigger a check in `EnsureRequirements` that the output distribution is compatible which in turn triggers resolving the input partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2
URL: https://github.com/apache/spark/pull/42099


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270208568


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,24 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    intercept[IllegalArgumentException](df.explain())
+    val df_scan = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "SUPPORTED").load()
+    df_scan.explain()
+    // Will fail during regular exeuction
+    intercept[IllegalArgumentException](df_scan.count())
+
+    val df_scan_unsupported = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270355197


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,24 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    intercept[IllegalArgumentException](df.explain())

Review Comment:
   Done.



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,24 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    intercept[IllegalArgumentException](df.explain())
+    val df_scan = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "SUPPORTED").load()
+    df_scan.explain()
+    // Will fail during regular exeuction
+    intercept[IllegalArgumentException](df_scan.count())
+
+    val df_scan_unsupported = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270666646


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide

Review Comment:
   ```suggestion
      * partition can determine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1276040206


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlignAssignmentsSuiteBase.scala:
##########
@@ -147,7 +147,7 @@ abstract class AlignAssignmentsSuiteBase extends AnalysisTest {
   private val v2Catalog = {
     val newCatalog = mock(classOf[TableCatalog])
     when(newCatalog.loadTable(any())).thenAnswer((invocation: InvocationOnMock) => {
-      val ident = invocation.getArgument[Identifier](0)
+      val ident = invocation.getArguments()(0).asInstanceOf[Identifier]

Review Comment:
   I encountered this in testing, as you said. The PR that I mentioned in the description fixes it the same way. Personally, I feel that it leaves the code better than it was before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270213549


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   My initial impression is that the return value of this method should be Boolean type, as in Spark, other methods or functions with `support(s)` as a prefix are often like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270215070


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,24 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    intercept[IllegalArgumentException](df.explain())

Review Comment:
   I think it would be better to add assertions for the exception content, if possible.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270672439


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   Will do!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1276135736


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlignAssignmentsSuiteBase.scala:
##########
@@ -147,7 +147,7 @@ abstract class AlignAssignmentsSuiteBase extends AnalysisTest {
   private val v2Catalog = {
     val newCatalog = mock(classOf[TableCatalog])
     when(newCatalog.loadTable(any())).thenAnswer((invocation: InvocationOnMock) => {
-      val ident = invocation.getArgument[Identifier](0)
+      val ident = invocation.getArguments()(0).asInstanceOf[Identifier]

Review Comment:
   OK,  fine to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #42099:
URL: https://github.com/apache/spark/pull/42099#issuecomment-1646770535

   @grundprinzip please fix the scalastyle, then we can merge this one, thanks ~
   
   ```
   [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:32:0: org.apache.spark.sql.connector.read._ is in wrong order relative to org.apache.spark.sql.connector.read.Scan.ColumnarSupportMode.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #42099: [SPARK-44505][DSv2] Provide override for columnar support in Scan

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on PR #42099:
URL: https://github.com/apache/spark/pull/42099#issuecomment-1644699847

   @cloud-fan PTAL! Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270303330


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java:
##########
@@ -125,4 +125,26 @@ default CustomMetric[] supportedCustomMetrics() {
   default CustomTaskMetric[] reportDriverMetrics() {
     return new CustomTaskMetric[]{};
   }
+
+  /**
+   * This enum defines how the columnar support for the partitions of the data source
+   * should be determined. The default value is `PARTITION_DEFINED` which indicates that each
+   * partition can deterine if it should be columnar or not. SUPPORTED and UNSUPPORTED provide
+   * default shortcuts to indicate support for columnar data or not.
+   *
+   * @since 3.5.0
+   */
+  enum ColumnarSupportType {
+    PARTITION_DEFINED,
+    SUPPORTED,
+    UNSUPPORTED
+  }
+
+  /**
+   * Subclasses can implement this method to indicate if the support for columnar data should
+   * be determined by each partition or is set as a default for the whole scan.
+   *
+   * @since 3.5.0
+   */
+  default ColumnarSupportType supportsColumnar() { return ColumnarSupportType.PARTITION_DEFINED; }

Review Comment:
   it can't be a boolean because there are three different values. 
   
   Since this is used as an override, we must preserve today's behavior for backwards compatibility. The backward compatible behavior is to iterate over all partitions to check. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270668613


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -686,6 +708,26 @@ class SimpleSinglePartitionSource extends TestingV2Source {
   }
 }
 
+class ScanDefinedColumnarSupport extends TestingV2Source {
+
+  class MyScanBuilder(st: Scan.ColumnarSupportType) extends SimpleScanBuilder {
+    override def planInputPartitions(): Array[InputPartition] = {
+      throw new IllegalArgumentException("planInputPartitions must not be called")
+    }
+
+    override def supportsColumnar()
+        : Scan.ColumnarSupportType = st

Review Comment:
   ```suggestion
       override def supportsColumnar(): Scan.ColumnarSupportType = st
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1270208487


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -388,6 +388,24 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS
     assert(df.queryExecution.executedPlan.collect { case e: Exchange => e }.isEmpty)
   }
 
+  test("SPARK-44505: should not call planInputPartitions() on explain") {
+    val df = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)
+      .option("columnar", "PARTITION_DEFINED").load()
+    // Default mode will throw an exception on explain.
+    intercept[IllegalArgumentException](df.explain())
+    val df_scan = spark.read.format(classOf[ScanDefinedColumnarSupport].getName)

Review Comment:
   should use camel case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #42099:
URL: https://github.com/apache/spark/pull/42099#issuecomment-1649026633

   I think the problem is
   ```
       if (partitions.length == 1) {
         SinglePartition
       }
   ```
   
   For `SinglePartition`, the perf doesn't matter as the data is small and we only use one task to process it. It doesn't worth the cost of triggering the partition listing during planning. We should just remove this if branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42099: [SPARK-44505][SQL] Provide override for columnar support in Scan for DSv2

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42099:
URL: https://github.com/apache/spark/pull/42099#discussion_r1276032428


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/command/AlignAssignmentsSuiteBase.scala:
##########
@@ -147,7 +147,7 @@ abstract class AlignAssignmentsSuiteBase extends AnalysisTest {
   private val v2Catalog = {
     val newCatalog = mock(classOf[TableCatalog])
     when(newCatalog.loadTable(any())).thenAnswer((invocation: InvocationOnMock) => {
-      val ident = invocation.getArgument[Identifier](0)
+      val ident = invocation.getArguments()(0).asInstanceOf[Identifier]

Review Comment:
   Why do we need to change this part of the code in this pr?
   
   I guess you encountered a mockito dependency conflict when running UT  in Intellij?  Because there will be a mockito-all-1.10.19.jar in `External Libraries`, and this jar  lacks the corresponding API? But this may only occur when import spark as sbt project and running UT using Intellij ...
   
    I'm not sure if we should fix it this way
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org