You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "clownxc (via GitHub)" <gi...@apache.org> on 2023/04/03 01:04:36 UTC

[GitHub] [spark] clownxc opened a new pull request, #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

clownxc opened a new pull request, #40638:
URL: https://github.com/apache/spark/pull/40638

   What changes were proposed in this pull request?
   Add an override to BatchScanExecBase which delegates to a new default method on PartitionReaderFactory to expose vectoryTypes.
   
   Why are the changes needed?
   SparkPlan's vectorType's attribute can be used to specialize codegen however BatchScanExecBase does not override this so we DSv2 sources do not get any benefit of concrete class dispatch.
   
   Does this PR introduce any user-facing change?
   NO
   
   How was this patch tested?
   Pass GitHub Actions
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] clownxc commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "clownxc (via GitHub)" <gi...@apache.org>.
clownxc commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1155932178


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -905,6 +908,13 @@ object ColumnarReaderFactory extends PartitionReaderFactory {
 
   override def supportColumnarReads(partition: InputPartition): Boolean = true
 
+  override def getVectorTypes: Optional[java.lang.Iterable[String]] = {
+    val data: Iterable[String] = Iterable.fill(2)(
+        classOf[OnHeapColumnVector].getName
+    )

Review Comment:
   > Oh, that's also very nice (re my previous comment to use `Collections.nCopies`).
   > 
   > Can you rename `data`?
   
   Thank you, I will try to modify the code as you say



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] clownxc commented on pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "clownxc (via GitHub)" <gi...@apache.org>.
clownxc commented on PR #40638:
URL: https://github.com/apache/spark/pull/40638#issuecomment-1501887451

   Could you please take a look when you have time? thanks! @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "jaceklaskowski (via GitHub)" <gi...@apache.org>.
jaceklaskowski commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1172969651


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -36,33 +37,42 @@
 public interface PartitionReaderFactory extends Serializable {
 
   /**
-   * Returns a row-based partition reader to read data from the given {@link InputPartition}.
+   * @return a row-based partition reader to read data from the given {@link InputPartition}.

Review Comment:
   nit: A (uppercase)



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -36,33 +37,42 @@
 public interface PartitionReaderFactory extends Serializable {
 
   /**
-   * Returns a row-based partition reader to read data from the given {@link InputPartition}.
+   * @return a row-based partition reader to read data from the given {@link InputPartition}.
    * <p>
-   * Implementations probably need to cast the input partition to the concrete
-   * {@link InputPartition} class defined for the data source.
+   * @note Implementations probably need to cast the input partition to the concrete
+   *       {@link InputPartition} class defined for the data source.
    */
   PartitionReader<InternalRow> createReader(InputPartition partition);
 
   /**
-   * Returns a columnar partition reader to read data from the given {@link InputPartition}.
+   * @return a columnar partition reader to read data from the given {@link InputPartition}.

Review Comment:
   Same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "jaceklaskowski (via GitHub)" <gi...@apache.org>.
jaceklaskowski commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1155668281


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -65,4 +66,12 @@ default PartitionReader<ColumnarBatch> createColumnarReader(InputPartition parti
   default boolean supportColumnarReads(InputPartition partition) {
     return false;
   }
+
+  /**
+   * Returns exact java types of the columns that are output in columnar processing mode.

Review Comment:
   nit: Use `@return` instead?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -62,6 +64,10 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
     redact(result)
   }
 
+  override def vectorTypes: Option[Seq[String]] = {
+    Option(readerFactory.getVectorTypes.get.asScala.toSeq)

Review Comment:
   Just curious if `getVectorTypes.get` can throw an exception since by default it's `Optional.empty`?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala:
##########
@@ -79,6 +84,18 @@ case class OrcPartitionReaderFactory(
     }
   }
 
+  override def getVectorTypes: Optional[java.lang.Iterable[String]] = {
+
+    val vectorTypes: Iterable[String] = readDataSchema.fields.map(_.dataType match {

Review Comment:
   nit: Remove `Iterable[String]` (since `vectorTypes` is local).
   
   Can you also move `.getName` (from all the `classOf`s) in a separate `.map`? Less repetition and easier to understand the main idea of this code to make a conversion (with `getName` being just a mere "addition").



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -108,6 +112,17 @@ case class ParquetPartitionReaderFactory(
     supportsColumnar
   }
 
+  override def getVectorTypes: Optional[java.lang.Iterable[String]] = {
+    val data: Iterable[String] = Iterable.fill(readDataSchema.fields.length)(

Review Comment:
   1. Remove the type declaration. Not needed.
   2. Why is it `data` (not `vectorTypes` as previously)?



##########
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala:
##########
@@ -905,6 +908,13 @@ object ColumnarReaderFactory extends PartitionReaderFactory {
 
   override def supportColumnarReads(partition: InputPartition): Boolean = true
 
+  override def getVectorTypes: Optional[java.lang.Iterable[String]] = {
+    val data: Iterable[String] = Iterable.fill(2)(
+        classOf[OnHeapColumnVector].getName
+    )

Review Comment:
   Oh, that's also very nice (re my previous comment to use `Collections.nCopies`).
   
   Can you rename `data`?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala:
##########
@@ -108,6 +112,17 @@ case class ParquetPartitionReaderFactory(
     supportsColumnar
   }
 
+  override def getVectorTypes: Optional[java.lang.Iterable[String]] = {
+    val data: Iterable[String] = Iterable.fill(readDataSchema.fields.length)(
+      if (!enableOffHeapColumnVector) {
+        classOf[OnHeapColumnVector].getName
+      } else {
+        classOf[OffHeapColumnVector].getName

Review Comment:
   Can you move `.getName` to a separate `.map`?



##########
sql/core/src/test/java/test/org/apache/spark/sql/connector/JavaColumnarDataSourceV2.java:
##########
@@ -69,6 +72,14 @@ public boolean supportColumnarReads(InputPartition partition) {
       return true;
     }
 
+    @Override
+    public Optional<Iterable<String>> getVectorTypes() {
+      List<String> vectorTypes = new ArrayList<>();
+      vectorTypes.add(OnHeapColumnVector.class.getName());
+      vectorTypes.add(OnHeapColumnVector.class.getName());

Review Comment:
   nit: [Collections.nCopies(int, Object)](https://docs.oracle.com/javase/8/docs/api/java/util/Collections.html#nCopies-int-T-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "jaceklaskowski (via GitHub)" <gi...@apache.org>.
jaceklaskowski commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1172968872


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -68,7 +68,7 @@ default boolean supportColumnarReads(InputPartition partition) {
   }
 
   /**
-   * Returns exact java types of the columns that are output in columnar processing mode.
+   * @return exact java types of the columns that are output in columnar processing mode.

Review Comment:
   `@return The fully-qualified Java class names of...`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] clownxc commented on pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "clownxc (via GitHub)" <gi...@apache.org>.
clownxc commented on PR #40638:
URL: https://github.com/apache/spark/pull/40638#issuecomment-1512288253

   I applied the suggested code changes. can you review the code one more time in your spare time, thanks a lot @jaceklaskowski


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jaceklaskowski commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "jaceklaskowski (via GitHub)" <gi...@apache.org>.
jaceklaskowski commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1162889479


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -37,15 +37,15 @@
 public interface PartitionReaderFactory extends Serializable {
 
   /**
-   * Returns a row-based partition reader to read data from the given {@link InputPartition}.
+   * @return a row-based partition reader to read data from the given {@link InputPartition}.
    * <p>
    * Implementations probably need to cast the input partition to the concrete

Review Comment:
   Can we add `@note` here so `@return` is just one-liner?



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -55,7 +55,7 @@ default PartitionReader<ColumnarBatch> createColumnarReader(InputPartition parti
   }
 
   /**
-   * Returns true if the given {@link InputPartition} should be read by Spark in a columnar way.
+   * @return true if the given {@link InputPartition} should be read by Spark in a columnar way.
    * This means, implementations must also implement {@link #createColumnarReader(InputPartition)}

Review Comment:
   and here



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -37,15 +37,15 @@
 public interface PartitionReaderFactory extends Serializable {
 
   /**
-   * Returns a row-based partition reader to read data from the given {@link InputPartition}.
+   * @return a row-based partition reader to read data from the given {@link InputPartition}.
    * <p>
    * Implementations probably need to cast the input partition to the concrete
    * {@link InputPartition} class defined for the data source.
    */
   PartitionReader<InternalRow> createReader(InputPartition partition);
 
   /**
-   * Returns a columnar partition reader to read data from the given {@link InputPartition}.
+   * @return a columnar partition reader to read data from the given {@link InputPartition}.
    * <p>
    * Implementations probably need to cast the input partition to the concrete

Review Comment:
   same here



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -65,7 +63,11 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
   }
 
   override def vectorTypes: Option[Seq[String]] = {
-    Option(readerFactory.getVectorTypes.get.asScala.toSeq)
+    val vectorTypes = readerFactory.getVectorTypes
+    if (vectorTypes.isPresent) {
+      vectorTypes.get()

Review Comment:
   What does this line do? Looks like a missing `return`?



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -68,7 +68,7 @@ default boolean supportColumnarReads(InputPartition partition) {
   }
 
   /**
-   * Returns exact java types of the columns that are output in columnar processing mode.
+   * @return exact java types of the columns that are output in columnar processing mode.

Review Comment:
   "Fully-qualified Java class names of"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] clownxc commented on pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "clownxc (via GitHub)" <gi...@apache.org>.
clownxc commented on PR #40638:
URL: https://github.com/apache/spark/pull/40638#issuecomment-1498287802

   I applied the suggested code changes.Can you please do another review @jaceklaskowski 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] clownxc commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "clownxc (via GitHub)" <gi...@apache.org>.
clownxc commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1164246953


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -55,7 +55,7 @@ default PartitionReader<ColumnarBatch> createColumnarReader(InputPartition parti
   }
 
   /**
-   * Returns true if the given {@link InputPartition} should be read by Spark in a columnar way.
+   * @return true if the given {@link InputPartition} should be read by Spark in a columnar way.
    * This means, implementations must also implement {@link #createColumnarReader(InputPartition)}

Review Comment:
   > and here
   
   Thank you very much for review and sorry for the late response. I will try to modify the code according to your suggestions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] clownxc commented on a diff in pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "clownxc (via GitHub)" <gi...@apache.org>.
clownxc commented on code in PR #40638:
URL: https://github.com/apache/spark/pull/40638#discussion_r1169386398


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReaderFactory.java:
##########
@@ -68,7 +68,7 @@ default boolean supportColumnarReads(InputPartition partition) {
   }
 
   /**
-   * Returns exact java types of the columns that are output in columnar processing mode.
+   * @return exact java types of the columns that are output in columnar processing mode.

Review Comment:
   > "Fully-qualified Java class names of"?
   
   like this `org.apache.spark.sql.execution.vectorized.OnHeapColumnVector`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #40638:
URL: https://github.com/apache/spark/pull/40638#issuecomment-1656977493

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #40638: [SPARK-42774][SQL]Expose VectorTypes API for DataSourceV2 Batch Scans
URL: https://github.com/apache/spark/pull/40638


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org