You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/03 19:38:53 UTC

[GitHub] [spark] kazuyukitanimura opened a new pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

kazuyukitanimura opened a new pull request #35393:
URL: https://github.com/apache/spark/pull/35393


   ### What changes were proposed in this pull request?
   This PR proposes to make ArrowColumnVector extendable by relaxing access modifier restrictions
   
   
   ### Why are the changes needed?
   Some Spark extension libraries need to extend ArrowColumnVector.java. For example, Rapids copies the entire ArrowColumnVector class in order to work around the issue
   https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java
   
   It is impossible extend ArrowColumnVector class for now because the class is final and the accessors are all private.
   Proposing to relax private/final restrictions to make ArrowColumnVector extendable.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1033168895


   #35448 is merged. Please rebase to the `master` branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1033331798


   Merged to master for Apache Spark 3.3.0.
   Thank you, @kazuyukitanimura and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1032125040


   It would be great if you could take another look when you get a chance @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1033113039


   > @kazuyukitanimura . I realized that you are moving `ArrowColumnVectorSuite.scala` silently in this PR. Please do that in a separate JIRA before this.
   
   Thank you @dongjoon-hyun I just opened #35448 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r801246303



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/vectorized/ArrowColumnVectorSuite.scala
##########
@@ -431,4 +430,35 @@ class ArrowColumnVectorSuite extends SparkFunSuite {
     columnVector.close()
     allocator.close()
   }
+
+  test ("SPARK-38086: subclassing") {
+    class ChildArrowColumnVector(vector: ValueVector, n: Int)
+      extends ArrowColumnVector(vector: ValueVector) {
+
+      override def getValueVector: ValueVector = accessor.vector
+      override def getInt(rowId: Int): Int = accessor.getInt(rowId) + n
+    }
+
+    val allocator = ArrowUtils.rootAllocator.newChildAllocator("int", 0, Long.MaxValue)
+    val vector = ArrowUtils.toArrowField("int", IntegerType, nullable = true, null)
+      .createVector(allocator).asInstanceOf[IntVector]
+    vector.allocateNew()
+
+    (0 until 10).foreach { i =>
+      vector.setSafe(i, i)
+    }
+
+    val columnVector = new ChildArrowColumnVector(vector, 1)
+    assert(columnVector.dataType === IntegerType)
+    assert(!columnVector.hasNull)
+
+    val intVector = columnVector.getValueVector.asInstanceOf[IntVector]
+    (0 until 10).foreach { i =>
+      assert(columnVector.getInt(i) === i + 1)
+      intVector.get(i) === i + 1

Review comment:
       nit: remove this line?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1029516169


   Shall we add a test for an extensible class?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r799820612



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
##########
@@ -130,9 +130,16 @@ public ColumnarMap getMap(int rowId) {
   @Override
   public ArrowColumnVector getChild(int ordinal) { return childColumns[ordinal]; }
 
+  ArrowColumnVector(DataType type) {

Review comment:
       So is the idea to call this constructor and then override `initAccessor` to provide a custom accessor?
   If so, would it be easier to just add a constructor `public ArrowColumnVector(ArrowVectorAccessor accessor)` with the relaxed access modifiers?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r801196053



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
##########
@@ -28,10 +28,10 @@
 /**
  * A column vector backed by Apache Arrow.
  */
-public final class ArrowColumnVector extends ColumnVector {
+public class ArrowColumnVector extends ColumnVector {

Review comment:
       Thanks, added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r799064065



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
##########
@@ -28,10 +28,10 @@
 /**
  * A column vector backed by Apache Arrow.
  */
-public final class ArrowColumnVector extends ColumnVector {
+public class ArrowColumnVector extends ColumnVector {

Review comment:
       Shall we mark this class as `DeveloperApi`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r801259527



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/vectorized/ArrowColumnVectorSuite.scala
##########
@@ -431,4 +430,35 @@ class ArrowColumnVectorSuite extends SparkFunSuite {
     columnVector.close()
     allocator.close()
   }
+
+  test ("SPARK-38086: subclassing") {
+    class ChildArrowColumnVector(vector: ValueVector, n: Int)
+      extends ArrowColumnVector(vector: ValueVector) {
+
+      override def getValueVector: ValueVector = accessor.vector
+      override def getInt(rowId: Int): Int = accessor.getInt(rowId) + n
+    }
+
+    val allocator = ArrowUtils.rootAllocator.newChildAllocator("int", 0, Long.MaxValue)
+    val vector = ArrowUtils.toArrowField("int", IntegerType, nullable = true, null)
+      .createVector(allocator).asInstanceOf[IntVector]
+    vector.allocateNew()
+
+    (0 until 10).foreach { i =>
+      vector.setSafe(i, i)
+    }
+
+    val columnVector = new ChildArrowColumnVector(vector, 1)
+    assert(columnVector.dataType === IntegerType)
+    assert(!columnVector.hasNull)
+
+    val intVector = columnVector.getValueVector.asInstanceOf[IntVector]
+    (0 until 10).foreach { i =>
+      assert(columnVector.getInt(i) === i + 1)
+      intVector.get(i) === i + 1

Review comment:
       Sorry, I meant to `assert` here as I am testing the `getValueVector`. Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r802164443



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
##########
@@ -130,9 +130,16 @@ public ColumnarMap getMap(int rowId) {
   @Override
   public ArrowColumnVector getChild(int ordinal) { return childColumns[ordinal]; }
 
+  ArrowColumnVector(DataType type) {

Review comment:
       Sounds reasonable to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1029574252


   cc @BryanCutler FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1029433530


   cc @sunchao @dongjoon-hyun @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on a change in pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on a change in pull request #35393:
URL: https://github.com/apache/spark/pull/35393#discussion_r799981350



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
##########
@@ -130,9 +130,16 @@ public ColumnarMap getMap(int rowId) {
   @Override
   public ArrowColumnVector getChild(int ordinal) { return childColumns[ordinal]; }
 
+  ArrowColumnVector(DataType type) {

Review comment:
       Thanks. My intention was to allow sub classes to override the constructor so that the subclass constructor can do more stuff before calling `initAccessor`. At least that is my use case.
   The subclasses can choose to override `initAccessor` or the subclass constructor can directly assign the `accessor`.
   
   Is your idea for `public ArrowColumnVector(ArrowVectorAccessor accessor)` more for using ArrowColumnVector as is rather than extending?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #35393:
URL: https://github.com/apache/spark/pull/35393


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35393: [SPARK-38086][SQL] Make ArrowColumnVector Extendable

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35393:
URL: https://github.com/apache/spark/pull/35393#issuecomment-1029435080


   cc @huaxingao , too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org