You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/13 13:10:45 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request, #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput

HeartSaVioR opened a new pull request, #37864:
URL: https://github.com/apache/spark/pull/37864

   ### What changes were proposed in this pull request?
   
   This PR proposes to change PythonArrowInput and PythonArrowOutput to be more generic to cover the complex data type on both input and output. This is a baseline work for #37863.
   
   ### Why are the changes needed?
   
   The traits PythonArrowInput and PythonArrowOutput can be further generalized to cover complex data type on both input and output. E.g. Not all operators would have simple InternalRow as input data to pass to Python worker and vice versa for output data.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1245394839

   Here is a direct example of further usage:
   https://github.com/apache/spark/blob/ce85c20f4e8f37d5f99a22bd2c890c36e5f31e95/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r969651811


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -26,21 +26,30 @@ import org.apache.spark.{SparkEnv, TaskContext}
 import org.apache.spark.api.python.{BasePythonRunner, PythonRDD}
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.util.ArrowUtils
 import org.apache.spark.util.Utils
 
 /**
  * A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * JVM (an iterator of internal rows) to Python (Arrow).
+ * JVM (an iterator of internal rows + additional data if required) to Python (Arrow).
  */
-private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[InternalRow], _] =>
+private[python] trait PythonArrowInput[IN] { self: BasePythonRunner[IN, _] =>
+  protected val sqlConf = SQLConf.get

Review Comment:
   Seems like we don't need this (in https://github.com/apache/spark/pull/37863, it's not used too)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR closed pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
URL: https://github.com/apache/spark/pull/37864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
ueshin commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r969963457


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -107,3 +107,27 @@ private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[Interna
     }
   }
 }
+
+private[python] trait BasicPythonArrowInput extends PythonArrowInput[Iterator[InternalRow]] {
+  self: BasePythonRunner[Iterator[InternalRow], _] =>
+
+  protected def writeIteratorToArrowStream(
+    root: VectorSchemaRoot,
+    writer: ArrowStreamWriter,
+    dataOut: DataOutputStream,
+    inputIterator: Iterator[Iterator[InternalRow]]): Unit = {

Review Comment:
   nit: indent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1246245394

   Thanks for reviewing! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1245396507

   cc. @HyukjinKwon @ueshin Please take a look, thanks! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r970208295


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala:
##########
@@ -33,12 +33,14 @@ import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, Column
 
 /**
  * A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * Python (Arrow) to JVM (ColumnarBatch).
+ * Python (Arrow) to JVM (output type being deserialized from ColumnarBatch).
  */
-private[python] trait PythonArrowOutput { self: BasePythonRunner[_, ColumnarBatch] =>
+private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[_, OUT] =>

Review Comment:
   We assign `null` to the OUT type (although that's a trick) hence need to be AnyRef at least if I understand correctly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r969649861


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala:
##########
@@ -33,12 +33,14 @@ import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, Column
 
 /**
  * A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * Python (Arrow) to JVM (ColumnarBatch).
+ * Python (Arrow) to JVM (output type being deserialized from ColumnarBatch).
  */
-private[python] trait PythonArrowOutput { self: BasePythonRunner[_, ColumnarBatch] =>
+private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[_, OUT] =>

Review Comment:
   qq: should it be `<: AnyRef`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r970207233


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -26,21 +26,30 @@ import org.apache.spark.{SparkEnv, TaskContext}
 import org.apache.spark.api.python.{BasePythonRunner, PythonRDD}
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.util.ArrowUtils
 import org.apache.spark.util.Utils
 
 /**
  * A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * JVM (an iterator of internal rows) to Python (Arrow).
+ * JVM (an iterator of internal rows + additional data if required) to Python (Arrow).
  */
-private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[InternalRow], _] =>
+private[python] trait PythonArrowInput[IN] { self: BasePythonRunner[IN, _] =>
+  protected val sqlConf = SQLConf.get

Review Comment:
   Ah OK didn't notice this. Will remove this. Thanks for the pointer.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -107,3 +107,27 @@ private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[Interna
     }
   }
 }
+
+private[python] trait BasicPythonArrowInput extends PythonArrowInput[Iterator[InternalRow]] {
+  self: BasePythonRunner[Iterator[InternalRow], _] =>
+
+  protected def writeIteratorToArrowStream(
+    root: VectorSchemaRoot,
+    writer: ArrowStreamWriter,
+    dataOut: DataOutputStream,
+    inputIterator: Iterator[Iterator[InternalRow]]): Unit = {

Review Comment:
   Nice finding!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1246141934

   I'll merge this PR once build is green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org