You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/13 13:10:45 UTC
[GitHub] [spark] HeartSaVioR opened a new pull request, #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput
HeartSaVioR opened a new pull request, #37864:
URL: https://github.com/apache/spark/pull/37864
### What changes were proposed in this pull request?
This PR proposes to change PythonArrowInput and PythonArrowOutput to be more generic to cover the complex data type on both input and output. This is a baseline work for #37863.
### Why are the changes needed?
The traits PythonArrowInput and PythonArrowOutput can be further generalized to cover complex data type on both input and output. E.g. Not all operators would have simple InternalRow as input data to pass to Python worker and vice versa for output data.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1245394839
Here is a direct example of further usage:
https://github.com/apache/spark/blob/ce85c20f4e8f37d5f99a22bd2c890c36e5f31e95/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStatePythonRunner.scala
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r969651811
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -26,21 +26,30 @@ import org.apache.spark.{SparkEnv, TaskContext}
import org.apache.spark.api.python.{BasePythonRunner, PythonRDD}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.util.ArrowUtils
import org.apache.spark.util.Utils
/**
* A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * JVM (an iterator of internal rows) to Python (Arrow).
+ * JVM (an iterator of internal rows + additional data if required) to Python (Arrow).
*/
-private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[InternalRow], _] =>
+private[python] trait PythonArrowInput[IN] { self: BasePythonRunner[IN, _] =>
+ protected val sqlConf = SQLConf.get
Review Comment:
Seems like we don't need this (in https://github.com/apache/spark/pull/37863, it's not used too)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR closed pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
URL: https://github.com/apache/spark/pull/37864
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
ueshin commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r969963457
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -107,3 +107,27 @@ private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[Interna
}
}
}
+
+private[python] trait BasicPythonArrowInput extends PythonArrowInput[Iterator[InternalRow]] {
+ self: BasePythonRunner[Iterator[InternalRow], _] =>
+
+ protected def writeIteratorToArrowStream(
+ root: VectorSchemaRoot,
+ writer: ArrowStreamWriter,
+ dataOut: DataOutputStream,
+ inputIterator: Iterator[Iterator[InternalRow]]): Unit = {
Review Comment:
nit: indent?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1246245394
Thanks for reviewing! Merging to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1245396507
cc. @HyukjinKwon @ueshin Please take a look, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r970208295
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala:
##########
@@ -33,12 +33,14 @@ import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, Column
/**
* A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * Python (Arrow) to JVM (ColumnarBatch).
+ * Python (Arrow) to JVM (output type being deserialized from ColumnarBatch).
*/
-private[python] trait PythonArrowOutput { self: BasePythonRunner[_, ColumnarBatch] =>
+private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[_, OUT] =>
Review Comment:
We assign `null` to the OUT type (although that's a trick) hence need to be AnyRef at least if I understand correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37864: [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r969649861
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala:
##########
@@ -33,12 +33,14 @@ import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, Column
/**
* A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * Python (Arrow) to JVM (ColumnarBatch).
+ * Python (Arrow) to JVM (output type being deserialized from ColumnarBatch).
*/
-private[python] trait PythonArrowOutput { self: BasePythonRunner[_, ColumnarBatch] =>
+private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[_, OUT] =>
Review Comment:
qq: should it be `<: AnyRef`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on code in PR #37864:
URL: https://github.com/apache/spark/pull/37864#discussion_r970207233
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -26,21 +26,30 @@ import org.apache.spark.{SparkEnv, TaskContext}
import org.apache.spark.api.python.{BasePythonRunner, PythonRDD}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.util.ArrowUtils
import org.apache.spark.util.Utils
/**
* A trait that can be mixed-in with [[BasePythonRunner]]. It implements the logic from
- * JVM (an iterator of internal rows) to Python (Arrow).
+ * JVM (an iterator of internal rows + additional data if required) to Python (Arrow).
*/
-private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[InternalRow], _] =>
+private[python] trait PythonArrowInput[IN] { self: BasePythonRunner[IN, _] =>
+ protected val sqlConf = SQLConf.get
Review Comment:
Ah OK didn't notice this. Will remove this. Thanks for the pointer.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -107,3 +107,27 @@ private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[Interna
}
}
}
+
+private[python] trait BasicPythonArrowInput extends PythonArrowInput[Iterator[InternalRow]] {
+ self: BasePythonRunner[Iterator[InternalRow], _] =>
+
+ protected def writeIteratorToArrowStream(
+ root: VectorSchemaRoot,
+ writer: ArrowStreamWriter,
+ dataOut: DataOutputStream,
+ inputIterator: Iterator[Iterator[InternalRow]]): Unit = {
Review Comment:
Nice finding!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37864: [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput
Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #37864:
URL: https://github.com/apache/spark/pull/37864#issuecomment-1246141934
I'll merge this PR once build is green.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org