You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/12 04:14:37 UTC

[GitHub] HyukjinKwon commented on a change in pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame

HyukjinKwon commented on a change in pull request #23760: [SPARK-26762][SQL][R] Arrow optimization for conversion from Spark DataFrame to R DataFrame
URL: https://github.com/apache/spark/pull/23760#discussion_r255797537
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3198,9 +3199,66 @@ class Dataset[T] private[sql](
   }
 
   /**
-   * Collect a Dataset as Arrow batches and serve stream to PySpark.
+   * Collect a Dataset as Arrow batches and serve stream to SparkR. It sends
+   * arrow batches in an ordered manner with buffering. This is inevitable
+   * due to missing R API that reads batches from socket directly. See ARROW-4512.
+   * Eventually, this code should be deduplicated by `collectAsArrowToPython`.
    */
-  private[sql] def collectAsArrowToPython(): Array[Any] = {
+  private[sql] def collectAsArrowToR(): Array[Any] = {
+    val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone
+
+    withAction("collectAsArrowToR", queryExecution) { plan =>
+      RRDD.serveToStream("serve-Arrow") { outputStream =>
 
 Review comment:
   These codes were restored from previous Python side's Arrow optimization (see https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af). For now, due to the limitation of R API (see ARROW-3204), it should send Arrow batches at once (it cannot read batch by batch like Python side).
   
   Once this is fixed, we can deduplicate the codes with `collectAsArrowToPython`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org