You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/26 11:39:19 UTC

[GitHub] [spark] jpivarski commented on pull request #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

jpivarski commented on pull request #26783:
URL: https://github.com/apache/spark/pull/26783#issuecomment-951848597


   For Arrow-based access, I agree that a developer-level API is appropriate. In fact, if it had data analyst-oriented features, the developers might end up fighting against those features to build their backends.
   
   I think that both "map" in chunks, like `mapPartitions` and `mapInPandas`, and the first step of a "reduce" tree are likely applications. The first application seems pretty direct, and the second could be built from map-in-chunks by having the process on Arrow buffers return single-row Arrow buffers, which are then flattened and further reduced on the Spark side. Or flattened, repartitioned on the Spark side, and then sent back to the Arrow-based process for further reduction. As long as it's possible for the Arrow-based process to return a different number of rows than it's given (same number of chunks, arbitrary number of returned rows per chunk), then all of these become possibilities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org