You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/07 03:15:51 UTC

[GitHub] [spark] jpivarski commented on issue #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

jpivarski commented on issue #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization
URL: https://github.com/apache/spark/pull/26783#issuecomment-562808553
 
 
   (Full disclosure: I'm a co-author.) Considering that Arrow is lower-level than Pandas, I would have thought that the Pandas API would have been built on top of an Arrow API, as a (very popular) special case.
   
   The argument is being made in terms of performance because in our application we have to work around DataFrame-construction that we don't want and don't need, and it's a considerable runtime cost. But the argument could also be made in terms of a layered architecture. Wouldn't it make more sense for the pandas_udf to be an application built on an arrow_udf?
   
   I just want to make sure that the architectural point isn't lost in discussions about performance.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org