You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/07 13:55:42 UTC

[GitHub] [spark] lgray edited a comment on issue #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization

lgray edited a comment on issue #26783: [SPARK-30153][PYTHON][WIP] Extend data exchange options for vectorized UDF functions with vanilla Arrow serialization
URL: https://github.com/apache/spark/pull/26783#issuecomment-562851829

(Also an author of the PR) But I want to chime in on a few things, and hopefully clarify some points.

We are not proposing here to inject new packages into the dependencies of pyspark, we're only using awkward-array as an example here to show the power of directly exposing the arrow interface to spark in python. There is no duplication of arrow's capabilities or scope, awkward-array is using arrow as a substrate to feed data to a columnar analytics tool, similarly to how pandas uses arrow. In our demonstration here, we are also using an analytics tool that is optimized for dense, ragged columnar data.

The issue we are tackling is that pandas, while it a very powerful and useful package, is not optimized for all data analysis tasks in all domains. We demonstrate a significant speedup in our domain (High Energy Physics) by skipping the generation of pandas Series/dataframes, and this suggests such optimizations could be implemented for other domains. This could be achieved by allowing users to opt-in on removing the overhead of pandas by providing direct access to arrow buffers containing (parts of) spark tables.

Moreover, skipping the creation of the pandas tables from arrow buffers is typically a single function call, and requires no deep changes in spark or pyspark itself.

The user interface for pandas_udf should not change from what is there currently, and I don't think it needs to. However, there can be additions which expose the direct arrow interface we would like.

I think an opt-in pattern ameliorates the counter argument that arrow is "too low level" since if a user actively requests that interface, they know what they are doing or are using a library which supplies that knowledge in some way.

However, I'd find that exposing the interface like `udf = pandas_udf(some_func, returnType, PandasUDFType.SCALAR, skip_pandas=True )`, or `expose_arrow` instead of `skip_pandas`, could be confusing as a way to opt-in. What is pandas without pandas, after all?

Just to spitball something that seems a bit more smooth in my mind:
We can create the interface `arrow_udf` whose data transformation type is determined by `ArrowUDFType` which can be a python alias of PandasUDFType. `pandas_udf` can, under the hood, call arrow_udf, with serializers that add in the small number of extra function calls to generate pandas on the fly.

This way the interface for `pandas_udf` remains and operates the same, and the raw arrow interface is exposed to users who want it. Both of these can also be achieved while simultaneously not impacting the performance of spark itself.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org