You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Corey Nolet <cj...@gmail.com> on 2018/05/23 20:30:08 UTC

PySpark API on top of Apache Arrow

Please forgive me if this question has been asked already.

I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
anyone knows of any efforts to implement the PySpark API on top of Apache
Arrow directly. In my case, I'm doing data science on a machine with 288
cores and 1TB of ram.

It would make life much easier if I was able to use the flexibility of the
PySpark API (rather than having to be tied to the operations in Pandas). It
seems like an implementation would be fairly straightforward using the
Plasma server and object_ids.

If you have not heard of an effort underway to accomplish this, any reasons
why it would be a bad idea?


Thanks!

Re: PySpark API on top of Apache Arrow

Posted by Jules Damji <dm...@comcast.net>.

Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And point to the blog by their contributors from Two Sigma. :-) 

“On the other hand, Pandas UDF built atop Apache Arrow accords high-performance to Python developers, whether you use Pandas UDFs on a single-node machine or distributed cluster.”

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 26, 2018, at 12:41 PM, Corey Nolet <cj...@gmail.com> wrote:
> 
> Gourav & Nicholas,
> 
> THank you! It does look like the pyspark Pandas UDF is exactly what I want and the article I read didn't mention that it used Arrow underneath. Looks like Wes McKinney was also key part of building the Pandas UDF.
> 
> Gourav,
> 
> I totally apologize for my long and drawn out response to you. I initially misunderstood your response. I also need to take the time to dive into the PySpark source code- I was assuming that it was just firing up JVMs under the hood.
> 
> Thanks again! I'll report back with findings. 
> 
>> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <ni...@gmail.com> wrote:
>> hi corey
>> 
>> not familiar with arrow, plasma. However recently read an article about spark on
>> a standalone machine (your case). Sounds like you could take benefit of pyspark
>> "as-is"
>> 
>> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
>> 
>> regars,
>> 
>> 2018-05-23 22:30 GMT+02:00 Corey Nolet <cj...@gmail.com>:
>>> Please forgive me if this question has been asked already. 
>>> 
>>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of ram. 
>>> 
>>> It would make life much easier if I was able to use the flexibility of the PySpark API (rather than having to be tied to the operations in Pandas). It seems like an implementation would be fairly straightforward using the Plasma server and object_ids. 
>>> 
>>> If you have not heard of an effort underway to accomplish this, any reasons why it would be a bad idea?
>>> 
>>> 
>>> Thanks!
>> 
>

Re: PySpark API on top of Apache Arrow

Posted by Corey Nolet <cj...@gmail.com>.

Gourav & Nicholas,

THank you! It does look like the pyspark Pandas UDF is exactly what I want
and the article I read didn't mention that it used Arrow underneath. Looks
like Wes McKinney was also key part of building the Pandas UDF.

Gourav,

I totally apologize for my long and drawn out response to you. I initially
misunderstood your response. I also need to take the time to dive into the
PySpark source code- I was assuming that it was just firing up JVMs under
the hood.

Thanks again! I'll report back with findings.

On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <ni...@gmail.com> wrote:

> hi corey
>
> not familiar with arrow, plasma. However recently read an article about
> spark on
> a standalone machine (your case). Sounds like you could take benefit of
> pyspark
> "as-is"
>
> https://databricks.com/blog/2018/05/03/benchmarking-
> apache-spark-on-a-single-node-machine.html
>
> regars,
>
> 2018-05-23 22:30 GMT+02:00 Corey Nolet <cj...@gmail.com>:
>
>> Please forgive me if this question has been asked already.
>>
>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
>> anyone knows of any efforts to implement the PySpark API on top of Apache
>> Arrow directly. In my case, I'm doing data science on a machine with 288
>> cores and 1TB of ram.
>>
>> It would make life much easier if I was able to use the flexibility of
>> the PySpark API (rather than having to be tied to the operations in
>> Pandas). It seems like an implementation would be fairly straightforward
>> using the Plasma server and object_ids.
>>
>> If you have not heard of an effort underway to accomplish this, any
>> reasons why it would be a bad idea?
>>
>>
>> Thanks!
>>
>
>

Re: PySpark API on top of Apache Arrow

Posted by Nicolas Paris <ni...@gmail.com>.

hi corey

not familiar with arrow, plasma. However recently read an article about
spark on
a standalone machine (your case). Sounds like you could take benefit of
pyspark
"as-is"

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

regars,

2018-05-23 22:30 GMT+02:00 Corey Nolet <cj...@gmail.com>:

> Please forgive me if this question has been asked already.
>
> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
> anyone knows of any efforts to implement the PySpark API on top of Apache
> Arrow directly. In my case, I'm doing data science on a machine with 288
> cores and 1TB of ram.
>
> It would make life much easier if I was able to use the flexibility of the
> PySpark API (rather than having to be tied to the operations in Pandas). It
> seems like an implementation would be fairly straightforward using the
> Plasma server and object_ids.
>
> If you have not heard of an effort underway to accomplish this, any
> reasons why it would be a bad idea?
>
>
> Thanks!
>