You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Hyukjin Kwon <gu...@gmail.com> on 2021/03/14 01:57:04 UTC

[DISCUSS] Support pandas API layer on PySpark

Hi all,

I would like to start the discussion on supporting pandas API layer on
Spark.

If we have a general consensus on having it in PySpark, I will initiate and
drive an SPIP with a detailed explanation about the implementation’s
overview and structure.

I would appreciate it if I can know whether you guys support this or not
before starting the SPIP.
What do you want to propose?

I have been working on the Koalas <https://github.com/databricks/koalas>
project that is essentially: pandas API support on Spark, and I would like
to propose embracing Koalas in PySpark.

More specifically, I am thinking about adding a separate package, to
PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
koalas_dataframe = koalas.from_spark*(*pandas_dataframe*)*
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same
API usages. Users can leverage their existing Spark cluster to scale their
pandas workloads. It works interchangeably with PySpark by allowing both
pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been
successfully going. With version 1.7.0 Koalas has greatly improved maturity
and stability. Its usability has been proven with numerous users’ adoptions
and by reaching more than 75% API coverage in pandas’ Index, Series and
DataFrame.

I strongly think this is the direction we should go for Apache Spark, and
it is a win-win strategy for the growth of both Apache Spark and pandas.
Please see the reasons below.
Why do we need it?

Python has grown dramatically in the last few years and became one of
the most popular languages, see also StackOverFlow trend
<https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
for Python, Java, R and Scala languages.
-

pandas became almost the standard library of data science. Please also
see the StackOverFlow trend
<https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
for pandas, Apache Spark and PySpark.
-

PySpark is not Pythonic enough. At least I myself hear a lot of
complaints. That initiated Project Zen
<https://issues.apache.org/jira/browse/SPARK-32082>, and we have greatly
improved PySpark usability and made it more Pythonic.

Nevertheless, data scientists tend to prefer pandas libraries according to
the trends but APIs are hard to change in PySpark. We should redesign all
APIs and improve them from scratch, which is very difficult.

One straightforward and fast approach is to benchmark a successful case,
and pandas does not support distributed execution. Once PySpark supports
pandas-like APIs, it can be a good option for pandas users to scale their
workloads easily. I do believe this is a win-win strategy for the growth of
both pandas and PySpark.

In fact, there are already similar tries such as Dask <https://dask.org/>
and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
<https://github.com/databricks/koalas>). They are all growing fast and
successfully, and I find that people compare it to PySpark from time to
time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data
technologies battling head to head
<https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
.

There are many important features missing that are very common in data
science. One of the most important features is plotting and drawing a
chart. Almost every data scientist plots and draws a chart to understand
their data quickly and visually in their daily work but this is missing in
PySpark. Please see one example in pandas:

I do recommend taking a quick look for blog posts and talks made for pandas
on Spark:
https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
They explain why we need this far more better.

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Hyukjin Kwon <gu...@gmail.com>.

Yeah, that's a good point, Georg. I think we will port as is first, and
discuss further about that indexing system.
We should probably either add non-index mode or switch it to a distributed
default index type that minimizes the side effect in query plan.
We still have some months left. I will very likely raise another discussion
about it in a PR or dev mailing list after finishing the initial porting.

2021년 3월 17일 (수) 오후 8:33, Georg Heiler <ge...@gmail.com>님이 작성:

> Would you plan to keep the existing indexing mechanism then?
>
> https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
> For me, it always even when trying to use the distributed version resulted
> in various window functions being chained, a different query plan than the
> default query plan, and slower execution of the job due to this overhead.
>
> Especially when some people here are thinking about making it the
> default/replacing the regular API I would strongly suggest defaulting to an
> indexing mechanism that is not changing the query plan.
>
> Best,
> Georg
>
> Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <
> gurwls223@gmail.com>:
>
>> > Just out of curiosity, does Koalas pretty much implement all of the
>> Pandas APIs now? If there are some that are yet to be implemented or others
>> that have differences, are these documented so users won't be caught
>> off-guard?
>>
>> It's roughly 75% done so far (in Series, DataFrame and Index).
>> Yeah, and it throws an exception that says it's not implemented yet
>> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
>> easily make users shoot their feet by, for example, for loop ... ).
>>
>>
>> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <cu...@gmail.com>님이 작성:
>>
>>> +1 the proposal sounds good to me. Having a familiar API built-in will
>>> really help new users get into using Spark that might only have Pandas
>>> experience. It sounds like maintenance costs should be manageable, once the
>>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>>> pretty much implement all of the Pandas APIs now? If there are some that
>>> are yet to be implemented or others that have differences, are these
>>> documented so users won't be caught off-guard?
>>>
>>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <an...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Integrating Koalas with pyspark might help enable a richer integration
>>>> between the two. Something that would be useful with a tighter
>>>> integration is support for custom column array types. Currently, Spark
>>>> takes dataframes, converts them to arrow buffers then transmits them
>>>> over the socket to Python. On the other side, pyspark takes the arrow
>>>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>>>> default Pandas representation of a list-type for a column causes it to
>>>> turn what was contiguous value/offset arrays in Arrow into
>>>> deserialized Python objects for each row. Obviously, this kills
>>>> performance.
>>>>
>>>> A PR to extend the pyspark API to elide the pandas conversion
>>>> (https://github.com/apache/spark/pull/26783) was submitted and
>>>> rejected, which is unfortunate, but perhaps this proposed integration
>>>> would provide the hooks via Pandas' ExtensionArray interface to allow
>>>> Spark to performantly interchange jagged/ragged lists to/from python
>>>> UDFs.
>>>>
>>>> Cheers
>>>> Andrew
>>>>
>>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gu...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Thank you guys for all your feedback. I will start working on SPIP
>>>> with Koalas team.
>>>> > I would expect the SPIP can be sent late this week or early next week.
>>>> >
>>>> >
>>>> > I inlined and answered the questions unanswered as below:
>>>> >
>>>> > Is the community developing the pandas API layer for Spark interested
>>>> in being part of Spark or do they prefer having their own release cycle?
>>>> >
>>>> > Yeah, Koalas team used to have its own release cycle to develop and
>>>> move quickly.
>>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>>>> that it’s now
>>>> > fine to have less frequent releases, and they are happy to work
>>>> together with Spark with
>>>> > contributing to it. The active contributors in the Koalas community
>>>> will continue to
>>>> > make the contributions in Spark.
>>>> >
>>>> > How about test code? Does it fit into the PySpark test framework?
>>>> >
>>>> > Yes, this will be one of the places where it needs some efforts.
>>>> Koalas currently uses pytest
>>>> > with various dependency version combinations (e.g., Python version,
>>>> conda vs pip) whereas
>>>> > PySpark uses the plain unittests with less dependency version
>>>> combinations.
>>>> >
>>>> > For pytest in Koalas <> unittests in PySpark:
>>>> >
>>>> >   I am currently thinking we will have to convert the Koalas tests to
>>>> use unittests to match
>>>> >   with PySpark for now.
>>>> >   It is a feasible option for PySpark to migrate to pytest too but it
>>>> will need extra effort to
>>>> >   make it working with our own PySpark testing framework seamlessly.
>>>> >   Koalas team (presumably and likely I) will take a look in any event.
>>>> >
>>>> > For the combinations of dependency versions:
>>>> >
>>>> >   Due to the lack of the resources in GitHub Actions, I currently
>>>> plan to just add the
>>>> >   Koalas tests into the matrix PySpark is currently using.
>>>> >
>>>> > one question I have; what’s an initial goal of the proposal?
>>>> > Is that to port all the pandas interfaces that Koalas has already
>>>> implemented?
>>>> > Or, the basic set of them?
>>>> >
>>>> > The goal of the proposal is to port all of Koalas project into
>>>> PySpark.
>>>> > For example,
>>>> >
>>>> > import koalas
>>>> >
>>>> > will be equivalent to
>>>> >
>>>> > # Names, etc. might change in the final proposal or during the review
>>>> > from pyspark.sql import pandas
>>>> >
>>>> > Koalas supports pandas APIs with a separate layer to cover a bit of
>>>> difference between
>>>> > DataFrame structures in pandas and PySpark, e.g.) other types as
>>>> column names (labels),
>>>> > index (something like row number in DBMSs) and so on. So I think it
>>>> would make more sense
>>>> > to port the whole layer instead of a subset of the APIs.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cl...@gmail.com>님이 작성:
>>>> >>
>>>> >> +1, it's great to have Pandas support in Spark out of the box.
>>>> >>
>>>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
>>>> linguin.m.s@gmail.com> wrote:
>>>> >>>
>>>> >>> +1; the pandas interfaces are pretty popular and supporting them in
>>>> pyspark looks promising, I think.
>>>> >>> one question I have; what's an initial goal of the proposal?
>>>> >>> Is that to port all the pandas interfaces that Koalas has already
>>>> implemented?
>>>> >>> Or, the basic set of them?
>>>> >>>
>>>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> +1
>>>> >>>>
>>>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> >>>> well as better alignment with core Spark improvements, the extra
>>>> >>>> weight looks manageable.
>>>> >>>>
>>>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> >>>> <ni...@gmail.com> wrote:
>>>> >>>> >
>>>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>> >>>> >>
>>>> >>>> >> I don't think we should deprecate existing APIs.
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +1
>>>> >>>> >
>>>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
>>>> API. I could be wrong, but I wager most people who have worked with both
>>>> Spark and Pandas feel the same way.
>>>> >>>> >
>>>> >>>> > For the large community of current PySpark users, or users
>>>> switching to PySpark from another Spark language API, it doesn't make sense
>>>> to deprecate the current API, even by convention.
>>>> >>>>
>>>> >>>>
>>>> ---------------------------------------------------------------------
>>>> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> ---
>>>> >>> Takeshi Yamamuro
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Georg Heiler <ge...@gmail.com>.

Would you plan to keep the existing indexing mechanism then?
https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
For me, it always even when trying to use the distributed version resulted
in various window functions being chained, a different query plan than the
default query plan, and slower execution of the job due to this overhead.

Especially when some people here are thinking about making it the
default/replacing the regular API I would strongly suggest defaulting to an
indexing mechanism that is not changing the query plan.

Best,
Georg

Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <gurwls223@gmail.com
>:

> > Just out of curiosity, does Koalas pretty much implement all of the
> Pandas APIs now? If there are some that are yet to be implemented or others
> that have differences, are these documented so users won't be caught
> off-guard?
>
> It's roughly 75% done so far (in Series, DataFrame and Index).
> Yeah, and it throws an exception that says it's not implemented yet
> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
> easily make users shoot their feet by, for example, for loop ... ).
>
>
> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <cu...@gmail.com>님이 작성:
>
>> +1 the proposal sounds good to me. Having a familiar API built-in will
>> really help new users get into using Spark that might only have Pandas
>> experience. It sounds like maintenance costs should be manageable, once the
>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>> pretty much implement all of the Pandas APIs now? If there are some that
>> are yet to be implemented or others that have differences, are these
>> documented so users won't be caught off-guard?
>>
>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <an...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Integrating Koalas with pyspark might help enable a richer integration
>>> between the two. Something that would be useful with a tighter
>>> integration is support for custom column array types. Currently, Spark
>>> takes dataframes, converts them to arrow buffers then transmits them
>>> over the socket to Python. On the other side, pyspark takes the arrow
>>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>>> default Pandas representation of a list-type for a column causes it to
>>> turn what was contiguous value/offset arrays in Arrow into
>>> deserialized Python objects for each row. Obviously, this kills
>>> performance.
>>>
>>> A PR to extend the pyspark API to elide the pandas conversion
>>> (https://github.com/apache/spark/pull/26783) was submitted and
>>> rejected, which is unfortunate, but perhaps this proposed integration
>>> would provide the hooks via Pandas' ExtensionArray interface to allow
>>> Spark to performantly interchange jagged/ragged lists to/from python
>>> UDFs.
>>>
>>> Cheers
>>> Andrew
>>>
>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gu...@gmail.com>
>>> wrote:
>>> >
>>> > Thank you guys for all your feedback. I will start working on SPIP
>>> with Koalas team.
>>> > I would expect the SPIP can be sent late this week or early next week.
>>> >
>>> >
>>> > I inlined and answered the questions unanswered as below:
>>> >
>>> > Is the community developing the pandas API layer for Spark interested
>>> in being part of Spark or do they prefer having their own release cycle?
>>> >
>>> > Yeah, Koalas team used to have its own release cycle to develop and
>>> move quickly.
>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>>> that it’s now
>>> > fine to have less frequent releases, and they are happy to work
>>> together with Spark with
>>> > contributing to it. The active contributors in the Koalas community
>>> will continue to
>>> > make the contributions in Spark.
>>> >
>>> > How about test code? Does it fit into the PySpark test framework?
>>> >
>>> > Yes, this will be one of the places where it needs some efforts.
>>> Koalas currently uses pytest
>>> > with various dependency version combinations (e.g., Python version,
>>> conda vs pip) whereas
>>> > PySpark uses the plain unittests with less dependency version
>>> combinations.
>>> >
>>> > For pytest in Koalas <> unittests in PySpark:
>>> >
>>> >   I am currently thinking we will have to convert the Koalas tests to
>>> use unittests to match
>>> >   with PySpark for now.
>>> >   It is a feasible option for PySpark to migrate to pytest too but it
>>> will need extra effort to
>>> >   make it working with our own PySpark testing framework seamlessly.
>>> >   Koalas team (presumably and likely I) will take a look in any event.
>>> >
>>> > For the combinations of dependency versions:
>>> >
>>> >   Due to the lack of the resources in GitHub Actions, I currently plan
>>> to just add the
>>> >   Koalas tests into the matrix PySpark is currently using.
>>> >
>>> > one question I have; what’s an initial goal of the proposal?
>>> > Is that to port all the pandas interfaces that Koalas has already
>>> implemented?
>>> > Or, the basic set of them?
>>> >
>>> > The goal of the proposal is to port all of Koalas project into PySpark.
>>> > For example,
>>> >
>>> > import koalas
>>> >
>>> > will be equivalent to
>>> >
>>> > # Names, etc. might change in the final proposal or during the review
>>> > from pyspark.sql import pandas
>>> >
>>> > Koalas supports pandas APIs with a separate layer to cover a bit of
>>> difference between
>>> > DataFrame structures in pandas and PySpark, e.g.) other types as
>>> column names (labels),
>>> > index (something like row number in DBMSs) and so on. So I think it
>>> would make more sense
>>> > to port the whole layer instead of a subset of the APIs.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cl...@gmail.com>님이 작성:
>>> >>
>>> >> +1, it's great to have Pandas support in Spark out of the box.
>>> >>
>>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
>>> linguin.m.s@gmail.com> wrote:
>>> >>>
>>> >>> +1; the pandas interfaces are pretty popular and supporting them in
>>> pyspark looks promising, I think.
>>> >>> one question I have; what's an initial goal of the proposal?
>>> >>> Is that to port all the pandas interfaces that Koalas has already
>>> implemented?
>>> >>> Or, the basic set of them?
>>> >>>
>>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> +1
>>> >>>>
>>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>> >>>> well as better alignment with core Spark improvements, the extra
>>> >>>> weight looks manageable.
>>> >>>>
>>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>> >>>> <ni...@gmail.com> wrote:
>>> >>>> >
>>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >>>> >>
>>> >>>> >> I don't think we should deprecate existing APIs.
>>> >>>> >
>>> >>>> >
>>> >>>> > +1
>>> >>>> >
>>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
>>> API. I could be wrong, but I wager most people who have worked with both
>>> Spark and Pandas feel the same way.
>>> >>>> >
>>> >>>> > For the large community of current PySpark users, or users
>>> switching to PySpark from another Spark language API, it doesn't make sense
>>> to deprecate the current API, even by convention.
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> ---
>>> >>> Takeshi Yamamuro
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Hyukjin Kwon <gu...@gmail.com>.

> Just out of curiosity, does Koalas pretty much implement all of the
Pandas APIs now? If there are some that are yet to be implemented or others
that have differences, are these documented so users won't be caught
off-guard?

It's roughly 75% done so far (in Series, DataFrame and Index).
Yeah, and it throws an exception that says it's not implemented yet
properly (or intentionally not implemented, e.g.) Series.__iter__ that will
easily make users shoot their feet by, for example, for loop ... ).


2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <cu...@gmail.com>님이 작성:

> +1 the proposal sounds good to me. Having a familiar API built-in will
> really help new users get into using Spark that might only have Pandas
> experience. It sounds like maintenance costs should be manageable, once the
> hurdle with setting up tests is done. Just out of curiosity, does Koalas
> pretty much implement all of the Pandas APIs now? If there are some that
> are yet to be implemented or others that have differences, are these
> documented so users won't be caught off-guard?
>
> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <an...@gmail.com> wrote:
>
>> Hi,
>>
>> Integrating Koalas with pyspark might help enable a richer integration
>> between the two. Something that would be useful with a tighter
>> integration is support for custom column array types. Currently, Spark
>> takes dataframes, converts them to arrow buffers then transmits them
>> over the socket to Python. On the other side, pyspark takes the arrow
>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>> default Pandas representation of a list-type for a column causes it to
>> turn what was contiguous value/offset arrays in Arrow into
>> deserialized Python objects for each row. Obviously, this kills
>> performance.
>>
>> A PR to extend the pyspark API to elide the pandas conversion
>> (https://github.com/apache/spark/pull/26783) was submitted and
>> rejected, which is unfortunate, but perhaps this proposed integration
>> would provide the hooks via Pandas' ExtensionArray interface to allow
>> Spark to performantly interchange jagged/ragged lists to/from python
>> UDFs.
>>
>> Cheers
>> Andrew
>>
>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>> >
>> > Thank you guys for all your feedback. I will start working on SPIP with
>> Koalas team.
>> > I would expect the SPIP can be sent late this week or early next week.
>> >
>> >
>> > I inlined and answered the questions unanswered as below:
>> >
>> > Is the community developing the pandas API layer for Spark interested
>> in being part of Spark or do they prefer having their own release cycle?
>> >
>> > Yeah, Koalas team used to have its own release cycle to develop and
>> move quickly.
>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>> that it’s now
>> > fine to have less frequent releases, and they are happy to work
>> together with Spark with
>> > contributing to it. The active contributors in the Koalas community
>> will continue to
>> > make the contributions in Spark.
>> >
>> > How about test code? Does it fit into the PySpark test framework?
>> >
>> > Yes, this will be one of the places where it needs some efforts. Koalas
>> currently uses pytest
>> > with various dependency version combinations (e.g., Python version,
>> conda vs pip) whereas
>> > PySpark uses the plain unittests with less dependency version
>> combinations.
>> >
>> > For pytest in Koalas <> unittests in PySpark:
>> >
>> >   I am currently thinking we will have to convert the Koalas tests to
>> use unittests to match
>> >   with PySpark for now.
>> >   It is a feasible option for PySpark to migrate to pytest too but it
>> will need extra effort to
>> >   make it working with our own PySpark testing framework seamlessly.
>> >   Koalas team (presumably and likely I) will take a look in any event.
>> >
>> > For the combinations of dependency versions:
>> >
>> >   Due to the lack of the resources in GitHub Actions, I currently plan
>> to just add the
>> >   Koalas tests into the matrix PySpark is currently using.
>> >
>> > one question I have; what’s an initial goal of the proposal?
>> > Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> > Or, the basic set of them?
>> >
>> > The goal of the proposal is to port all of Koalas project into PySpark.
>> > For example,
>> >
>> > import koalas
>> >
>> > will be equivalent to
>> >
>> > # Names, etc. might change in the final proposal or during the review
>> > from pyspark.sql import pandas
>> >
>> > Koalas supports pandas APIs with a separate layer to cover a bit of
>> difference between
>> > DataFrame structures in pandas and PySpark, e.g.) other types as column
>> names (labels),
>> > index (something like row number in DBMSs) and so on. So I think it
>> would make more sense
>> > to port the whole layer instead of a subset of the APIs.
>> >
>> >
>> >
>> >
>> >
>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cl...@gmail.com>님이 작성:
>> >>
>> >> +1, it's great to have Pandas support in Spark out of the box.
>> >>
>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
>> linguin.m.s@gmail.com> wrote:
>> >>>
>> >>> +1; the pandas interfaces are pretty popular and supporting them in
>> pyspark looks promising, I think.
>> >>> one question I have; what's an initial goal of the proposal?
>> >>> Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> >>> Or, the basic set of them?
>> >>>
>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com>
>> wrote:
>> >>>>
>> >>>> +1
>> >>>>
>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>> >>>> well as better alignment with core Spark improvements, the extra
>> >>>> weight looks manageable.
>> >>>>
>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>> >>>> <ni...@gmail.com> wrote:
>> >>>> >
>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>> >>
>> >>>> >> I don't think we should deprecate existing APIs.
>> >>>> >
>> >>>> >
>> >>>> > +1
>> >>>> >
>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
>> API. I could be wrong, but I wager most people who have worked with both
>> Spark and Pandas feel the same way.
>> >>>> >
>> >>>> > For the large community of current PySpark users, or users
>> switching to PySpark from another Spark language API, it doesn't make sense
>> to deprecate the current API, even by convention.
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> ---
>> >>> Takeshi Yamamuro
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Bryan Cutler <cu...@gmail.com>.

+1 the proposal sounds good to me. Having a familiar API built-in will
really help new users get into using Spark that might only have Pandas
experience. It sounds like maintenance costs should be manageable, once the
hurdle with setting up tests is done. Just out of curiosity, does Koalas
pretty much implement all of the Pandas APIs now? If there are some that
are yet to be implemented or others that have differences, are these
documented so users won't be caught off-guard?

On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <an...@gmail.com> wrote:

> Hi,
>
> Integrating Koalas with pyspark might help enable a richer integration
> between the two. Something that would be useful with a tighter
> integration is support for custom column array types. Currently, Spark
> takes dataframes, converts them to arrow buffers then transmits them
> over the socket to Python. On the other side, pyspark takes the arrow
> buffer and converts it to a Pandas dataframe. Unfortunately, the
> default Pandas representation of a list-type for a column causes it to
> turn what was contiguous value/offset arrays in Arrow into
> deserialized Python objects for each row. Obviously, this kills
> performance.
>
> A PR to extend the pyspark API to elide the pandas conversion
> (https://github.com/apache/spark/pull/26783) was submitted and
> rejected, which is unfortunate, but perhaps this proposed integration
> would provide the hooks via Pandas' ExtensionArray interface to allow
> Spark to performantly interchange jagged/ragged lists to/from python
> UDFs.
>
> Cheers
> Andrew
>
> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gu...@gmail.com> wrote:
> >
> > Thank you guys for all your feedback. I will start working on SPIP with
> Koalas team.
> > I would expect the SPIP can be sent late this week or early next week.
> >
> >
> > I inlined and answered the questions unanswered as below:
> >
> > Is the community developing the pandas API layer for Spark interested in
> being part of Spark or do they prefer having their own release cycle?
> >
> > Yeah, Koalas team used to have its own release cycle to develop and move
> quickly.
> > Now it became pretty mature with reaching 1.7.0, and the team thinks
> that it’s now
> > fine to have less frequent releases, and they are happy to work together
> with Spark with
> > contributing to it. The active contributors in the Koalas community will
> continue to
> > make the contributions in Spark.
> >
> > How about test code? Does it fit into the PySpark test framework?
> >
> > Yes, this will be one of the places where it needs some efforts. Koalas
> currently uses pytest
> > with various dependency version combinations (e.g., Python version,
> conda vs pip) whereas
> > PySpark uses the plain unittests with less dependency version
> combinations.
> >
> > For pytest in Koalas <> unittests in PySpark:
> >
> >   I am currently thinking we will have to convert the Koalas tests to
> use unittests to match
> >   with PySpark for now.
> >   It is a feasible option for PySpark to migrate to pytest too but it
> will need extra effort to
> >   make it working with our own PySpark testing framework seamlessly.
> >   Koalas team (presumably and likely I) will take a look in any event.
> >
> > For the combinations of dependency versions:
> >
> >   Due to the lack of the resources in GitHub Actions, I currently plan
> to just add the
> >   Koalas tests into the matrix PySpark is currently using.
> >
> > one question I have; what’s an initial goal of the proposal?
> > Is that to port all the pandas interfaces that Koalas has already
> implemented?
> > Or, the basic set of them?
> >
> > The goal of the proposal is to port all of Koalas project into PySpark.
> > For example,
> >
> > import koalas
> >
> > will be equivalent to
> >
> > # Names, etc. might change in the final proposal or during the review
> > from pyspark.sql import pandas
> >
> > Koalas supports pandas APIs with a separate layer to cover a bit of
> difference between
> > DataFrame structures in pandas and PySpark, e.g.) other types as column
> names (labels),
> > index (something like row number in DBMSs) and so on. So I think it
> would make more sense
> > to port the whole layer instead of a subset of the APIs.
> >
> >
> >
> >
> >
> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cl...@gmail.com>님이 작성:
> >>
> >> +1, it's great to have Pandas support in Spark out of the box.
> >>
> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
> linguin.m.s@gmail.com> wrote:
> >>>
> >>> +1; the pandas interfaces are pretty popular and supporting them in
> pyspark looks promising, I think.
> >>> one question I have; what's an initial goal of the proposal?
> >>> Is that to port all the pandas interfaces that Koalas has already
> implemented?
> >>> Or, the basic set of them?
> >>>
> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >>>>
> >>>> +1
> >>>>
> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
> >>>> well as better alignment with core Spark improvements, the extra
> >>>> weight looks manageable.
> >>>>
> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
> >>>> <ni...@gmail.com> wrote:
> >>>> >
> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com>
> wrote:
> >>>> >>
> >>>> >> I don't think we should deprecate existing APIs.
> >>>> >
> >>>> >
> >>>> > +1
> >>>> >
> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
> API. I could be wrong, but I wager most people who have worked with both
> Spark and Pandas feel the same way.
> >>>> >
> >>>> > For the large community of current PySpark users, or users
> switching to PySpark from another Spark language API, it doesn't make sense
> to deprecate the current API, even by convention.
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>>>
> >>>
> >>>
> >>> --
> >>> ---
> >>> Takeshi Yamamuro
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Andrew Melo <an...@gmail.com>.

Hi,

Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to Python. On the other side, pyspark takes the arrow
buffer and converts it to a Pandas dataframe. Unfortunately, the
default Pandas representation of a list-type for a column causes it to
turn what was contiguous value/offset arrays in Arrow into
deserialized Python objects for each row. Obviously, this kills
performance.

A PR to extend the pyspark API to elide the pandas conversion
(https://github.com/apache/spark/pull/26783) was submitted and
rejected, which is unfortunate, but perhaps this proposed integration
would provide the hooks via Pandas' ExtensionArray interface to allow
Spark to performantly interchange jagged/ragged lists to/from python
UDFs.

Cheers
Andrew

On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> Thank you guys for all your feedback. I will start working on SPIP with Koalas team.
> I would expect the SPIP can be sent late this week or early next week.
>
>
> I inlined and answered the questions unanswered as below:
>
> Is the community developing the pandas API layer for Spark interested in being part of Spark or do they prefer having their own release cycle?
>
> Yeah, Koalas team used to have its own release cycle to develop and move quickly.
> Now it became pretty mature with reaching 1.7.0, and the team thinks that it’s now
> fine to have less frequent releases, and they are happy to work together with Spark with
> contributing to it. The active contributors in the Koalas community will continue to
> make the contributions in Spark.
>
> How about test code? Does it fit into the PySpark test framework?
>
> Yes, this will be one of the places where it needs some efforts. Koalas currently uses pytest
> with various dependency version combinations (e.g., Python version, conda vs pip) whereas
> PySpark uses the plain unittests with less dependency version combinations.
>
> For pytest in Koalas <> unittests in PySpark:
>
>   I am currently thinking we will have to convert the Koalas tests to use unittests to match
>   with PySpark for now.
>   It is a feasible option for PySpark to migrate to pytest too but it will need extra effort to
>   make it working with our own PySpark testing framework seamlessly.
>   Koalas team (presumably and likely I) will take a look in any event.
>
> For the combinations of dependency versions:
>
>   Due to the lack of the resources in GitHub Actions, I currently plan to just add the
>   Koalas tests into the matrix PySpark is currently using.
>
> one question I have; what’s an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already implemented?
> Or, the basic set of them?
>
> The goal of the proposal is to port all of Koalas project into PySpark.
> For example,
>
> import koalas
>
> will be equivalent to
>
> # Names, etc. might change in the final proposal or during the review
> from pyspark.sql import pandas
>
> Koalas supports pandas APIs with a separate layer to cover a bit of difference between
> DataFrame structures in pandas and PySpark, e.g.) other types as column names (labels),
> index (something like row number in DBMSs) and so on. So I think it would make more sense
> to port the whole layer instead of a subset of the APIs.
>
>
>
>
>
> 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cl...@gmail.com>님이 작성:
>>
>> +1, it's great to have Pandas support in Spark out of the box.
>>
>> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <li...@gmail.com> wrote:
>>>
>>> +1; the pandas interfaces are pretty popular and supporting them in pyspark looks promising, I think.
>>> one question I have; what's an initial goal of the proposal?
>>> Is that to port all the pandas interfaces that Koalas has already implemented?
>>> Or, the basic set of them?
>>>
>>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>> +1
>>>>
>>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> well as better alignment with core Spark improvements, the extra
>>>> weight looks manageable.
>>>>
>>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> <ni...@gmail.com> wrote:
>>>> >
>>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com> wrote:
>>>> >>
>>>> >> I don't think we should deprecate existing APIs.
>>>> >
>>>> >
>>>> > +1
>>>> >
>>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>>>> >
>>>> > For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Hyukjin Kwon <gu...@gmail.com>.

Thanks Nicholas for the pointer :-).

On Thu, 18 Mar 2021, 00:11 Nicholas Chammas, <ni...@gmail.com>
wrote:

> On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
>>   I am currently thinking we will have to convert the Koalas tests to use
>> unittests to match with PySpark for now.
>>
> Keep in mind that pytest supports unittest-based tests out of the box
> <https://docs.pytest.org/en/stable/unittest.html>, so you should be able
> to run pytest against the PySpark codebase without changing much about the
> tests.
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Nicholas Chammas <ni...@gmail.com>.

On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon <gu...@gmail.com> wrote:

>   I am currently thinking we will have to convert the Koalas tests to use
> unittests to match with PySpark for now.
>
Keep in mind that pytest supports unittest-based tests out of the box
<https://docs.pytest.org/en/stable/unittest.html>, so you should be able to
run pytest against the PySpark codebase without changing much about the
tests.

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Hyukjin Kwon <gu...@gmail.com>.

Thank you guys for all your feedback. I will start working on SPIP with
Koalas team.
I would expect the SPIP can be sent late this week or early next week.

I inlined and answered the questions unanswered as below:

Is the community developing the pandas API layer for Spark interested in
being part of Spark or do they prefer having their own release cycle?

Yeah, Koalas team used to have its own release cycle to develop and move
quickly.
Now it became pretty mature with reaching 1.7.0, and the team thinks that
it’s now
fine to have less frequent releases, and they are happy to work together
with Spark with
contributing to it. The active contributors in the Koalas community will
continue to
make the contributions in Spark.

How about test code? Does it fit into the PySpark test framework?

Yes, this will be one of the places where it needs some efforts. Koalas
currently uses pytest
with various dependency version combinations (e.g., Python version, conda
vs pip) whereas
PySpark uses the plain unittests with less dependency version combinations.

For pytest in Koalas <> unittests in PySpark:

  I am currently thinking we will have to convert the Koalas tests to use
unittests to match
  with PySpark for now.
  It is a feasible option for PySpark to migrate to pytest too but it will
need extra effort to
  make it working with our own PySpark testing framework seamlessly.
  Koalas team (presumably and likely I) will take a look in any event.

For the combinations of dependency versions:

  Due to the lack of the resources in GitHub Actions, I currently plan to
just add the
  Koalas tests into the matrix PySpark is currently using.

one question I have; what’s an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already
implemented?
Or, the basic set of them?

The goal of the proposal is to port all of Koalas project into PySpark.
For example,

import koalas

will be equivalent to

# Names, etc. might change in the final proposal or during the review
from pyspark.sql import pandas

Koalas supports pandas APIs with a separate layer to cover a bit of
difference between
DataFrame structures in pandas and PySpark, e.g.) other types as column
names (labels),
index (something like row number in DBMSs) and so on. So I think it would
make more sense
to port the whole layer instead of a subset of the APIs.

2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cl...@gmail.com>님이 작성:

> +1, it's great to have Pandas support in Spark out of the box.
>
> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> +1; the pandas interfaces are pretty popular and supporting them in
>> pyspark looks promising, I think.
>> one question I have; what's an initial goal of the proposal?
>> Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> Or, the basic set of them?
>>
>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>> well as better alignment with core Spark improvements, the extra
>>> weight looks manageable.
>>>
>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>> <ni...@gmail.com> wrote:
>>> >
>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >>
>>> >> I don't think we should deprecate existing APIs.
>>> >
>>> >
>>> > +1
>>> >
>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
>>> could be wrong, but I wager most people who have worked with both Spark and
>>> Pandas feel the same way.
>>> >
>>> > For the large community of current PySpark users, or users switching
>>> to PySpark from another Spark language API, it doesn't make sense to
>>> deprecate the current API, even by convention.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Wenchen Fan <cl...@gmail.com>.

+1, it's great to have Pandas support in Spark out of the box.

On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <li...@gmail.com>
wrote:

> +1; the pandas interfaces are pretty popular and supporting them in
> pyspark looks promising, I think.
> one question I have; what's an initial goal of the proposal?
> Is that to port all the pandas interfaces that Koalas has already
> implemented?
> Or, the basic set of them?
>
> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> +1
>>
>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>> well as better alignment with core Spark improvements, the extra
>> weight looks manageable.
>>
>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>> <ni...@gmail.com> wrote:
>> >
>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>
>> >> I don't think we should deprecate existing APIs.
>> >
>> >
>> > +1
>> >
>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
>> could be wrong, but I wager most people who have worked with both Spark and
>> Pandas feel the same way.
>> >
>> > For the large community of current PySpark users, or users switching to
>> PySpark from another Spark language API, it doesn't make sense to deprecate
>> the current API, even by convention.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Takeshi Yamamuro <li...@gmail.com>.

+1; the pandas interfaces are pretty popular and supporting them in pyspark
looks promising, I think.
one question I have; what's an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already
implemented?
Or, the basic set of them?

On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +1
>
> Bringing a Pandas API for pyspark to upstream Spark will only bring
> benefits for everyone (more eyes to use/see/fix/improve the API) as
> well as better alignment with core Spark improvements, the extra
> weight looks manageable.
>
> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
> <ni...@gmail.com> wrote:
> >
> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com> wrote:
> >>
> >> I don't think we should deprecate existing APIs.
> >
> >
> > +1
> >
> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
> could be wrong, but I wager most people who have worked with both Spark and
> Pandas feel the same way.
> >
> > For the large community of current PySpark users, or users switching to
> PySpark from another Spark language API, it doesn't make sense to deprecate
> the current API, even by convention.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Ismaël Mejía <ie...@gmail.com>.

+1

Bringing a Pandas API for pyspark to upstream Spark will only bring
benefits for everyone (more eyes to use/see/fix/improve the API) as
well as better alignment with core Spark improvements, the extra
weight looks manageable.

On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
<ni...@gmail.com> wrote:
>
> On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>> I don't think we should deprecate existing APIs.
>
>
> +1
>
> I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way.
>
> For the large community of current PySpark users, or users switching to PySpark from another Spark language API, it doesn't make sense to deprecate the current API, even by convention.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Nicholas Chammas <ni...@gmail.com>.

On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <rx...@databricks.com> wrote:

> I don't think we should deprecate existing APIs.
>

+1

I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
could be wrong, but I wager most people who have worked with both Spark and
Pandas feel the same way.

For the large community of current PySpark users, or users switching to
PySpark from another Spark language API, it doesn't make sense to deprecate
the current API, even by convention.

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Maciej <ms...@gmail.com>.

I concur. These two don't have the same target audience or
expressiveness. I cannot imagine most of the PySpark projects I've seen
to switch to Pandas-style API.

If this is to be included, it would be great if we could model similar
to SQLAlchemy, with its core and ORM components being equally important
parts of the API.

On 3/15/21 7:12 AM, Reynold Xin wrote:
> I don't think we should deprecate existing APIs.
>
> Spark's own Python API is relatively stable and not difficult to
> support. It has a pretty large number of users and existing code. Also
> pretty easy to learn by data engineers.
>
> pandas API is a great for data science, but isn't that great for some
> other tasks. It's super wide. Great for data scientists that have
> learned it, or great for copy paste from Stackoverflow.
>
>
>
> On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun
> <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
>
>     Thank you for the proposal. It looks like a good addition.
>     BTW, what is the future plan for the existing APIs?
>     Are we going to deprecate it eventually in favor of Koalas
>     (because we don't remove the existing APIs in general)?
>
>     > Fourthly, PySpark is still not Pythonic enough. For example, I
>     hear complaints such as "why does
>     > PySpark follow pascalCase?" or "PySpark APIs are difficult to
>     learn", and APIs are very difficult to change
>     > in Spark (as I emphasized above). 
>
>
>     On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <gurwls223@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Firstly my biggest reason is that I would like to promote this
>         more as a built-in support because it is simply
>         important to have it with the impact on the large user group,
>         and the needs are increasing
>         as the charts indicate. I usually think that features or
>         add-ons stay as third parties when it’s rather for a
>         smaller set of users, it addresses a corner case of needs,
>         etc. I think this is similar to the datasources
>         we have added. Spark ported CSV and Avro because more and more
>         people use it, and it became important
>         to have it as a built-in support.
>
>         Secondly, Koalas needs more help from Spark, PySpark, Python
>         and pandas experts from the
>         bigger community. Koalas’ team isn’t experts in all the areas,
>         and there are many missing corner
>         cases to fix, Some require deep expertise from specific areas.
>
>         One example is the type hints. Koalas uses type hints for
>         schema inference.
>         Due to the lack of Python’s type hinting way, Koalas added its
>         own (hacky) way
>         <https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>.
>         Fortunately the way Koalas implemented is now partially
>         proposed into Python officially (PEP 646).
>         But Koalas could have been better with interacting with the
>         Python community more and actively
>         joining in the design issues together to lead the best output
>         that benefits both and more projects.
>
>         Thirdly, I would like to contribute to the growth of PySpark.
>         The growth of the Koalas is very fast given the
>         internal and external stats. The number of users has jumped up
>         twice almost every 4 ~ 6 months.
>         I think Koalas will be a good momentum to keep Spark up.
>
>         Fourthly, PySpark is still not Pythonic enough. For example, I
>         hear complaints such as "why does
>         PySpark follow pascalCase?" or "PySpark APIs are difficult to
>         learn", and APIs are very difficult to change
>         in Spark (as I emphasized above). This set of Koalas APIs will
>         be able to address these concerns
>         in PySpark.
>
>         Lastly, I really think PySpark needs its native plotting
>         features. As I emphasized before with
>         elaboration, I do think this is an important feature missing
>         in PySpark that users need.
>         I do think Koalas completes what PySpark is currently missing.
>
>
>
>         2021년 3월 14일 (일) 오후 7:12, Sean Owen <srowen@gmail.com
>         <ma...@gmail.com>>님이 작성:
>
>             I like koalas a lot. Playing devil's advocate, why not
>             just let it continue to live as an add on? Usually the
>             argument is it'll be maintained better in Spark but it's
>             well maintained. It adds some overhead to maintaining
>             Spark conversely. On the upside it makes it a little more
>             discoverable. Are there more 'synergies'?
>
>             On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon
>             <gurwls223@gmail.com <ma...@gmail.com>> wrote:
>
>                 Hi all,
>
>
>                 I would like to start the discussion on supporting
>                 pandas API layer on Spark.
>
>                  
>
>                 If we have a general consensus on having it in
>                 PySpark, I will initiate and drive an SPIP with a
>                 detailed explanation about the implementation’s
>                 overview and structure.
>
>                 I would appreciate it if I can know whether you guys
>                 support this or not before starting the SPIP.
>
>
>                     What do you want to propose?
>
>                 I have been working on the Koalas
>                 <https://github.com/databricks/koalas>project that is
>                 essentially: pandas API support on Spark, and I would
>                 like to propose embracing Koalas in PySpark.
>
>                  
>
>                 More specifically, I am thinking about adding a
>                 separate package, to PySpark, for pandas APIs on
>                 PySpark Therefore it wouldn’t break anything in the
>                 existing codes. The overview would look as below:
>
>                 |pyspark_dataframe.[... PySpark APIs
>                 ...]pandas_dataframe.[... pandas APIs (local) ...]#
>                 The package names will change in the final proposal
>                 and during review. koalas_dataframe
>                 =koalas.from_pandas*(*pyspark_dataframe*)*koalas_dataframe 
>                 =koalas.from_spark*(*pandas_dataframe*)*koalas_dataframe.[...
>                 pandas APIs on Spark ...]pyspark_dataframe=
>                 koalas_dataframe.to_spark()pandas_dataframe=
>                 koalas_dataframe.to_pandas() |
>
>                 Koalas provides a pandas API layer on PySpark. It
>                 supports almost the same API usages. Users can
>                 leverage their existing Spark cluster to scale their
>                 pandas workloads. It works interchangeably with
>                 PySpark by allowing both pandas and PySpark APIs to users.
>
>                 The project has grown separately more than two years,
>                 and this has been successfully going. With version
>                 1.7.0 Koalas has greatly improved maturity and
>                 stability. Its usability has been proven with numerous
>                 users’ adoptions and by reaching more than 75% API
>                 coverage in pandas’ Index, Series and DataFrame.
>
>                 I strongly think this is the direction we should go
>                 for Apache Spark, and it is a win-win strategy for the
>                 growth of both Apache Spark and pandas. Please see the
>                 reasons below.
>
>
>                     Why do we need it?
>
>                  *
>
>                     Python has grown dramatically in the last few
>                     years and became one of the most popular
>                     languages, see also StackOverFlow trend
>                     <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>for
>                     Python, Java, R and Scala languages.
>
>                  *
>
>                     pandas became almost the standard library of data
>                     science. Please also see the StackOverFlow trend
>                     <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>                     for pandas, Apache Spark and PySpark.
>
>                  *
>
>                     PySpark is not Pythonic enough. At least I myself
>                     hear a lot of complaints. That initiated Project
>                     Zen
>                     <https://issues.apache.org/jira/browse/SPARK-32082>,
>                     and we have greatly improved PySpark usability and
>                     made it more Pythonic. 
>
>                 Nevertheless, data scientists tend to prefer pandas
>                 libraries according to the trends but APIs are hard to
>                 change in PySpark. We should redesign all APIs and
>                 improve them from scratch, which is very difficult.
>
>
>                 One straightforward and fast approach is to benchmark
>                 a successful case, and pandas does not support
>                 distributed execution. Once PySpark supports
>                 pandas-like APIs, it can be a good option for pandas
>                 users to scale their workloads easily. I do believe
>                 this is a win-win strategy for the growth of both
>                 pandas and PySpark.
>
>
>                 In fact, there are already similar tries such as Dask
>                 <https://dask.org/>and Modin
>                 <https://modin.readthedocs.io/en/latest/>(other than
>                 Koalas <https://github.com/databricks/koalas>). They
>                 are all growing fast and successfully, and I find that
>                 people compare it to PySpark from time to time, for
>                 example, see Beyond Pandas: Spark, Dask, Vaex and
>                 other big data technologies battling head to head
>                 <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>.
>
>                  
>
>                  *
>
>                     There are many important features missing that are
>                     very common in data science. One of the most
>                     important features is plotting and drawing a
>                     chart. Almost every data scientist plots and draws
>                     a chart to understand their data quickly and
>                     visually in their daily work but this is missing
>                     in PySpark. Please see one example in pandas:
>
>
>                  
>
>                 I do recommend taking a quick look for blog posts and
>                 talks made for pandas on Spark:
>                 https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html
>                 <https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html>.
>                 They explain why we need this far more better.
>
>

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

Re: Support User Defined Types in pandas_udf for Spark's own Python API

Posted by Hyukjin Kwon <gu...@gmail.com>.

Yeah, we still should improve PySpark APIs together. I am currently stuck
at some work and porting Koalas at this moment so couldn't have a chance to
take a very close look (but drop some comments and skim).

2021년 4월 6일 (화) 오후 5:31, Darcy Shen <sa...@zoho.com.cn>님이 작성:

> was: [DISCUSS] Support pandas API layer on PySpark
>
>
> I'm working on [SPARK-34600] Support user defined types in Pandas UDF -
> ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-34600>.
>
> I'm wondering if we are still working on improving Spark's own Python API.
>
> SPARK-34600 is relatively a big feature for PySpark. I splited it into
> several small tickets and submitted the first small PR:
>
> [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support
> Enabled by sadhen · Pull Request #32026 · apache/spark (github.com)
> <https://github.com/apache/spark/pull/32026>
>
> I'm afraid that the Spark community are busy working on pandas API layer
> on PySpark and the improvements for Spark's own Python API will be
> postponed and postponed.
>
> As gongjonn.hyun said:
> > BTW, what is the future plan for the existing APIs?
>
> If we are keeping these existing APIs, will we add new features for
> Spark's own Python API?
>
> Or will we fix bugs for Spark's own Python API?
>
> Specifically, will we add support for User Defined Types in pandas_udf for
> Spark's own Python API?
>
>
> ---- On Mon, 2021-03-15 14:12:28 *Reynold Xin <rxin@databricks.com
> <rx...@databricks.com>>* wrote ----
>
> I don't think we should deprecate existing APIs.
>
> Spark's own Python API is relatively stable and not difficult to support.
> It has a pretty large number of users and existing code. Also pretty easy
> to learn by data engineers.
>
> pandas API is a great for data science, but isn't that great for some
> other tasks. It's super wide. Great for data scientists that have learned
> it, or great for copy paste from Stackoverflow.
>
>
>
>
>
> On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun <do...@gmail.com>
> wrote:
>
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
>
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
>
>
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> Firstly my biggest reason is that I would like to promote this more as a
> built-in support because it is simply
> important to have it with the impact on the large user group, and the
> needs are increasing
> as the charts indicate. I usually think that features or add-ons stay as
> third parties when it’s rather for a
> smaller set of users, it addresses a corner case of needs, etc. I think
> this is similar to the datasources
> we have added. Spark ported CSV and Avro because more and more people use
> it, and it became important
> to have it as a built-in support.
>
> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
> experts from the
> bigger community. Koalas’ team isn’t experts in all the areas, and there
> are many missing corner
> cases to fix, Some require deep expertise from specific areas.
>
> One example is the type hints. Koalas uses type hints for schema inference.
> Due to the lack of Python’s type hinting way, Koalas added its own
> (hacky) way
> <https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
> .
> Fortunately the way Koalas implemented is now partially proposed into
> Python officially (PEP 646).
> But Koalas could have been better with interacting with the Python
> community more and actively
> joining in the design issues together to lead the best output that
> benefits both and more projects.
>
> Thirdly, I would like to contribute to the growth of PySpark. The growth
> of the Koalas is very fast given the
> internal and external stats. The number of users has jumped up twice
> almost every 4 ~ 6 months.
> I think Koalas will be a good momentum to keep Spark up.
> Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
> APIs are very difficult to change
> in Spark (as I emphasized above). This set of Koalas APIs will be able to
> address these concerns
> in PySpark.
>
> Lastly, I really think PySpark needs its native plotting features. As I
> emphasized before with
> elaboration, I do think this is an important feature missing in PySpark
> that users need.
> I do think Koalas completes what PySpark is currently missing.
>
>
>
> 2021년 3월 14일 (일) 오후 7:12, Sean Owen <sr...@gmail.com>님이 작성:
>
> I like koalas a lot. Playing devil's advocate, why not just let it
> continue to live as an add on? Usually the argument is it'll be maintained
> better in Spark but it's well maintained. It adds some overhead to
> maintaining Spark conversely. On the upside it makes it a little more
> discoverable. Are there more 'synergies'?
>
> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
>
>
> If we have a general consensus on having it in PySpark, I will initiate
> and drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> What do you want to propose?
>
> I have been working on the Koalas <https://github.com/databricks/koalas>
> project that is essentially: pandas API support on Spark, and I would like
> to propose embracing Koalas in PySpark.
>
>
>
> More specifically, I am thinking about adding a separate package, to
> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
> the existing codes. The overview would look as below:
>
>
> pyspark_dataframe.[... PySpark APIs ...]
> pandas_dataframe.[... pandas APIs (local) ...]
>
> *# The package names will change in the final proposal and during review. *
> *koalas_dataframe *=* koalas.from_pandas**(*pyspark_dataframe*)*
> *koalas_dataframe  *=* koalas.from_spark**(*pandas_dataframe*)*
> *koalas_dataframe.[... pandas APIs on Spark ...]*
>
> pyspark_dataframe = *koalas_dataframe.to_spark()*
> pandas_dataframe = *koalas_dataframe.to_pandas()*
>
>
>
> Koalas provides a pandas API layer on PySpark. It supports almost the same
> API usages. Users can leverage their existing Spark cluster to scale their
> pandas workloads. It works interchangeably with PySpark by allowing both
> pandas and PySpark APIs to users.
>
> The project has grown separately more than two years, and this has been
> successfully going. With version 1.7.0 Koalas has greatly improved maturity
> and stability. Its usability has been proven with numerous users’ adoptions
> and by reaching more than 75% API coverage in pandas’ Index, Series and
> DataFrame.
>
>
> I strongly think this is the direction we should go for Apache Spark, and
> it is a win-win strategy for the growth of both Apache Spark and pandas.
> Please see the reasons below.
> Why do we need it?
>
>    -
>
>    Python has grown dramatically in the last few years and became one of
>    the most popular languages, see also StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for Python, Java, R and Scala languages.
>    -
>
>    pandas became almost the standard library of data science. Please also
>    see the StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for pandas, Apache Spark and PySpark.
>    -
>
>    PySpark is not Pythonic enough. At least I myself hear a lot of
>    complaints. That initiated Project Zen
>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>    greatly improved PySpark usability and made it more Pythonic.
>
> Nevertheless, data scientists tend to prefer pandas libraries according to
> the trends but APIs are hard to change in PySpark. We should redesign all
> APIs and improve them from scratch, which is very difficult.
>
> One straightforward and fast approach is to benchmark a successful case,
> and pandas does not support distributed execution. Once PySpark supports
> pandas-like APIs, it can be a good option for pandas users to scale their
> workloads easily. I do believe this is a win-win strategy for the growth of
> both pandas and PySpark.
>
> In fact, there are already similar tries such as Dask <https://dask.org/>
> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
> <https://github.com/databricks/koalas>). They are all growing fast and
> successfully, and I find that people compare it to PySpark from time to
> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
> data technologies battling head to head
> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
> .
>
>
>
>    -
>
>    There are many important features missing that are very common in data
>    science. One of the most important features is plotting and drawing a
>    chart. Almost every data scientist plots and draws a chart to understand
>    their data quickly and visually in their daily work but this is missing in
>    PySpark. Please see one example in pandas:
>
>
>
>
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.
>
>
>
>
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Reynold Xin <rx...@databricks.com>.

I don't think we should deprecate existing APIs.

Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers.

pandas API is a great for data science, but isn't that great for some other tasks. It's super wide. Great for data scientists that have learned it, or great for copy paste from Stackoverflow.

On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun < dongjoon.hyun@gmail.com > wrote:

> 
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
> 
> 
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> 
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
> 
> 
> 
> 
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon < gurwls223@gmail.com > wrote:
> 
> 
> 
>> 
>> 
>> Firstly my biggest reason is that I would like to promote this more as a
>> built-in support because it is simply
>> important to have it with the impact on the large user group, and the
>> needs are increasing
>> as the charts indicate. I usually think that features or add-ons stay as
>> third parties when it’s rather for a
>> smaller set of users, it addresses a corner case of needs, etc. I think
>> this is similar to the datasources
>> we have added. Spark ported CSV and Avro because more and more people use
>> it, and it became important
>> to have it as a built-in support.
>> 
>> 
>> 
>> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
>> experts from the
>> bigger community. Koalas’ team isn’t experts in all the areas, and there
>> are many missing corner
>> cases to fix, Some require deep expertise from specific areas.
>> 
>> 
>> 
>> One example is the type hints. Koalas uses type hints for schema
>> inference.
>> Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
>> way (
>> https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas
>> ).
>> Fortunately the way Koalas implemented is now partially proposed into
>> Python officially (PEP 646).
>> But Koalas could have been better with interacting with the Python
>> community more and actively
>> joining in the design issues together to lead the best output that
>> benefits both and more projects.
>> 
>> 
>> 
>> Thirdly, I would like to contribute to the growth of PySpark. The growth
>> of the Koalas is very fast given the
>> internal and external stats. The number of users has jumped up twice
>> almost every 4 ~ 6 months.
>> I think Koalas will be a good momentum to keep Spark up.
>> 
>> 
>> Fourthly, PySpark is still not Pythonic enough. For example, I hear
>> complaints such as "why does
>> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
>> APIs are very difficult to change
>> in Spark (as I emphasized above). This set of Koalas APIs will be able to
>> address these concerns
>> in PySpark.
>> 
>> Lastly, I really think PySpark needs its native plotting features. As I
>> emphasized before with
>> elaboration, I do think this is an important feature missing in PySpark
>> that users need.
>> I do think Koalas completes what PySpark is currently missing.
>> 
>> 
>> 
>> 
>> 
>> 2021년 3월 14일 (일) 오후 7:12, Sean Owen < srowen@gmail.com >님이 작성:
>> 
>> 
>>> I like koalas a lot. Playing devil's advocate, why not just let it
>>> continue to live as an add on? Usually the argument is it'll be maintained
>>> better in Spark but it's well maintained. It adds some overhead to
>>> maintaining Spark conversely. On the upside it makes it a little more
>>> discoverable. Are there more 'synergies'?
>>> 
>>> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon < gurwls223@gmail.com > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> Hi all,
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I would like to start the discussion on supporting pandas API layer on
>>>> Spark.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> If we have a general consensus on having it in PySpark, I will initiate
>>>> and drive an SPIP with a detailed explanation about the implementation’s
>>>> overview and structure.
>>>> 
>>>> 
>>>> 
>>>> I would appreciate it if I can know whether you guys support this or not
>>>> before starting the SPIP.
>>>> 
>>>> 
>>>> 
>>>> ----------------------------
>>>>  What do you want to propose?
>>>> ----------------------------
>>>> 
>>>> 
>>>> 
>>>> I have been working on the Koalas ( https://github.com/databricks/koalas )
>>>> project that is essentially: pandas API support on Spark, and I would like
>>>> to propose embracing Koalas in PySpark.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> More specifically, I am thinking about adding a separate package, to
>>>> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything
>>>> in the existing codes. The overview would look as below:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> pyspark_dataframe.[... PySpark APIs ...]
>>>> pandas_dataframe.[... pandas APIs (local) ...]
>>>> 
>>>> # The package names will change in the final proposal and during review. 
>>>> koalas_dataframe = koalas.from_pandas *(* pyspark_dataframe *)*
>>>> koalas_dataframe  = koalas.from_spark *(* pandas_dataframe *)*
>>>> koalas_dataframe.[... pandas APIs on Spark ...]
>>>> 
>>>> pyspark_dataframe = koalas_dataframe.to_spark()
>>>> pandas_dataframe = koalas_dataframe.to_pandas()
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Koalas provides a pandas API layer on PySpark. It supports almost the same
>>>> API usages. Users can leverage their existing Spark cluster to scale their
>>>> pandas workloads. It works interchangeably with PySpark by allowing both
>>>> pandas and PySpark APIs to users.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> The project has grown separately more than two years, and this has been
>>>> successfully going. With version 1.7.0 Koalas has greatly improved
>>>> maturity and stability. Its usability has been proven with numerous users’
>>>> adoptions and by reaching more than 75% API coverage in pandas’ Index,
>>>> Series and DataFrame.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I strongly think this is the direction we should go for Apache Spark, and
>>>> it is a win-win strategy for the growth of both Apache Spark and pandas.
>>>> Please see the reasons below.
>>>> 
>>>> 
>>>> 
>>>> ------------------
>>>>  Why do we need it?
>>>> ------------------
>>>> 
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> Python has grown dramatically in the last few years and became one of the
>>>> most popular languages, see also StackOverFlow trend (
>>>> https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr )
>>>> for Python, Java, R and Scala languages.
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> pandas became almost the standard library of data science. Please also see
>>>> the StackOverFlow trend (
>>>> https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr )
>>>> for pandas, Apache Spark and PySpark.
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> PySpark is not Pythonic enough. At least I myself hear a lot of
>>>> complaints. That initiated Project Zen (
>>>> https://issues.apache.org/jira/browse/SPARK-32082 ) , and we have greatly
>>>> improved PySpark usability and made it more Pythonic.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Nevertheless, data scientists tend to prefer pandas libraries according to
>>>> the trends but APIs are hard to change in PySpark. We should redesign all
>>>> APIs and improve them from scratch, which is very difficult.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> One straightforward and fast approach is to benchmark a successful case,
>>>> and pandas does not support distributed execution. Once PySpark supports
>>>> pandas-like APIs, it can be a good option for pandas users to scale their
>>>> workloads easily. I do believe this is a win-win strategy for the growth
>>>> of both pandas and PySpark.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> In fact, there are already similar tries such as Dask ( https://dask.org/ )
>>>> and Modin ( https://modin.readthedocs.io/en/latest/ ) (other than Koalas (
>>>> https://github.com/databricks/koalas ) ). They are all growing fast and
>>>> successfully, and I find that people compare it to PySpark from time to
>>>> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data
>>>> technologies battling head to head (
>>>> https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13
>>>> ).
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> There are many important features missing that are very common in data
>>>> science. One of the most important features is plotting and drawing a
>>>> chart. Almost every data scientist plots and draws a chart to understand
>>>> their data quickly and visually in their daily work but this is missing in
>>>> PySpark. Please see one example in pandas:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I do recommend taking a quick look for blog posts and talks made for
>>>> pandas on Spark: https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html
>>>> (
>>>> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html
>>>> ). They explain why we need this far more better.
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you for the proposal. It looks like a good addition.
BTW, what is the future plan for the existing APIs?
Are we going to deprecate it eventually in favor of Koalas (because we
don't remove the existing APIs in general)?

> Fourthly, PySpark is still not Pythonic enough. For example, I hear
complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
APIs are very difficult to change
> in Spark (as I emphasized above).


On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon <gu...@gmail.com> wrote:

> Firstly my biggest reason is that I would like to promote this more as a
> built-in support because it is simply
> important to have it with the impact on the large user group, and the
> needs are increasing
> as the charts indicate. I usually think that features or add-ons stay as
> third parties when it’s rather for a
> smaller set of users, it addresses a corner case of needs, etc. I think
> this is similar to the datasources
> we have added. Spark ported CSV and Avro because more and more people use
> it, and it became important
> to have it as a built-in support.
>
> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
> experts from the
> bigger community. Koalas’ team isn’t experts in all the areas, and there
> are many missing corner
> cases to fix, Some require deep expertise from specific areas.
>
> One example is the type hints. Koalas uses type hints for schema inference.
> Due to the lack of Python’s type hinting way, Koalas added its own
> (hacky) way
> <https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
> .
> Fortunately the way Koalas implemented is now partially proposed into
> Python officially (PEP 646).
> But Koalas could have been better with interacting with the Python
> community more and actively
> joining in the design issues together to lead the best output that
> benefits both and more projects.
>
> Thirdly, I would like to contribute to the growth of PySpark. The growth
> of the Koalas is very fast given the
> internal and external stats. The number of users has jumped up twice
> almost every 4 ~ 6 months.
> I think Koalas will be a good momentum to keep Spark up.
> Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
> APIs are very difficult to change
> in Spark (as I emphasized above). This set of Koalas APIs will be able to
> address these concerns
> in PySpark.
>
> Lastly, I really think PySpark needs its native plotting features. As I
> emphasized before with
> elaboration, I do think this is an important feature missing in PySpark
> that users need.
> I do think Koalas completes what PySpark is currently missing.
>
>
>
> 2021년 3월 14일 (일) 오후 7:12, Sean Owen <sr...@gmail.com>님이 작성:
>
>> I like koalas a lot. Playing devil's advocate, why not just let it
>> continue to live as an add on? Usually the argument is it'll be maintained
>> better in Spark but it's well maintained. It adds some overhead to
>> maintaining Spark conversely. On the upside it makes it a little more
>> discoverable. Are there more 'synergies'?
>>
>> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I would like to start the discussion on supporting pandas API layer on
>>> Spark.
>>>
>>>
>>>
>>> If we have a general consensus on having it in PySpark, I will initiate
>>> and drive an SPIP with a detailed explanation about the implementation’s
>>> overview and structure.
>>>
>>> I would appreciate it if I can know whether you guys support this or not
>>> before starting the SPIP.
>>> What do you want to propose?
>>>
>>> I have been working on the Koalas <https://github.com/databricks/koalas>
>>> project that is essentially: pandas API support on Spark, and I would like
>>> to propose embracing Koalas in PySpark.
>>>
>>>
>>>
>>> More specifically, I am thinking about adding a separate package, to
>>> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
>>> the existing codes. The overview would look as below:
>>>
>>> pyspark_dataframe.[... PySpark APIs ...]
>>> pandas_dataframe.[... pandas APIs (local) ...]
>>>
>>> # The package names will change in the final proposal and during review.
>>> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
>>> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
>>> koalas_dataframe.[... pandas APIs on Spark ...]
>>>
>>> pyspark_dataframe = koalas_dataframe.to_spark()
>>> pandas_dataframe = koalas_dataframe.to_pandas()
>>>
>>> Koalas provides a pandas API layer on PySpark. It supports almost the
>>> same API usages. Users can leverage their existing Spark cluster to scale
>>> their pandas workloads. It works interchangeably with PySpark by allowing
>>> both pandas and PySpark APIs to users.
>>>
>>> The project has grown separately more than two years, and this has been
>>> successfully going. With version 1.7.0 Koalas has greatly improved maturity
>>> and stability. Its usability has been proven with numerous users’ adoptions
>>> and by reaching more than 75% API coverage in pandas’ Index, Series and
>>> DataFrame.
>>>
>>> I strongly think this is the direction we should go for Apache Spark,
>>> and it is a win-win strategy for the growth of both Apache Spark and
>>> pandas. Please see the reasons below.
>>> Why do we need it?
>>>
>>>    -
>>>
>>>    Python has grown dramatically in the last few years and became one
>>>    of the most popular languages, see also StackOverFlow trend
>>>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>>>    for Python, Java, R and Scala languages.
>>>    -
>>>
>>>    pandas became almost the standard library of data science. Please
>>>    also see the StackOverFlow trend
>>>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>>>    for pandas, Apache Spark and PySpark.
>>>    -
>>>
>>>    PySpark is not Pythonic enough. At least I myself hear a lot of
>>>    complaints. That initiated Project Zen
>>>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>>>    greatly improved PySpark usability and made it more Pythonic.
>>>
>>> Nevertheless, data scientists tend to prefer pandas libraries according
>>> to the trends but APIs are hard to change in PySpark. We should redesign
>>> all APIs and improve them from scratch, which is very difficult.
>>>
>>> One straightforward and fast approach is to benchmark a successful case,
>>> and pandas does not support distributed execution. Once PySpark supports
>>> pandas-like APIs, it can be a good option for pandas users to scale their
>>> workloads easily. I do believe this is a win-win strategy for the growth of
>>> both pandas and PySpark.
>>>
>>> In fact, there are already similar tries such as Dask
>>> <https://dask.org/> and Modin <https://modin.readthedocs.io/en/latest/>
>>> (other than Koalas <https://github.com/databricks/koalas>). They are
>>> all growing fast and successfully, and I find that people compare it to
>>> PySpark from time to time, for example, see Beyond Pandas: Spark, Dask,
>>> Vaex and other big data technologies battling head to head
>>> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
>>> .
>>>
>>>
>>>
>>>    -
>>>
>>>    There are many important features missing that are very common in
>>>    data science. One of the most important features is plotting and drawing a
>>>    chart. Almost every data scientist plots and draws a chart to understand
>>>    their data quickly and visually in their daily work but this is missing in
>>>    PySpark. Please see one example in pandas:
>>>
>>>
>>>
>>>
>>> I do recommend taking a quick look for blog posts and talks made for
>>> pandas on Spark:
>>> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
>>> They explain why we need this far more better.
>>>
>>>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Hyukjin Kwon <gu...@gmail.com>.

Firstly my biggest reason is that I would like to promote this more as a
built-in support because it is simply
important to have it with the impact on the large user group, and the needs
are increasing
as the charts indicate. I usually think that features or add-ons stay as
third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think
this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use
it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there
are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
way
<https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
.
Fortunately the way Koalas implemented is now partially proposed into
Python officially (PEP 646).
But Koalas could have been better with interacting with the Python
community more and actively
joining in the design issues together to lead the best output that benefits
both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of
the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost
every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.
Fourthly, PySpark is still not Pythonic enough. For example, I hear
complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to
address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I
emphasized before with
elaboration, I do think this is an important feature missing in PySpark
that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen <sr...@gmail.com>님이 작성:

> I like koalas a lot. Playing devil's advocate, why not just let it
> continue to live as an add on? Usually the argument is it'll be maintained
> better in Spark but it's well maintained. It adds some overhead to
> maintaining Spark conversely. On the upside it makes it a little more
> discoverable. Are there more 'synergies'?
>
> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
>> Hi all,
>>
>> I would like to start the discussion on supporting pandas API layer on
>> Spark.
>>
>>
>>
>> If we have a general consensus on having it in PySpark, I will initiate
>> and drive an SPIP with a detailed explanation about the implementation’s
>> overview and structure.
>>
>> I would appreciate it if I can know whether you guys support this or not
>> before starting the SPIP.
>> What do you want to propose?
>>
>> I have been working on the Koalas <https://github.com/databricks/koalas>
>> project that is essentially: pandas API support on Spark, and I would like
>> to propose embracing Koalas in PySpark.
>>
>>
>>
>> More specifically, I am thinking about adding a separate package, to
>> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
>> the existing codes. The overview would look as below:
>>
>> pyspark_dataframe.[... PySpark APIs ...]
>> pandas_dataframe.[... pandas APIs (local) ...]
>>
>> # The package names will change in the final proposal and during review.
>> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
>> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
>> koalas_dataframe.[... pandas APIs on Spark ...]
>>
>> pyspark_dataframe = koalas_dataframe.to_spark()
>> pandas_dataframe = koalas_dataframe.to_pandas()
>>
>> Koalas provides a pandas API layer on PySpark. It supports almost the
>> same API usages. Users can leverage their existing Spark cluster to scale
>> their pandas workloads. It works interchangeably with PySpark by allowing
>> both pandas and PySpark APIs to users.
>>
>> The project has grown separately more than two years, and this has been
>> successfully going. With version 1.7.0 Koalas has greatly improved maturity
>> and stability. Its usability has been proven with numerous users’ adoptions
>> and by reaching more than 75% API coverage in pandas’ Index, Series and
>> DataFrame.
>>
>> I strongly think this is the direction we should go for Apache Spark, and
>> it is a win-win strategy for the growth of both Apache Spark and pandas.
>> Please see the reasons below.
>> Why do we need it?
>>
>>    -
>>
>>    Python has grown dramatically in the last few years and became one of
>>    the most popular languages, see also StackOverFlow trend
>>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>>    for Python, Java, R and Scala languages.
>>    -
>>
>>    pandas became almost the standard library of data science. Please
>>    also see the StackOverFlow trend
>>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>>    for pandas, Apache Spark and PySpark.
>>    -
>>
>>    PySpark is not Pythonic enough. At least I myself hear a lot of
>>    complaints. That initiated Project Zen
>>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>>    greatly improved PySpark usability and made it more Pythonic.
>>
>> Nevertheless, data scientists tend to prefer pandas libraries according
>> to the trends but APIs are hard to change in PySpark. We should redesign
>> all APIs and improve them from scratch, which is very difficult.
>>
>> One straightforward and fast approach is to benchmark a successful case,
>> and pandas does not support distributed execution. Once PySpark supports
>> pandas-like APIs, it can be a good option for pandas users to scale their
>> workloads easily. I do believe this is a win-win strategy for the growth of
>> both pandas and PySpark.
>>
>> In fact, there are already similar tries such as Dask <https://dask.org/>
>> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
>> <https://github.com/databricks/koalas>). They are all growing fast and
>> successfully, and I find that people compare it to PySpark from time to
>> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
>> data technologies battling head to head
>> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
>> .
>>
>>
>>
>>    -
>>
>>    There are many important features missing that are very common in
>>    data science. One of the most important features is plotting and drawing a
>>    chart. Almost every data scientist plots and draws a chart to understand
>>    their data quickly and visually in their daily work but this is missing in
>>    PySpark. Please see one example in pandas:
>>
>>
>>
>>
>> I do recommend taking a quick look for blog posts and talks made for
>> pandas on Spark:
>> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
>> They explain why we need this far more better.
>>
>>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Sean Owen <sr...@gmail.com>.

I like koalas a lot. Playing devil's advocate, why not just let it continue
to live as an add on? Usually the argument is it'll be maintained better in
Spark but it's well maintained. It adds some overhead to maintaining Spark
conversely. On the upside it makes it a little more discoverable. Are there
more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gu...@gmail.com> wrote:

> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
>
>
> If we have a general consensus on having it in PySpark, I will initiate
> and drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> What do you want to propose?
>
> I have been working on the Koalas <https://github.com/databricks/koalas>
> project that is essentially: pandas API support on Spark, and I would like
> to propose embracing Koalas in PySpark.
>
>
>
> More specifically, I am thinking about adding a separate package, to
> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
> the existing codes. The overview would look as below:
>
> pyspark_dataframe.[... PySpark APIs ...]
> pandas_dataframe.[... pandas APIs (local) ...]
>
> # The package names will change in the final proposal and during review.
> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
> koalas_dataframe.[... pandas APIs on Spark ...]
>
> pyspark_dataframe = koalas_dataframe.to_spark()
> pandas_dataframe = koalas_dataframe.to_pandas()
>
> Koalas provides a pandas API layer on PySpark. It supports almost the same
> API usages. Users can leverage their existing Spark cluster to scale their
> pandas workloads. It works interchangeably with PySpark by allowing both
> pandas and PySpark APIs to users.
>
> The project has grown separately more than two years, and this has been
> successfully going. With version 1.7.0 Koalas has greatly improved maturity
> and stability. Its usability has been proven with numerous users’ adoptions
> and by reaching more than 75% API coverage in pandas’ Index, Series and
> DataFrame.
>
> I strongly think this is the direction we should go for Apache Spark, and
> it is a win-win strategy for the growth of both Apache Spark and pandas.
> Please see the reasons below.
> Why do we need it?
>
>    -
>
>    Python has grown dramatically in the last few years and became one of
>    the most popular languages, see also StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for Python, Java, R and Scala languages.
>    -
>
>    pandas became almost the standard library of data science. Please also
>    see the StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for pandas, Apache Spark and PySpark.
>    -
>
>    PySpark is not Pythonic enough. At least I myself hear a lot of
>    complaints. That initiated Project Zen
>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>    greatly improved PySpark usability and made it more Pythonic.
>
> Nevertheless, data scientists tend to prefer pandas libraries according to
> the trends but APIs are hard to change in PySpark. We should redesign all
> APIs and improve them from scratch, which is very difficult.
>
> One straightforward and fast approach is to benchmark a successful case,
> and pandas does not support distributed execution. Once PySpark supports
> pandas-like APIs, it can be a good option for pandas users to scale their
> workloads easily. I do believe this is a win-win strategy for the growth of
> both pandas and PySpark.
>
> In fact, there are already similar tries such as Dask <https://dask.org/>
> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
> <https://github.com/databricks/koalas>). They are all growing fast and
> successfully, and I find that people compare it to PySpark from time to
> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
> data technologies battling head to head
> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
> .
>
>
>
>    -
>
>    There are many important features missing that are very common in data
>    science. One of the most important features is plotting and drawing a
>    chart. Almost every data scientist plots and draws a chart to understand
>    their data quickly and visually in their daily work but this is missing in
>    PySpark. Please see one example in pandas:
>
>
>
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.
>
>

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

From Python developer perspective, this direction sounds making sense to me.
As pandas is almost the standard library in the related area, if PySpark
supports pandas API out of box, the usability would be in a higher level.

For maintenance cost, IIUC, there are some Spark committers in the community
of Koalas and they are pretty active. So seems we don't need to worry about
who will be interested to do the maintenance. 

It is good that it is as a separate package and does not break anything in
the existing codes. How about test code? Does it fit into PySpark test
framework?


Hyukjin Kwon wrote
> Hi all,
> 
> I would like to start the discussion on supporting pandas API layer on
> Spark.
> 
> If we have a general consensus on having it in PySpark, I will initiate
> and
> drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
> 
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> 
> I do recommend taking a quick look for blog posts and talks made for
> pandas
> on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Support pandas API layer on PySpark

Posted by Holden Karau <ho...@pigscanfly.ca>.

I think having pandas support inside of Spark makes sense. One of my
questions is who are the majour contributors to this effort, is the
community developing the pandas API layer for Spark interested in being
part of Spark or do they prefer having their own release cycle?

On Sat, Mar 13, 2021 at 5:57 PM Hyukjin Kwon <gu...@gmail.com> wrote:

> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
>
>
> If we have a general consensus on having it in PySpark, I will initiate
> and drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> What do you want to propose?
>
> I have been working on the Koalas <https://github.com/databricks/koalas>
> project that is essentially: pandas API support on Spark, and I would like
> to propose embracing Koalas in PySpark.
>
>
>
> More specifically, I am thinking about adding a separate package, to
> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
> the existing codes. The overview would look as below:
>
> pyspark_dataframe.[... PySpark APIs ...]
> pandas_dataframe.[... pandas APIs (local) ...]
>
> # The package names will change in the final proposal and during review.
> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
> koalas_dataframe.[... pandas APIs on Spark ...]
>
> pyspark_dataframe = koalas_dataframe.to_spark()
> pandas_dataframe = koalas_dataframe.to_pandas()
>
> Koalas provides a pandas API layer on PySpark. It supports almost the same
> API usages. Users can leverage their existing Spark cluster to scale their
> pandas workloads. It works interchangeably with PySpark by allowing both
> pandas and PySpark APIs to users.
>
> The project has grown separately more than two years, and this has been
> successfully going. With version 1.7.0 Koalas has greatly improved maturity
> and stability. Its usability has been proven with numerous users’ adoptions
> and by reaching more than 75% API coverage in pandas’ Index, Series and
> DataFrame.
>
> I strongly think this is the direction we should go for Apache Spark, and
> it is a win-win strategy for the growth of both Apache Spark and pandas.
> Please see the reasons below.
> Why do we need it?
>
>    -
>
>    Python has grown dramatically in the last few years and became one of
>    the most popular languages, see also StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for Python, Java, R and Scala languages.
>    -
>
>    pandas became almost the standard library of data science. Please also
>    see the StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for pandas, Apache Spark and PySpark.
>    -
>
>    PySpark is not Pythonic enough. At least I myself hear a lot of
>    complaints. That initiated Project Zen
>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>    greatly improved PySpark usability and made it more Pythonic.
>
> Nevertheless, data scientists tend to prefer pandas libraries according to
> the trends but APIs are hard to change in PySpark. We should redesign all
> APIs and improve them from scratch, which is very difficult.
>
> One straightforward and fast approach is to benchmark a successful case,
> and pandas does not support distributed execution. Once PySpark supports
> pandas-like APIs, it can be a good option for pandas users to scale their
> workloads easily. I do believe this is a win-win strategy for the growth of
> both pandas and PySpark.
>
> In fact, there are already similar tries such as Dask <https://dask.org/>
> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
> <https://github.com/databricks/koalas>). They are all growing fast and
> successfully, and I find that people compare it to PySpark from time to
> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
> data technologies battling head to head
> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
> .
>
>
>
>    -
>
>    There are many important features missing that are very common in data
>    science. One of the most important features is plotting and drawing a
>    chart. Almost every data scientist plots and draws a chart to understand
>    their data quickly and visually in their daily work but this is missing in
>    PySpark. Please see one example in pandas:
>
>
>
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau