You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Jaggi <mo...@gmail.com> on 2014/09/04 18:27:46 UTC

pandas-like dataframe in spark

Folks,
I have been working on a pandas-like dataframe DSL on top of spark. It is
written in Scala and can be used from spark-shell. The APIs have the look
and feel of pandas which is a wildly popular piece of software data
scientists use. The goal is to let people familiar with pandas scale their
efforts to larger datasets by using spark but not having to go through a
steep learning curve for Spark and Scala.
It is open sourced with Apache License and can be found here:
https://github.com/AyasdiOpenSource/df

I welcome your comments, suggestions and feedback. Any help in developing
it further is much appreciated. I have the following items on the roadmap
(and happy to change this based on your comments)
- Python wrappers most likely in the same way as MLLib
- Sliding window aggregations
- Row indexing
- Graphing/charting
- Efficient row-based operations
- Pretty printing of output on the spark-shell
- Unit test completeness and automated nightly runs

Mohit.

P.S.: Thanks to my awesome employer Ayasdi <http://www.ayasdi.com> for open
sourcing this software

P.P.S.: I need some design advice on making row operations efficient and
I'll start a new thread for that

Re: pandas-like dataframe in spark

Posted by Mohit Jaggi <mo...@gmail.com>.

Thanks Matei. I will take a look at SchemaRDDs.


On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> Hi Mohit,
>
> This looks pretty interesting, but just a note on the implementation -- it
> might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The
> reason is that SchemaRDDs already have an efficient in-memory
> representation (columnar storage), and can be read from a variety of data
> sources (JSON, Hive, soon things like CSV as well). Using the operators in
> Spark SQL you can also get really efficient code-generated operations on
> them. I know that stuff like zipping two data frames might become harder,
> but the overall benefit in performance could be substantial.
>
> Matei
>
> On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitjaggi@gmail.com)
> wrote:
>
> Folks,
> I have been working on a pandas-like dataframe DSL on top of spark. It is
> written in Scala and can be used from spark-shell. The APIs have the look
> and feel of pandas which is a wildly popular piece of software data
> scientists use. The goal is to let people familiar with pandas scale their
> efforts to larger datasets by using spark but not having to go through a
> steep learning curve for Spark and Scala.
> It is open sourced with Apache License and can be found here:
> https://github.com/AyasdiOpenSource/df
>
> I welcome your comments, suggestions and feedback. Any help in developing
> it further is much appreciated. I have the following items on the roadmap
> (and happy to change this based on your comments)
> - Python wrappers most likely in the same way as MLLib
> - Sliding window aggregations
> - Row indexing
> - Graphing/charting
> - Efficient row-based operations
> - Pretty printing of output on the spark-shell
> - Unit test completeness and automated nightly runs
>
> Mohit.
>
> P.S.: Thanks to my awesome employer Ayasdi <http://www.ayasdi.com> for
> open sourcing this software
>
> P.P.S.: I need some design advice on making row operations efficient and
> I'll start a new thread for that
>
>

Re: pandas-like dataframe in spark

Posted by Matei Zaharia <ma...@gmail.com>.

Hi Mohit,

This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The reason is that SchemaRDDs already have an efficient in-memory representation (columnar storage), and can be read from a variety of data sources (JSON, Hive, soon things like CSV as well). Using the operators in Spark SQL you can also get really efficient code-generated operations on them. I know that stuff like zipping two data frames might become harder, but the overall benefit in performance could be substantial.

Matei

On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitjaggi@gmail.com) wrote:

Folks,
I have been working on a pandas-like dataframe DSL on top of spark. It is written in Scala and can be used from spark-shell. The APIs have the look and feel of pandas which is a wildly popular piece of software data scientists use. The goal is to let people familiar with pandas scale their efforts to larger datasets by using spark but not having to go through a steep learning curve for Spark and Scala.
It is open sourced with Apache License and can be found here:
https://github.com/AyasdiOpenSource/df

I welcome your comments, suggestions and feedback. Any help in developing it further is much appreciated. I have the following items on the roadmap (and happy to change this based on your comments)
- Python wrappers most likely in the same way as MLLib
- Sliding window aggregations
- Row indexing
- Graphing/charting
- Efficient row-based operations
- Pretty printing of output on the spark-shell
- Unit test completeness and automated nightly runs

Mohit.

P.S.: Thanks to my awesome employer Ayasdi for open sourcing this software

P.P.S.: I need some design advice on making row operations efficient and I'll start a new thread for that