You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2020/10/20 16:35:09 UTC

Experiment with DataFusion + Pyarrow

Hi,

Over the past few weeks I have been running an experiment whose main goal
is to run a query in (Rust's) DataFusion and use Python on it so that we
can embed the Python's ecosystem on the query (a-la pyspark) (details here
<https://github.com/jorgecarleitao/datafusion-python>).

I am super happy to say that with the code on PR 8401
<https://github.com/apache/arrow/pull/8401>, I was able to achieve its main
goal: a DataFusion query that uses a Python UDF on it, using the C data
interface to perform zero-copies between C++/Python and Rust. The Python
UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.

This is a pure arrow implementation (no Python dependencies other than
pyarrow) and achieves near optimal execution performance under the
constraints, with the full flexibility of Python and its ecosystem to
perform non-trivial transformations. Folks can of course convert Pyarrow
Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
also for interpolation, ML, cuda, etc.

Finally, the coolest thing about this is that a single execution passes
through most Rust's code base (DataFusion, Arrow, Parquet), but also
through a lot of the Python, C and C++ code base. IMO this reaffirms the
success of this project, on which the different parts stick to the contract
(spec, c data interface) and enable stuff like this. So, big kudos to all
of you!

Best,
Jorge

Re: Experiment with DataFusion + Pyarrow

Posted by Antoine Pitrou <an...@python.org>.

Very promising, congratulations :-)


Le 20/10/2020 à 18:35, Jorge Cardoso Leitão a écrit :
> Hi,
> 
> Over the past few weeks I have been running an experiment whose main goal
> is to run a query in (Rust's) DataFusion and use Python on it so that we
> can embed the Python's ecosystem on the query (a-la pyspark) (details here
> <https://github.com/jorgecarleitao/datafusion-python>).
> 
> I am super happy to say that with the code on PR 8401
> <https://github.com/apache/arrow/pull/8401>, I was able to achieve its main
> goal: a DataFusion query that uses a Python UDF on it, using the C data
> interface to perform zero-copies between C++/Python and Rust. The Python
> UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.
> 
> This is a pure arrow implementation (no Python dependencies other than
> pyarrow) and achieves near optimal execution performance under the
> constraints, with the full flexibility of Python and its ecosystem to
> perform non-trivial transformations. Folks can of course convert Pyarrow
> Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
> also for interpolation, ML, cuda, etc.
> 
> Finally, the coolest thing about this is that a single execution passes
> through most Rust's code base (DataFusion, Arrow, Parquet), but also
> through a lot of the Python, C and C++ code base. IMO this reaffirms the
> success of this project, on which the different parts stick to the contract
> (spec, c data interface) and enable stuff like this. So, big kudos to all
> of you!
> 
> Best,
> Jorge
>

Re: Experiment with DataFusion + Pyarrow

Posted by Wes McKinney <we...@gmail.com>.

Exciting to see, this is exactly the kind of interop we've been
working diligently toward since the start of the project!

On Tue, Oct 20, 2020 at 11:54 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Really cool work.  Very nice to see this type of integration!
>
> On Tue, Oct 20, 2020 at 9:35 AM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Hi,
> >
> > Over the past few weeks I have been running an experiment whose main goal
> > is to run a query in (Rust's) DataFusion and use Python on it so that we
> > can embed the Python's ecosystem on the query (a-la pyspark) (details here
> > <https://github.com/jorgecarleitao/datafusion-python>).
> >
> > I am super happy to say that with the code on PR 8401
> > <https://github.com/apache/arrow/pull/8401>, I was able to achieve its
> > main
> > goal: a DataFusion query that uses a Python UDF on it, using the C data
> > interface to perform zero-copies between C++/Python and Rust. The Python
> > UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.
> >
> > This is a pure arrow implementation (no Python dependencies other than
> > pyarrow) and achieves near optimal execution performance under the
> > constraints, with the full flexibility of Python and its ecosystem to
> > perform non-trivial transformations. Folks can of course convert Pyarrow
> > Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
> > also for interpolation, ML, cuda, etc.
> >
> > Finally, the coolest thing about this is that a single execution passes
> > through most Rust's code base (DataFusion, Arrow, Parquet), but also
> > through a lot of the Python, C and C++ code base. IMO this reaffirms the
> > success of this project, on which the different parts stick to the contract
> > (spec, c data interface) and enable stuff like this. So, big kudos to all
> > of you!
> >
> > Best,
> > Jorge
> >

Re: Experiment with DataFusion + Pyarrow

Posted by Micah Kornfield <em...@gmail.com>.

Really cool work.  Very nice to see this type of integration!

On Tue, Oct 20, 2020 at 9:35 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi,
>
> Over the past few weeks I have been running an experiment whose main goal
> is to run a query in (Rust's) DataFusion and use Python on it so that we
> can embed the Python's ecosystem on the query (a-la pyspark) (details here
> <https://github.com/jorgecarleitao/datafusion-python>).
>
> I am super happy to say that with the code on PR 8401
> <https://github.com/apache/arrow/pull/8401>, I was able to achieve its
> main
> goal: a DataFusion query that uses a Python UDF on it, using the C data
> interface to perform zero-copies between C++/Python and Rust. The Python
> UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.
>
> This is a pure arrow implementation (no Python dependencies other than
> pyarrow) and achieves near optimal execution performance under the
> constraints, with the full flexibility of Python and its ecosystem to
> perform non-trivial transformations. Folks can of course convert Pyarrow
> Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
> also for interpolation, ML, cuda, etc.
>
> Finally, the coolest thing about this is that a single execution passes
> through most Rust's code base (DataFusion, Arrow, Parquet), but also
> through a lot of the Python, C and C++ code base. IMO this reaffirms the
> success of this project, on which the different parts stick to the contract
> (spec, c data interface) and enable stuff like this. So, big kudos to all
> of you!
>
> Best,
> Jorge
>