You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Tom Scheffers <to...@youngbulls.nl.INVALID> on 2021/02/12 14:52:27 UTC

[Python] Python based Query Engine for Arrow

Dear devs,

I am really interested in an in-memory query interface to Arrow tables
(like DataFusion is for Rust), preferably in Python. In my opinion, there
are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
copy Arrow to pandas and use an existing framework (like Ibis) and 3.
build/extend something new based on pyarrow (with small conversions back
and forth to numpy or pandas).

The Arrow / DataFusion route currently lacks some capabilities, like
parquet files directly from S3, but also the push down of predicates.
Therefore, I would rather wait for things to mature. Besides, the C++
branch of Arrow seems to be more mature and integrates nicely with Python.

The pandas route is probably more convenient, however it will be much less
efficient. Columnar storage, predicate push downs and statistics
optimizations are the main reason for using Arrow, which will not be fully
utilized in this route.

Is there already something like DataFusion on the roadmap for C++ (and thus
Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
would like to help on advancements into this direction, but struggle in
finding where to start.

Thanks for your help.

Kind regards,

Tom

Re: [Python] Python based Query Engine for Arrow

Posted by Micah Kornfield <em...@gmail.com>.

Welcome Tom,

> Is there already something like DataFusion on the roadmap for C++ (and thus
> Python)?


Yes it is [1] and the components are being developed.  In terms of
contributions others might have a better idea but I think the two big
pieces of functionality missing from a kernel/operator perspective are:
Aggregates and Joins.

There is also the work of tying together datasets withs kernels and
materializing the output.

[1]
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit

On Fri, Feb 12, 2021 at 8:41 AM Tom Scheffers <to...@youngbulls.nl.invalid>
wrote:

> Dear devs,
>
> I am really interested in an in-memory query interface to Arrow tables
> (like DataFusion is for Rust), preferably in Python. In my opinion, there
> are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
> copy Arrow to pandas and use an existing framework (like Ibis) and 3.
> build/extend something new based on pyarrow (with small conversions back
> and forth to numpy or pandas).
>
> The Arrow / DataFusion route currently lacks some capabilities, like
> parquet files directly from S3, but also the push down of predicates.
> Therefore, I would rather wait for things to mature. Besides, the C++
> branch of Arrow seems to be more mature and integrates nicely with Python.
>
> The pandas route is probably more convenient, however it will be much less
> efficient. Columnar storage, predicate push downs and statistics
> optimizations are the main reason for using Arrow, which will not be fully
> utilized in this route.
>
> Is there already something like DataFusion on the roadmap for C++ (and thus
> Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
> would like to help on advancements into this direction, but struggle in
> finding where to start.
>
> Thanks for your help.
>
> Kind regards,
>
> Tom
>

Re: [Python] Python based Query Engine for Arrow

Posted by Wes McKinney <we...@gmail.com>.

I'm actively building an engineering team to work on this -- so anyone
who would like to work on this as part of their day job can reach out
to me to discuss. We are doing some research about what aspects of
prior art in columnar database systems we can pull into the Arrow C++
project (to make sure we are building something that is close to
state-of-the-art) and will post some design docs about various topics
as we collect our thoughts. The threading improvement proposal that
Weston posted the other day is closely related to this.

On Fri, Feb 12, 2021 at 11:09 AM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> Hi, Tom,
>
> This does not address the question directly, but for what is worth, I had
> the same issue and thus released a Python binding for DataFusion
> <https://pypi.org/project/datafusion/>. It allows e.g. to create a pyarrow
> RecordBatch by reading from s3 (via pyarrow), and use it as a source to
> DataFusion's plan via SQL or DataFrame API. Because it uses the C data
> interface, there is virtually no cost in moving from and to
> datafusion/pyarrow. It supports UDFs and UDAF in native pyarrow arrays,
> which means that there is no performance hit when using a UDF with a
> pyarrow/C++ kernel also. Performance decrates when you need to map the
> pyarrow array to some other format (e.g. numpy), typically to push it to
> sklearn, scipy, etc.
>
> `pip install datafusion`, but fyi this is *not* production ready and many
> of the pyarrow types are not supported yet. :)
>
> Best,
> Jorge
>
>
>
>
> On Fri, Feb 12, 2021 at 5:41 PM Tom Scheffers <to...@youngbulls.nl.invalid>
> wrote:
>
> > Dear devs,
> >
> > I am really interested in an in-memory query interface to Arrow tables
> > (like DataFusion is for Rust), preferably in Python. In my opinion, there
> > are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
> > copy Arrow to pandas and use an existing framework (like Ibis) and 3.
> > build/extend something new based on pyarrow (with small conversions back
> > and forth to numpy or pandas).
> >
> > The Arrow / DataFusion route currently lacks some capabilities, like
> > parquet files directly from S3, but also the push down of predicates.
> > Therefore, I would rather wait for things to mature. Besides, the C++
> > branch of Arrow seems to be more mature and integrates nicely with Python.
> >
> > The pandas route is probably more convenient, however it will be much less
> > efficient. Columnar storage, predicate push downs and statistics
> > optimizations are the main reason for using Arrow, which will not be fully
> > utilized in this route.
> >
> > Is there already something like DataFusion on the roadmap for C++ (and thus
> > Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
> > would like to help on advancements into this direction, but struggle in
> > finding where to start.
> >
> > Thanks for your help.
> >
> > Kind regards,
> >
> > Tom
> >

Re: [Python] Python based Query Engine for Arrow

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Hi, Tom,

This does not address the question directly, but for what is worth, I had
the same issue and thus released a Python binding for DataFusion
<https://pypi.org/project/datafusion/>. It allows e.g. to create a pyarrow
RecordBatch by reading from s3 (via pyarrow), and use it as a source to
DataFusion's plan via SQL or DataFrame API. Because it uses the C data
interface, there is virtually no cost in moving from and to
datafusion/pyarrow. It supports UDFs and UDAF in native pyarrow arrays,
which means that there is no performance hit when using a UDF with a
pyarrow/C++ kernel also. Performance decrates when you need to map the
pyarrow array to some other format (e.g. numpy), typically to push it to
sklearn, scipy, etc.

`pip install datafusion`, but fyi this is *not* production ready and many
of the pyarrow types are not supported yet. :)

Best,
Jorge

On Fri, Feb 12, 2021 at 5:41 PM Tom Scheffers <to...@youngbulls.nl.invalid>
wrote:

> Dear devs,
>
> I am really interested in an in-memory query interface to Arrow tables
> (like DataFusion is for Rust), preferably in Python. In my opinion, there
> are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
> copy Arrow to pandas and use an existing framework (like Ibis) and 3.
> build/extend something new based on pyarrow (with small conversions back
> and forth to numpy or pandas).
>
> The Arrow / DataFusion route currently lacks some capabilities, like
> parquet files directly from S3, but also the push down of predicates.
> Therefore, I would rather wait for things to mature. Besides, the C++
> branch of Arrow seems to be more mature and integrates nicely with Python.
>
> The pandas route is probably more convenient, however it will be much less
> efficient. Columnar storage, predicate push downs and statistics
> optimizations are the main reason for using Arrow, which will not be fully
> utilized in this route.
>
> Is there already something like DataFusion on the roadmap for C++ (and thus
> Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
> would like to help on advancements into this direction, but struggle in
> finding where to start.
>
> Thanks for your help.
>
> Kind regards,
>
> Tom
>