You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Andy Grove <an...@gmail.com> on 2020/08/13 15:11:05 UTC

My focus for Rust implementation for 2.0.0

Some of you may have noticed a sudden flurry of activity from me after a
bit of a break from the project, so I thought it might be useful to explain
what I am up to.

As of 1.0.0, DataFusion isn't really useful against any real-world data
sets for a number of reasons, but most of all due to the simplistic
threading/partitioning model. There are a few small bugs as well.

My current focus is to be able to run TPC-H query 1 against decent size
datasets (starting with the 100 GB dataset) with hundreds of partitions. I
believe that I can get this working with some fairly small changes. Later,
we can experiment with more advanced threading models and async, using the
same benchmark to measure improvements.

Let me know if you have any questions.

Thanks,

Andy.

Re: My focus for Rust implementation for 2.0.0

Posted by Andy Grove <an...@gmail.com>.

First, an update on progress. Once the PRs for ARROW-9711 and ARROW-9716
are merged, it is possible to run TPC-H query 1 against a 100 GB data set
with similar performance to Apache Spark in local mode. I plan on testing
larger datasets over the weekend.

To answer Kirill's question, I wouldn't necessarily characterize it as
giving up on exploring any integration with Gandiva. There are several
integrations that I would be interested in exploring, including with the
Arrow C Data Interface, and the C++ Dataset work that is happening, but I
only have so much time available to contribute to this project and I have
some specific goals that I am working towards that are a much higher
priority for me right now.

Also, I am encouraged by the performance I'm seeing from DataFusion after
some of the changes this week, and I know there is plenty of room for
improvement still. This perhaps makes it less compelling to explore
delegating to C++ at this point. However, it would be nice to see some
performance comparisons between DataFusion and the C++ Dataset work.

Thanks,

Andy.

On Fri, Aug 14, 2020 at 2:18 AM Kirill Lykov <ly...@gmail.com> wrote:

> Sounds interesting as we wanted to start using DataFusion.
> Btw, I vaguely remember that in the original repository you had issue
> like "investigate DataFusion with Gandiva", I'm curious  why you have
> decided to give up with it?
>
> On Thu, Aug 13, 2020 at 5:11 PM Andy Grove <an...@gmail.com> wrote:
> >
> > Some of you may have noticed a sudden flurry of activity from me after a
> > bit of a break from the project, so I thought it might be useful to
> explain
> > what I am up to.
> >
> > As of 1.0.0, DataFusion isn't really useful against any real-world data
> > sets for a number of reasons, but most of all due to the simplistic
> > threading/partitioning model. There are a few small bugs as well.
> >
> > My current focus is to be able to run TPC-H query 1 against decent size
> > datasets (starting with the 100 GB dataset) with hundreds of partitions.
> I
> > believe that I can get this working with some fairly small changes.
> Later,
> > we can experiment with more advanced threading models and async, using
> the
> > same benchmark to measure improvements.
> >
> > Let me know if you have any questions.
> >
> > Thanks,
> >
> > Andy.
>
>
>
> --
> Best regards,
> Kirill Lykov
>

Re: My focus for Rust implementation for 2.0.0

Posted by Kirill Lykov <ly...@gmail.com>.

Sounds interesting as we wanted to start using DataFusion.
Btw, I vaguely remember that in the original repository you had issue
like "investigate DataFusion with Gandiva", I'm curious  why you have
decided to give up with it?

On Thu, Aug 13, 2020 at 5:11 PM Andy Grove <an...@gmail.com> wrote:
>
> Some of you may have noticed a sudden flurry of activity from me after a
> bit of a break from the project, so I thought it might be useful to explain
> what I am up to.
>
> As of 1.0.0, DataFusion isn't really useful against any real-world data
> sets for a number of reasons, but most of all due to the simplistic
> threading/partitioning model. There are a few small bugs as well.
>
> My current focus is to be able to run TPC-H query 1 against decent size
> datasets (starting with the 100 GB dataset) with hundreds of partitions. I
> believe that I can get this working with some fairly small changes. Later,
> we can experiment with more advanced threading models and async, using the
> same benchmark to measure improvements.
>
> Let me know if you have any questions.
>
> Thanks,
>
> Andy.



-- 
Best regards,
Kirill Lykov