You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jacques Nadeau <ja...@apache.org> on 2018/06/21 19:15:20 UTC

Gandiva Initiative

Hey Guys,

Dremio just open sourced a new framework for processing data in Arrow data
structures [1], built on top of the Apache Arrow C++ APIs and leveraging
LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
Arrow Java libraries. I expect the developers who have been working on this
will introduce themselves soon. To read more about it, take a look at our
Ravindra's blog post (he's the lead developer driving this work): [2].
Hopefully people will find this interesting/useful.

Let us know what you all think!

thanks,
Jacques


[1] https://github.com/dremio/gandiva
[2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/

Re: Gandiva Initiative

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Antoine,

the LLVM API is an interesting point. I've been using PyArrow and Numba for quite a bit and this would definitely clash. A quick Google search did not reveal any workaround for this issue. In the other cases where we have such clashes, boost and jemalloc, the library itself already provides the infrastructure to vendor it with a private namespace. LLVM does not seem to have such infrastructure.

From my experience, the llvmlite (the Python package for LLVM which Numba uses) maintainers have been quite quick in updating to new LLVM versions. I would expect that we also would update quite frequently. Thus the newest releases would work nicely together (assuming that we may finally get the infrastructure for monthly releases running(. The problematic situation would be when the user has two Python packages installed with differing LLVM versions. In the case of conda this would probably be detected at the package manager level but with pip we probably be facing the problems only on execution.

Uwe

On Sun, Jun 24, 2018, at 7:02 PM, Antoine Pitrou wrote:
> 
> Hi,
> 
> I think JIT-compiling of kernels operating on Arrow data is an important
> development path, but just for the record, LLVM doesn't have a stable
> C++ API (the API changes at each feature release).  Just something to
> keep a mind for the ensuing packaging discussions ;-)
> 
> (it also raises interesting questions such as "what happens if a user
> wants to use both PyArrow and Numba in a given process, and they don't
> target the same LLVM API version")
> 
> Regards
> 
> Antoine.
> 
> 
> Le 22/06/2018 à 01:26, Wes McKinney a écrit :
> > hi Jacques,
> > 
> > This is very exciting! LLVM codegen for Arrow has been on my wishlist
> > since the early days of the project. I always considered it more of a
> > "when" question more than "if".
> > 
> > I will take a closer look at the codebase to make some comments, but
> > my biggest initial question is whether we could work to make Gandiva
> > the official community-supported LLVM framework for creating
> > JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
> > to focus 90+% on Apache Arrow development) tech roadmap we discussed
> > the need for a subgraph compiler using LLVM:
> > https://ursalabs.org/tech/#subgraph-compilation-code-generation.
> > 
> > I would be interesting in getting involved in the project, and I
> > expect in time many others will, as well. An obvious question would be
> > whether you would be interested in donating the project to Apache
> > Arrow and continuing the work there. We would benefit from common
> > build, testing/CI, and packaging/deployment infrastructure. I'm keen
> > to see JIT-powered predicate pushdown in Parquet files, for example.
> > Phillip and I could look into building a Gandiva backend for compiling
> > a subset of expressions originating from Ibis, a lazy-evaluation DSL
> > system with similar API to pandas
> > (https://github.com/ibis-project/ibis).
> > 
> > best
> > Wes
> > 
> > On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
> > <al...@googlemail.com.invalid> wrote:
> >> Hey Jaques,
> >>
> >> Great stuff! I'm actually researching the integration of arrow and flight
> >> into a main memory database which also uses LLVM for dynamic query
> >> generation! Excited to have a more detailed look at Gandiva!
> >>
> >> Cheers,
> >> Dimitri.
> >>
> >> On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
> >>
> >>> Hey Guys,
> >>>
> >>> Dremio just open sourced a new framework for processing data in Arrow data
> >>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> >>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
> >>> Arrow Java libraries. I expect the developers who have been working on this
> >>> will introduce themselves soon. To read more about it, take a look at our
> >>> Ravindra's blog post (he's the lead developer driving this work): [2].
> >>> Hopefully people will find this interesting/useful.
> >>>
> >>> Let us know what you all think!
> >>>
> >>> thanks,
> >>> Jacques
> >>>
> >>>
> >>> [1] https://github.com/dremio/gandiva
> >>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> >>>

Re: Gandiva Initiative

Posted by Antoine Pitrou <an...@python.org>.

Hi,

I think JIT-compiling of kernels operating on Arrow data is an important
development path, but just for the record, LLVM doesn't have a stable
C++ API (the API changes at each feature release).  Just something to
keep a mind for the ensuing packaging discussions ;-)

(it also raises interesting questions such as "what happens if a user
wants to use both PyArrow and Numba in a given process, and they don't
target the same LLVM API version")

Regards

Antoine.


Le 22/06/2018 à 01:26, Wes McKinney a écrit :
> hi Jacques,
> 
> This is very exciting! LLVM codegen for Arrow has been on my wishlist
> since the early days of the project. I always considered it more of a
> "when" question more than "if".
> 
> I will take a closer look at the codebase to make some comments, but
> my biggest initial question is whether we could work to make Gandiva
> the official community-supported LLVM framework for creating
> JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
> to focus 90+% on Apache Arrow development) tech roadmap we discussed
> the need for a subgraph compiler using LLVM:
> https://ursalabs.org/tech/#subgraph-compilation-code-generation.
> 
> I would be interesting in getting involved in the project, and I
> expect in time many others will, as well. An obvious question would be
> whether you would be interested in donating the project to Apache
> Arrow and continuing the work there. We would benefit from common
> build, testing/CI, and packaging/deployment infrastructure. I'm keen
> to see JIT-powered predicate pushdown in Parquet files, for example.
> Phillip and I could look into building a Gandiva backend for compiling
> a subset of expressions originating from Ibis, a lazy-evaluation DSL
> system with similar API to pandas
> (https://github.com/ibis-project/ibis).
> 
> best
> Wes
> 
> On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
> <al...@googlemail.com.invalid> wrote:
>> Hey Jaques,
>>
>> Great stuff! I'm actually researching the integration of arrow and flight
>> into a main memory database which also uses LLVM for dynamic query
>> generation! Excited to have a more detailed look at Gandiva!
>>
>> Cheers,
>> Dimitri.
>>
>> On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
>>
>>> Hey Guys,
>>>
>>> Dremio just open sourced a new framework for processing data in Arrow data
>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
>>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
>>> Arrow Java libraries. I expect the developers who have been working on this
>>> will introduce themselves soon. To read more about it, take a look at our
>>> Ravindra's blog post (he's the lead developer driving this work): [2].
>>> Hopefully people will find this interesting/useful.
>>>
>>> Let us know what you all think!
>>>
>>> thanks,
>>> Jacques
>>>
>>>
>>> [1] https://github.com/dremio/gandiva
>>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>>>

Re: Gandiva Initiative

Posted by Wes McKinney <we...@gmail.com>.

This is cool, thanks for putting together a prototype!

> it would be great if we could find a good solution to integrate the two projects and build systems

At the moment I'm thinking of Gandiva as analogous to Plasma, a
subcomponent of the C++ codebase to stand alongside the core Arrow
codebase (or it could go in arrow/gandiva, too), so everything would
get built and shipped as a single artifact containing several shared
libraries. Similarly, when the user writes "pip install pyarrow", they
would receive all of the libraries including Gandiva ready to go. It
looks like this is basically already what you've done in your PR.

To do that, we would have to conduct an IP clearance to import the
code and then get to refactoring to incorporate the components into
the Arrow codebase. I'll be standing by to help with that effort if
the Gandiva developers wish to go that route.

- Wes

On Fri, Jun 22, 2018 at 5:22 AM, Philipp Moritz <pc...@gmail.com> wrote:
> This is really exciting, thanks a lot for sharing!
>
> In case anybody wants to try this out from Python, I wrote up some Cython
> bindings (very limited so far, but they can already be used to construct
> some computation graphs and do some benchmarks):
> https://github.com/apache/arrow/pull/2153
>
> They are developed in the Arrow repo for now, it would be great if we could
> find a good solution to integrate the two projects and build systems
> seamlessly (for example setting up a Cython environment in the Gandiva repo
> in a way that interoperates well with PyArrow would be hard right now).
>
> -- Philipp.
>
> On Thu, Jun 21, 2018 at 4:26 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Jacques,
>>
>> This is very exciting! LLVM codegen for Arrow has been on my wishlist
>> since the early days of the project. I always considered it more of a
>> "when" question more than "if".
>>
>> I will take a closer look at the codebase to make some comments, but
>> my biggest initial question is whether we could work to make Gandiva
>> the official community-supported LLVM framework for creating
>> JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
>> to focus 90+% on Apache Arrow development) tech roadmap we discussed
>> the need for a subgraph compiler using LLVM:
>> https://ursalabs.org/tech/#subgraph-compilation-code-generation.
>>
>> I would be interesting in getting involved in the project, and I
>> expect in time many others will, as well. An obvious question would be
>> whether you would be interested in donating the project to Apache
>> Arrow and continuing the work there. We would benefit from common
>> build, testing/CI, and packaging/deployment infrastructure. I'm keen
>> to see JIT-powered predicate pushdown in Parquet files, for example.
>> Phillip and I could look into building a Gandiva backend for compiling
>> a subset of expressions originating from Ibis, a lazy-evaluation DSL
>> system with similar API to pandas
>> (https://github.com/ibis-project/ibis).
>>
>> best
>> Wes
>>
>> On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
>> <al...@googlemail.com.invalid> wrote:
>> > Hey Jaques,
>> >
>> > Great stuff! I'm actually researching the integration of arrow and flight
>> > into a main memory database which also uses LLVM for dynamic query
>> > generation! Excited to have a more detailed look at Gandiva!
>> >
>> > Cheers,
>> > Dimitri.
>> >
>> > On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
>> >
>> >> Hey Guys,
>> >>
>> >> Dremio just open sourced a new framework for processing data in Arrow
>> data
>> >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
>> >> LLVM (Apache licensed). It also includes Java APIs that leverage the
>> Apache
>> >> Arrow Java libraries. I expect the developers who have been working on
>> this
>> >> will introduce themselves soon. To read more about it, take a look at
>> our
>> >> Ravindra's blog post (he's the lead developer driving this work): [2].
>> >> Hopefully people will find this interesting/useful.
>> >>
>> >> Let us know what you all think!
>> >>
>> >> thanks,
>> >> Jacques
>> >>
>> >>
>> >> [1] https://github.com/dremio/gandiva
>> >> [2] https://www.dremio.com/announcing-gandiva-initiative-
>> for-apache-arrow/
>> >>
>>

Re: Gandiva Initiative

Posted by Philipp Moritz <pc...@gmail.com>.

This is really exciting, thanks a lot for sharing!

In case anybody wants to try this out from Python, I wrote up some Cython
bindings (very limited so far, but they can already be used to construct
some computation graphs and do some benchmarks):
https://github.com/apache/arrow/pull/2153

They are developed in the Arrow repo for now, it would be great if we could
find a good solution to integrate the two projects and build systems
seamlessly (for example setting up a Cython environment in the Gandiva repo
in a way that interoperates well with PyArrow would be hard right now).

-- Philipp.

On Thu, Jun 21, 2018 at 4:26 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Jacques,
>
> This is very exciting! LLVM codegen for Arrow has been on my wishlist
> since the early days of the project. I always considered it more of a
> "when" question more than "if".
>
> I will take a closer look at the codebase to make some comments, but
> my biggest initial question is whether we could work to make Gandiva
> the official community-supported LLVM framework for creating
> JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
> to focus 90+% on Apache Arrow development) tech roadmap we discussed
> the need for a subgraph compiler using LLVM:
> https://ursalabs.org/tech/#subgraph-compilation-code-generation.
>
> I would be interesting in getting involved in the project, and I
> expect in time many others will, as well. An obvious question would be
> whether you would be interested in donating the project to Apache
> Arrow and continuing the work there. We would benefit from common
> build, testing/CI, and packaging/deployment infrastructure. I'm keen
> to see JIT-powered predicate pushdown in Parquet files, for example.
> Phillip and I could look into building a Gandiva backend for compiling
> a subset of expressions originating from Ibis, a lazy-evaluation DSL
> system with similar API to pandas
> (https://github.com/ibis-project/ibis).
>
> best
> Wes
>
> On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
> <al...@googlemail.com.invalid> wrote:
> > Hey Jaques,
> >
> > Great stuff! I'm actually researching the integration of arrow and flight
> > into a main memory database which also uses LLVM for dynamic query
> > generation! Excited to have a more detailed look at Gandiva!
> >
> > Cheers,
> > Dimitri.
> >
> > On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
> >
> >> Hey Guys,
> >>
> >> Dremio just open sourced a new framework for processing data in Arrow
> data
> >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> >> LLVM (Apache licensed). It also includes Java APIs that leverage the
> Apache
> >> Arrow Java libraries. I expect the developers who have been working on
> this
> >> will introduce themselves soon. To read more about it, take a look at
> our
> >> Ravindra's blog post (he's the lead developer driving this work): [2].
> >> Hopefully people will find this interesting/useful.
> >>
> >> Let us know what you all think!
> >>
> >> thanks,
> >> Jacques
> >>
> >>
> >> [1] https://github.com/dremio/gandiva
> >> [2] https://www.dremio.com/announcing-gandiva-initiative-
> for-apache-arrow/
> >>
>

Re: Gandiva Initiative

Posted by Phillip Cloud <cp...@gmail.com>.

This is super exciting. In particular, I think for ibis (
http://docs.ibis-project.org/) building up expressions and executing them
using gandiva would fit nicely as another in-memory backend alongside the
pandas backend. I think it would also drive some use cases forward for more
complex datatype support. I'm likely going to hack together a very basic
POC using Philipp's bindings to see where I trip up and submit patches for
any non-PEBKAC errors I encounter.

I think it would be wonderful if this code were donated to the arrow
project as it would make the testing, release, deployment aspects of both
pieces of software much easier.

On Sat, Jun 23, 2018 at 11:11 AM Jacques Nadeau <ja...@apache.org> wrote:

> >
> > I'd be willing to carve out some extra time to make sure patches are
> > getting
> > merged promptly until Ravindra or others become committers.
> >
>
> Thanks!
>
>
>
> > I think the fact that Philipp incorporated Gandiva into the Arrow
> > codebase and built Python bindings in less than 24 hours is a pretty
> > strong indicator!
> >
>
> Agreed
>

Re: Gandiva Initiative

Posted by Jacques Nadeau <ja...@apache.org>.

>
> I'd be willing to carve out some extra time to make sure patches are
> getting
> merged promptly until Ravindra or others become committers.
>

Thanks!



> I think the fact that Philipp incorporated Gandiva into the Arrow
> codebase and built Python bindings in less than 24 hours is a pretty
> strong indicator!
>

Agreed

Re: Gandiva Initiative

Posted by Wes McKinney <we...@gmail.com>.

Thanks Jacques -- sounds good to me. I'll be interested in the
feedback of others, but speaking for myself at least I'm definitely
committed to helping you move quickly, and dividing some of the
burdens of developing the initiative (particularly around packaging
and deployment) may actually make things move more quickly. I'd be
willing to carve out some extra time to make sure patches are getting
merged promptly until Ravindra or others become committers.

I think the fact that Philipp incorporated Gandiva into the Arrow
codebase and built Python bindings in less than 24 hours is a pretty
strong indicator!

- Wes

On Fri, Jun 22, 2018 at 11:05 PM, Jacques Nadeau <ja...@apache.org> wrote:
> Hey Wes et al,
>
> Our goal at Dremio was to contribute the project to the Apache Arrow
> project if people think that makes sense. (So much so that you have noticed
> many things are already namespaced Apache.)  Gandiva is another way we can
> continue to have Apache projects drive the vision of a deconstructed
> database.
>
> The additional reality is we are trying to get a bunch of things done for
> Dremio so we need to figure out how to make sure the current Gandiva
> developers like Ravindra can stay productive through a transition (they are
> not Arrow committers).
>
> I've got several things going on and traveling a lot over the next little
> bit but our goal is definitely to share this with community. Look forward
> to more feedback of whether others also think Arrow would be a good home
> for this project. (As opposed to other options shall as GitHub managed or a
> new Apache project.)
>
> thanks for all the support!
>
> On Thu, Jun 21, 2018 at 4:26 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Jacques,
>>
>> This is very exciting! LLVM codegen for Arrow has been on my wishlist
>> since the early days of the project. I always considered it more of a
>> "when" question more than "if".
>>
>> I will take a closer look at the codebase to make some comments, but
>> my biggest initial question is whether we could work to make Gandiva
>> the official community-supported LLVM framework for creating
>> JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
>> to focus 90+% on Apache Arrow development) tech roadmap we discussed
>> the need for a subgraph compiler using LLVM:
>> https://ursalabs.org/tech/#subgraph-compilation-code-generation.
>>
>> I would be interesting in getting involved in the project, and I
>> expect in time many others will, as well. An obvious question would be
>> whether you would be interested in donating the project to Apache
>> Arrow and continuing the work there. We would benefit from common
>> build, testing/CI, and packaging/deployment infrastructure. I'm keen
>> to see JIT-powered predicate pushdown in Parquet files, for example.
>> Phillip and I could look into building a Gandiva backend for compiling
>> a subset of expressions originating from Ibis, a lazy-evaluation DSL
>> system with similar API to pandas
>> (https://github.com/ibis-project/ibis).
>>
>> best
>> Wes
>>
>> On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
>> <al...@googlemail.com.invalid> wrote:
>> > Hey Jaques,
>> >
>> > Great stuff! I'm actually researching the integration of arrow and flight
>> > into a main memory database which also uses LLVM for dynamic query
>> > generation! Excited to have a more detailed look at Gandiva!
>> >
>> > Cheers,
>> > Dimitri.
>> >
>> > On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
>> >
>> >> Hey Guys,
>> >>
>> >> Dremio just open sourced a new framework for processing data in Arrow
>> data
>> >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
>> >> LLVM (Apache licensed). It also includes Java APIs that leverage the
>> Apache
>> >> Arrow Java libraries. I expect the developers who have been working on
>> this
>> >> will introduce themselves soon. To read more about it, take a look at
>> our
>> >> Ravindra's blog post (he's the lead developer driving this work): [2].
>> >> Hopefully people will find this interesting/useful.
>> >>
>> >> Let us know what you all think!
>> >>
>> >> thanks,
>> >> Jacques
>> >>
>> >>
>> >> [1] https://github.com/dremio/gandiva
>> >> [2] https://www.dremio.com/announcing-gandiva-initiative-
>> for-apache-arrow/
>> >>
>>

Re: Gandiva Initiative

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Wes et al,

Our goal at Dremio was to contribute the project to the Apache Arrow
project if people think that makes sense. (So much so that you have noticed
many things are already namespaced Apache.)  Gandiva is another way we can
continue to have Apache projects drive the vision of a deconstructed
database.

The additional reality is we are trying to get a bunch of things done for
Dremio so we need to figure out how to make sure the current Gandiva
developers like Ravindra can stay productive through a transition (they are
not Arrow committers).

I've got several things going on and traveling a lot over the next little
bit but our goal is definitely to share this with community. Look forward
to more feedback of whether others also think Arrow would be a good home
for this project. (As opposed to other options shall as GitHub managed or a
new Apache project.)

thanks for all the support!

On Thu, Jun 21, 2018 at 4:26 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Jacques,
>
> This is very exciting! LLVM codegen for Arrow has been on my wishlist
> since the early days of the project. I always considered it more of a
> "when" question more than "if".
>
> I will take a closer look at the codebase to make some comments, but
> my biggest initial question is whether we could work to make Gandiva
> the official community-supported LLVM framework for creating
> JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
> to focus 90+% on Apache Arrow development) tech roadmap we discussed
> the need for a subgraph compiler using LLVM:
> https://ursalabs.org/tech/#subgraph-compilation-code-generation.
>
> I would be interesting in getting involved in the project, and I
> expect in time many others will, as well. An obvious question would be
> whether you would be interested in donating the project to Apache
> Arrow and continuing the work there. We would benefit from common
> build, testing/CI, and packaging/deployment infrastructure. I'm keen
> to see JIT-powered predicate pushdown in Parquet files, for example.
> Phillip and I could look into building a Gandiva backend for compiling
> a subset of expressions originating from Ibis, a lazy-evaluation DSL
> system with similar API to pandas
> (https://github.com/ibis-project/ibis).
>
> best
> Wes
>
> On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
> <al...@googlemail.com.invalid> wrote:
> > Hey Jaques,
> >
> > Great stuff! I'm actually researching the integration of arrow and flight
> > into a main memory database which also uses LLVM for dynamic query
> > generation! Excited to have a more detailed look at Gandiva!
> >
> > Cheers,
> > Dimitri.
> >
> > On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
> >
> >> Hey Guys,
> >>
> >> Dremio just open sourced a new framework for processing data in Arrow
> data
> >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> >> LLVM (Apache licensed). It also includes Java APIs that leverage the
> Apache
> >> Arrow Java libraries. I expect the developers who have been working on
> this
> >> will introduce themselves soon. To read more about it, take a look at
> our
> >> Ravindra's blog post (he's the lead developer driving this work): [2].
> >> Hopefully people will find this interesting/useful.
> >>
> >> Let us know what you all think!
> >>
> >> thanks,
> >> Jacques
> >>
> >>
> >> [1] https://github.com/dremio/gandiva
> >> [2] https://www.dremio.com/announcing-gandiva-initiative-
> for-apache-arrow/
> >>
>

Re: Gandiva Initiative

Posted by Wes McKinney <we...@gmail.com>.

hi Jacques,

This is very exciting! LLVM codegen for Arrow has been on my wishlist
since the early days of the project. I always considered it more of a
"when" question more than "if".

I will take a closer look at the codebase to make some comments, but
my biggest initial question is whether we could work to make Gandiva
the official community-supported LLVM framework for creating
JIT-compiled Arrow kernels. In the Ursa Labs (a new lab I am building
to focus 90+% on Apache Arrow development) tech roadmap we discussed
the need for a subgraph compiler using LLVM:
https://ursalabs.org/tech/#subgraph-compilation-code-generation.

I would be interesting in getting involved in the project, and I
expect in time many others will, as well. An obvious question would be
whether you would be interested in donating the project to Apache
Arrow and continuing the work there. We would benefit from common
build, testing/CI, and packaging/deployment infrastructure. I'm keen
to see JIT-powered predicate pushdown in Parquet files, for example.
Phillip and I could look into building a Gandiva backend for compiling
a subset of expressions originating from Ibis, a lazy-evaluation DSL
system with similar API to pandas
(https://github.com/ibis-project/ibis).

best
Wes

On Thu, Jun 21, 2018 at 4:13 PM, Dimitri Vorona
<al...@googlemail.com.invalid> wrote:
> Hey Jaques,
>
> Great stuff! I'm actually researching the integration of arrow and flight
> into a main memory database which also uses LLVM for dynamic query
> generation! Excited to have a more detailed look at Gandiva!
>
> Cheers,
> Dimitri.
>
> On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:
>
>> Hey Guys,
>>
>> Dremio just open sourced a new framework for processing data in Arrow data
>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
>> Arrow Java libraries. I expect the developers who have been working on this
>> will introduce themselves soon. To read more about it, take a look at our
>> Ravindra's blog post (he's the lead developer driving this work): [2].
>> Hopefully people will find this interesting/useful.
>>
>> Let us know what you all think!
>>
>> thanks,
>> Jacques
>>
>>
>> [1] https://github.com/dremio/gandiva
>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>>

Re: Gandiva Initiative

Posted by Dimitri Vorona <al...@googlemail.com.INVALID>.

Hey Jaques,

Great stuff! I'm actually researching the integration of arrow and flight
into a main memory database which also uses LLVM for dynamic query
generation! Excited to have a more detailed look at Gandiva!

Cheers,
Dimitri.

On Thu, Jun 21, 2018, 21:15 Jacques Nadeau <ja...@apache.org> wrote:

> Hey Guys,
>
> Dremio just open sourced a new framework for processing data in Arrow data
> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
> Arrow Java libraries. I expect the developers who have been working on this
> will introduce themselves soon. To read more about it, take a look at our
> Ravindra's blog post (he's the lead developer driving this work): [2].
> Hopefully people will find this interesting/useful.
>
> Let us know what you all think!
>
> thanks,
> Jacques
>
>
> [1] https://github.com/dremio/gandiva
> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>

Re: Gandiva Initiative

Posted by Pindikura Ravindra <ra...@gmail.com>.

Sorry for the delay, Julian. My replies inline.

On Fri, Jun 22, 2018 at 11:39 PM Julian Hyde <jh...@apache.org> wrote:

> This is exciting. We have wanted to build an Arrow adapter in Calcite for
> some time and have a prototype (see
> https://issues.apache.org/jira/browse/CALCITE-2173 <
> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we
> can use Gandiva. I know that Gandiva has Java bindings, but will these
> allow queries to be compiled and executed from a pure Java process?
>

Yes. Dremio is a java process and uses the java bindings for gandiva. You
could take a look at the maven unit tests for an example.


>
> Can you describe Gandiva’s governance model? Without an open governance
> model, companies that compete with Dremio may be wary about contributing.
>

Jacques has replied on this.


>
> Can you compare and contrast your approach to Hyper[1]? Hyper is also
> concerned with efficient use to the bus, and also uses LLVM, but it has a
> different memory format and places much emphasis on lock-free data
> structures.
>
> I just attended SIGMOD and there were interesting industry papers from
> MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks
> MemSQL uses to achieve SIMD parallelism on queries such as “select k4,
> sum(x) from t group by k4” (where k4 has 4 values).
>
> I missed part of the RAPID talk, but I got the impression that they are
> using disk-based algorithms (e.g. hybrid hash join) to handle data spread
> between fast and slow memory.
>
> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would
> be good target for Gandiva also. It is a table scan with a range filter
> (returning 98% of rows), a low-cardinality aggregate (grouping by two
> fields with 3 values each), and several aggregate functions, the arguments
> of which contain common sub-expressions.
>


Thanks for the references - I'll look into them and get back.

Gandiva doesn't attempt to solve query optimization, efficient disk reads
or work distribution across threads/VMs. We expect the higher layers (i.e
users of gandiva) to handle this.

The expression builder returns a compiled, immutable "llvm module", which
can be shared across threads. Once an expression is built, both the
inputs/outputs are arrow vectors (actually, the input is a row batch).
There is no locking within gandiva in the evaluation path.

We are also targeting performance evaluation using TPC-H, but we plan to
first address projections and filters before moving to aggregations.


>
>   SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity),
>     sum(l_extendedprice),
>     sum(l_extendedprice * (1 - l_discount)),
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
>     avg(l_quantity),
>     avg(l_extendedprice),
>     avg(l_discount),
>     count(*)
>   FROM lineitem
>   WHERE l_shipdate <= date '1998-12-01' - interval '90’ day
>   GROUP BY
>     l_returnflag,
>     l_linestatus
>   ORDER BY
>     l_returnflag,
>     l_linestatus;
>
> Julian
>
> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <
> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>
>
> [2]
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> <
> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
> >
>
> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <
> https://dl.acm.org/citation.cfm?id=3183713.3190658>
>
> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <
> https://dl.acm.org/citation.cfm?id=3183713.3190655>
>
> > On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:
> >
> > Hi everyone,
> >
> > I'm Ravindra and I'm a developer on the Gandiva project. I do believe
> that the combination of arrow and llvm for efficient expression evaluation
> is powerful, and has a broad range of use-cases. We've just started and
> hope to finesse and add a lot of functionality over the next few months.
> >
> > Welcome your feedback and participation in gandiva !!
> >
> > thanks & regards,
> > ravindra.
> >
> > On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote:
> >> Hey Guys,
> >>
> >> Dremio just open sourced a new framework for processing data in Arrow
> data
> >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> >> LLVM (Apache licensed). It also includes Java APIs that leverage the
> Apache
> >> Arrow Java libraries. I expect the developers who have been working on
> this
> >> will introduce themselves soon. To read more about it, take a look at
> our
> >> Ravindra's blog post (he's the lead developer driving this work): [2].
> >> Hopefully people will find this interesting/useful.
> >>
> >> Let us know what you all think!
> >>
> >> thanks,
> >> Jacques
> >>
> >>
> >> [1] https://github.com/dremio/gandiva
> >> [2]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> >>
>
>

Re: Gandiva Initiative

Posted by Julian Hyde <jh...@apache.org>.

This is exciting. We have wanted to build an Arrow adapter in Calcite for some time and have a prototype (see https://issues.apache.org/jira/browse/CALCITE-2173 <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we can use Gandiva. I know that Gandiva has Java bindings, but will these allow queries to be compiled and executed from a pure Java process?

Can you describe Gandiva’s governance model? Without an open governance model, companies that compete with Dremio may be wary about contributing.

Can you compare and contrast your approach to Hyper[1]? Hyper is also concerned with efficient use to the bus, and also uses LLVM, but it has a different memory format and places much emphasis on lock-free data structures.

I just attended SIGMOD and there were interesting industry papers from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks MemSQL uses to achieve SIMD parallelism on queries such as “select k4, sum(x) from t group by k4” (where k4 has 4 values).

I missed part of the RAPID talk, but I got the impression that they are using disk-based algorithms (e.g. hybrid hash join) to handle data spread between fast and slow memory.

MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would be good target for Gandiva also. It is a table scan with a range filter (returning 98% of rows), a low-cardinality aggregate (grouping by two fields with 3 values each), and several aggregate functions, the arguments of which contain common sub-expressions.

  SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity),
    sum(l_extendedprice),
    sum(l_extendedprice * (1 - l_discount)),
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
    avg(l_quantity),
    avg(l_extendedprice),
    avg(l_discount),
    count(*)
  FROM lineitem
  WHERE l_shipdate <= date '1998-12-01' - interval '90’ day
  GROUP BY
    l_returnflag,
    l_linestatus
  ORDER BY
    l_returnflag,
    l_linestatus;

Julian

[1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>

[2] http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>

[3] https://dl.acm.org/citation.cfm?id=3183713.3190658 <https://dl.acm.org/citation.cfm?id=3183713.3190658>

[4] https://dl.acm.org/citation.cfm?id=3183713.3190655 <https://dl.acm.org/citation.cfm?id=3183713.3190655>

> On Jun 22, 2018, at 7:22 AM, ravindrap@gmail.com wrote:
> 
> Hi everyone,
> 
> I'm Ravindra and I'm a developer on the Gandiva project. I do believe that the combination of arrow and llvm for efficient expression evaluation is powerful, and has a broad range of use-cases. We've just started and hope to finesse and add a lot of functionality over the next few months.
> 
> Welcome your feedback and participation in gandiva !!
> 
> thanks & regards,
> ravindra.
> 
> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: 
>> Hey Guys,
>> 
>> Dremio just open sourced a new framework for processing data in Arrow data
>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
>> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
>> Arrow Java libraries. I expect the developers who have been working on this
>> will introduce themselves soon. To read more about it, take a look at our
>> Ravindra's blog post (he's the lead developer driving this work): [2].
>> Hopefully people will find this interesting/useful.
>> 
>> Let us know what you all think!
>> 
>> thanks,
>> Jacques
>> 
>> 
>> [1] https://github.com/dremio/gandiva
>> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>>

Re: Gandiva Initiative

Posted by ra...@gmail.com, ra...@gmail.com.

Hi everyone,

I'm Ravindra and I'm a developer on the Gandiva project. I do believe that the combination of arrow and llvm for efficient expression evaluation is powerful, and has a broad range of use-cases. We've just started and hope to finesse and add a lot of functionality over the next few months.

Welcome your feedback and participation in gandiva !!

thanks & regards,
ravindra.

On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: 
> Hey Guys,
> 
> Dremio just open sourced a new framework for processing data in Arrow data
> structures [1], built on top of the Apache Arrow C++ APIs and leveraging
> LLVM (Apache licensed). It also includes Java APIs that leverage the Apache
> Arrow Java libraries. I expect the developers who have been working on this
> will introduce themselves soon. To read more about it, take a look at our
> Ravindra's blog post (he's the lead developer driving this work): [2].
> Hopefully people will find this interesting/useful.
> 
> Let us know what you all think!
> 
> thanks,
> Jacques
> 
> 
> [1] https://github.com/dremio/gandiva
> [2] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>