You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Yibo Cai <yi...@arm.com> on 2019/10/31 05:04:48 UTC

questions about Gandiva

Hi,

Arrow cpp integrates Gandiva to provide low level operations on arrow buffers. [1][2]
I have some questions, any help is appreciated:
- Arrow cpp already has a compute kernel[3], does it duplicate what Gandiva provides? I see a Jira talk about it.[4]
- Is Gandiva only for arrow cpp? What about other languages(go, rust, ...)?
- Gandiva leverages SIMD for vectorized operations[1], but I didn't see any related code. Am I missing something?

[1] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
[2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
[3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
[4] https://issues.apache.org/jira/browse/ARROW-7017

Thanks,
Yibo

Re: questions about Gandiva

Posted by Ted Gooch <te...@gmail.com>.

You can also see some of the Gandiva python bindings in the tests in
pyarrow:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py


On Thu, Oct 31, 2019 at 10:26 AM Wes McKinney <we...@gmail.com> wrote:

> hi
>
> On Thu, Oct 31, 2019 at 12:11 AM Yibo Cai <yi...@arm.com> wrote:
> >
> > Hi,
> >
> > Arrow cpp integrates Gandiva to provide low level operations on arrow
> buffers. [1][2]
> > I have some questions, any help is appreciated:
> > - Arrow cpp already has a compute kernel[3], does it duplicate what
> Gandiva provides? I see a Jira talk about it.[4]
>
> No. There are some cases of functional overlap but we are servicing a
> spectrum of use cases beyond the scope of Gandiva. Additionally, it is
> unclear to me that an LLVM JIT compilation step should be required to
> evaluate simple expressions such as "a > 5" -- in addition to
> introducing latency (due to the compilation step) it is also a heavy
> dependency to require the LLVM runtime in all applications.
>
> Personally I'm interested in supporting a wide gamut of analytics
> workloads, from data frame / data science type libraries to SQL-like
> systems. Gandiva is designed for the needs of a SQL-based execution
> engine where chunks of data are fed into Projection or Filter nodes in
> a computation graph -- Gandiva generates a specialized kernel to
> perform a unit of work inside those nodes. Realistically, I expect
> many real world applications will contain a mixture of pre-compiled
> analytic kernels and JIT-compiled kernels.
>
> Rome wasn't built in a day, so I'm expecting several years of work
> ahead of us at the present rate. We need more help in this domain.
>
> > - Is Gandiva only for arrow cpp? What about other languages(go, rust,
> ...)?
>
> It's being used in Java via JNI. The same approach could be applied
> for the other languages as they have their own C FFI mechanisms.
>
> > - Gandiva leverages SIMD for vectorized operations[1], but I didn't see
> any related code. Am I missing something?
>
> My understanding is that LLVM inserts many SIMD instructions
> automatically based on the host CPU architecture version. Gandiva
> developers may have some comments / pointers about this
>
> >
> > [1]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> > [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
> > [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> > [4] https://issues.apache.org/jira/browse/ARROW-7017
> >
> > Thanks,
> > Yibo
>

Re: questions about Gandiva

Posted by Ravindra Pindikura <ra...@dremio.com>.

On Thu, Oct 31, 2019 at 10:56 PM Wes McKinney <we...@gmail.com> wrote:

> hi
>
> On Thu, Oct 31, 2019 at 12:11 AM Yibo Cai <yi...@arm.com> wrote:
> >
> > Hi,
> >
> > Arrow cpp integrates Gandiva to provide low level operations on arrow
> buffers. [1][2]
> > I have some questions, any help is appreciated:
> > - Arrow cpp already has a compute kernel[3], does it duplicate what
> Gandiva provides? I see a Jira talk about it.[4]
>
> No. There are some cases of functional overlap but we are servicing a
> spectrum of use cases beyond the scope of Gandiva. Additionally, it is
> unclear to me that an LLVM JIT compilation step should be required to
> evaluate simple expressions such as "a > 5" -- in addition to
> introducing latency (due to the compilation step) it is also a heavy
> dependency to require the LLVM runtime in all applications.
>

Like other JIT based systems, gandiva take a hit at "build" time with the
hope that it can be amortized by faster "expression evaluate" times. This
works really well when we build an expression once, and evaluate thousands
or millions of record batches against the same built expression.

The build time is negligible (~1 or 2 ms) for simple expressions like the
one Wes gave here but it could be much higher with complex expressions
involving lots of if/else/case/in statements. We have seen very large
expressions (that includes 1000s of case statements) for while the build
time starts to show up as a significant factor, especially if it's used
evaluate only a few 100 batches or so.

Our approach to this is two fold :

1. cache built expressions : this is already done
- build expression once, cache it and reuse
- helps a lot for query reattempts

2. tiered compilation : not done yet
- the compilation time increases as we do more optimisation passes or we
try to inline functions more aggressively
- we could do this in a tiered fashion : eg.


   - tier 1: for the first M batches, use llvm's interpreter evaluation
   (build tier2 module in parallel)
   - tier 2 : for the next N batches, use gandiva-compiled module but with
   minimal optimisation passes (build tier3 in parallel)
   - tier 3: for rest, use gandiva-compiled and fully optimised module

That way, if a complex expression is used to evaluate just a few batches,
they don't pay the cost of the "fully optimised build time".


> Personally I'm interested in supporting a wide gamut of analytics
> workloads, from data frame / data science type libraries to SQL-like
> systems. Gandiva is designed for the needs of a SQL-based execution
> engine where chunks of data are fed into Projection or Filter nodes in
> a computation graph -- Gandiva generates a specialized kernel to
> perform a unit of work inside those nodes. Realistically, I expect
> many real world applications will contain a mixture of pre-compiled
> analytic kernels and JIT-compiled kernels.
>
> Rome wasn't built in a day, so I'm expecting several years of work
> ahead of us at the present rate. We need more help in this domain.
>
> > - Is Gandiva only for arrow cpp? What about other languages(go, rust,
> ...)?
>
> It's being used in Java via JNI. The same approach could be applied
> for the other languages as they have their own C FFI mechanisms.
>
> > - Gandiva leverages SIMD for vectorized operations[1], but I didn't see
> any related code. Am I missing something?
>
> My understanding is that LLVM inserts many SIMD instructions
> automatically based on the host CPU architecture version. Gandiva
> developers may have some comments / pointers about this
>

Wes is correct - we depend on the llvm optimisation passes to do this.

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/engine.cc#L214



>
> >
> > [1]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> > [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
> > [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> > [4] https://issues.apache.org/jira/browse/ARROW-7017
> >
> > Thanks,
> > Yibo
>


-- 
Thanks and regards,
Ravindra.

Re: questions about Gandiva

Posted by Ravindra Pindikura <ra...@dremio.com>.

On Fri, Nov 1, 2019 at 10:41 AM Yibo Cai <yi...@arm.com> wrote:

> Thanks Wes. Arrow is a very exciting project.
> I'm from Arm. We are interested in arrow and would like to study and help
> improving arrow.
>

If you are familiar with LLVM/JIT, you could help us with improving the
optimisation passes in gandiva (tweaking existing ones or adding new ones
or any other tricks ..)


>
> Yibo
>
> On 11/1/19 1:25 AM, Wes McKinney wrote:
> > hi
> >
> > On Thu, Oct 31, 2019 at 12:11 AM Yibo Cai <yi...@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> Arrow cpp integrates Gandiva to provide low level operations on arrow
> buffers. [1][2]
> >> I have some questions, any help is appreciated:
> >> - Arrow cpp already has a compute kernel[3], does it duplicate what
> Gandiva provides? I see a Jira talk about it.[4]
> >
> > No. There are some cases of functional overlap but we are servicing a
> > spectrum of use cases beyond the scope of Gandiva. Additionally, it is
> > unclear to me that an LLVM JIT compilation step should be required to
> > evaluate simple expressions such as "a > 5" -- in addition to
> > introducing latency (due to the compilation step) it is also a heavy
> > dependency to require the LLVM runtime in all applications.
> >
> > Personally I'm interested in supporting a wide gamut of analytics
> > workloads, from data frame / data science type libraries to SQL-like
> > systems. Gandiva is designed for the needs of a SQL-based execution
> > engine where chunks of data are fed into Projection or Filter nodes in
> > a computation graph -- Gandiva generates a specialized kernel to
> > perform a unit of work inside those nodes. Realistically, I expect
> > many real world applications will contain a mixture of pre-compiled
> > analytic kernels and JIT-compiled kernels.
> >
> > Rome wasn't built in a day, so I'm expecting several years of work
> > ahead of us at the present rate. We need more help in this domain.
> >
> >> - Is Gandiva only for arrow cpp? What about other languages(go, rust,
> ...)?
> >
> > It's being used in Java via JNI. The same approach could be applied
> > for the other languages as they have their own C FFI mechanisms.
> >
> >> - Gandiva leverages SIMD for vectorized operations[1], but I didn't see
> any related code. Am I missing something?
> >
> > My understanding is that LLVM inserts many SIMD instructions
> > automatically based on the host CPU architecture version. Gandiva
> > developers may have some comments / pointers about this
> >
> >>
> >> [1]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> >> [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
> >> [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> >> [4] https://issues.apache.org/jira/browse/ARROW-7017
> >>
> >> Thanks,
> >> Yibo
>


-- 
Thanks and regards,
Ravindra.

Re: questions about Gandiva

Posted by Yibo Cai <yi...@arm.com>.

Thanks Wes. Arrow is a very exciting project.
I'm from Arm. We are interested in arrow and would like to study and help improving arrow.

Yibo

On 11/1/19 1:25 AM, Wes McKinney wrote:
> hi
> 
> On Thu, Oct 31, 2019 at 12:11 AM Yibo Cai <yi...@arm.com> wrote:
>>
>> Hi,
>>
>> Arrow cpp integrates Gandiva to provide low level operations on arrow buffers. [1][2]
>> I have some questions, any help is appreciated:
>> - Arrow cpp already has a compute kernel[3], does it duplicate what Gandiva provides? I see a Jira talk about it.[4]
> 
> No. There are some cases of functional overlap but we are servicing a
> spectrum of use cases beyond the scope of Gandiva. Additionally, it is
> unclear to me that an LLVM JIT compilation step should be required to
> evaluate simple expressions such as "a > 5" -- in addition to
> introducing latency (due to the compilation step) it is also a heavy
> dependency to require the LLVM runtime in all applications.
> 
> Personally I'm interested in supporting a wide gamut of analytics
> workloads, from data frame / data science type libraries to SQL-like
> systems. Gandiva is designed for the needs of a SQL-based execution
> engine where chunks of data are fed into Projection or Filter nodes in
> a computation graph -- Gandiva generates a specialized kernel to
> perform a unit of work inside those nodes. Realistically, I expect
> many real world applications will contain a mixture of pre-compiled
> analytic kernels and JIT-compiled kernels.
> 
> Rome wasn't built in a day, so I'm expecting several years of work
> ahead of us at the present rate. We need more help in this domain.
> 
>> - Is Gandiva only for arrow cpp? What about other languages(go, rust, ...)?
> 
> It's being used in Java via JNI. The same approach could be applied
> for the other languages as they have their own C FFI mechanisms.
> 
>> - Gandiva leverages SIMD for vectorized operations[1], but I didn't see any related code. Am I missing something?
> 
> My understanding is that LLVM inserts many SIMD instructions
> automatically based on the host CPU architecture version. Gandiva
> developers may have some comments / pointers about this
> 
>>
>> [1] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
>> [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
>> [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
>> [4] https://issues.apache.org/jira/browse/ARROW-7017
>>
>> Thanks,
>> Yibo

Re: questions about Gandiva

Posted by Wes McKinney <we...@gmail.com>.

hi

On Thu, Oct 31, 2019 at 12:11 AM Yibo Cai <yi...@arm.com> wrote:
>
> Hi,
>
> Arrow cpp integrates Gandiva to provide low level operations on arrow buffers. [1][2]
> I have some questions, any help is appreciated:
> - Arrow cpp already has a compute kernel[3], does it duplicate what Gandiva provides? I see a Jira talk about it.[4]

No. There are some cases of functional overlap but we are servicing a
spectrum of use cases beyond the scope of Gandiva. Additionally, it is
unclear to me that an LLVM JIT compilation step should be required to
evaluate simple expressions such as "a > 5" -- in addition to
introducing latency (due to the compilation step) it is also a heavy
dependency to require the LLVM runtime in all applications.

Personally I'm interested in supporting a wide gamut of analytics
workloads, from data frame / data science type libraries to SQL-like
systems. Gandiva is designed for the needs of a SQL-based execution
engine where chunks of data are fed into Projection or Filter nodes in
a computation graph -- Gandiva generates a specialized kernel to
perform a unit of work inside those nodes. Realistically, I expect
many real world applications will contain a mixture of pre-compiled
analytic kernels and JIT-compiled kernels.

Rome wasn't built in a day, so I'm expecting several years of work
ahead of us at the present rate. We need more help in this domain.

> - Is Gandiva only for arrow cpp? What about other languages(go, rust, ...)?

It's being used in Java via JNI. The same approach could be applied
for the other languages as they have their own C FFI mechanisms.

> - Gandiva leverages SIMD for vectorized operations[1], but I didn't see any related code. Am I missing something?

My understanding is that LLVM inserts many SIMD instructions
automatically based on the host CPU architecture version. Gandiva
developers may have some comments / pointers about this

>
> [1] https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
> [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> [4] https://issues.apache.org/jira/browse/ARROW-7017
>
> Thanks,
> Yibo