You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by weijie tong <to...@gmail.com> on 2019/04/04 06:51:18 UTC

[Discuss] Integrate Arrow gandiva into Drill

HI :

Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
simd skill could achieve better query performance.  Arrow and Drill has
similar column memory format. The main difference now is the null
representation. Also Arrow has made great changes to the ValueVector. To
adopt Arrow to replace Drill's VV has been discussed before. That would be
a great job. But to leverage gandiva , by working at the physical memory
address level , this work could be little relatively.

Now I have done the integration work at our own branch by make some changes
to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
to ARROW-4819 is to make some package level method to be public. But arrow
community seems not plan to accept this change. Their advice is to have a
arrow branch.

So what do you think?

1、Have a self branch of Arrow.
2、waiting for the Arrow integration completely.
or some other ideas?

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
got it. Thanks to all of you!

On Sat, Apr 6, 2019 at 4:24 AM Karthikeyan Manivannan <km...@mapr.com>
wrote:

> Hi Weijie,
>
> You are right. Before DRILL-6340 the purpose of the hasRemainder() logic
> was not clear. projector.projectRecords() always took in the
> incomingRowCount as the argument and returned the same value in
> non-exceptional paths. So, I think the whole hasReaminder() was dead-code
> then. I did not investigate it further because I knew that under DRILL-6340
> that code would definitely be necessary.
>
> Karthik
>
>
> On Fri, Apr 5, 2019 at 9:27 AM Sorabh Hamirwasia <so...@gmail.com>
> wrote:
>
> > Hi Weijie,
> > I think the only case in which that line will be executed is if there is
> > any UDF like flatten operation which results in producing multiple rows
> for
> > each input row. Even though currently Flatten is a separate operator in
> > Drill but I think that code is there to handle such cases.
> >
> > Thanks,
> > Sorabh
> >
> > On Fri, Apr 5, 2019 at 6:08 AM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > The first appearance of the comparison code is at DRILL-620 :
> > >
> > >
> >
> https://github.com/apache/drill/commit/a2355d42dbff51b858fc28540915cf793f1c0fac#diff-e87beb3f2aa0fbc06b07b1d55c3d3536
> > > . Before DRILL-6340 , according to the ProjectorTemplate's
> projectRecords
> > > method and its actual input parameter values , I think  the line 234 of
> > > ProjectRecordBatch will never be executed. Untill DRILL-6340 , we
> control
> > > the output batch memory size, that part of code finally come into use.
> > >
> > > If I was wrong, please let me know.
> > >
> > > On Fri, Apr 5, 2019 at 12:15 AM weijie tong <to...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the reply, But it seems the code has been there even
> before
> > > > DRILL-6340.
> > > >
> > > > On Thu, Apr 4, 2019 at 10:45 PM Vova Vysotskyi <vv...@gmail.com>
> > wrote:
> > > >
> > > >> Hi Weijie,
> > > >>
> > > >> It is possible if maxOuputRecordCount (received from
> > > >> memoryManager.getOutputRowCount()) is less than incomingRecordCount.
> > > >> For more details please see DRILL-6340
> > > >> <https://issues.apache.org/jira/browse/DRILL-6340> and design
> > document
> > > >> <
> > > >>
> > >
> >
> https://docs.google.com/document/d/1h0WsQsen6xqqAyyYSrtiAniQpVZGmQNQqC1I2DJaxAA/edit?usp=sharing
> > > >> >
> > > >> attached to this Jira.
> > > >>
> > > >> Kind regards,
> > > >> Volodymyr Vysotskyi
> > > >>
> > > >>
> > > >> On Thu, Apr 4, 2019 at 5:17 PM weijie tong <tongweijie178@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > I have a doubt about the ProjectRecordBatch implementation.  Hope
> > > >> someone
> > > >> > could give an explanation about that. To the line 234 of
> > > >> > ProjectRecordBatch, at what case,the projector output row size
> less
> > > than
> > > >> > the input size ?
> > > >> >
> > > >> > On Thu, Apr 4, 2019 at 5:11 PM weijie tong <
> tongweijie178@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Igor:
> > > >> > > That's a good idea! It could resolve that issue. The basic
> > question
> > > >> has
> > > >> > > solved. To use the official Arrow,  there's still two issues
> > needed
> > > >> to be
> > > >> > > contributed to Arrow, that I will do:
> > > >> > > 1. make gcc lib static linked into the jni dynamic lib.
> > > >> > >   Without this, it will require the platform installed right
> > version
> > > >> gcc
> > > >> > > 2. add convertToNull function to gandiva
> > > >> > >  This could make some project expression with convertToNull
> > function
> > > >> to
> > > >> > be
> > > >> > > gandiva executed
> > > >> > >
> > > >> > > Of course, without these two issues solved, I still could give
> an
> > > >> > > integration implementation.
> > > >> > >
> > > >> > > BTW, once the integration is done. How do we supply the gandiva
> > jni
> > > >> lib ?
> > > >> > > Leave it to user to build it ? or we supply different platform
> > > >> > > distributions?
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <
> > > >> ihor.huzenko.igs@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> Hello Weijie,
> > > >> > >>
> > > >> > >> Did you try to create same package as in Arrow, but in Drill
> and
> > > use
> > > >> > >> wrapper class around target for publishing
> > > >> > >> desired methods with package access ?
> > > >> > >>
> > > >> > >> Thanks, Igor
> > > >> > >>
> > > >> > >> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <
> > > tongweijie178@gmail.com>
> > > >> > >> wrote:
> > > >> > >> >
> > > >> > >> > HI :
> > > >> > >> >
> > > >> > >> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM
> > > codegen
> > > >> > and
> > > >> > >> > simd skill could achieve better query performance.  Arrow and
> > > Drill
> > > >> > has
> > > >> > >> > similar column memory format. The main difference now is the
> > null
> > > >> > >> > representation. Also Arrow has made great changes to the
> > > >> ValueVector.
> > > >> > To
> > > >> > >> > adopt Arrow to replace Drill's VV has been discussed before.
> > That
> > > >> > would
> > > >> > >> be
> > > >> > >> > a great job. But to leverage gandiva , by working at the
> > physical
> > > >> > memory
> > > >> > >> > address level , this work could be little relatively.
> > > >> > >> >
> > > >> > >> > Now I have done the integration work at our own branch by
> make
> > > some
> > > >> > >> changes
> > > >> > >> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819.
> The
> > > main
> > > >> > >> changes
> > > >> > >> > to ARROW-4819 is to make some package level method to be
> > public.
> > > >> But
> > > >> > >> arrow
> > > >> > >> > community seems not plan to accept this change. Their advice
> is
> > > to
> > > >> > have
> > > >> > >> a
> > > >> > >> > arrow branch.
> > > >> > >> >
> > > >> > >> > So what do you think?
> > > >> > >> >
> > > >> > >> > 1、Have a self branch of Arrow.
> > > >> > >> > 2、waiting for the Arrow integration completely.
> > > >> > >> > or some other ideas?
> > > >> > >>
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Karthikeyan Manivannan <km...@mapr.com>.
Hi Weijie,

You are right. Before DRILL-6340 the purpose of the hasRemainder() logic
was not clear. projector.projectRecords() always took in the
incomingRowCount as the argument and returned the same value in
non-exceptional paths. So, I think the whole hasReaminder() was dead-code
then. I did not investigate it further because I knew that under DRILL-6340
that code would definitely be necessary.

Karthik


On Fri, Apr 5, 2019 at 9:27 AM Sorabh Hamirwasia <so...@gmail.com>
wrote:

> Hi Weijie,
> I think the only case in which that line will be executed is if there is
> any UDF like flatten operation which results in producing multiple rows for
> each input row. Even though currently Flatten is a separate operator in
> Drill but I think that code is there to handle such cases.
>
> Thanks,
> Sorabh
>
> On Fri, Apr 5, 2019 at 6:08 AM weijie tong <to...@gmail.com>
> wrote:
>
> > The first appearance of the comparison code is at DRILL-620 :
> >
> >
> https://github.com/apache/drill/commit/a2355d42dbff51b858fc28540915cf793f1c0fac#diff-e87beb3f2aa0fbc06b07b1d55c3d3536
> > . Before DRILL-6340 , according to the ProjectorTemplate's projectRecords
> > method and its actual input parameter values , I think  the line 234 of
> > ProjectRecordBatch will never be executed. Untill DRILL-6340 , we control
> > the output batch memory size, that part of code finally come into use.
> >
> > If I was wrong, please let me know.
> >
> > On Fri, Apr 5, 2019 at 12:15 AM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > Thanks for the reply, But it seems the code has been there even before
> > > DRILL-6340.
> > >
> > > On Thu, Apr 4, 2019 at 10:45 PM Vova Vysotskyi <vv...@gmail.com>
> wrote:
> > >
> > >> Hi Weijie,
> > >>
> > >> It is possible if maxOuputRecordCount (received from
> > >> memoryManager.getOutputRowCount()) is less than incomingRecordCount.
> > >> For more details please see DRILL-6340
> > >> <https://issues.apache.org/jira/browse/DRILL-6340> and design
> document
> > >> <
> > >>
> >
> https://docs.google.com/document/d/1h0WsQsen6xqqAyyYSrtiAniQpVZGmQNQqC1I2DJaxAA/edit?usp=sharing
> > >> >
> > >> attached to this Jira.
> > >>
> > >> Kind regards,
> > >> Volodymyr Vysotskyi
> > >>
> > >>
> > >> On Thu, Apr 4, 2019 at 5:17 PM weijie tong <to...@gmail.com>
> > >> wrote:
> > >>
> > >> > I have a doubt about the ProjectRecordBatch implementation.  Hope
> > >> someone
> > >> > could give an explanation about that. To the line 234 of
> > >> > ProjectRecordBatch, at what case,the projector output row size less
> > than
> > >> > the input size ?
> > >> >
> > >> > On Thu, Apr 4, 2019 at 5:11 PM weijie tong <tongweijie178@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > Hi Igor:
> > >> > > That's a good idea! It could resolve that issue. The basic
> question
> > >> has
> > >> > > solved. To use the official Arrow,  there's still two issues
> needed
> > >> to be
> > >> > > contributed to Arrow, that I will do:
> > >> > > 1. make gcc lib static linked into the jni dynamic lib.
> > >> > >   Without this, it will require the platform installed right
> version
> > >> gcc
> > >> > > 2. add convertToNull function to gandiva
> > >> > >  This could make some project expression with convertToNull
> function
> > >> to
> > >> > be
> > >> > > gandiva executed
> > >> > >
> > >> > > Of course, without these two issues solved, I still could give an
> > >> > > integration implementation.
> > >> > >
> > >> > > BTW, once the integration is done. How do we supply the gandiva
> jni
> > >> lib ?
> > >> > > Leave it to user to build it ? or we supply different platform
> > >> > > distributions?
> > >> > >
> > >> > >
> > >> > > On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <
> > >> ihor.huzenko.igs@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> Hello Weijie,
> > >> > >>
> > >> > >> Did you try to create same package as in Arrow, but in Drill and
> > use
> > >> > >> wrapper class around target for publishing
> > >> > >> desired methods with package access ?
> > >> > >>
> > >> > >> Thanks, Igor
> > >> > >>
> > >> > >> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <
> > tongweijie178@gmail.com>
> > >> > >> wrote:
> > >> > >> >
> > >> > >> > HI :
> > >> > >> >
> > >> > >> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM
> > codegen
> > >> > and
> > >> > >> > simd skill could achieve better query performance.  Arrow and
> > Drill
> > >> > has
> > >> > >> > similar column memory format. The main difference now is the
> null
> > >> > >> > representation. Also Arrow has made great changes to the
> > >> ValueVector.
> > >> > To
> > >> > >> > adopt Arrow to replace Drill's VV has been discussed before.
> That
> > >> > would
> > >> > >> be
> > >> > >> > a great job. But to leverage gandiva , by working at the
> physical
> > >> > memory
> > >> > >> > address level , this work could be little relatively.
> > >> > >> >
> > >> > >> > Now I have done the integration work at our own branch by make
> > some
> > >> > >> changes
> > >> > >> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The
> > main
> > >> > >> changes
> > >> > >> > to ARROW-4819 is to make some package level method to be
> public.
> > >> But
> > >> > >> arrow
> > >> > >> > community seems not plan to accept this change. Their advice is
> > to
> > >> > have
> > >> > >> a
> > >> > >> > arrow branch.
> > >> > >> >
> > >> > >> > So what do you think?
> > >> > >> >
> > >> > >> > 1、Have a self branch of Arrow.
> > >> > >> > 2、waiting for the Arrow integration completely.
> > >> > >> > or some other ideas?
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Sorabh Hamirwasia <so...@gmail.com>.
Hi Weijie,
I think the only case in which that line will be executed is if there is
any UDF like flatten operation which results in producing multiple rows for
each input row. Even though currently Flatten is a separate operator in
Drill but I think that code is there to handle such cases.

Thanks,
Sorabh

On Fri, Apr 5, 2019 at 6:08 AM weijie tong <to...@gmail.com> wrote:

> The first appearance of the comparison code is at DRILL-620 :
>
> https://github.com/apache/drill/commit/a2355d42dbff51b858fc28540915cf793f1c0fac#diff-e87beb3f2aa0fbc06b07b1d55c3d3536
> . Before DRILL-6340 , according to the ProjectorTemplate's projectRecords
> method and its actual input parameter values , I think  the line 234 of
> ProjectRecordBatch will never be executed. Untill DRILL-6340 , we control
> the output batch memory size, that part of code finally come into use.
>
> If I was wrong, please let me know.
>
> On Fri, Apr 5, 2019 at 12:15 AM weijie tong <to...@gmail.com>
> wrote:
>
> > Thanks for the reply, But it seems the code has been there even before
> > DRILL-6340.
> >
> > On Thu, Apr 4, 2019 at 10:45 PM Vova Vysotskyi <vv...@gmail.com> wrote:
> >
> >> Hi Weijie,
> >>
> >> It is possible if maxOuputRecordCount (received from
> >> memoryManager.getOutputRowCount()) is less than incomingRecordCount.
> >> For more details please see DRILL-6340
> >> <https://issues.apache.org/jira/browse/DRILL-6340> and design document
> >> <
> >>
> https://docs.google.com/document/d/1h0WsQsen6xqqAyyYSrtiAniQpVZGmQNQqC1I2DJaxAA/edit?usp=sharing
> >> >
> >> attached to this Jira.
> >>
> >> Kind regards,
> >> Volodymyr Vysotskyi
> >>
> >>
> >> On Thu, Apr 4, 2019 at 5:17 PM weijie tong <to...@gmail.com>
> >> wrote:
> >>
> >> > I have a doubt about the ProjectRecordBatch implementation.  Hope
> >> someone
> >> > could give an explanation about that. To the line 234 of
> >> > ProjectRecordBatch, at what case,the projector output row size less
> than
> >> > the input size ?
> >> >
> >> > On Thu, Apr 4, 2019 at 5:11 PM weijie tong <to...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Igor:
> >> > > That's a good idea! It could resolve that issue. The basic question
> >> has
> >> > > solved. To use the official Arrow,  there's still two issues needed
> >> to be
> >> > > contributed to Arrow, that I will do:
> >> > > 1. make gcc lib static linked into the jni dynamic lib.
> >> > >   Without this, it will require the platform installed right version
> >> gcc
> >> > > 2. add convertToNull function to gandiva
> >> > >  This could make some project expression with convertToNull function
> >> to
> >> > be
> >> > > gandiva executed
> >> > >
> >> > > Of course, without these two issues solved, I still could give an
> >> > > integration implementation.
> >> > >
> >> > > BTW, once the integration is done. How do we supply the gandiva jni
> >> lib ?
> >> > > Leave it to user to build it ? or we supply different platform
> >> > > distributions?
> >> > >
> >> > >
> >> > > On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <
> >> ihor.huzenko.igs@gmail.com>
> >> > > wrote:
> >> > >
> >> > >> Hello Weijie,
> >> > >>
> >> > >> Did you try to create same package as in Arrow, but in Drill and
> use
> >> > >> wrapper class around target for publishing
> >> > >> desired methods with package access ?
> >> > >>
> >> > >> Thanks, Igor
> >> > >>
> >> > >> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <
> tongweijie178@gmail.com>
> >> > >> wrote:
> >> > >> >
> >> > >> > HI :
> >> > >> >
> >> > >> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM
> codegen
> >> > and
> >> > >> > simd skill could achieve better query performance.  Arrow and
> Drill
> >> > has
> >> > >> > similar column memory format. The main difference now is the null
> >> > >> > representation. Also Arrow has made great changes to the
> >> ValueVector.
> >> > To
> >> > >> > adopt Arrow to replace Drill's VV has been discussed before. That
> >> > would
> >> > >> be
> >> > >> > a great job. But to leverage gandiva , by working at the physical
> >> > memory
> >> > >> > address level , this work could be little relatively.
> >> > >> >
> >> > >> > Now I have done the integration work at our own branch by make
> some
> >> > >> changes
> >> > >> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The
> main
> >> > >> changes
> >> > >> > to ARROW-4819 is to make some package level method to be public.
> >> But
> >> > >> arrow
> >> > >> > community seems not plan to accept this change. Their advice is
> to
> >> > have
> >> > >> a
> >> > >> > arrow branch.
> >> > >> >
> >> > >> > So what do you think?
> >> > >> >
> >> > >> > 1、Have a self branch of Arrow.
> >> > >> > 2、waiting for the Arrow integration completely.
> >> > >> > or some other ideas?
> >> > >>
> >> > >
> >> >
> >>
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
The first appearance of the comparison code is at DRILL-620 :
https://github.com/apache/drill/commit/a2355d42dbff51b858fc28540915cf793f1c0fac#diff-e87beb3f2aa0fbc06b07b1d55c3d3536
. Before DRILL-6340 , according to the ProjectorTemplate's projectRecords
method and its actual input parameter values , I think  the line 234 of
ProjectRecordBatch will never be executed. Untill DRILL-6340 , we control
the output batch memory size, that part of code finally come into use.

If I was wrong, please let me know.

On Fri, Apr 5, 2019 at 12:15 AM weijie tong <to...@gmail.com> wrote:

> Thanks for the reply, But it seems the code has been there even before
> DRILL-6340.
>
> On Thu, Apr 4, 2019 at 10:45 PM Vova Vysotskyi <vv...@gmail.com> wrote:
>
>> Hi Weijie,
>>
>> It is possible if maxOuputRecordCount (received from
>> memoryManager.getOutputRowCount()) is less than incomingRecordCount.
>> For more details please see DRILL-6340
>> <https://issues.apache.org/jira/browse/DRILL-6340> and design document
>> <
>> https://docs.google.com/document/d/1h0WsQsen6xqqAyyYSrtiAniQpVZGmQNQqC1I2DJaxAA/edit?usp=sharing
>> >
>> attached to this Jira.
>>
>> Kind regards,
>> Volodymyr Vysotskyi
>>
>>
>> On Thu, Apr 4, 2019 at 5:17 PM weijie tong <to...@gmail.com>
>> wrote:
>>
>> > I have a doubt about the ProjectRecordBatch implementation.  Hope
>> someone
>> > could give an explanation about that. To the line 234 of
>> > ProjectRecordBatch, at what case,the projector output row size less than
>> > the input size ?
>> >
>> > On Thu, Apr 4, 2019 at 5:11 PM weijie tong <to...@gmail.com>
>> > wrote:
>> >
>> > > Hi Igor:
>> > > That's a good idea! It could resolve that issue. The basic question
>> has
>> > > solved. To use the official Arrow,  there's still two issues needed
>> to be
>> > > contributed to Arrow, that I will do:
>> > > 1. make gcc lib static linked into the jni dynamic lib.
>> > >   Without this, it will require the platform installed right version
>> gcc
>> > > 2. add convertToNull function to gandiva
>> > >  This could make some project expression with convertToNull function
>> to
>> > be
>> > > gandiva executed
>> > >
>> > > Of course, without these two issues solved, I still could give an
>> > > integration implementation.
>> > >
>> > > BTW, once the integration is done. How do we supply the gandiva jni
>> lib ?
>> > > Leave it to user to build it ? or we supply different platform
>> > > distributions?
>> > >
>> > >
>> > > On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <
>> ihor.huzenko.igs@gmail.com>
>> > > wrote:
>> > >
>> > >> Hello Weijie,
>> > >>
>> > >> Did you try to create same package as in Arrow, but in Drill and use
>> > >> wrapper class around target for publishing
>> > >> desired methods with package access ?
>> > >>
>> > >> Thanks, Igor
>> > >>
>> > >> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <to...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > HI :
>> > >> >
>> > >> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen
>> > and
>> > >> > simd skill could achieve better query performance.  Arrow and Drill
>> > has
>> > >> > similar column memory format. The main difference now is the null
>> > >> > representation. Also Arrow has made great changes to the
>> ValueVector.
>> > To
>> > >> > adopt Arrow to replace Drill's VV has been discussed before. That
>> > would
>> > >> be
>> > >> > a great job. But to leverage gandiva , by working at the physical
>> > memory
>> > >> > address level , this work could be little relatively.
>> > >> >
>> > >> > Now I have done the integration work at our own branch by make some
>> > >> changes
>> > >> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
>> > >> changes
>> > >> > to ARROW-4819 is to make some package level method to be public.
>> But
>> > >> arrow
>> > >> > community seems not plan to accept this change. Their advice is to
>> > have
>> > >> a
>> > >> > arrow branch.
>> > >> >
>> > >> > So what do you think?
>> > >> >
>> > >> > 1、Have a self branch of Arrow.
>> > >> > 2、waiting for the Arrow integration completely.
>> > >> > or some other ideas?
>> > >>
>> > >
>> >
>>
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
Thanks for the reply, But it seems the code has been there even before
DRILL-6340.

On Thu, Apr 4, 2019 at 10:45 PM Vova Vysotskyi <vv...@gmail.com> wrote:

> Hi Weijie,
>
> It is possible if maxOuputRecordCount (received from
> memoryManager.getOutputRowCount()) is less than incomingRecordCount.
> For more details please see DRILL-6340
> <https://issues.apache.org/jira/browse/DRILL-6340> and design document
> <
> https://docs.google.com/document/d/1h0WsQsen6xqqAyyYSrtiAniQpVZGmQNQqC1I2DJaxAA/edit?usp=sharing
> >
> attached to this Jira.
>
> Kind regards,
> Volodymyr Vysotskyi
>
>
> On Thu, Apr 4, 2019 at 5:17 PM weijie tong <to...@gmail.com>
> wrote:
>
> > I have a doubt about the ProjectRecordBatch implementation.  Hope someone
> > could give an explanation about that. To the line 234 of
> > ProjectRecordBatch, at what case,the projector output row size less than
> > the input size ?
> >
> > On Thu, Apr 4, 2019 at 5:11 PM weijie tong <to...@gmail.com>
> > wrote:
> >
> > > Hi Igor:
> > > That's a good idea! It could resolve that issue. The basic question has
> > > solved. To use the official Arrow,  there's still two issues needed to
> be
> > > contributed to Arrow, that I will do:
> > > 1. make gcc lib static linked into the jni dynamic lib.
> > >   Without this, it will require the platform installed right version
> gcc
> > > 2. add convertToNull function to gandiva
> > >  This could make some project expression with convertToNull function to
> > be
> > > gandiva executed
> > >
> > > Of course, without these two issues solved, I still could give an
> > > integration implementation.
> > >
> > > BTW, once the integration is done. How do we supply the gandiva jni
> lib ?
> > > Leave it to user to build it ? or we supply different platform
> > > distributions?
> > >
> > >
> > > On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <
> ihor.huzenko.igs@gmail.com>
> > > wrote:
> > >
> > >> Hello Weijie,
> > >>
> > >> Did you try to create same package as in Arrow, but in Drill and use
> > >> wrapper class around target for publishing
> > >> desired methods with package access ?
> > >>
> > >> Thanks, Igor
> > >>
> > >> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <to...@gmail.com>
> > >> wrote:
> > >> >
> > >> > HI :
> > >> >
> > >> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen
> > and
> > >> > simd skill could achieve better query performance.  Arrow and Drill
> > has
> > >> > similar column memory format. The main difference now is the null
> > >> > representation. Also Arrow has made great changes to the
> ValueVector.
> > To
> > >> > adopt Arrow to replace Drill's VV has been discussed before. That
> > would
> > >> be
> > >> > a great job. But to leverage gandiva , by working at the physical
> > memory
> > >> > address level , this work could be little relatively.
> > >> >
> > >> > Now I have done the integration work at our own branch by make some
> > >> changes
> > >> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> > >> changes
> > >> > to ARROW-4819 is to make some package level method to be public. But
> > >> arrow
> > >> > community seems not plan to accept this change. Their advice is to
> > have
> > >> a
> > >> > arrow branch.
> > >> >
> > >> > So what do you think?
> > >> >
> > >> > 1、Have a self branch of Arrow.
> > >> > 2、waiting for the Arrow integration completely.
> > >> > or some other ideas?
> > >>
> > >
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Vova Vysotskyi <vv...@gmail.com>.
Hi Weijie,

It is possible if maxOuputRecordCount (received from
memoryManager.getOutputRowCount()) is less than incomingRecordCount.
For more details please see DRILL-6340
<https://issues.apache.org/jira/browse/DRILL-6340> and design document
<https://docs.google.com/document/d/1h0WsQsen6xqqAyyYSrtiAniQpVZGmQNQqC1I2DJaxAA/edit?usp=sharing>
attached to this Jira.

Kind regards,
Volodymyr Vysotskyi


On Thu, Apr 4, 2019 at 5:17 PM weijie tong <to...@gmail.com> wrote:

> I have a doubt about the ProjectRecordBatch implementation.  Hope someone
> could give an explanation about that. To the line 234 of
> ProjectRecordBatch, at what case,the projector output row size less than
> the input size ?
>
> On Thu, Apr 4, 2019 at 5:11 PM weijie tong <to...@gmail.com>
> wrote:
>
> > Hi Igor:
> > That's a good idea! It could resolve that issue. The basic question has
> > solved. To use the official Arrow,  there's still two issues needed to be
> > contributed to Arrow, that I will do:
> > 1. make gcc lib static linked into the jni dynamic lib.
> >   Without this, it will require the platform installed right version gcc
> > 2. add convertToNull function to gandiva
> >  This could make some project expression with convertToNull function to
> be
> > gandiva executed
> >
> > Of course, without these two issues solved, I still could give an
> > integration implementation.
> >
> > BTW, once the integration is done. How do we supply the gandiva jni lib ?
> > Leave it to user to build it ? or we supply different platform
> > distributions?
> >
> >
> > On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <ih...@gmail.com>
> > wrote:
> >
> >> Hello Weijie,
> >>
> >> Did you try to create same package as in Arrow, but in Drill and use
> >> wrapper class around target for publishing
> >> desired methods with package access ?
> >>
> >> Thanks, Igor
> >>
> >> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <to...@gmail.com>
> >> wrote:
> >> >
> >> > HI :
> >> >
> >> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen
> and
> >> > simd skill could achieve better query performance.  Arrow and Drill
> has
> >> > similar column memory format. The main difference now is the null
> >> > representation. Also Arrow has made great changes to the ValueVector.
> To
> >> > adopt Arrow to replace Drill's VV has been discussed before. That
> would
> >> be
> >> > a great job. But to leverage gandiva , by working at the physical
> memory
> >> > address level , this work could be little relatively.
> >> >
> >> > Now I have done the integration work at our own branch by make some
> >> changes
> >> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> >> changes
> >> > to ARROW-4819 is to make some package level method to be public. But
> >> arrow
> >> > community seems not plan to accept this change. Their advice is to
> have
> >> a
> >> > arrow branch.
> >> >
> >> > So what do you think?
> >> >
> >> > 1、Have a self branch of Arrow.
> >> > 2、waiting for the Arrow integration completely.
> >> > or some other ideas?
> >>
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
I have a doubt about the ProjectRecordBatch implementation.  Hope someone
could give an explanation about that. To the line 234 of
ProjectRecordBatch, at what case,the projector output row size less than
the input size ?

On Thu, Apr 4, 2019 at 5:11 PM weijie tong <to...@gmail.com> wrote:

> Hi Igor:
> That's a good idea! It could resolve that issue. The basic question has
> solved. To use the official Arrow,  there's still two issues needed to be
> contributed to Arrow, that I will do:
> 1. make gcc lib static linked into the jni dynamic lib.
>   Without this, it will require the platform installed right version gcc
> 2. add convertToNull function to gandiva
>  This could make some project expression with convertToNull function to be
> gandiva executed
>
> Of course, without these two issues solved, I still could give an
> integration implementation.
>
> BTW, once the integration is done. How do we supply the gandiva jni lib ?
> Leave it to user to build it ? or we supply different platform
> distributions?
>
>
> On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <ih...@gmail.com>
> wrote:
>
>> Hello Weijie,
>>
>> Did you try to create same package as in Arrow, but in Drill and use
>> wrapper class around target for publishing
>> desired methods with package access ?
>>
>> Thanks, Igor
>>
>> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <to...@gmail.com>
>> wrote:
>> >
>> > HI :
>> >
>> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
>> > simd skill could achieve better query performance.  Arrow and Drill has
>> > similar column memory format. The main difference now is the null
>> > representation. Also Arrow has made great changes to the ValueVector. To
>> > adopt Arrow to replace Drill's VV has been discussed before. That would
>> be
>> > a great job. But to leverage gandiva , by working at the physical memory
>> > address level , this work could be little relatively.
>> >
>> > Now I have done the integration work at our own branch by make some
>> changes
>> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
>> changes
>> > to ARROW-4819 is to make some package level method to be public. But
>> arrow
>> > community seems not plan to accept this change. Their advice is to have
>> a
>> > arrow branch.
>> >
>> > So what do you think?
>> >
>> > 1、Have a self branch of Arrow.
>> > 2、waiting for the Arrow integration completely.
>> > or some other ideas?
>>
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
Hi Igor:
That's a good idea! It could resolve that issue. The basic question has
solved. To use the official Arrow,  there's still two issues needed to be
contributed to Arrow, that I will do:
1. make gcc lib static linked into the jni dynamic lib.
  Without this, it will require the platform installed right version gcc
2. add convertToNull function to gandiva
 This could make some project expression with convertToNull function to be
gandiva executed

Of course, without these two issues solved, I still could give an
integration implementation.

BTW, once the integration is done. How do we supply the gandiva jni lib ?
Leave it to user to build it ? or we supply different platform
distributions?


On Thu, Apr 4, 2019 at 3:53 PM Igor Guzenko <ih...@gmail.com>
wrote:

> Hello Weijie,
>
> Did you try to create same package as in Arrow, but in Drill and use
> wrapper class around target for publishing
> desired methods with package access ?
>
> Thanks, Igor
>
> On Thu, Apr 4, 2019 at 9:51 AM weijie tong <to...@gmail.com>
> wrote:
> >
> > HI :
> >
> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> > simd skill could achieve better query performance.  Arrow and Drill has
> > similar column memory format. The main difference now is the null
> > representation. Also Arrow has made great changes to the ValueVector. To
> > adopt Arrow to replace Drill's VV has been discussed before. That would
> be
> > a great job. But to leverage gandiva , by working at the physical memory
> > address level , this work could be little relatively.
> >
> > Now I have done the integration work at our own branch by make some
> changes
> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> changes
> > to ARROW-4819 is to make some package level method to be public. But
> arrow
> > community seems not plan to accept this change. Their advice is to have a
> > arrow branch.
> >
> > So what do you think?
> >
> > 1、Have a self branch of Arrow.
> > 2、waiting for the Arrow integration completely.
> > or some other ideas?
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Igor Guzenko <ih...@gmail.com>.
Hello Weijie,

Did you try to create same package as in Arrow, but in Drill and use
wrapper class around target for publishing
desired methods with package access ?

Thanks, Igor

On Thu, Apr 4, 2019 at 9:51 AM weijie tong <to...@gmail.com> wrote:
>
> HI :
>
> Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> simd skill could achieve better query performance.  Arrow and Drill has
> similar column memory format. The main difference now is the null
> representation. Also Arrow has made great changes to the ValueVector. To
> adopt Arrow to replace Drill's VV has been discussed before. That would be
> a great job. But to leverage gandiva , by working at the physical memory
> address level , this work could be little relatively.
>
> Now I have done the integration work at our own branch by make some changes
> to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
> to ARROW-4819 is to make some package level method to be public. But arrow
> community seems not plan to accept this change. Their advice is to have a
> arrow branch.
>
> So what do you think?
>
> 1、Have a self branch of Arrow.
> 2、waiting for the Arrow integration completely.
> or some other ideas?

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Parth Chandra <pa...@apache.org>.
Finally!
We can definitely go ahead with Gandiva if it doesn't depend on the memory
allocator.
Last time I checked the C++ version of Arrow still had its own memory
allocation, but Gandiva likely does not use that for its C++ code.



On Tue, Apr 23, 2019 at 5:46 AM weijie tong <to...@gmail.com> wrote:

> Gandiva 's Project does not allocate any more memory to execute. It just
> calculates the input memory data whatever they are var-length or
> fixed-width. The output memory will also be allocated by the Drill ahead
> which needs to be fixed-width vectors. The var-width output vector cases
> should not be allowed the Gandiva to evaluate since that will need Gandiva
> to allocate additional memory which is not controlled by the JVM.
>
> I guess that's why Gandiva does not implement operator like HashJoin or
> HashAggregate which need to allocate additional memory to implement. But
> Arrow's WIP PR ARROW-3191 https://github.com/apache/arrow/pull/4151 will
> make that possible.
>
> On Tue, Apr 23, 2019 at 7:15 AM Parth Chandra <pa...@apache.org> wrote:
>
> > Is there a way to provide Drill's memory allocator to Gandiva/Arrow? If
> > not, then how do we keep a proper accounting of any memory used by
> > Gandiva/Arrow?
> >
> > On Sat, Apr 20, 2019 at 7:05 PM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi Weijie,
> > >
> > > Thanks much for the explanation. Sounds like you are making good
> > progress.
> > >
> > >
> > > For which operator is the filter pushed into the scan? Although Impala
> > > does this for all scans, AFAIK, Drill does not do so. For example, the
> > text
> > > and JSON reader do not handle filtering. Filtering is instead done by
> the
> > > Filter operator in these cases. Perhaps you have your own special scan
> > > which handles filtering?
> > >
> > >
> > > The concern in DRILL-6340 was the user might do a project operation
> that
> > > causes the output batch to be much larger than the input batch. Someone
> > > suggested flatten as one example. String concatenation is another
> > example.
> > > The input batch might be large. The result of the concatenation could
> be
> > > too large for available memory. So, the idea was to project the single
> > > input batch into two (or more) output batches to control batch size.
> > >
> > >
> > > II like how you've categorized the vectors into the set that Gandiva
> can
> > > project, and the set that Drill must handle. Maybe you can extend this
> > idea
> > > for the case where input batches are split into multiple output
> batches.
> > >
> > >  Let Drill handle VarChar expressions that could increase column width
> > > (such as the concatenate operator.) Let Drill decide the number of rows
> > in
> > > the output batch. Then, for the columns that Gandiva can handle,
> project
> > > just those rows needed for the current output batch.
> > >
> > > Your solution might also be extended to handle the Gandiva library
> issue.
> > > Since you are splitting vectors into the Drill group and the Gandiva
> > group,
> > > if Drill runs on a platform without Gandiva support, or if the Gandiva
> > > library can't be found, just let all vectors fall into the Drill vector
> > > group.
> > >
> > > If the user wants to use Gandiva, he/she could set a config option to
> > > point to the Gandiva library (and supporting files, if any.) Or, use
> the
> > > existing LD_LIBRARY_PATH env. variable.
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >     On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <
> > > tongweijie178@gmail.com> wrote:
> > >
> > >  Hi Paul:
> > > Currently Gandiva only supports Project ,Filter operations. My work is
> to
> > > integrate Project operator. Since most of the Filter operator will be
> > > pushed down to the Scan.
> > >
> > > The Gandiva project interface works at the RecordBatch level. It
> accepts
> > > the memory address of the vectors of  input RecordBatch and . Before
> that
> > > it also need to construct a binary schema object to describe the input
> > > RecordBatch schema.
> > >
> > > The integration work mainly has two parts:
> > >   1. at the setup step, find the expressions which can be solved by the
> > > Gandiva . The matched expression will be solved by the Gandiva, others
> > will
> > > still be solved by Drill.
> > >   2. invoking the Gandiva native project method. The matched
> expressions'
> > > ValueVectors will all be allocated corresponding Arrow type null
> > > representation ValueVector. The null input vector's bit  will also be
> > set.
> > > The same work will also be done to the output ValueVectors, transfer
> the
> > > arrow output null vector to Drill's null vector. Since the native
> method
> > > only care the physical memory address, invoking that native method is
> > not a
> > > hard work.
> > >
> > > Since my current implementation is before DRILL-6340, it does not solve
> > the
> > > output size of the project which is less than the input size case. To
> > cover
> > > that case , there's some more work to do which I have not focused on.
> > >
> > > To contribute to community , there's also some test case problem which
> > > needs to be considered, since the Gandiva jar is platform dependent.
> > >
> > >
> > >
> > >
> > > On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <par0328@yahoo.com.invalid
> >
> > > wrote:
> > >
> > > > Hi Weijie,
> > > >
> > > > Thanks much for the update on your Gandiva work. It is great work.
> > > >
> > > > Can you say more about how you are doing the integration?
> > > >
> > > > As you mentioned the memory layout of Arrow's null vector differs
> from
> > > the
> > > > "is set" vector in Drill. How did you work around that?
> > > >
> > > > The Project operator is pretty simple if we are just copying or
> > removing
> > > > columns. However, much of Project deals with invoking Drill-provided
> > > > functions: simple ones (add two ints) and complex ones (perform a
> regex
> > > > match). To be useful, the integration would have to mimic Drill's
> > > behavior
> > > > for each of these many functions.
> > > >
> > > > Project currently works row-by-row. But, to get the maximum
> > performance,
> > > > it would work column-by-column to take full advantage of
> vectorization.
> > > > Doing that would require large changes to the code that sets up
> > codegen,
> > > > and iterates over the batch.
> > > >
> > > >
> > > > For operators such as Sort, the only vector-based operations are 1)
> > sort
> > > a
> > > > batch using defined keys to get an offset vector, and 2) create a new
> > > > vector by copying values, row-by-row, from one batch to another
> > according
> > > > to the offset vector.
> > > >
> > > > The join and aggregate operations are even more complex, as are the
> > > > partition senders and receivers.
> > > >
> > > > Can you tell us where you've used Gandiva? Which operators? How did
> you
> > > > handle the function integration? I am very curious how you were able
> to
> > > > solve these problems.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> > > > tongweijie178@gmail.com> wrote:
> > > >
> > > >  HI :
> > > >
> > > > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen
> and
> > > > simd skill could achieve better query performance.  Arrow and Drill
> has
> > > > similar column memory format. The main difference now is the null
> > > > representation. Also Arrow has made great changes to the ValueVector.
> > To
> > > > adopt Arrow to replace Drill's VV has been discussed before. That
> would
> > > be
> > > > a great job. But to leverage gandiva , by working at the physical
> > memory
> > > > address level , this work could be little relatively.
> > > >
> > > > Now I have done the integration work at our own branch by make some
> > > changes
> > > > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> > > changes
> > > > to ARROW-4819 is to make some package level method to be public. But
> > > arrow
> > > > community seems not plan to accept this change. Their advice is to
> > have a
> > > > arrow branch.
> > > >
> > > > So what do you think?
> > > >
> > > > 1、Have a self branch of Arrow.
> > > > 2、waiting for the Arrow integration completely.
> > > > or some other ideas?
> >
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
Gandiva 's Project does not allocate any more memory to execute. It just
calculates the input memory data whatever they are var-length or
fixed-width. The output memory will also be allocated by the Drill ahead
which needs to be fixed-width vectors. The var-width output vector cases
should not be allowed the Gandiva to evaluate since that will need Gandiva
to allocate additional memory which is not controlled by the JVM.

I guess that's why Gandiva does not implement operator like HashJoin or
HashAggregate which need to allocate additional memory to implement. But
Arrow's WIP PR ARROW-3191 https://github.com/apache/arrow/pull/4151 will
make that possible.

On Tue, Apr 23, 2019 at 7:15 AM Parth Chandra <pa...@apache.org> wrote:

> Is there a way to provide Drill's memory allocator to Gandiva/Arrow? If
> not, then how do we keep a proper accounting of any memory used by
> Gandiva/Arrow?
>
> On Sat, Apr 20, 2019 at 7:05 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Weijie,
> >
> > Thanks much for the explanation. Sounds like you are making good
> progress.
> >
> >
> > For which operator is the filter pushed into the scan? Although Impala
> > does this for all scans, AFAIK, Drill does not do so. For example, the
> text
> > and JSON reader do not handle filtering. Filtering is instead done by the
> > Filter operator in these cases. Perhaps you have your own special scan
> > which handles filtering?
> >
> >
> > The concern in DRILL-6340 was the user might do a project operation that
> > causes the output batch to be much larger than the input batch. Someone
> > suggested flatten as one example. String concatenation is another
> example.
> > The input batch might be large. The result of the concatenation could be
> > too large for available memory. So, the idea was to project the single
> > input batch into two (or more) output batches to control batch size.
> >
> >
> > II like how you've categorized the vectors into the set that Gandiva can
> > project, and the set that Drill must handle. Maybe you can extend this
> idea
> > for the case where input batches are split into multiple output batches.
> >
> >  Let Drill handle VarChar expressions that could increase column width
> > (such as the concatenate operator.) Let Drill decide the number of rows
> in
> > the output batch. Then, for the columns that Gandiva can handle, project
> > just those rows needed for the current output batch.
> >
> > Your solution might also be extended to handle the Gandiva library issue.
> > Since you are splitting vectors into the Drill group and the Gandiva
> group,
> > if Drill runs on a platform without Gandiva support, or if the Gandiva
> > library can't be found, just let all vectors fall into the Drill vector
> > group.
> >
> > If the user wants to use Gandiva, he/she could set a config option to
> > point to the Gandiva library (and supporting files, if any.) Or, use the
> > existing LD_LIBRARY_PATH env. variable.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <
> > tongweijie178@gmail.com> wrote:
> >
> >  Hi Paul:
> > Currently Gandiva only supports Project ,Filter operations. My work is to
> > integrate Project operator. Since most of the Filter operator will be
> > pushed down to the Scan.
> >
> > The Gandiva project interface works at the RecordBatch level. It accepts
> > the memory address of the vectors of  input RecordBatch and . Before that
> > it also need to construct a binary schema object to describe the input
> > RecordBatch schema.
> >
> > The integration work mainly has two parts:
> >   1. at the setup step, find the expressions which can be solved by the
> > Gandiva . The matched expression will be solved by the Gandiva, others
> will
> > still be solved by Drill.
> >   2. invoking the Gandiva native project method. The matched expressions'
> > ValueVectors will all be allocated corresponding Arrow type null
> > representation ValueVector. The null input vector's bit  will also be
> set.
> > The same work will also be done to the output ValueVectors, transfer the
> > arrow output null vector to Drill's null vector. Since the native method
> > only care the physical memory address, invoking that native method is
> not a
> > hard work.
> >
> > Since my current implementation is before DRILL-6340, it does not solve
> the
> > output size of the project which is less than the input size case. To
> cover
> > that case , there's some more work to do which I have not focused on.
> >
> > To contribute to community , there's also some test case problem which
> > needs to be considered, since the Gandiva jar is platform dependent.
> >
> >
> >
> >
> > On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi Weijie,
> > >
> > > Thanks much for the update on your Gandiva work. It is great work.
> > >
> > > Can you say more about how you are doing the integration?
> > >
> > > As you mentioned the memory layout of Arrow's null vector differs from
> > the
> > > "is set" vector in Drill. How did you work around that?
> > >
> > > The Project operator is pretty simple if we are just copying or
> removing
> > > columns. However, much of Project deals with invoking Drill-provided
> > > functions: simple ones (add two ints) and complex ones (perform a regex
> > > match). To be useful, the integration would have to mimic Drill's
> > behavior
> > > for each of these many functions.
> > >
> > > Project currently works row-by-row. But, to get the maximum
> performance,
> > > it would work column-by-column to take full advantage of vectorization.
> > > Doing that would require large changes to the code that sets up
> codegen,
> > > and iterates over the batch.
> > >
> > >
> > > For operators such as Sort, the only vector-based operations are 1)
> sort
> > a
> > > batch using defined keys to get an offset vector, and 2) create a new
> > > vector by copying values, row-by-row, from one batch to another
> according
> > > to the offset vector.
> > >
> > > The join and aggregate operations are even more complex, as are the
> > > partition senders and receivers.
> > >
> > > Can you tell us where you've used Gandiva? Which operators? How did you
> > > handle the function integration? I am very curious how you were able to
> > > solve these problems.
> > >
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > >
> > >
> > >    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> > > tongweijie178@gmail.com> wrote:
> > >
> > >  HI :
> > >
> > > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> > > simd skill could achieve better query performance.  Arrow and Drill has
> > > similar column memory format. The main difference now is the null
> > > representation. Also Arrow has made great changes to the ValueVector.
> To
> > > adopt Arrow to replace Drill's VV has been discussed before. That would
> > be
> > > a great job. But to leverage gandiva , by working at the physical
> memory
> > > address level , this work could be little relatively.
> > >
> > > Now I have done the integration work at our own branch by make some
> > changes
> > > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> > changes
> > > to ARROW-4819 is to make some package level method to be public. But
> > arrow
> > > community seems not plan to accept this change. Their advice is to
> have a
> > > arrow branch.
> > >
> > > So what do you think?
> > >
> > > 1、Have a self branch of Arrow.
> > > 2、waiting for the Arrow integration completely.
> > > or some other ideas?
>

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Parth Chandra <pa...@apache.org>.
Is there a way to provide Drill's memory allocator to Gandiva/Arrow? If
not, then how do we keep a proper accounting of any memory used by
Gandiva/Arrow?

On Sat, Apr 20, 2019 at 7:05 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Weijie,
>
> Thanks much for the explanation. Sounds like you are making good progress.
>
>
> For which operator is the filter pushed into the scan? Although Impala
> does this for all scans, AFAIK, Drill does not do so. For example, the text
> and JSON reader do not handle filtering. Filtering is instead done by the
> Filter operator in these cases. Perhaps you have your own special scan
> which handles filtering?
>
>
> The concern in DRILL-6340 was the user might do a project operation that
> causes the output batch to be much larger than the input batch. Someone
> suggested flatten as one example. String concatenation is another example.
> The input batch might be large. The result of the concatenation could be
> too large for available memory. So, the idea was to project the single
> input batch into two (or more) output batches to control batch size.
>
>
> II like how you've categorized the vectors into the set that Gandiva can
> project, and the set that Drill must handle. Maybe you can extend this idea
> for the case where input batches are split into multiple output batches.
>
>  Let Drill handle VarChar expressions that could increase column width
> (such as the concatenate operator.) Let Drill decide the number of rows in
> the output batch. Then, for the columns that Gandiva can handle, project
> just those rows needed for the current output batch.
>
> Your solution might also be extended to handle the Gandiva library issue.
> Since you are splitting vectors into the Drill group and the Gandiva group,
> if Drill runs on a platform without Gandiva support, or if the Gandiva
> library can't be found, just let all vectors fall into the Drill vector
> group.
>
> If the user wants to use Gandiva, he/she could set a config option to
> point to the Gandiva library (and supporting files, if any.) Or, use the
> existing LD_LIBRARY_PATH env. variable.
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <
> tongweijie178@gmail.com> wrote:
>
>  Hi Paul:
> Currently Gandiva only supports Project ,Filter operations. My work is to
> integrate Project operator. Since most of the Filter operator will be
> pushed down to the Scan.
>
> The Gandiva project interface works at the RecordBatch level. It accepts
> the memory address of the vectors of  input RecordBatch and . Before that
> it also need to construct a binary schema object to describe the input
> RecordBatch schema.
>
> The integration work mainly has two parts:
>   1. at the setup step, find the expressions which can be solved by the
> Gandiva . The matched expression will be solved by the Gandiva, others will
> still be solved by Drill.
>   2. invoking the Gandiva native project method. The matched expressions'
> ValueVectors will all be allocated corresponding Arrow type null
> representation ValueVector. The null input vector's bit  will also be set.
> The same work will also be done to the output ValueVectors, transfer the
> arrow output null vector to Drill's null vector. Since the native method
> only care the physical memory address, invoking that native method is not a
> hard work.
>
> Since my current implementation is before DRILL-6340, it does not solve the
> output size of the project which is less than the input size case. To cover
> that case , there's some more work to do which I have not focused on.
>
> To contribute to community , there's also some test case problem which
> needs to be considered, since the Gandiva jar is platform dependent.
>
>
>
>
> On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Weijie,
> >
> > Thanks much for the update on your Gandiva work. It is great work.
> >
> > Can you say more about how you are doing the integration?
> >
> > As you mentioned the memory layout of Arrow's null vector differs from
> the
> > "is set" vector in Drill. How did you work around that?
> >
> > The Project operator is pretty simple if we are just copying or removing
> > columns. However, much of Project deals with invoking Drill-provided
> > functions: simple ones (add two ints) and complex ones (perform a regex
> > match). To be useful, the integration would have to mimic Drill's
> behavior
> > for each of these many functions.
> >
> > Project currently works row-by-row. But, to get the maximum performance,
> > it would work column-by-column to take full advantage of vectorization.
> > Doing that would require large changes to the code that sets up codegen,
> > and iterates over the batch.
> >
> >
> > For operators such as Sort, the only vector-based operations are 1) sort
> a
> > batch using defined keys to get an offset vector, and 2) create a new
> > vector by copying values, row-by-row, from one batch to another according
> > to the offset vector.
> >
> > The join and aggregate operations are even more complex, as are the
> > partition senders and receivers.
> >
> > Can you tell us where you've used Gandiva? Which operators? How did you
> > handle the function integration? I am very curious how you were able to
> > solve these problems.
> >
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> > tongweijie178@gmail.com> wrote:
> >
> >  HI :
> >
> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> > simd skill could achieve better query performance.  Arrow and Drill has
> > similar column memory format. The main difference now is the null
> > representation. Also Arrow has made great changes to the ValueVector. To
> > adopt Arrow to replace Drill's VV has been discussed before. That would
> be
> > a great job. But to leverage gandiva , by working at the physical memory
> > address level , this work could be little relatively.
> >
> > Now I have done the integration work at our own branch by make some
> changes
> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> changes
> > to ARROW-4819 is to make some package level method to be public. But
> arrow
> > community seems not plan to accept this change. Their advice is to have a
> > arrow branch.
> >
> > So what do you think?
> >
> > 1、Have a self branch of Arrow.
> > 2、waiting for the Arrow integration completely.
> > or some other ideas?

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Weijie,

Thanks much for the explanation. Sounds like you are making good progress.


For which operator is the filter pushed into the scan? Although Impala does this for all scans, AFAIK, Drill does not do so. For example, the text and JSON reader do not handle filtering. Filtering is instead done by the Filter operator in these cases. Perhaps you have your own special scan which handles filtering?


The concern in DRILL-6340 was the user might do a project operation that causes the output batch to be much larger than the input batch. Someone suggested flatten as one example. String concatenation is another example. The input batch might be large. The result of the concatenation could be too large for available memory. So, the idea was to project the single input batch into two (or more) output batches to control batch size.


II like how you've categorized the vectors into the set that Gandiva can project, and the set that Drill must handle. Maybe you can extend this idea for the case where input batches are split into multiple output batches.

 Let Drill handle VarChar expressions that could increase column width (such as the concatenate operator.) Let Drill decide the number of rows in the output batch. Then, for the columns that Gandiva can handle, project just those rows needed for the current output batch.

Your solution might also be extended to handle the Gandiva library issue. Since you are splitting vectors into the Drill group and the Gandiva group, if Drill runs on a platform without Gandiva support, or if the Gandiva library can't be found, just let all vectors fall into the Drill vector group.

If the user wants to use Gandiva, he/she could set a config option to point to the Gandiva library (and supporting files, if any.) Or, use the existing LD_LIBRARY_PATH env. variable.

Thanks,
- Paul

 

    On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <to...@gmail.com> wrote:  
 
 Hi Paul:
Currently Gandiva only supports Project ,Filter operations. My work is to
integrate Project operator. Since most of the Filter operator will be
pushed down to the Scan.

The Gandiva project interface works at the RecordBatch level. It accepts
the memory address of the vectors of  input RecordBatch and . Before that
it also need to construct a binary schema object to describe the input
RecordBatch schema.

The integration work mainly has two parts:
  1. at the setup step, find the expressions which can be solved by the
Gandiva . The matched expression will be solved by the Gandiva, others will
still be solved by Drill.
  2. invoking the Gandiva native project method. The matched expressions'
ValueVectors will all be allocated corresponding Arrow type null
representation ValueVector. The null input vector's bit  will also be set.
The same work will also be done to the output ValueVectors, transfer the
arrow output null vector to Drill's null vector. Since the native method
only care the physical memory address, invoking that native method is not a
hard work.

Since my current implementation is before DRILL-6340, it does not solve the
output size of the project which is less than the input size case. To cover
that case , there's some more work to do which I have not focused on.

To contribute to community , there's also some test case problem which
needs to be considered, since the Gandiva jar is platform dependent.




On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Weijie,
>
> Thanks much for the update on your Gandiva work. It is great work.
>
> Can you say more about how you are doing the integration?
>
> As you mentioned the memory layout of Arrow's null vector differs from the
> "is set" vector in Drill. How did you work around that?
>
> The Project operator is pretty simple if we are just copying or removing
> columns. However, much of Project deals with invoking Drill-provided
> functions: simple ones (add two ints) and complex ones (perform a regex
> match). To be useful, the integration would have to mimic Drill's behavior
> for each of these many functions.
>
> Project currently works row-by-row. But, to get the maximum performance,
> it would work column-by-column to take full advantage of vectorization.
> Doing that would require large changes to the code that sets up codegen,
> and iterates over the batch.
>
>
> For operators such as Sort, the only vector-based operations are 1) sort a
> batch using defined keys to get an offset vector, and 2) create a new
> vector by copying values, row-by-row, from one batch to another according
> to the offset vector.
>
> The join and aggregate operations are even more complex, as are the
> partition senders and receivers.
>
> Can you tell us where you've used Gandiva? Which operators? How did you
> handle the function integration? I am very curious how you were able to
> solve these problems.
>
>
> Thanks,
>
> - Paul
>
>
>
>    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> tongweijie178@gmail.com> wrote:
>
>  HI :
>
> Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> simd skill could achieve better query performance.  Arrow and Drill has
> similar column memory format. The main difference now is the null
> representation. Also Arrow has made great changes to the ValueVector. To
> adopt Arrow to replace Drill's VV has been discussed before. That would be
> a great job. But to leverage gandiva , by working at the physical memory
> address level , this work could be little relatively.
>
> Now I have done the integration work at our own branch by make some changes
> to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
> to ARROW-4819 is to make some package level method to be public. But arrow
> community seems not plan to accept this change. Their advice is to have a
> arrow branch.
>
> So what do you think?
>
> 1、Have a self branch of Arrow.
> 2、waiting for the Arrow integration completely.
> or some other ideas?  

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by weijie tong <to...@gmail.com>.
Hi Paul:
Currently Gandiva only supports Project ,Filter operations. My work is to
integrate Project operator. Since most of the Filter operator will be
pushed down to the Scan.

The Gandiva project interface works at the RecordBatch level. It accepts
the memory address of the vectors of  input RecordBatch and . Before that
it also need to construct a binary schema object to describe the input
RecordBatch schema.

The integration work mainly has two parts:
  1. at the setup step, find the expressions which can be solved by the
Gandiva . The matched expression will be solved by the Gandiva, others will
still be solved by Drill.
  2. invoking the Gandiva native project method. The matched expressions'
ValueVectors will all be allocated corresponding Arrow type null
representation ValueVector. The null input vector's bit  will also be set.
The same work will also be done to the output ValueVectors, transfer the
arrow output null vector to Drill's null vector. Since the native method
only care the physical memory address, invoking that native method is not a
hard work.

Since my current implementation is before DRILL-6340, it does not solve the
output size of the project which is less than the input size case. To cover
that case , there's some more work to do which I have not focused on.

To contribute to community , there's also some test case problem which
needs to be considered, since the Gandiva jar is platform dependent.




On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Weijie,
>
> Thanks much for the update on your Gandiva work. It is great work.
>
> Can you say more about how you are doing the integration?
>
> As you mentioned the memory layout of Arrow's null vector differs from the
> "is set" vector in Drill. How did you work around that?
>
> The Project operator is pretty simple if we are just copying or removing
> columns. However, much of Project deals with invoking Drill-provided
> functions: simple ones (add two ints) and complex ones (perform a regex
> match). To be useful, the integration would have to mimic Drill's behavior
> for each of these many functions.
>
> Project currently works row-by-row. But, to get the maximum performance,
> it would work column-by-column to take full advantage of vectorization.
> Doing that would require large changes to the code that sets up codegen,
> and iterates over the batch.
>
>
> For operators such as Sort, the only vector-based operations are 1) sort a
> batch using defined keys to get an offset vector, and 2) create a new
> vector by copying values, row-by-row, from one batch to another according
> to the offset vector.
>
> The join and aggregate operations are even more complex, as are the
> partition senders and receivers.
>
> Can you tell us where you've used Gandiva? Which operators? How did you
> handle the function integration? I am very curious how you were able to
> solve these problems.
>
>
> Thanks,
>
> - Paul
>
>
>
>     On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> tongweijie178@gmail.com> wrote:
>
>  HI :
>
> Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> simd skill could achieve better query performance.  Arrow and Drill has
> similar column memory format. The main difference now is the null
> representation. Also Arrow has made great changes to the ValueVector. To
> adopt Arrow to replace Drill's VV has been discussed before. That would be
> a great job. But to leverage gandiva , by working at the physical memory
> address level , this work could be little relatively.
>
> Now I have done the integration work at our own branch by make some changes
> to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
> to ARROW-4819 is to make some package level method to be public. But arrow
> community seems not plan to accept this change. Their advice is to have a
> arrow branch.
>
> So what do you think?
>
> 1、Have a self branch of Arrow.
> 2、waiting for the Arrow integration completely.
> or some other ideas?

Re: [Discuss] Integrate Arrow gandiva into Drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Weijie,

Thanks much for the update on your Gandiva work. It is great work.

Can you say more about how you are doing the integration?

As you mentioned the memory layout of Arrow's null vector differs from the "is set" vector in Drill. How did you work around that?

The Project operator is pretty simple if we are just copying or removing columns. However, much of Project deals with invoking Drill-provided functions: simple ones (add two ints) and complex ones (perform a regex match). To be useful, the integration would have to mimic Drill's behavior for each of these many functions.

Project currently works row-by-row. But, to get the maximum performance, it would work column-by-column to take full advantage of vectorization. Doing that would require large changes to the code that sets up codegen, and iterates over the batch.


For operators such as Sort, the only vector-based operations are 1) sort a batch using defined keys to get an offset vector, and 2) create a new vector by copying values, row-by-row, from one batch to another according to the offset vector.

The join and aggregate operations are even more complex, as are the partition senders and receivers.

Can you tell us where you've used Gandiva? Which operators? How did you handle the function integration? I am very curious how you were able to solve these problems.


Thanks,

- Paul

 

    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <to...@gmail.com> wrote:  
 
 HI :

Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
simd skill could achieve better query performance.  Arrow and Drill has
similar column memory format. The main difference now is the null
representation. Also Arrow has made great changes to the ValueVector. To
adopt Arrow to replace Drill's VV has been discussed before. That would be
a great job. But to leverage gandiva , by working at the physical memory
address level , this work could be little relatively.

Now I have done the integration work at our own branch by make some changes
to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes
to ARROW-4819 is to make some package level method to be public. But arrow
community seems not plan to accept this change. Their advice is to have a
arrow branch.

So what do you think?

1、Have a self branch of Arrow.
2、waiting for the Arrow integration completely.
or some other ideas?