You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Adam Gilmore <dr...@gmail.com> on 2015/05/27 04:26:15 UTC

Custom UDFS slow

Hi guys,

I have written a couple of custom UDFS (specifically WEEK() and WEEKYEAR()
to get that date information out of timestamps).

I sampled two queries (on approx. 11 million records in Parquet files)

select count(*) from `table` group by extract(day from `timestamp`)

750ms

select count(*) from `table` group by week(`timestamp`)

2100ms

The code for the WEEK() function is not far from the code from the source
for the EXTRACT(DAY) function.  Furthermore, even if I copy the exact code
for the EXTRACT(DAY) function into that, it has the same performance
detriments.

My question is, why would a UDF be so much slower?  Is this by design or is
there something I'm missing?

Happy to attach the source code of the function if that helps.

Re: Custom UDFS slow

Posted by Steven Phillips <sp...@maprtech.com>.
Could you include the physical plan generated for each query?

Since you say you tried copying the exact code from Drill's EXTRACT
function, you should see the same performance, unless for some reason the
plan is different. There is no difference whatsoever between UDFs and built
in functions. Built in functions are simply UDFs that happen to be packaged
with Drill, but otherwise there is nothing special about them.

On Tue, May 26, 2015 at 8:04 PM, Ted Dunning <te...@gmail.com> wrote:

> On Tue, May 26, 2015 at 7:26 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > The code for the WEEK() function is not far from the code from the source
> > for the EXTRACT(DAY) function.  Furthermore, even if I copy the exact
> code
> > for the EXTRACT(DAY) function into that, it has the same performance
> > detriments.
> >
> > My question is, why would a UDF be so much slower?  Is this by design or
> is
> > there something I'm missing?
> >
> > Happy to attach the source code of the function if that helps.
> >
>
> Well, you might want to try exactly copying the source of the extract
> function.  I would expect that you would get just hte same performance
> since UDF's use the same mechanism as physical operators.
>
> Two possibilities are:
>
> 1) the Java optimizer has seen something subtle about your code or the
> built in code that allows for economical implementation
>
> 2) the Drill optimizer has some kind of special trick that it has figured
> out
>
> 3) there is some sort of data type conversion that your code has forced the
> Drill optimizer to insert a conversion
>
> (the third option is a bonus, just for you)
>
>
> The fourth option that I don't know about is also quite a likely
> possibility.
>
> Seeing your code (put it in a gist, don't attach it) would help a lot.
> Seeing queries and query plans would help as well.
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Custom UDFS slow

Posted by Ted Dunning <te...@gmail.com>.
On Tue, May 26, 2015 at 7:26 PM, Adam Gilmore <dr...@gmail.com> wrote:

> The code for the WEEK() function is not far from the code from the source
> for the EXTRACT(DAY) function.  Furthermore, even if I copy the exact code
> for the EXTRACT(DAY) function into that, it has the same performance
> detriments.
>
> My question is, why would a UDF be so much slower?  Is this by design or is
> there something I'm missing?
>
> Happy to attach the source code of the function if that helps.
>

Well, you might want to try exactly copying the source of the extract
function.  I would expect that you would get just hte same performance
since UDF's use the same mechanism as physical operators.

Two possibilities are:

1) the Java optimizer has seen something subtle about your code or the
built in code that allows for economical implementation

2) the Drill optimizer has some kind of special trick that it has figured
out

3) there is some sort of data type conversion that your code has forced the
Drill optimizer to insert a conversion

(the third option is a bonus, just for you)


The fourth option that I don't know about is also quite a likely
possibility.

Seeing your code (put it in a gist, don't attach it) would help a lot.
Seeing queries and query plans would help as well.