You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Igor Guzenko <ih...@gmail.com> on 2019/08/08 20:34:23 UTC

Re: [QUESTION]: Caching UDFs

Hello Charles,

Although the idea seems good in general, actual implementation may cause
much more issues than solve. If you are thinking about distributed cache,
then every cache miss on current drillbit will cause network request to
other drillbits, passing function arguments along with ids of executable
query fragments. Local cache is more suitable of course, but another
problem may be that caching is actually great only for repeatable arguments
and especially when similar arguments are coming in one by one (like
presort rows by the args before function execution). For cases when args
are mostly distinct the caching will cause heavy memory overhead. But one
case when the caching may perform well is to know that rows are sorted by
function arguments and locally cache just one function call result for
bunch of repeated rows.  For example, suppose that we are executing query

*select x,y,* *slow_function(x,y) from (select x,y from dfs.`large_table`
order by x,y)*

*x y * *slow_function(x,y)*
1 1  (calculate and cache)
1 1  (get cached)
1 1  (get cached)
1 1  (get cached)
1 2  (calculate and cache)
1 2  (get cached)
1 2  (get cached)

in such case heavy logic in *slow_function(x,y)* will be executed only
twice for the rows.  But in this case ordering by x, y will most probably
kill all benefits provided by caching.

Thanks,
Igor

On Thu, Aug 8, 2019 at 7:46 PM Charles Givre <ch...@gtkcyber.com>
wrote:

> Hello Drill Devs,
> I have a question about UDFs.  Let's say you have a non-trivial UDF called
> foo(x,y) which returns some value.  Assuming that if the arguments are the
> same, the function foo() will return the same result, does Drill have any
> optimizations to prevent running the non-trivial function?
>
> I was thinking that it might make sense to cache the arguments and results
> in memory and before the function is executed, check the cache to see if
> they're there.  If they are, return the cached results, and if not, execute
> the function.  I was thinking that for some functions, like date/time
> functions, we might want to include something in the code to ensure that
> the results do not get cached.
>
> Thoughts?
>
>
> Charles S. Givre CISSP
> Data Scientist,
> Co-Founder GTK Cyber LLC
>
> charles.givre@gtkcyber.com
> *Mobile*: (443) 762-3286
>
>
> <https://www.linkedin.com/in/cgivre/>
> <https://www.linkedin.com/in/cgivre/>
>
>