You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jason Altekruse <al...@gmail.com> on 2015/01/16 17:41:23 UTC
Possible modification to drill function interface

Hello all,

I was trying to document the interfaces we expect developers to interact
with when working with Drill and ran into a possible refactoring we might
want to do for UDFs. Currently the UDF interface takes an instance of
RecordBatch (see the other discussion thread about re-naming this, this is
our base class for operators, not a data structure) in the setup method for
a UDF, designed to be run once before evaluating the function on any of the
input data. Currently this input is rarely used, and I think it should
possibly be removed.

The only current uses of this interface is finding out the current time and
timezone from the fragment context of the record batch.

We have a mechanism that currently allows providing a hook into the wider
context of execution to UDFs in the form of the @Inject annotation. This is
currently only implemented for a single type, DrillBuf, our primary storage
buffer type for all data in Drill. These injected drillbufs currently allow
providing a reusable temporary buffer that can be re-allocated as needed.
This is used for cases where we have variable length data produced by a UDF
and need a place to store the intermediate work of the function. To allow
these buffers to be accounted for, they must be connected to the fragment's
memory allocator, which is done when they are created and being injected
into the runtime generated expression evaluation code.

I believe we should do something similar to provide a wrapper object to the
current time and timezone information, which is currently gathered from
this direct reference to the RecordBatch provided in the setup method.

I had tabled this work, as it was not a bug, but instead a clarification of
an API. We should have a limited set of fragment/query context available to
UDF writers and be explicit about it.

This has re-emerged as I have been trying to allow for more advanced
filters against our generated partition columns, to allow for at least
constant expression evaluation in determining a folder or partition to
read. The current use case I am trying to enable is finding a 'most recent'
folder using the now() function and formatting the date to match a folder
naming pattern for dates.

To do this I have been looking at the Interpreted expression evaluation
code that was added to the codebase but has not been hooked up to partition
pruning. The interpreted expression evaluator currently passes a record
batch into the evaluator to satisfy the interface of the UDFs, but a
primary place where we were planning on using interpreted expression
evaluation is at planning time, such as the case with partition pruning. At
planning time we do not have a RecordBatch available to pass into the
evaluator, and trying to create a mock implementation of the interface
seems like a bit of a hack to say the least.

Let me know your thoughts on how best to modify the interface.

Thanks,
Jason