You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Ryan Shanks <ry...@softcomcorp.com> on 2018/04/07 21:08:13 UTC

Non-column filters in Drill

Hi Drill Dev Team!

I am writing a custom storage plugin and I am curious if it is possible 
in Drill to pass a filter value, in the form of a where clause, that is 
not related to a column. What I would like to accomplish is something like:

select * from myTable where notColumn = 'value';

In the example, notColumn is not a column in myTable, or any other 
table, it is just a specific parameter that the storage plugin will use 
in the filtering process. Additionally, notColumn would not be returned 
as a column so Drill needs to not expect it as a part of the 'select *'. 
I created a rule that will push down and remove these non-column filter 
calls, but I need to somehow tell drill/calcite that the filter name is 
valid, without actually registering it as a column. The following error 
occurs prior to submitting any rules:

org.apache.drill.common.exceptions.UserRemoteException: VALIDATION 
ERROR: From line 1, column 35 to line 1, column 39: Column 'notColumn' 
not found in any table


Alternatively, can I manipulate star queries to only return a subset of 
all the columns for a table?

Any insight would be greatly appreciated!

Thanks,
Ryan

Re: Non-column filters in Drill

Posted by Ted Dunning <te...@gmail.com>.
Ryan

What would happen if you defined a column so that you could use the normal
pushdown mechanism? In most cases, you wouldn't return the value of the
column since the only purpose is to use as a filter, but nothing should
prevent you from returning the value of this not-really-a-column.

By letting it be considered as if a column, all of the normal mechanisms
can be brought to bear.

Another case can be seen in how the CSV reader lets your inject separators
and such.

On Sat, Apr 7, 2018, 16:23 Ryan Shanks <ry...@softcomcorp.com> wrote:

> Hi Drill Dev Team!
>
> I am writing a custom storage plugin and I am curious if it is possible
> in Drill to pass a filter value, in the form of a where clause, that is
> not related to a column. What I would like to accomplish is something like:
>
> select * from myTable where notColumn = 'value';
>
> In the example, notColumn is not a column in myTable, or any other
> table, it is just a specific parameter that the storage plugin will use
> in the filtering process. Additionally, notColumn would not be returned
> as a column so Drill needs to not expect it as a part of the 'select *'.
> I created a rule that will push down and remove these non-column filter
> calls, but I need to somehow tell drill/calcite that the filter name is
> valid, without actually registering it as a column. The following error
> occurs prior to submitting any rules:
>
> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR: From line 1, column 35 to line 1, column 39: Column 'notColumn'
> not found in any table
>
>
> Alternatively, can I manipulate star queries to only return a subset of
> all the columns for a table?
>
> Any insight would be greatly appreciated!
>
> Thanks,
> Ryan
>

Re: Non-column filters in Drill

Posted by Ryan Shanks <ry...@softcomcorp.com>.
Thanks for all the suggestions everyone! It looks like table functions 
are the closest to what I am looking for.

- Ryan


On 2018-04-08 02:25 AM, Paul Rogers wrote:
> Hi Ryan,
>
> There is an obscure, but very handy feature of Drill called table functions. [1] These allow you to set parameters of your format plugin as part of a query.
>
> You mentioned a storage plugin. I've not tried a table function with a storage plugin. I have tested table functions with a format plugin.
>
> Your format or storage plugin has a Jackson-serializable Java class. Normally you set the properties for your plugin in the Drill web console. But, these can also be set in the table function.
>
> I had a use case something like yours. I defined an example "regex" plugin where the user can specify a regular expression to apply to to a text file to parse columns. The use can then provide a list of column names. Using the table function, I could specify the regex and column names per-query.
>
> This exercise did, however, point out two current limitations of table functions. First, they work only with simple data types (strings, ints). (DRILL-6169) So, my list of columns has to be a single string with a comma delimited list of columns. I could not use the more natural list of strings. Second, table functions do not retain the configured value of parameters: you have to include all parameters in the function, not just the ones you want to change. (DRILL-6168)
>
> Yet another option is to set a session option. However, unless you do a bit of clever coding, format plugins don't have visibility to session options (DRILL-5181).
>
> Perhaps your use case provides a compelling reason to fix some of these limitations...
>
> Thanks,
>
> - Paul
>
> [1] https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters, see the section "Using the Formats Attributes as Table Function Parameters".
>
>
>      On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinha <am...@apache.org> wrote:
>   
>   A better option would be to have a user-defined function that takes 2
> parameters and evaluates to a boolean value.
>   e.g  select * from myTable where MyUDF(notColumn, 'value')  IS TRUE;
>
> The Storage Plugin that you are developing would need to implement a
> pushdown rule that  looks
> at the filter condition and if it contains 'MyUDF()', it would pushdown to
> the scan/reader corresponding to your plugin.
>
>
> On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri <ha...@gmail.com>
> wrote:
>
>> Hello Ryan,
>>
>> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
>> supplied by the underlying scan.
>> However, I expect that this column will be present in the scan but not past
>> the filter (notColumn = 'value') in the plan.
>> In that case you may need to pushdown the filter to the groupScan and then
>> remove the column projections from your custom groupscan.
>>
>> It would be easy for us to guess what could be the issue, if you can post
>> the logical and physical query plan's for this query.
>>
>> Hope this helps. Please do let us know if you have any further issues.
>>
>> Thanks,
>>
>>
>> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
>> wrote:
>>
>>> Hi Drill Dev Team!
>>>
>>> I am writing a custom storage plugin and I am curious if it is possible
>> in
>>> Drill to pass a filter value, in the form of a where clause, that is not
>>> related to a column. What I would like to accomplish is something like:
>>>
>>> select * from myTable where notColumn = 'value';
>>>
>>> In the example, notColumn is not a column in myTable, or any other table,
>>> it is just a specific parameter that the storage plugin will use in the
>>> filtering process. Additionally, notColumn would not be returned as a
>>> column so Drill needs to not expect it as a part of the 'select *'. I
>>> created a rule that will push down and remove these non-column filter
>>> calls, but I need to somehow tell drill/calcite that the filter name is
>>> valid, without actually registering it as a column. The following error
>>> occurs prior to submitting any rules:
>>>
>>> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
>> ERROR:
>>>  From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
>>> in any table
>>>
>>>
>>> Alternatively, can I manipulate star queries to only return a subset of
>>> all the columns for a table?
>>>
>>> Any insight would be greatly appreciated!
>>>
>>> Thanks,
>>> Ryan
>>>


Re: Non-column filters in Drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Ryan,

There is an obscure, but very handy feature of Drill called table functions. [1] These allow you to set parameters of your format plugin as part of a query.

You mentioned a storage plugin. I've not tried a table function with a storage plugin. I have tested table functions with a format plugin.

Your format or storage plugin has a Jackson-serializable Java class. Normally you set the properties for your plugin in the Drill web console. But, these can also be set in the table function.

I had a use case something like yours. I defined an example "regex" plugin where the user can specify a regular expression to apply to to a text file to parse columns. The use can then provide a list of column names. Using the table function, I could specify the regex and column names per-query.

This exercise did, however, point out two current limitations of table functions. First, they work only with simple data types (strings, ints). (DRILL-6169) So, my list of columns has to be a single string with a comma delimited list of columns. I could not use the more natural list of strings. Second, table functions do not retain the configured value of parameters: you have to include all parameters in the function, not just the ones you want to change. (DRILL-6168)

Yet another option is to set a session option. However, unless you do a bit of clever coding, format plugins don't have visibility to session options (DRILL-5181).

Perhaps your use case provides a compelling reason to fix some of these limitations...

Thanks,

- Paul

[1] https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters, see the section "Using the Formats Attributes as Table Function Parameters".


    On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinha <am...@apache.org> wrote:  
 
 A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
 e.g  select * from myTable where MyUDF(notColumn, 'value')  IS TRUE;

The Storage Plugin that you are developing would need to implement a
pushdown rule that  looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.


On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri <ha...@gmail.com>
wrote:

> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>
  

Re: Non-column filters in Drill

Posted by Aman Sinha <am...@apache.org>.
A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
 e.g   select * from myTable where MyUDF(notColumn, 'value')  IS TRUE;

The Storage Plugin that you are developing would need to implement a
pushdown rule that  looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.


On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri <ha...@gmail.com>
wrote:

> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>

Re: Non-column filters in Drill

Posted by Hanumath Rao Maduri <ha...@gmail.com>.
Hello Ryan,

Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
supplied by the underlying scan.
However, I expect that this column will be present in the scan but not past
the filter (notColumn = 'value') in the plan.
In that case you may need to pushdown the filter to the groupScan and then
remove the column projections from your custom groupscan.

It would be easy for us to guess what could be the issue, if you can post
the logical and physical query plan's for this query.

Hope this helps. Please do let us know if you have any further issues.

Thanks,


On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
wrote:

> Hi Drill Dev Team!
>
> I am writing a custom storage plugin and I am curious if it is possible in
> Drill to pass a filter value, in the form of a where clause, that is not
> related to a column. What I would like to accomplish is something like:
>
> select * from myTable where notColumn = 'value';
>
> In the example, notColumn is not a column in myTable, or any other table,
> it is just a specific parameter that the storage plugin will use in the
> filtering process. Additionally, notColumn would not be returned as a
> column so Drill needs to not expect it as a part of the 'select *'. I
> created a rule that will push down and remove these non-column filter
> calls, but I need to somehow tell drill/calcite that the filter name is
> valid, without actually registering it as a column. The following error
> occurs prior to submitting any rules:
>
> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR:
> From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> in any table
>
>
> Alternatively, can I manipulate star queries to only return a subset of
> all the columns for a table?
>
> Any insight would be greatly appreciated!
>
> Thanks,
> Ryan
>