You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Ryan Shanks <ry...@softcomcorp.com> on 2018/04/07 21:08:13 UTC
Non-column filters in Drill
Hi Drill Dev Team!
I am writing a custom storage plugin and I am curious if it is possible
in Drill to pass a filter value, in the form of a where clause, that is
not related to a column. What I would like to accomplish is something like:
select * from myTable where notColumn = 'value';
In the example, notColumn is not a column in myTable, or any other
table, it is just a specific parameter that the storage plugin will use
in the filtering process. Additionally, notColumn would not be returned
as a column so Drill needs to not expect it as a part of the 'select *'.
I created a rule that will push down and remove these non-column filter
calls, but I need to somehow tell drill/calcite that the filter name is
valid, without actually registering it as a column. The following error
occurs prior to submitting any rules:
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
ERROR: From line 1, column 35 to line 1, column 39: Column 'notColumn'
not found in any table
Alternatively, can I manipulate star queries to only return a subset of
all the columns for a table?
Any insight would be greatly appreciated!
Thanks,
Ryan
Re: Non-column filters in Drill
Posted by Ted Dunning <te...@gmail.com>.
Ryan
What would happen if you defined a column so that you could use the normal
pushdown mechanism? In most cases, you wouldn't return the value of the
column since the only purpose is to use as a filter, but nothing should
prevent you from returning the value of this not-really-a-column.
By letting it be considered as if a column, all of the normal mechanisms
can be brought to bear.
Another case can be seen in how the CSV reader lets your inject separators
and such.
On Sat, Apr 7, 2018, 16:23 Ryan Shanks <ry...@softcomcorp.com> wrote:
> Hi Drill Dev Team!
>
> I am writing a custom storage plugin and I am curious if it is possible
> in Drill to pass a filter value, in the form of a where clause, that is
> not related to a column. What I would like to accomplish is something like:
>
> select * from myTable where notColumn = 'value';
>
> In the example, notColumn is not a column in myTable, or any other
> table, it is just a specific parameter that the storage plugin will use
> in the filtering process. Additionally, notColumn would not be returned
> as a column so Drill needs to not expect it as a part of the 'select *'.
> I created a rule that will push down and remove these non-column filter
> calls, but I need to somehow tell drill/calcite that the filter name is
> valid, without actually registering it as a column. The following error
> occurs prior to submitting any rules:
>
> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR: From line 1, column 35 to line 1, column 39: Column 'notColumn'
> not found in any table
>
>
> Alternatively, can I manipulate star queries to only return a subset of
> all the columns for a table?
>
> Any insight would be greatly appreciated!
>
> Thanks,
> Ryan
>
Re: Non-column filters in Drill
Posted by Ryan Shanks <ry...@softcomcorp.com>.
Thanks for all the suggestions everyone! It looks like table functions
are the closest to what I am looking for.
- Ryan
On 2018-04-08 02:25 AM, Paul Rogers wrote:
> Hi Ryan,
>
> There is an obscure, but very handy feature of Drill called table functions. [1] These allow you to set parameters of your format plugin as part of a query.
>
> You mentioned a storage plugin. I've not tried a table function with a storage plugin. I have tested table functions with a format plugin.
>
> Your format or storage plugin has a Jackson-serializable Java class. Normally you set the properties for your plugin in the Drill web console. But, these can also be set in the table function.
>
> I had a use case something like yours. I defined an example "regex" plugin where the user can specify a regular expression to apply to to a text file to parse columns. The use can then provide a list of column names. Using the table function, I could specify the regex and column names per-query.
>
> This exercise did, however, point out two current limitations of table functions. First, they work only with simple data types (strings, ints). (DRILL-6169) So, my list of columns has to be a single string with a comma delimited list of columns. I could not use the more natural list of strings. Second, table functions do not retain the configured value of parameters: you have to include all parameters in the function, not just the ones you want to change. (DRILL-6168)
>
> Yet another option is to set a session option. However, unless you do a bit of clever coding, format plugins don't have visibility to session options (DRILL-5181).
>
> Perhaps your use case provides a compelling reason to fix some of these limitations...
>
> Thanks,
>
> - Paul
>
> [1] https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters, see the section "Using the Formats Attributes as Table Function Parameters".
>
>
> On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinha <am...@apache.org> wrote:
>
> A better option would be to have a user-defined function that takes 2
> parameters and evaluates to a boolean value.
> e.g select * from myTable where MyUDF(notColumn, 'value') IS TRUE;
>
> The Storage Plugin that you are developing would need to implement a
> pushdown rule that looks
> at the filter condition and if it contains 'MyUDF()', it would pushdown to
> the scan/reader corresponding to your plugin.
>
>
> On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri <ha...@gmail.com>
> wrote:
>
>> Hello Ryan,
>>
>> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
>> supplied by the underlying scan.
>> However, I expect that this column will be present in the scan but not past
>> the filter (notColumn = 'value') in the plan.
>> In that case you may need to pushdown the filter to the groupScan and then
>> remove the column projections from your custom groupscan.
>>
>> It would be easy for us to guess what could be the issue, if you can post
>> the logical and physical query plan's for this query.
>>
>> Hope this helps. Please do let us know if you have any further issues.
>>
>> Thanks,
>>
>>
>> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
>> wrote:
>>
>>> Hi Drill Dev Team!
>>>
>>> I am writing a custom storage plugin and I am curious if it is possible
>> in
>>> Drill to pass a filter value, in the form of a where clause, that is not
>>> related to a column. What I would like to accomplish is something like:
>>>
>>> select * from myTable where notColumn = 'value';
>>>
>>> In the example, notColumn is not a column in myTable, or any other table,
>>> it is just a specific parameter that the storage plugin will use in the
>>> filtering process. Additionally, notColumn would not be returned as a
>>> column so Drill needs to not expect it as a part of the 'select *'. I
>>> created a rule that will push down and remove these non-column filter
>>> calls, but I need to somehow tell drill/calcite that the filter name is
>>> valid, without actually registering it as a column. The following error
>>> occurs prior to submitting any rules:
>>>
>>> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
>> ERROR:
>>> From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
>>> in any table
>>>
>>>
>>> Alternatively, can I manipulate star queries to only return a subset of
>>> all the columns for a table?
>>>
>>> Any insight would be greatly appreciated!
>>>
>>> Thanks,
>>> Ryan
>>>
Re: Non-column filters in Drill
Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Ryan,
There is an obscure, but very handy feature of Drill called table functions. [1] These allow you to set parameters of your format plugin as part of a query.
You mentioned a storage plugin. I've not tried a table function with a storage plugin. I have tested table functions with a format plugin.
Your format or storage plugin has a Jackson-serializable Java class. Normally you set the properties for your plugin in the Drill web console. But, these can also be set in the table function.
I had a use case something like yours. I defined an example "regex" plugin where the user can specify a regular expression to apply to to a text file to parse columns. The use can then provide a list of column names. Using the table function, I could specify the regex and column names per-query.
This exercise did, however, point out two current limitations of table functions. First, they work only with simple data types (strings, ints). (DRILL-6169) So, my list of columns has to be a single string with a comma delimited list of columns. I could not use the more natural list of strings. Second, table functions do not retain the configured value of parameters: you have to include all parameters in the function, not just the ones you want to change. (DRILL-6168)
Yet another option is to set a session option. However, unless you do a bit of clever coding, format plugins don't have visibility to session options (DRILL-5181).
Perhaps your use case provides a compelling reason to fix some of these limitations...
Thanks,
- Paul
[1] https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters, see the section "Using the Formats Attributes as Table Function Parameters".
On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinha <am...@apache.org> wrote:
A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
e.g select * from myTable where MyUDF(notColumn, 'value') IS TRUE;
The Storage Plugin that you are developing would need to implement a
pushdown rule that looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.
On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri <ha...@gmail.com>
wrote:
> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>
Re: Non-column filters in Drill
Posted by Aman Sinha <am...@apache.org>.
A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
e.g select * from myTable where MyUDF(notColumn, 'value') IS TRUE;
The Storage Plugin that you are developing would need to implement a
pushdown rule that looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.
On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri <ha...@gmail.com>
wrote:
> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>
Re: Non-column filters in Drill
Posted by Hanumath Rao Maduri <ha...@gmail.com>.
Hello Ryan,
Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
supplied by the underlying scan.
However, I expect that this column will be present in the scan but not past
the filter (notColumn = 'value') in the plan.
In that case you may need to pushdown the filter to the groupScan and then
remove the column projections from your custom groupscan.
It would be easy for us to guess what could be the issue, if you can post
the logical and physical query plan's for this query.
Hope this helps. Please do let us know if you have any further issues.
Thanks,
On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks <ry...@softcomcorp.com>
wrote:
> Hi Drill Dev Team!
>
> I am writing a custom storage plugin and I am curious if it is possible in
> Drill to pass a filter value, in the form of a where clause, that is not
> related to a column. What I would like to accomplish is something like:
>
> select * from myTable where notColumn = 'value';
>
> In the example, notColumn is not a column in myTable, or any other table,
> it is just a specific parameter that the storage plugin will use in the
> filtering process. Additionally, notColumn would not be returned as a
> column so Drill needs to not expect it as a part of the 'select *'. I
> created a rule that will push down and remove these non-column filter
> calls, but I need to somehow tell drill/calcite that the filter name is
> valid, without actually registering it as a column. The following error
> occurs prior to submitting any rules:
>
> org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR:
> From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> in any table
>
>
> Alternatively, can I manipulate star queries to only return a subset of
> all the columns for a table?
>
> Any insight would be greatly appreciated!
>
> Thanks,
> Ryan
>