You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by luoc <lu...@apache.org> on 2021/08/24 11:38:59 UTC

Query the HBase data in Drill

Hello Guys,
  Will you use Drill to query Apache HBase? If so, what new feature would you like to see in HBase storage plugin? In addition, Drill supported the Apache Cassandra since 1.19.
Absolutely… Could you tell me what your most common storage plugin (or data format) are? Thanks for your time.


-- luoc

Re: Query the HBase data in Drill

Posted by Charles Givre <cg...@gmail.com>.

Hi Luoc, 
There actually is a pending PR which we should merge before we do any additional work to the HBase plugin which is DRILL-7985 [1].   This PR introduces a new framework for pushdowns which will make it a lot easier to implement pushdowns for the various storage plugins.  I would recommend reading the docs for that as we can really make the HBase plugin a lot more robust than it currently is. 

Best,
-- C 

1: https://github.com/apache/drill/pull/2289 <https://github.com/apache/drill/pull/2289>




> On Aug 25, 2021, at 10:43 AM, luoc <lu...@apache.org> wrote:
> 
>  Thanks for the feedback. Apache HBase and Apache Phoenix are an important part of my work. And then, I'm not sure anyone have started the `HBase to EVF` for Drill, but this improvement is valuable.
>  In particular, I found a big improvement over the Phoenix 4.x and HBase 1.x series when I recently used the Phoenix 5.1 + HBase 2.3 on Hadoop 3.3.
>  Look forward to seeing Drill inherit from these advantages.
> 
>> 在 2021年8月24日，23:16，Ted Dunning <te...@gmail.com> 写道：
>> 
>> I know somebody who is querying a very large table and has trouble with
>> pushdown.
>> 
>> They are looking for values indexed by primary key with a query like
>> "select * from table where key in s".  If s has a very small number of
>> values, this turns into primary key access, but if there are more than just
>> a few, it becomes a scan.
>> 
>> The situation that would be interesting to detect is where s has a few
>> tightly clustered groups. The ideal strategy would be to scan each group.
>> How this might be detected isn't clear to me, but it would make a massive
>> difference to this kind of query.
>> 
>> Currently, the best alternative is to try to avoid this kind of query and
>> build a data flow such that each cluster of keys flows into a separate
>> query. This would be made easier if a common table expression (CTE) query
>> could be done without having the optimizer try to globally optimize back to
>> a single big scan.
>> 
>> Anyway, I have absolutely no concrete suggestions for making this work, but
>> the need is there.
>> 
>> 
>>> On Tue, Aug 24, 2021 at 4:39 AM luoc <lu...@apache.org> wrote:
>>> 
>>> Hello Guys,
>>> Will you use Drill to query Apache HBase? If so, what new feature would
>>> you like to see in HBase storage plugin? In addition, Drill supported the
>>> Apache Cassandra since 1.19.
>>> Absolutely… Could you tell me what your most common storage plugin (or
>>> data format) are? Thanks for your time.
>>> 
>>> 
>>> -- luoc
>

Re: Query the HBase data in Drill

Posted by Charles Givre <cg...@gmail.com>.

Hi Luoc, 
There actually is a pending PR which we should merge before we do any additional work to the HBase plugin which is DRILL-7985 [1].   This PR introduces a new framework for pushdowns which will make it a lot easier to implement pushdowns for the various storage plugins.  I would recommend reading the docs for that as we can really make the HBase plugin a lot more robust than it currently is. 

Best,
-- C 

1: https://github.com/apache/drill/pull/2289 <https://github.com/apache/drill/pull/2289>




> On Aug 25, 2021, at 10:43 AM, luoc <lu...@apache.org> wrote:
> 
>  Thanks for the feedback. Apache HBase and Apache Phoenix are an important part of my work. And then, I'm not sure anyone have started the `HBase to EVF` for Drill, but this improvement is valuable.
>  In particular, I found a big improvement over the Phoenix 4.x and HBase 1.x series when I recently used the Phoenix 5.1 + HBase 2.3 on Hadoop 3.3.
>  Look forward to seeing Drill inherit from these advantages.
> 
>> 在 2021年8月24日，23:16，Ted Dunning <te...@gmail.com> 写道：
>> 
>> I know somebody who is querying a very large table and has trouble with
>> pushdown.
>> 
>> They are looking for values indexed by primary key with a query like
>> "select * from table where key in s".  If s has a very small number of
>> values, this turns into primary key access, but if there are more than just
>> a few, it becomes a scan.
>> 
>> The situation that would be interesting to detect is where s has a few
>> tightly clustered groups. The ideal strategy would be to scan each group.
>> How this might be detected isn't clear to me, but it would make a massive
>> difference to this kind of query.
>> 
>> Currently, the best alternative is to try to avoid this kind of query and
>> build a data flow such that each cluster of keys flows into a separate
>> query. This would be made easier if a common table expression (CTE) query
>> could be done without having the optimizer try to globally optimize back to
>> a single big scan.
>> 
>> Anyway, I have absolutely no concrete suggestions for making this work, but
>> the need is there.
>> 
>> 
>>> On Tue, Aug 24, 2021 at 4:39 AM luoc <lu...@apache.org> wrote:
>>> 
>>> Hello Guys,
>>> Will you use Drill to query Apache HBase? If so, what new feature would
>>> you like to see in HBase storage plugin? In addition, Drill supported the
>>> Apache Cassandra since 1.19.
>>> Absolutely… Could you tell me what your most common storage plugin (or
>>> data format) are? Thanks for your time.
>>> 
>>> 
>>> -- luoc
>

Re: Query the HBase data in Drill

Posted by luoc <lu...@apache.org>.

  Thanks for the feedback. Apache HBase and Apache Phoenix are an important part of my work. And then, I'm not sure anyone have started the `HBase to EVF` for Drill, but this improvement is valuable.
  In particular, I found a big improvement over the Phoenix 4.x and HBase 1.x series when I recently used the Phoenix 5.1 + HBase 2.3 on Hadoop 3.3.
  Look forward to seeing Drill inherit from these advantages.

> 在 2021年8月24日，23:16，Ted Dunning <te...@gmail.com> 写道：
> 
> I know somebody who is querying a very large table and has trouble with
> pushdown.
> 
> They are looking for values indexed by primary key with a query like
> "select * from table where key in s".  If s has a very small number of
> values, this turns into primary key access, but if there are more than just
> a few, it becomes a scan.
> 
> The situation that would be interesting to detect is where s has a few
> tightly clustered groups. The ideal strategy would be to scan each group.
> How this might be detected isn't clear to me, but it would make a massive
> difference to this kind of query.
> 
> Currently, the best alternative is to try to avoid this kind of query and
> build a data flow such that each cluster of keys flows into a separate
> query. This would be made easier if a common table expression (CTE) query
> could be done without having the optimizer try to globally optimize back to
> a single big scan.
> 
> Anyway, I have absolutely no concrete suggestions for making this work, but
> the need is there.
> 
> 
>> On Tue, Aug 24, 2021 at 4:39 AM luoc <lu...@apache.org> wrote:
>> 
>> Hello Guys,
>>  Will you use Drill to query Apache HBase? If so, what new feature would
>> you like to see in HBase storage plugin? In addition, Drill supported the
>> Apache Cassandra since 1.19.
>> Absolutely… Could you tell me what your most common storage plugin (or
>> data format) are? Thanks for your time.
>> 
>> 
>> -- luoc

Re: Query the HBase data in Drill

Posted by luoc <lu...@apache.org>.

  Thanks for the feedback. Apache HBase and Apache Phoenix are an important part of my work. And then, I'm not sure anyone have started the `HBase to EVF` for Drill, but this improvement is valuable.
  In particular, I found a big improvement over the Phoenix 4.x and HBase 1.x series when I recently used the Phoenix 5.1 + HBase 2.3 on Hadoop 3.3.
  Look forward to seeing Drill inherit from these advantages.

> 在 2021年8月24日，23:16，Ted Dunning <te...@gmail.com> 写道：
> 
> I know somebody who is querying a very large table and has trouble with
> pushdown.
> 
> They are looking for values indexed by primary key with a query like
> "select * from table where key in s".  If s has a very small number of
> values, this turns into primary key access, but if there are more than just
> a few, it becomes a scan.
> 
> The situation that would be interesting to detect is where s has a few
> tightly clustered groups. The ideal strategy would be to scan each group.
> How this might be detected isn't clear to me, but it would make a massive
> difference to this kind of query.
> 
> Currently, the best alternative is to try to avoid this kind of query and
> build a data flow such that each cluster of keys flows into a separate
> query. This would be made easier if a common table expression (CTE) query
> could be done without having the optimizer try to globally optimize back to
> a single big scan.
> 
> Anyway, I have absolutely no concrete suggestions for making this work, but
> the need is there.
> 
> 
>> On Tue, Aug 24, 2021 at 4:39 AM luoc <lu...@apache.org> wrote:
>> 
>> Hello Guys,
>>  Will you use Drill to query Apache HBase? If so, what new feature would
>> you like to see in HBase storage plugin? In addition, Drill supported the
>> Apache Cassandra since 1.19.
>> Absolutely… Could you tell me what your most common storage plugin (or
>> data format) are? Thanks for your time.
>> 
>> 
>> -- luoc

Re: Query the HBase data in Drill

Posted by Ted Dunning <te...@gmail.com>.

I know somebody who is querying a very large table and has trouble with
pushdown.

They are looking for values indexed by primary key with a query like
"select * from table where key in s".  If s has a very small number of
values, this turns into primary key access, but if there are more than just
a few, it becomes a scan.

The situation that would be interesting to detect is where s has a few
tightly clustered groups. The ideal strategy would be to scan each group.
How this might be detected isn't clear to me, but it would make a massive
difference to this kind of query.

Currently, the best alternative is to try to avoid this kind of query and
build a data flow such that each cluster of keys flows into a separate
query. This would be made easier if a common table expression (CTE) query
could be done without having the optimizer try to globally optimize back to
a single big scan.

Anyway, I have absolutely no concrete suggestions for making this work, but
the need is there.

On Tue, Aug 24, 2021 at 4:39 AM luoc <lu...@apache.org> wrote:

> Hello Guys,
>   Will you use Drill to query Apache HBase? If so, what new feature would
> you like to see in HBase storage plugin? In addition, Drill supported the
> Apache Cassandra since 1.19.
> Absolutely… Could you tell me what your most common storage plugin (or
> data format) are? Thanks for your time.
>
>
> -- luoc

Re: Query the HBase data in Drill

Posted by Ted Dunning <te...@gmail.com>.

I know somebody who is querying a very large table and has trouble with
pushdown.

They are looking for values indexed by primary key with a query like
"select * from table where key in s".  If s has a very small number of
values, this turns into primary key access, but if there are more than just
a few, it becomes a scan.

The situation that would be interesting to detect is where s has a few
tightly clustered groups. The ideal strategy would be to scan each group.
How this might be detected isn't clear to me, but it would make a massive
difference to this kind of query.

Currently, the best alternative is to try to avoid this kind of query and
build a data flow such that each cluster of keys flows into a separate
query. This would be made easier if a common table expression (CTE) query
could be done without having the optimizer try to globally optimize back to
a single big scan.

Anyway, I have absolutely no concrete suggestions for making this work, but
the need is there.

On Tue, Aug 24, 2021 at 4:39 AM luoc <lu...@apache.org> wrote:

> Hello Guys,
>   Will you use Drill to query Apache HBase? If so, what new feature would
> you like to see in HBase storage plugin? In addition, Drill supported the
> Apache Cassandra since 1.19.
> Absolutely… Could you tell me what your most common storage plugin (or
> data format) are? Thanks for your time.
>
>
> -- luoc

Re: Query the HBase data in Drill

Posted by Christian Pfarr <z0...@pm.me.INVALID>.

Hi Luoc,

i would like to have better filter pushdown to hbase.
We see that hbase works fine with "like" Filters but for example Tableau creates very strange queries.
In that case drill loads most of the rows and has to filter it by itself wich leads to poor performance.

Because of that we try to use phoenix for the most critical usecases, but we cannot union this with other storage plugins because of https://issues.apache.org/jira/browse/DRILL-7720
(we add phoenix through queryserver as jdbc storage plugin)
Additionally we have this problem with phoenix and imperonation within jdbc storage plugins, as described in https://github.com/apache/drill/issues/2296

So if i had a free wish, i would like to have a good phoenix integration (without need of queryserver) including kerberos security and impersonation support.

Regards,
Christian

-------- Original-Nachricht --------
Am 24. Aug. 2021, 13:38, luoc schrieb:

> Hello Guys,
> Will you use Drill to query Apache HBase? If so, what new feature would you like to see in HBase storage plugin? In addition, Drill supported the Apache Cassandra since 1.19.
> Absolutely… Could you tell me what your most common storage plugin (or data format) are? Thanks for your time.
> 

> -- luoc

Re: Query the HBase data in Drill

Posted by Christian Pfarr <z0...@pm.me.INVALID>.

Hi Luoc,

i would like to have better filter pushdown to hbase.
We see that hbase works fine with "like" Filters but for example Tableau creates very strange queries.
In that case drill loads most of the rows and has to filter it by itself wich leads to poor performance.

Because of that we try to use phoenix for the most critical usecases, but we cannot union this with other storage plugins because of https://issues.apache.org/jira/browse/DRILL-7720
(we add phoenix through queryserver as jdbc storage plugin)
Additionally we have this problem with phoenix and imperonation within jdbc storage plugins, as described in https://github.com/apache/drill/issues/2296

So if i had a free wish, i would like to have a good phoenix integration (without need of queryserver) including kerberos security and impersonation support.

Regards,
Christian

-------- Original-Nachricht --------
Am 24. Aug. 2021, 13:38, luoc schrieb:

> Hello Guys,
> Will you use Drill to query Apache HBase? If so, what new feature would you like to see in HBase storage plugin? In addition, Drill supported the Apache Cassandra since 1.19.
> Absolutely… Could you tell me what your most common storage plugin (or data format) are? Thanks for your time.
> 

> -- luoc