You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by George Lu <lu...@gmail.com> on 2015/05/05 09:54:33 UTC

How to deploy Drill to achieve optimal performance

Hi all,

These days, I am trying Drill to see whether Drill fits the realtime/near
realtime interactive queries requirement.
I have a HBase server, underlying HDFS contains three data nodes, and I
deployed 7 Drill nodes within the cluster.
I have several million records in the HBase table and I issue queries like
SUM, MAX, COUNT against the table and found the Drill costs like 5 to 6
minutes on average to get the result.

Such latency is not ideal for interactive use.

I know Drill is used for low-latency query, so I would like to ask for help
how to achieve that? How to make Drill run queries in low-latency (in
seconds not minutes)?

Any suggestions are welcome!

Thanks!

George

Re: How to deploy Drill to achieve optimal performance

Posted by George Lu <lu...@gmail.com>.

Hi Ted and Steven,

The cluster is a testing one and it has two HDFS DataNodes and two HBase
RegionServers and total cluster has 9 nodes, I deployed Drill 0.9.0 to 7 of
them.

test2      | 31010      | 31011        | 31012      | false      |
| test3      | 31010      | 31011        | 31012      | false      |
| test8      | 31010      | 31011        | 31012      | true       |
| test4      | 31010      | 31011        | 31012      | false      |
| test6      | 31010      | 31011        | 31012      | false      |
| test9      | 31010      | 31011        | 31012      | false      |
| test5      | 31010      | 31011        | 31012      | false

test5,test6 are the data nodes and regionservers.

I query a small table (select count(*) from table) contains 18664 records,
and it costs "1 row selected (5.786 seconds)".
If I query some table with 40397300 records, "1 row selected (579.322
seconds)"
If I query select count(*), convert_from(activities_perf.log.rt,'utf8')
from activities_perf group by activities_perf.log.rt, it always get "
Query failed: SYSTEM ERROR: Command failed while establishing connection.
Failure type CONNECTION.

Fragment 2:4

[7540323b-1db0-4220-8016-b3a7c950979c on test3:31010]
java.lang.RuntimeException: java.sql.SQLException: Failure while executing
query.
at sqlline.SqlLine$IncrementalRows.hasNext(SqlLine.java:2514)
at sqlline.SqlLine$TableOutputFormat.print(SqlLine.java:2148)
at sqlline.SqlLine.print(SqlLine.java:1809)
at sqlline.SqlLine$Commands.execute(SqlLine.java:3766)
at sqlline.SqlLine$Commands.sql(SqlLine.java:3663)
at sqlline.SqlLine.dispatch(SqlLine.java:889)
at sqlline.SqlLine.begin(SqlLine.java:763)
at sqlline.SqlLine.start(SqlLine.java:498)
at sqlline.SqlLine.main(SqlLine.java:460)"

Seems some nodes fail from time to time. Not sure whether Drill will
reschedule the query on some node or can configure to do?

I have attach the log files from some of the nodes (as I cannot log into
some of the nodes in the cluster) for your reference.

Many thanks!

George Lu

On Wed, May 6, 2015 at 1:06 AM, Steven Phillips <sp...@maprtech.com>
wrote:

> It would be helpful if you could post the profile for the query somewhere,
> or send it directly to me as an attachment (since attachments won't post to
> the mailing list).
>
> To get the profile, go to the profile page in the Web UI:
>
>
> http://drill.apache.org/docs/monitoring-and-canceling-queries-in-the-drill-web-ui/
>
> When you find the profile for the query in question, if you add ".json" to
> the URL, this will display the wrong text for the profile. You can download
> this and save it somewhere.
>
> On Tue, May 5, 2015 at 3:38 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > George,
> >
> > That sounds much too slow.
> >
> > Can you provide some samples of the data and queries?  How about actual
> > data counts?  Millioins?  hundreds of millions?
> >
> >
> >
> >
> >
> > On Tue, May 5, 2015 at 8:54 AM, George Lu <lu...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > These days, I am trying Drill to see whether Drill fits the
> realtime/near
> > > realtime interactive queries requirement.
> > > I have a HBase server, underlying HDFS contains three data nodes, and I
> > > deployed 7 Drill nodes within the cluster.
> > > I have several million records in the HBase table and I issue queries
> > like
> > > SUM, MAX, COUNT against the table and found the Drill costs like 5 to 6
> > > minutes on average to get the result.
> > >
> > > Such latency is not ideal for interactive use.
> > >
> > > I know Drill is used for low-latency query, so I would like to ask for
> > help
> > > how to achieve that? How to make Drill run queries in low-latency (in
> > > seconds not minutes)?
> > >
> > > Any suggestions are welcome!
> > >
> > > Thanks!
> > >
> > > George
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: How to deploy Drill to achieve optimal performance

Posted by Steven Phillips <sp...@maprtech.com>.

It would be helpful if you could post the profile for the query somewhere,
or send it directly to me as an attachment (since attachments won't post to
the mailing list).

To get the profile, go to the profile page in the Web UI:

http://drill.apache.org/docs/monitoring-and-canceling-queries-in-the-drill-web-ui/

When you find the profile for the query in question, if you add ".json" to
the URL, this will display the wrong text for the profile. You can download
this and save it somewhere.

On Tue, May 5, 2015 at 3:38 AM, Ted Dunning <te...@gmail.com> wrote:

> George,
>
> That sounds much too slow.
>
> Can you provide some samples of the data and queries?  How about actual
> data counts?  Millioins?  hundreds of millions?
>
>
>
>
>
> On Tue, May 5, 2015 at 8:54 AM, George Lu <lu...@gmail.com> wrote:
>
> > Hi all,
> >
> > These days, I am trying Drill to see whether Drill fits the realtime/near
> > realtime interactive queries requirement.
> > I have a HBase server, underlying HDFS contains three data nodes, and I
> > deployed 7 Drill nodes within the cluster.
> > I have several million records in the HBase table and I issue queries
> like
> > SUM, MAX, COUNT against the table and found the Drill costs like 5 to 6
> > minutes on average to get the result.
> >
> > Such latency is not ideal for interactive use.
> >
> > I know Drill is used for low-latency query, so I would like to ask for
> help
> > how to achieve that? How to make Drill run queries in low-latency (in
> > seconds not minutes)?
> >
> > Any suggestions are welcome!
> >
> > Thanks!
> >
> > George
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: How to deploy Drill to achieve optimal performance

Posted by Ted Dunning <te...@gmail.com>.

George,

That sounds much too slow.

Can you provide some samples of the data and queries?  How about actual
data counts?  Millioins?  hundreds of millions?





On Tue, May 5, 2015 at 8:54 AM, George Lu <lu...@gmail.com> wrote:

> Hi all,
>
> These days, I am trying Drill to see whether Drill fits the realtime/near
> realtime interactive queries requirement.
> I have a HBase server, underlying HDFS contains three data nodes, and I
> deployed 7 Drill nodes within the cluster.
> I have several million records in the HBase table and I issue queries like
> SUM, MAX, COUNT against the table and found the Drill costs like 5 to 6
> minutes on average to get the result.
>
> Such latency is not ideal for interactive use.
>
> I know Drill is used for low-latency query, so I would like to ask for help
> how to achieve that? How to make Drill run queries in low-latency (in
> seconds not minutes)?
>
> Any suggestions are welcome!
>
> Thanks!
>
> George
>