You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Adam Gilmore <dr...@gmail.com> on 2015/03/25 08:00:40 UTC

Drill favouring a particular Drillbit

Hi guys,

I'm trying to understand how this could be possible.  I have a Hadoop
cluster of a name node and two data nodes setup.  All have identical specs
in terms of CPU/RAM etc.

The two data nodes have a replicated HDFS setup where I'm storing some
Parquet files.

A Drill cluster (with Zookeeper) is running with Drillbits on all three
servers.

When I submit a query to *any* of the Drillbits, no matter who the foreman
is, one particular data node gets picked to do the vast majority of the
work.

We've even added three more task nodes to the cluster and everything still
puts a huge load on one particular server.

There is nothing unique about this data node.  HDFS is fully replicated (no
unreplicated blocks) to the other data node.

I know that Drill tries to get data locality, so I'm wondering if this is
the cause, but this essentially swamping this data node with 100% CPU usage
while leaving the others barely doing any work.

As soon as we shut down the Drillbit on this data node, query performance
increases significantly.

Any thoughts on how I can troubleshoot why Drill is picking that particular
node?

Re: Drill favouring a particular Drillbit

Posted by Steven Phillips <sp...@maprtech.com>.

Adam,

Could you give more info regarding the dataset, including:

number and size of parquet files
block locations of the parquet files
drillbit hosts

If you could send the profile json files for a couple of queries, that
could be helpful too.

On Wed, Mar 25, 2015 at 11:23 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Adam,
>
> There is actually an option to control how much Drill uses locality versus
> distribution.  Not sure if that is influencing you but it could be.  If so,
> you can decrease the value to increase the importance of distribution.  The
> option is `planner.affinity_factor`.
>
>
>
> On Wed, Mar 25, 2015 at 12:00 AM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm trying to understand how this could be possible.  I have a Hadoop
> > cluster of a name node and two data nodes setup.  All have identical
> specs
> > in terms of CPU/RAM etc.
> >
> > The two data nodes have a replicated HDFS setup where I'm storing some
> > Parquet files.
> >
> > A Drill cluster (with Zookeeper) is running with Drillbits on all three
> > servers.
> >
> > When I submit a query to *any* of the Drillbits, no matter who the
> foreman
> > is, one particular data node gets picked to do the vast majority of the
> > work.
> >
> > We've even added three more task nodes to the cluster and everything
> still
> > puts a huge load on one particular server.
> >
> > There is nothing unique about this data node.  HDFS is fully replicated
> (no
> > unreplicated blocks) to the other data node.
> >
> > I know that Drill tries to get data locality, so I'm wondering if this is
> > the cause, but this essentially swamping this data node with 100% CPU
> usage
> > while leaving the others barely doing any work.
> >
> > As soon as we shut down the Drillbit on this data node, query performance
> > increases significantly.
> >
> > Any thoughts on how I can troubleshoot why Drill is picking that
> particular
> > node?
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com