You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Adam Gilmore <dr...@gmail.com> on 2015/03/25 08:00:40 UTC

Drill favouring a particular Drillbit

Hi guys,

I'm trying to understand how this could be possible.  I have a Hadoop
cluster of a name node and two data nodes setup.  All have identical specs
in terms of CPU/RAM etc.

The two data nodes have a replicated HDFS setup where I'm storing some
Parquet files.

A Drill cluster (with Zookeeper) is running with Drillbits on all three
servers.

When I submit a query to *any* of the Drillbits, no matter who the foreman
is, one particular data node gets picked to do the vast majority of the
work.

We've even added three more task nodes to the cluster and everything still
puts a huge load on one particular server.

There is nothing unique about this data node.  HDFS is fully replicated (no
unreplicated blocks) to the other data node.

I know that Drill tries to get data locality, so I'm wondering if this is
the cause, but this essentially swamping this data node with 100% CPU usage
while leaving the others barely doing any work.

As soon as we shut down the Drillbit on this data node, query performance
increases significantly.

Any thoughts on how I can troubleshoot why Drill is picking that particular
node?

Re: Drill favouring a particular Drillbit

Posted by Steven Phillips <sp...@maprtech.com>.
Adam,

Could you give more info regarding the dataset, including:

number and size of parquet files
block locations of the parquet files
drillbit hosts

If you could send the profile json files for a couple of queries, that
could be helpful too.

On Wed, Mar 25, 2015 at 11:23 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Adam,
>
> There is actually an option to control how much Drill uses locality versus
> distribution.  Not sure if that is influencing you but it could be.  If so,
> you can decrease the value to increase the importance of distribution.  The
> option is `planner.affinity_factor`.
>
>
>
> On Wed, Mar 25, 2015 at 12:00 AM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I'm trying to understand how this could be possible.  I have a Hadoop
> > cluster of a name node and two data nodes setup.  All have identical
> specs
> > in terms of CPU/RAM etc.
> >
> > The two data nodes have a replicated HDFS setup where I'm storing some
> > Parquet files.
> >
> > A Drill cluster (with Zookeeper) is running with Drillbits on all three
> > servers.
> >
> > When I submit a query to *any* of the Drillbits, no matter who the
> foreman
> > is, one particular data node gets picked to do the vast majority of the
> > work.
> >
> > We've even added three more task nodes to the cluster and everything
> still
> > puts a huge load on one particular server.
> >
> > There is nothing unique about this data node.  HDFS is fully replicated
> (no
> > unreplicated blocks) to the other data node.
> >
> > I know that Drill tries to get data locality, so I'm wondering if this is
> > the cause, but this essentially swamping this data node with 100% CPU
> usage
> > while leaving the others barely doing any work.
> >
> > As soon as we shut down the Drillbit on this data node, query performance
> > increases significantly.
> >
> > Any thoughts on how I can troubleshoot why Drill is picking that
> particular
> > node?
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Drill favouring a particular Drillbit

Posted by Jacques Nadeau <ja...@apache.org>.
Adam,

There is actually an option to control how much Drill uses locality versus
distribution.  Not sure if that is influencing you but it could be.  If so,
you can decrease the value to increase the importance of distribution.  The
option is `planner.affinity_factor`.



On Wed, Mar 25, 2015 at 12:00 AM, Adam Gilmore <dr...@gmail.com>
wrote:

> Hi guys,
>
> I'm trying to understand how this could be possible.  I have a Hadoop
> cluster of a name node and two data nodes setup.  All have identical specs
> in terms of CPU/RAM etc.
>
> The two data nodes have a replicated HDFS setup where I'm storing some
> Parquet files.
>
> A Drill cluster (with Zookeeper) is running with Drillbits on all three
> servers.
>
> When I submit a query to *any* of the Drillbits, no matter who the foreman
> is, one particular data node gets picked to do the vast majority of the
> work.
>
> We've even added three more task nodes to the cluster and everything still
> puts a huge load on one particular server.
>
> There is nothing unique about this data node.  HDFS is fully replicated (no
> unreplicated blocks) to the other data node.
>
> I know that Drill tries to get data locality, so I'm wondering if this is
> the cause, but this essentially swamping this data node with 100% CPU usage
> while leaving the others barely doing any work.
>
> As soon as we shut down the Drillbit on this data node, query performance
> increases significantly.
>
> Any thoughts on how I can troubleshoot why Drill is picking that particular
> node?
>

Re: Drill favouring a particular Drillbit

Posted by Adam Gilmore <dr...@gmail.com>.
It fixed the foreman issue perfectly and has significantly increase
performance in our test cases.  We're still struggling a bit with the data
affinity challenge, but it may be unrelated to Drill (as in our environment
the name/data nodes are doing the normal HDFS work as well, so it may be a
contention issue).

The shuffling is probably not the most optimal way to balance load, but
it's better than just picking the first as the foreman each time.

On Thu, Apr 16, 2015 at 10:29 AM, Jacques Nadeau <ja...@apache.org> wrote:

> It doesn't currently have plan caching but a simple implementation probably
> wouldn't be that difficult (assuming you keep it node-level as opposed to
> cluster level).  We merged the auto shuffling per session so let us know
> how that looks.
>
> On Wed, Apr 15, 2015 at 4:35 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > The workload does involve a fair number of short queries.  Although when
> I
> > say short, I'm talking about querying 2-10 million record Parquet files,
> so
> > they're not extremely short.
> >
> > Does Drill have plan caching built in at this stage?  Might help us
> reduce
> > some of that foreman overhead.
> >
> > On Tue, Apr 14, 2015 at 3:02 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > Yeah, it seems that way.  We should get your patch merged.  I just
> > reviewed
> > > and lgtm.
> > >
> > > What type of workload are you running?  Unless your workload is
> planning
> > > heavy (e.g. lots of short queries) or does a lot of sorts (the last
> merge
> > > is on the foreman node), work should be reasonably distributed.
> > >
> > > On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > >
> > > > Looks like this definitely is the following bug:
> > > >
> > > > https://issues.apache.org/jira/browse/DRILL-2512
> > > >
> > > > It's a pretty severe performance bottleneck having the foreman doing
> so
> > > > much work.  In our environment, the foreman hits basically 95-100%
> CPU
> > > > while the other drillbits barely do much work.  Means it's nearly
> > > > impossible for us to scale out.
> > > >
> > > > On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <dr...@gmail.com>
> > > > wrote:
> > > >
> > > > > Anyone have any more thoughts on this?  Anywhere I can start trying
> > to
> > > > > troubleshoot?
> > > > >
> > > > > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <
> dragoncurve@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > >> So there are 5 Parquet files, each ~125mb - not sure what I can
> > > provide
> > > > >> re the block locations?  I believe it's under the HDFS block size
> so
> > > > they
> > > > >> should be stored contiguously.
> > > > >>
> > > > >> I've tried setting the affinity factor to various values (1, 0,
> > etc.)
> > > > but
> > > > >> nothing seems to change that.  It always prefers certain nodes.
> > > > >>
> > > > >> Moreover, we added a stack more nodes and it started picking very
> > > > >> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always
> > > > picked
> > > > >> as foremen).  Therefore, the foremen were being swamped with CPU
> > while
> > > > the
> > > > >> other nodes were doing very little work.
> > > > >>
> > > > >> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <
> > > > sphillips@maprtech.com
> > > > >> > wrote:
> > > > >>
> > > > >>> Actually, I believe a query submitted through REST interface will
> > > > >>> instantiate a DrillClient, which uses the same
> ZKClusterCoordinator
> > > > that
> > > > >>> sqlline uses, and thus the foreman for the query is not
> necessarily
> > > on
> > > > >>> the
> > > > >>> same drillbit as it was submitted to. But I'm still not sure it's
> > > > related
> > > > >>> to DRILL-2512.
> > > > >>>
> > > > >>> I'll wait for your additional info before speculating further.
> > > > >>>
> > > > >>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <
> > dragoncurve@gmail.com
> > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > We actually setup a separate load balancer for port 8047 (we're
> > > > >>> submitting
> > > > >>> > these queries via the REST API at the moment) so Zookeeper etc.
> > is
> > > > out
> > > > >>> of
> > > > >>> > the equation, thus I doubt we're hitting DRILL-2512.
> > > > >>> >
> > > > >>> > When shutitng down the "troublesome" drillbit, it starts
> > > > parallelizing
> > > > >>> much
> > > > >>> > nicer again.  We even added 10+ nodes to the cluster and as
> long
> > as
> > > > >>> that
> > > > >>> > particular drillbit is shut down, it distributes very nicely.
> > The
> > > > >>> minute
> > > > >>> > we start the drillbit on that node again, it starts swamping it
> > > with
> > > > >>> work.
> > > > >>> >
> > > > >>> > I'll shoot through the JSON profiles and some more information
> on
> > > the
> > > > >>> > dataset etc. later today (Australian time!).
> > > > >>> >
> > > > >>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
> > > > >>> sphillips@maprtech.com>
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > I didn't notice at first that Adam said "no matter who the
> > > foreman
> > > > >>> is".
> > > > >>> > >
> > > > >>> > > Another suspicion I have is that our current logic for
> > assigning
> > > > work
> > > > >>> > will
> > > > >>> > > assign to the exact same nodes every time we query a
> particular
> > > > >>> table.
> > > > >>> > > Changing affinity factor may change it, but it will still be
> > the
> > > > same
> > > > >>> > every
> > > > >>> > > time. That is my suspicion, but I am not sure why shutting
> down
> > > the
> > > > >>> > > drillbit would improve performance. I would expect that
> > shutting
> > > > >>> down the
> > > > >>> > > drillbit would result in a different drillbit becoming the
> > > hotspot.
> > > > >>> > >
> > > > >>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <
> > > > jacques@apache.org
> > > > >>> >
> > > > >>> > > wrote:
> > > > >>> > >
> > > > >>> > > > On Steven's point, the node that the client connects to is
> > not
> > > > >>> > currently
> > > > >>> > > > randomized.  Given your description of behavior, I'm not
> sure
> > > > that
> > > > >>> > you're
> > > > >>> > > > hitting 2512 or just general undesirable distribution.
> > > > >>> > > >
> > > > >>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> > > > >>> > > sphillips@maprtech.com>
> > > > >>> > > > wrote:
> > > > >>> > > >
> > > > >>> > > > > This is a known issue:
> > > > >>> > > > >
> > > > >>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
> > > > >>> > > > >
> > > > >>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > > > >>> > > > > aengelbrecht@maprtech.com> wrote:
> > > > >>> > > > >
> > > > >>> > > > > > What version of Drill are you running?
> > > > >>> > > > > >
> > > > >>> > > > > > Any hints when looking at the query profiles? Is the
> node
> > > > that
> > > > >>> is
> > > > >>> > > being
> > > > >>> > > > > > hammered the foreman for the queries and most of the
> > major
> > > > >>> > fragments
> > > > >>> > > > are
> > > > >>> > > > > > tied to the foreman?
> > > > >>> > > > > >
> > > > >>> > > > > > —Andries
> > > > >>> > > > > >
> > > > >>> > > > > >
> > > > >>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
> > > > >>> dragoncurve@gmail.com>
> > > > >>> > > > > wrote:
> > > > >>> > > > > >
> > > > >>> > > > > > > Hi guys,
> > > > >>> > > > > > >
> > > > >>> > > > > > > I'm trying to understand how this could be
> possible.  I
> > > > have
> > > > >>> a
> > > > >>> > > Hadoop
> > > > >>> > > > > > > cluster of a name node and two data nodes setup.  All
> > > have
> > > > >>> > > identical
> > > > >>> > > > > > specs
> > > > >>> > > > > > > in terms of CPU/RAM etc.
> > > > >>> > > > > > >
> > > > >>> > > > > > > The two data nodes have a replicated HDFS setup where
> > I'm
> > > > >>> storing
> > > > >>> > > > some
> > > > >>> > > > > > > Parquet files.
> > > > >>> > > > > > >
> > > > >>> > > > > > > A Drill cluster (with Zookeeper) is running with
> > > Drillbits
> > > > >>> on all
> > > > >>> > > > three
> > > > >>> > > > > > > servers.
> > > > >>> > > > > > >
> > > > >>> > > > > > > When I submit a query to *any* of the Drillbits, no
> > > matter
> > > > >>> who
> > > > >>> > the
> > > > >>> > > > > > foreman
> > > > >>> > > > > > > is, one particular data node gets picked to do the
> vast
> > > > >>> majority
> > > > >>> > of
> > > > >>> > > > the
> > > > >>> > > > > > > work.
> > > > >>> > > > > > >
> > > > >>> > > > > > > We've even added three more task nodes to the cluster
> > and
> > > > >>> > > everything
> > > > >>> > > > > > still
> > > > >>> > > > > > > puts a huge load on one particular server.
> > > > >>> > > > > > >
> > > > >>> > > > > > > There is nothing unique about this data node.  HDFS
> is
> > > > fully
> > > > >>> > > > replicated
> > > > >>> > > > > > (no
> > > > >>> > > > > > > unreplicated blocks) to the other data node.
> > > > >>> > > > > > >
> > > > >>> > > > > > > I know that Drill tries to get data locality, so I'm
> > > > >>> wondering if
> > > > >>> > > > this
> > > > >>> > > > > is
> > > > >>> > > > > > > the cause, but this essentially swamping this data
> node
> > > > with
> > > > >>> 100%
> > > > >>> > > CPU
> > > > >>> > > > > > usage
> > > > >>> > > > > > > while leaving the others barely doing any work.
> > > > >>> > > > > > >
> > > > >>> > > > > > > As soon as we shut down the Drillbit on this data
> node,
> > > > query
> > > > >>> > > > > performance
> > > > >>> > > > > > > increases significantly.
> > > > >>> > > > > > >
> > > > >>> > > > > > > Any thoughts on how I can troubleshoot why Drill is
> > > picking
> > > > >>> that
> > > > >>> > > > > > particular
> > > > >>> > > > > > > node?
> > > > >>> > > > > >
> > > > >>> > > > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > --
> > > > >>> > > > >  Steven Phillips
> > > > >>> > > > >  Software Engineer
> > > > >>> > > > >
> > > > >>> > > > >  mapr.com
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > --
> > > > >>> > >  Steven Phillips
> > > > >>> > >  Software Engineer
> > > > >>> > >
> > > > >>> > >  mapr.com
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>>  Steven Phillips
> > > > >>>  Software Engineer
> > > > >>>
> > > > >>>  mapr.com
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Drill favouring a particular Drillbit

Posted by Jacques Nadeau <ja...@apache.org>.
It doesn't currently have plan caching but a simple implementation probably
wouldn't be that difficult (assuming you keep it node-level as opposed to
cluster level).  We merged the auto shuffling per session so let us know
how that looks.

On Wed, Apr 15, 2015 at 4:35 PM, Adam Gilmore <dr...@gmail.com> wrote:

> The workload does involve a fair number of short queries.  Although when I
> say short, I'm talking about querying 2-10 million record Parquet files, so
> they're not extremely short.
>
> Does Drill have plan caching built in at this stage?  Might help us reduce
> some of that foreman overhead.
>
> On Tue, Apr 14, 2015 at 3:02 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > Yeah, it seems that way.  We should get your patch merged.  I just
> reviewed
> > and lgtm.
> >
> > What type of workload are you running?  Unless your workload is planning
> > heavy (e.g. lots of short queries) or does a lot of sorts (the last merge
> > is on the foreman node), work should be reasonably distributed.
> >
> > On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> >
> > > Looks like this definitely is the following bug:
> > >
> > > https://issues.apache.org/jira/browse/DRILL-2512
> > >
> > > It's a pretty severe performance bottleneck having the foreman doing so
> > > much work.  In our environment, the foreman hits basically 95-100% CPU
> > > while the other drillbits barely do much work.  Means it's nearly
> > > impossible for us to scale out.
> > >
> > > On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > >
> > > > Anyone have any more thoughts on this?  Anywhere I can start trying
> to
> > > > troubleshoot?
> > > >
> > > > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <dragoncurve@gmail.com
> >
> > > > wrote:
> > > >
> > > >> So there are 5 Parquet files, each ~125mb - not sure what I can
> > provide
> > > >> re the block locations?  I believe it's under the HDFS block size so
> > > they
> > > >> should be stored contiguously.
> > > >>
> > > >> I've tried setting the affinity factor to various values (1, 0,
> etc.)
> > > but
> > > >> nothing seems to change that.  It always prefers certain nodes.
> > > >>
> > > >> Moreover, we added a stack more nodes and it started picking very
> > > >> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always
> > > picked
> > > >> as foremen).  Therefore, the foremen were being swamped with CPU
> while
> > > the
> > > >> other nodes were doing very little work.
> > > >>
> > > >> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <
> > > sphillips@maprtech.com
> > > >> > wrote:
> > > >>
> > > >>> Actually, I believe a query submitted through REST interface will
> > > >>> instantiate a DrillClient, which uses the same ZKClusterCoordinator
> > > that
> > > >>> sqlline uses, and thus the foreman for the query is not necessarily
> > on
> > > >>> the
> > > >>> same drillbit as it was submitted to. But I'm still not sure it's
> > > related
> > > >>> to DRILL-2512.
> > > >>>
> > > >>> I'll wait for your additional info before speculating further.
> > > >>>
> > > >>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <
> dragoncurve@gmail.com
> > >
> > > >>> wrote:
> > > >>>
> > > >>> > We actually setup a separate load balancer for port 8047 (we're
> > > >>> submitting
> > > >>> > these queries via the REST API at the moment) so Zookeeper etc.
> is
> > > out
> > > >>> of
> > > >>> > the equation, thus I doubt we're hitting DRILL-2512.
> > > >>> >
> > > >>> > When shutitng down the "troublesome" drillbit, it starts
> > > parallelizing
> > > >>> much
> > > >>> > nicer again.  We even added 10+ nodes to the cluster and as long
> as
> > > >>> that
> > > >>> > particular drillbit is shut down, it distributes very nicely.
> The
> > > >>> minute
> > > >>> > we start the drillbit on that node again, it starts swamping it
> > with
> > > >>> work.
> > > >>> >
> > > >>> > I'll shoot through the JSON profiles and some more information on
> > the
> > > >>> > dataset etc. later today (Australian time!).
> > > >>> >
> > > >>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
> > > >>> sphillips@maprtech.com>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > I didn't notice at first that Adam said "no matter who the
> > foreman
> > > >>> is".
> > > >>> > >
> > > >>> > > Another suspicion I have is that our current logic for
> assigning
> > > work
> > > >>> > will
> > > >>> > > assign to the exact same nodes every time we query a particular
> > > >>> table.
> > > >>> > > Changing affinity factor may change it, but it will still be
> the
> > > same
> > > >>> > every
> > > >>> > > time. That is my suspicion, but I am not sure why shutting down
> > the
> > > >>> > > drillbit would improve performance. I would expect that
> shutting
> > > >>> down the
> > > >>> > > drillbit would result in a different drillbit becoming the
> > hotspot.
> > > >>> > >
> > > >>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <
> > > jacques@apache.org
> > > >>> >
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > On Steven's point, the node that the client connects to is
> not
> > > >>> > currently
> > > >>> > > > randomized.  Given your description of behavior, I'm not sure
> > > that
> > > >>> > you're
> > > >>> > > > hitting 2512 or just general undesirable distribution.
> > > >>> > > >
> > > >>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> > > >>> > > sphillips@maprtech.com>
> > > >>> > > > wrote:
> > > >>> > > >
> > > >>> > > > > This is a known issue:
> > > >>> > > > >
> > > >>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
> > > >>> > > > >
> > > >>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > > >>> > > > > aengelbrecht@maprtech.com> wrote:
> > > >>> > > > >
> > > >>> > > > > > What version of Drill are you running?
> > > >>> > > > > >
> > > >>> > > > > > Any hints when looking at the query profiles? Is the node
> > > that
> > > >>> is
> > > >>> > > being
> > > >>> > > > > > hammered the foreman for the queries and most of the
> major
> > > >>> > fragments
> > > >>> > > > are
> > > >>> > > > > > tied to the foreman?
> > > >>> > > > > >
> > > >>> > > > > > —Andries
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
> > > >>> dragoncurve@gmail.com>
> > > >>> > > > > wrote:
> > > >>> > > > > >
> > > >>> > > > > > > Hi guys,
> > > >>> > > > > > >
> > > >>> > > > > > > I'm trying to understand how this could be possible.  I
> > > have
> > > >>> a
> > > >>> > > Hadoop
> > > >>> > > > > > > cluster of a name node and two data nodes setup.  All
> > have
> > > >>> > > identical
> > > >>> > > > > > specs
> > > >>> > > > > > > in terms of CPU/RAM etc.
> > > >>> > > > > > >
> > > >>> > > > > > > The two data nodes have a replicated HDFS setup where
> I'm
> > > >>> storing
> > > >>> > > > some
> > > >>> > > > > > > Parquet files.
> > > >>> > > > > > >
> > > >>> > > > > > > A Drill cluster (with Zookeeper) is running with
> > Drillbits
> > > >>> on all
> > > >>> > > > three
> > > >>> > > > > > > servers.
> > > >>> > > > > > >
> > > >>> > > > > > > When I submit a query to *any* of the Drillbits, no
> > matter
> > > >>> who
> > > >>> > the
> > > >>> > > > > > foreman
> > > >>> > > > > > > is, one particular data node gets picked to do the vast
> > > >>> majority
> > > >>> > of
> > > >>> > > > the
> > > >>> > > > > > > work.
> > > >>> > > > > > >
> > > >>> > > > > > > We've even added three more task nodes to the cluster
> and
> > > >>> > > everything
> > > >>> > > > > > still
> > > >>> > > > > > > puts a huge load on one particular server.
> > > >>> > > > > > >
> > > >>> > > > > > > There is nothing unique about this data node.  HDFS is
> > > fully
> > > >>> > > > replicated
> > > >>> > > > > > (no
> > > >>> > > > > > > unreplicated blocks) to the other data node.
> > > >>> > > > > > >
> > > >>> > > > > > > I know that Drill tries to get data locality, so I'm
> > > >>> wondering if
> > > >>> > > > this
> > > >>> > > > > is
> > > >>> > > > > > > the cause, but this essentially swamping this data node
> > > with
> > > >>> 100%
> > > >>> > > CPU
> > > >>> > > > > > usage
> > > >>> > > > > > > while leaving the others barely doing any work.
> > > >>> > > > > > >
> > > >>> > > > > > > As soon as we shut down the Drillbit on this data node,
> > > query
> > > >>> > > > > performance
> > > >>> > > > > > > increases significantly.
> > > >>> > > > > > >
> > > >>> > > > > > > Any thoughts on how I can troubleshoot why Drill is
> > picking
> > > >>> that
> > > >>> > > > > > particular
> > > >>> > > > > > > node?
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > --
> > > >>> > > > >  Steven Phillips
> > > >>> > > > >  Software Engineer
> > > >>> > > > >
> > > >>> > > > >  mapr.com
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > --
> > > >>> > >  Steven Phillips
> > > >>> > >  Software Engineer
> > > >>> > >
> > > >>> > >  mapr.com
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>  Steven Phillips
> > > >>>  Software Engineer
> > > >>>
> > > >>>  mapr.com
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Drill favouring a particular Drillbit

Posted by Adam Gilmore <dr...@gmail.com>.
The workload does involve a fair number of short queries.  Although when I
say short, I'm talking about querying 2-10 million record Parquet files, so
they're not extremely short.

Does Drill have plan caching built in at this stage?  Might help us reduce
some of that foreman overhead.

On Tue, Apr 14, 2015 at 3:02 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Yeah, it seems that way.  We should get your patch merged.  I just reviewed
> and lgtm.
>
> What type of workload are you running?  Unless your workload is planning
> heavy (e.g. lots of short queries) or does a lot of sorts (the last merge
> is on the foreman node), work should be reasonably distributed.
>
> On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Looks like this definitely is the following bug:
> >
> > https://issues.apache.org/jira/browse/DRILL-2512
> >
> > It's a pretty severe performance bottleneck having the foreman doing so
> > much work.  In our environment, the foreman hits basically 95-100% CPU
> > while the other drillbits barely do much work.  Means it's nearly
> > impossible for us to scale out.
> >
> > On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> >
> > > Anyone have any more thoughts on this?  Anywhere I can start trying to
> > > troubleshoot?
> > >
> > > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > >
> > >> So there are 5 Parquet files, each ~125mb - not sure what I can
> provide
> > >> re the block locations?  I believe it's under the HDFS block size so
> > they
> > >> should be stored contiguously.
> > >>
> > >> I've tried setting the affinity factor to various values (1, 0, etc.)
> > but
> > >> nothing seems to change that.  It always prefers certain nodes.
> > >>
> > >> Moreover, we added a stack more nodes and it started picking very
> > >> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always
> > picked
> > >> as foremen).  Therefore, the foremen were being swamped with CPU while
> > the
> > >> other nodes were doing very little work.
> > >>
> > >> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <
> > sphillips@maprtech.com
> > >> > wrote:
> > >>
> > >>> Actually, I believe a query submitted through REST interface will
> > >>> instantiate a DrillClient, which uses the same ZKClusterCoordinator
> > that
> > >>> sqlline uses, and thus the foreman for the query is not necessarily
> on
> > >>> the
> > >>> same drillbit as it was submitted to. But I'm still not sure it's
> > related
> > >>> to DRILL-2512.
> > >>>
> > >>> I'll wait for your additional info before speculating further.
> > >>>
> > >>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <dragoncurve@gmail.com
> >
> > >>> wrote:
> > >>>
> > >>> > We actually setup a separate load balancer for port 8047 (we're
> > >>> submitting
> > >>> > these queries via the REST API at the moment) so Zookeeper etc. is
> > out
> > >>> of
> > >>> > the equation, thus I doubt we're hitting DRILL-2512.
> > >>> >
> > >>> > When shutitng down the "troublesome" drillbit, it starts
> > parallelizing
> > >>> much
> > >>> > nicer again.  We even added 10+ nodes to the cluster and as long as
> > >>> that
> > >>> > particular drillbit is shut down, it distributes very nicely.  The
> > >>> minute
> > >>> > we start the drillbit on that node again, it starts swamping it
> with
> > >>> work.
> > >>> >
> > >>> > I'll shoot through the JSON profiles and some more information on
> the
> > >>> > dataset etc. later today (Australian time!).
> > >>> >
> > >>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
> > >>> sphillips@maprtech.com>
> > >>> > wrote:
> > >>> >
> > >>> > > I didn't notice at first that Adam said "no matter who the
> foreman
> > >>> is".
> > >>> > >
> > >>> > > Another suspicion I have is that our current logic for assigning
> > work
> > >>> > will
> > >>> > > assign to the exact same nodes every time we query a particular
> > >>> table.
> > >>> > > Changing affinity factor may change it, but it will still be the
> > same
> > >>> > every
> > >>> > > time. That is my suspicion, but I am not sure why shutting down
> the
> > >>> > > drillbit would improve performance. I would expect that shutting
> > >>> down the
> > >>> > > drillbit would result in a different drillbit becoming the
> hotspot.
> > >>> > >
> > >>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <
> > jacques@apache.org
> > >>> >
> > >>> > > wrote:
> > >>> > >
> > >>> > > > On Steven's point, the node that the client connects to is not
> > >>> > currently
> > >>> > > > randomized.  Given your description of behavior, I'm not sure
> > that
> > >>> > you're
> > >>> > > > hitting 2512 or just general undesirable distribution.
> > >>> > > >
> > >>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> > >>> > > sphillips@maprtech.com>
> > >>> > > > wrote:
> > >>> > > >
> > >>> > > > > This is a known issue:
> > >>> > > > >
> > >>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
> > >>> > > > >
> > >>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > >>> > > > > aengelbrecht@maprtech.com> wrote:
> > >>> > > > >
> > >>> > > > > > What version of Drill are you running?
> > >>> > > > > >
> > >>> > > > > > Any hints when looking at the query profiles? Is the node
> > that
> > >>> is
> > >>> > > being
> > >>> > > > > > hammered the foreman for the queries and most of the major
> > >>> > fragments
> > >>> > > > are
> > >>> > > > > > tied to the foreman?
> > >>> > > > > >
> > >>> > > > > > —Andries
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
> > >>> dragoncurve@gmail.com>
> > >>> > > > > wrote:
> > >>> > > > > >
> > >>> > > > > > > Hi guys,
> > >>> > > > > > >
> > >>> > > > > > > I'm trying to understand how this could be possible.  I
> > have
> > >>> a
> > >>> > > Hadoop
> > >>> > > > > > > cluster of a name node and two data nodes setup.  All
> have
> > >>> > > identical
> > >>> > > > > > specs
> > >>> > > > > > > in terms of CPU/RAM etc.
> > >>> > > > > > >
> > >>> > > > > > > The two data nodes have a replicated HDFS setup where I'm
> > >>> storing
> > >>> > > > some
> > >>> > > > > > > Parquet files.
> > >>> > > > > > >
> > >>> > > > > > > A Drill cluster (with Zookeeper) is running with
> Drillbits
> > >>> on all
> > >>> > > > three
> > >>> > > > > > > servers.
> > >>> > > > > > >
> > >>> > > > > > > When I submit a query to *any* of the Drillbits, no
> matter
> > >>> who
> > >>> > the
> > >>> > > > > > foreman
> > >>> > > > > > > is, one particular data node gets picked to do the vast
> > >>> majority
> > >>> > of
> > >>> > > > the
> > >>> > > > > > > work.
> > >>> > > > > > >
> > >>> > > > > > > We've even added three more task nodes to the cluster and
> > >>> > > everything
> > >>> > > > > > still
> > >>> > > > > > > puts a huge load on one particular server.
> > >>> > > > > > >
> > >>> > > > > > > There is nothing unique about this data node.  HDFS is
> > fully
> > >>> > > > replicated
> > >>> > > > > > (no
> > >>> > > > > > > unreplicated blocks) to the other data node.
> > >>> > > > > > >
> > >>> > > > > > > I know that Drill tries to get data locality, so I'm
> > >>> wondering if
> > >>> > > > this
> > >>> > > > > is
> > >>> > > > > > > the cause, but this essentially swamping this data node
> > with
> > >>> 100%
> > >>> > > CPU
> > >>> > > > > > usage
> > >>> > > > > > > while leaving the others barely doing any work.
> > >>> > > > > > >
> > >>> > > > > > > As soon as we shut down the Drillbit on this data node,
> > query
> > >>> > > > > performance
> > >>> > > > > > > increases significantly.
> > >>> > > > > > >
> > >>> > > > > > > Any thoughts on how I can troubleshoot why Drill is
> picking
> > >>> that
> > >>> > > > > > particular
> > >>> > > > > > > node?
> > >>> > > > > >
> > >>> > > > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > --
> > >>> > > > >  Steven Phillips
> > >>> > > > >  Software Engineer
> > >>> > > > >
> > >>> > > > >  mapr.com
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > --
> > >>> > >  Steven Phillips
> > >>> > >  Software Engineer
> > >>> > >
> > >>> > >  mapr.com
> > >>> > >
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>  Steven Phillips
> > >>>  Software Engineer
> > >>>
> > >>>  mapr.com
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Drill favouring a particular Drillbit

Posted by Jacques Nadeau <ja...@apache.org>.
Yeah, it seems that way.  We should get your patch merged.  I just reviewed
and lgtm.

What type of workload are you running?  Unless your workload is planning
heavy (e.g. lots of short queries) or does a lot of sorts (the last merge
is on the foreman node), work should be reasonably distributed.

On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore <dr...@gmail.com>
wrote:

> Looks like this definitely is the following bug:
>
> https://issues.apache.org/jira/browse/DRILL-2512
>
> It's a pretty severe performance bottleneck having the foreman doing so
> much work.  In our environment, the foreman hits basically 95-100% CPU
> while the other drillbits barely do much work.  Means it's nearly
> impossible for us to scale out.
>
> On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > Anyone have any more thoughts on this?  Anywhere I can start trying to
> > troubleshoot?
> >
> > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> >
> >> So there are 5 Parquet files, each ~125mb - not sure what I can provide
> >> re the block locations?  I believe it's under the HDFS block size so
> they
> >> should be stored contiguously.
> >>
> >> I've tried setting the affinity factor to various values (1, 0, etc.)
> but
> >> nothing seems to change that.  It always prefers certain nodes.
> >>
> >> Moreover, we added a stack more nodes and it started picking very
> >> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always
> picked
> >> as foremen).  Therefore, the foremen were being swamped with CPU while
> the
> >> other nodes were doing very little work.
> >>
> >> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <
> sphillips@maprtech.com
> >> > wrote:
> >>
> >>> Actually, I believe a query submitted through REST interface will
> >>> instantiate a DrillClient, which uses the same ZKClusterCoordinator
> that
> >>> sqlline uses, and thus the foreman for the query is not necessarily on
> >>> the
> >>> same drillbit as it was submitted to. But I'm still not sure it's
> related
> >>> to DRILL-2512.
> >>>
> >>> I'll wait for your additional info before speculating further.
> >>>
> >>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <dr...@gmail.com>
> >>> wrote:
> >>>
> >>> > We actually setup a separate load balancer for port 8047 (we're
> >>> submitting
> >>> > these queries via the REST API at the moment) so Zookeeper etc. is
> out
> >>> of
> >>> > the equation, thus I doubt we're hitting DRILL-2512.
> >>> >
> >>> > When shutitng down the "troublesome" drillbit, it starts
> parallelizing
> >>> much
> >>> > nicer again.  We even added 10+ nodes to the cluster and as long as
> >>> that
> >>> > particular drillbit is shut down, it distributes very nicely.  The
> >>> minute
> >>> > we start the drillbit on that node again, it starts swamping it with
> >>> work.
> >>> >
> >>> > I'll shoot through the JSON profiles and some more information on the
> >>> > dataset etc. later today (Australian time!).
> >>> >
> >>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
> >>> sphillips@maprtech.com>
> >>> > wrote:
> >>> >
> >>> > > I didn't notice at first that Adam said "no matter who the foreman
> >>> is".
> >>> > >
> >>> > > Another suspicion I have is that our current logic for assigning
> work
> >>> > will
> >>> > > assign to the exact same nodes every time we query a particular
> >>> table.
> >>> > > Changing affinity factor may change it, but it will still be the
> same
> >>> > every
> >>> > > time. That is my suspicion, but I am not sure why shutting down the
> >>> > > drillbit would improve performance. I would expect that shutting
> >>> down the
> >>> > > drillbit would result in a different drillbit becoming the hotspot.
> >>> > >
> >>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <
> jacques@apache.org
> >>> >
> >>> > > wrote:
> >>> > >
> >>> > > > On Steven's point, the node that the client connects to is not
> >>> > currently
> >>> > > > randomized.  Given your description of behavior, I'm not sure
> that
> >>> > you're
> >>> > > > hitting 2512 or just general undesirable distribution.
> >>> > > >
> >>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> >>> > > sphillips@maprtech.com>
> >>> > > > wrote:
> >>> > > >
> >>> > > > > This is a known issue:
> >>> > > > >
> >>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
> >>> > > > >
> >>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> >>> > > > > aengelbrecht@maprtech.com> wrote:
> >>> > > > >
> >>> > > > > > What version of Drill are you running?
> >>> > > > > >
> >>> > > > > > Any hints when looking at the query profiles? Is the node
> that
> >>> is
> >>> > > being
> >>> > > > > > hammered the foreman for the queries and most of the major
> >>> > fragments
> >>> > > > are
> >>> > > > > > tied to the foreman?
> >>> > > > > >
> >>> > > > > > —Andries
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
> >>> dragoncurve@gmail.com>
> >>> > > > > wrote:
> >>> > > > > >
> >>> > > > > > > Hi guys,
> >>> > > > > > >
> >>> > > > > > > I'm trying to understand how this could be possible.  I
> have
> >>> a
> >>> > > Hadoop
> >>> > > > > > > cluster of a name node and two data nodes setup.  All have
> >>> > > identical
> >>> > > > > > specs
> >>> > > > > > > in terms of CPU/RAM etc.
> >>> > > > > > >
> >>> > > > > > > The two data nodes have a replicated HDFS setup where I'm
> >>> storing
> >>> > > > some
> >>> > > > > > > Parquet files.
> >>> > > > > > >
> >>> > > > > > > A Drill cluster (with Zookeeper) is running with Drillbits
> >>> on all
> >>> > > > three
> >>> > > > > > > servers.
> >>> > > > > > >
> >>> > > > > > > When I submit a query to *any* of the Drillbits, no matter
> >>> who
> >>> > the
> >>> > > > > > foreman
> >>> > > > > > > is, one particular data node gets picked to do the vast
> >>> majority
> >>> > of
> >>> > > > the
> >>> > > > > > > work.
> >>> > > > > > >
> >>> > > > > > > We've even added three more task nodes to the cluster and
> >>> > > everything
> >>> > > > > > still
> >>> > > > > > > puts a huge load on one particular server.
> >>> > > > > > >
> >>> > > > > > > There is nothing unique about this data node.  HDFS is
> fully
> >>> > > > replicated
> >>> > > > > > (no
> >>> > > > > > > unreplicated blocks) to the other data node.
> >>> > > > > > >
> >>> > > > > > > I know that Drill tries to get data locality, so I'm
> >>> wondering if
> >>> > > > this
> >>> > > > > is
> >>> > > > > > > the cause, but this essentially swamping this data node
> with
> >>> 100%
> >>> > > CPU
> >>> > > > > > usage
> >>> > > > > > > while leaving the others barely doing any work.
> >>> > > > > > >
> >>> > > > > > > As soon as we shut down the Drillbit on this data node,
> query
> >>> > > > > performance
> >>> > > > > > > increases significantly.
> >>> > > > > > >
> >>> > > > > > > Any thoughts on how I can troubleshoot why Drill is picking
> >>> that
> >>> > > > > > particular
> >>> > > > > > > node?
> >>> > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > > --
> >>> > > > >  Steven Phillips
> >>> > > > >  Software Engineer
> >>> > > > >
> >>> > > > >  mapr.com
> >>> > > > >
> >>> > > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > >  Steven Phillips
> >>> > >  Software Engineer
> >>> > >
> >>> > >  mapr.com
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>>  Steven Phillips
> >>>  Software Engineer
> >>>
> >>>  mapr.com
> >>>
> >>
> >>
> >
>

Re: Drill favouring a particular Drillbit

Posted by Adam Gilmore <dr...@gmail.com>.
Looks like this definitely is the following bug:

https://issues.apache.org/jira/browse/DRILL-2512

It's a pretty severe performance bottleneck having the foreman doing so
much work.  In our environment, the foreman hits basically 95-100% CPU
while the other drillbits barely do much work.  Means it's nearly
impossible for us to scale out.

On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore <dr...@gmail.com> wrote:

> Anyone have any more thoughts on this?  Anywhere I can start trying to
> troubleshoot?
>
> On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
>> So there are 5 Parquet files, each ~125mb - not sure what I can provide
>> re the block locations?  I believe it's under the HDFS block size so they
>> should be stored contiguously.
>>
>> I've tried setting the affinity factor to various values (1, 0, etc.) but
>> nothing seems to change that.  It always prefers certain nodes.
>>
>> Moreover, we added a stack more nodes and it started picking very
>> specific nodes as foremen (perhaps 2-3 nodes out of 20 were always picked
>> as foremen).  Therefore, the foremen were being swamped with CPU while the
>> other nodes were doing very little work.
>>
>> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <sphillips@maprtech.com
>> > wrote:
>>
>>> Actually, I believe a query submitted through REST interface will
>>> instantiate a DrillClient, which uses the same ZKClusterCoordinator that
>>> sqlline uses, and thus the foreman for the query is not necessarily on
>>> the
>>> same drillbit as it was submitted to. But I'm still not sure it's related
>>> to DRILL-2512.
>>>
>>> I'll wait for your additional info before speculating further.
>>>
>>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <dr...@gmail.com>
>>> wrote:
>>>
>>> > We actually setup a separate load balancer for port 8047 (we're
>>> submitting
>>> > these queries via the REST API at the moment) so Zookeeper etc. is out
>>> of
>>> > the equation, thus I doubt we're hitting DRILL-2512.
>>> >
>>> > When shutitng down the "troublesome" drillbit, it starts parallelizing
>>> much
>>> > nicer again.  We even added 10+ nodes to the cluster and as long as
>>> that
>>> > particular drillbit is shut down, it distributes very nicely.  The
>>> minute
>>> > we start the drillbit on that node again, it starts swamping it with
>>> work.
>>> >
>>> > I'll shoot through the JSON profiles and some more information on the
>>> > dataset etc. later today (Australian time!).
>>> >
>>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
>>> sphillips@maprtech.com>
>>> > wrote:
>>> >
>>> > > I didn't notice at first that Adam said "no matter who the foreman
>>> is".
>>> > >
>>> > > Another suspicion I have is that our current logic for assigning work
>>> > will
>>> > > assign to the exact same nodes every time we query a particular
>>> table.
>>> > > Changing affinity factor may change it, but it will still be the same
>>> > every
>>> > > time. That is my suspicion, but I am not sure why shutting down the
>>> > > drillbit would improve performance. I would expect that shutting
>>> down the
>>> > > drillbit would result in a different drillbit becoming the hotspot.
>>> > >
>>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <jacques@apache.org
>>> >
>>> > > wrote:
>>> > >
>>> > > > On Steven's point, the node that the client connects to is not
>>> > currently
>>> > > > randomized.  Given your description of behavior, I'm not sure that
>>> > you're
>>> > > > hitting 2512 or just general undesirable distribution.
>>> > > >
>>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
>>> > > sphillips@maprtech.com>
>>> > > > wrote:
>>> > > >
>>> > > > > This is a known issue:
>>> > > > >
>>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
>>> > > > >
>>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
>>> > > > > aengelbrecht@maprtech.com> wrote:
>>> > > > >
>>> > > > > > What version of Drill are you running?
>>> > > > > >
>>> > > > > > Any hints when looking at the query profiles? Is the node that
>>> is
>>> > > being
>>> > > > > > hammered the foreman for the queries and most of the major
>>> > fragments
>>> > > > are
>>> > > > > > tied to the foreman?
>>> > > > > >
>>> > > > > > —Andries
>>> > > > > >
>>> > > > > >
>>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
>>> dragoncurve@gmail.com>
>>> > > > > wrote:
>>> > > > > >
>>> > > > > > > Hi guys,
>>> > > > > > >
>>> > > > > > > I'm trying to understand how this could be possible.  I have
>>> a
>>> > > Hadoop
>>> > > > > > > cluster of a name node and two data nodes setup.  All have
>>> > > identical
>>> > > > > > specs
>>> > > > > > > in terms of CPU/RAM etc.
>>> > > > > > >
>>> > > > > > > The two data nodes have a replicated HDFS setup where I'm
>>> storing
>>> > > > some
>>> > > > > > > Parquet files.
>>> > > > > > >
>>> > > > > > > A Drill cluster (with Zookeeper) is running with Drillbits
>>> on all
>>> > > > three
>>> > > > > > > servers.
>>> > > > > > >
>>> > > > > > > When I submit a query to *any* of the Drillbits, no matter
>>> who
>>> > the
>>> > > > > > foreman
>>> > > > > > > is, one particular data node gets picked to do the vast
>>> majority
>>> > of
>>> > > > the
>>> > > > > > > work.
>>> > > > > > >
>>> > > > > > > We've even added three more task nodes to the cluster and
>>> > > everything
>>> > > > > > still
>>> > > > > > > puts a huge load on one particular server.
>>> > > > > > >
>>> > > > > > > There is nothing unique about this data node.  HDFS is fully
>>> > > > replicated
>>> > > > > > (no
>>> > > > > > > unreplicated blocks) to the other data node.
>>> > > > > > >
>>> > > > > > > I know that Drill tries to get data locality, so I'm
>>> wondering if
>>> > > > this
>>> > > > > is
>>> > > > > > > the cause, but this essentially swamping this data node with
>>> 100%
>>> > > CPU
>>> > > > > > usage
>>> > > > > > > while leaving the others barely doing any work.
>>> > > > > > >
>>> > > > > > > As soon as we shut down the Drillbit on this data node, query
>>> > > > > performance
>>> > > > > > > increases significantly.
>>> > > > > > >
>>> > > > > > > Any thoughts on how I can troubleshoot why Drill is picking
>>> that
>>> > > > > > particular
>>> > > > > > > node?
>>> > > > > >
>>> > > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > >  Steven Phillips
>>> > > > >  Software Engineer
>>> > > > >
>>> > > > >  mapr.com
>>> > > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > >  Steven Phillips
>>> > >  Software Engineer
>>> > >
>>> > >  mapr.com
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>>  Steven Phillips
>>>  Software Engineer
>>>
>>>  mapr.com
>>>
>>
>>
>

Re: Drill favouring a particular Drillbit

Posted by Adam Gilmore <dr...@gmail.com>.
Anyone have any more thoughts on this?  Anywhere I can start trying to
troubleshoot?

On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore <dr...@gmail.com> wrote:

> So there are 5 Parquet files, each ~125mb - not sure what I can provide re
> the block locations?  I believe it's under the HDFS block size so they
> should be stored contiguously.
>
> I've tried setting the affinity factor to various values (1, 0, etc.) but
> nothing seems to change that.  It always prefers certain nodes.
>
> Moreover, we added a stack more nodes and it started picking very specific
> nodes as foremen (perhaps 2-3 nodes out of 20 were always picked as
> foremen).  Therefore, the foremen were being swamped with CPU while the
> other nodes were doing very little work.
>
> On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <sp...@maprtech.com>
> wrote:
>
>> Actually, I believe a query submitted through REST interface will
>> instantiate a DrillClient, which uses the same ZKClusterCoordinator that
>> sqlline uses, and thus the foreman for the query is not necessarily on the
>> same drillbit as it was submitted to. But I'm still not sure it's related
>> to DRILL-2512.
>>
>> I'll wait for your additional info before speculating further.
>>
>> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <dr...@gmail.com>
>> wrote:
>>
>> > We actually setup a separate load balancer for port 8047 (we're
>> submitting
>> > these queries via the REST API at the moment) so Zookeeper etc. is out
>> of
>> > the equation, thus I doubt we're hitting DRILL-2512.
>> >
>> > When shutitng down the "troublesome" drillbit, it starts parallelizing
>> much
>> > nicer again.  We even added 10+ nodes to the cluster and as long as that
>> > particular drillbit is shut down, it distributes very nicely.  The
>> minute
>> > we start the drillbit on that node again, it starts swamping it with
>> work.
>> >
>> > I'll shoot through the JSON profiles and some more information on the
>> > dataset etc. later today (Australian time!).
>> >
>> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <
>> sphillips@maprtech.com>
>> > wrote:
>> >
>> > > I didn't notice at first that Adam said "no matter who the foreman
>> is".
>> > >
>> > > Another suspicion I have is that our current logic for assigning work
>> > will
>> > > assign to the exact same nodes every time we query a particular table.
>> > > Changing affinity factor may change it, but it will still be the same
>> > every
>> > > time. That is my suspicion, but I am not sure why shutting down the
>> > > drillbit would improve performance. I would expect that shutting down
>> the
>> > > drillbit would result in a different drillbit becoming the hotspot.
>> > >
>> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <ja...@apache.org>
>> > > wrote:
>> > >
>> > > > On Steven's point, the node that the client connects to is not
>> > currently
>> > > > randomized.  Given your description of behavior, I'm not sure that
>> > you're
>> > > > hitting 2512 or just general undesirable distribution.
>> > > >
>> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
>> > > sphillips@maprtech.com>
>> > > > wrote:
>> > > >
>> > > > > This is a known issue:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/DRILL-2512
>> > > > >
>> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
>> > > > > aengelbrecht@maprtech.com> wrote:
>> > > > >
>> > > > > > What version of Drill are you running?
>> > > > > >
>> > > > > > Any hints when looking at the query profiles? Is the node that
>> is
>> > > being
>> > > > > > hammered the foreman for the queries and most of the major
>> > fragments
>> > > > are
>> > > > > > tied to the foreman?
>> > > > > >
>> > > > > > —Andries
>> > > > > >
>> > > > > >
>> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
>> dragoncurve@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi guys,
>> > > > > > >
>> > > > > > > I'm trying to understand how this could be possible.  I have a
>> > > Hadoop
>> > > > > > > cluster of a name node and two data nodes setup.  All have
>> > > identical
>> > > > > > specs
>> > > > > > > in terms of CPU/RAM etc.
>> > > > > > >
>> > > > > > > The two data nodes have a replicated HDFS setup where I'm
>> storing
>> > > > some
>> > > > > > > Parquet files.
>> > > > > > >
>> > > > > > > A Drill cluster (with Zookeeper) is running with Drillbits on
>> all
>> > > > three
>> > > > > > > servers.
>> > > > > > >
>> > > > > > > When I submit a query to *any* of the Drillbits, no matter who
>> > the
>> > > > > > foreman
>> > > > > > > is, one particular data node gets picked to do the vast
>> majority
>> > of
>> > > > the
>> > > > > > > work.
>> > > > > > >
>> > > > > > > We've even added three more task nodes to the cluster and
>> > > everything
>> > > > > > still
>> > > > > > > puts a huge load on one particular server.
>> > > > > > >
>> > > > > > > There is nothing unique about this data node.  HDFS is fully
>> > > > replicated
>> > > > > > (no
>> > > > > > > unreplicated blocks) to the other data node.
>> > > > > > >
>> > > > > > > I know that Drill tries to get data locality, so I'm
>> wondering if
>> > > > this
>> > > > > is
>> > > > > > > the cause, but this essentially swamping this data node with
>> 100%
>> > > CPU
>> > > > > > usage
>> > > > > > > while leaving the others barely doing any work.
>> > > > > > >
>> > > > > > > As soon as we shut down the Drillbit on this data node, query
>> > > > > performance
>> > > > > > > increases significantly.
>> > > > > > >
>> > > > > > > Any thoughts on how I can troubleshoot why Drill is picking
>> that
>> > > > > > particular
>> > > > > > > node?
>> > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > >  Steven Phillips
>> > > > >  Software Engineer
>> > > > >
>> > > > >  mapr.com
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >  Steven Phillips
>> > >  Software Engineer
>> > >
>> > >  mapr.com
>> > >
>> >
>>
>>
>>
>> --
>>  Steven Phillips
>>  Software Engineer
>>
>>  mapr.com
>>
>
>

Re: Drill favouring a particular Drillbit

Posted by Adam Gilmore <dr...@gmail.com>.
So there are 5 Parquet files, each ~125mb - not sure what I can provide re
the block locations?  I believe it's under the HDFS block size so they
should be stored contiguously.

I've tried setting the affinity factor to various values (1, 0, etc.) but
nothing seems to change that.  It always prefers certain nodes.

Moreover, we added a stack more nodes and it started picking very specific
nodes as foremen (perhaps 2-3 nodes out of 20 were always picked as
foremen).  Therefore, the foremen were being swamped with CPU while the
other nodes were doing very little work.

On Thu, Mar 26, 2015 at 12:12 PM, Steven Phillips <sp...@maprtech.com>
wrote:

> Actually, I believe a query submitted through REST interface will
> instantiate a DrillClient, which uses the same ZKClusterCoordinator that
> sqlline uses, and thus the foreman for the query is not necessarily on the
> same drillbit as it was submitted to. But I'm still not sure it's related
> to DRILL-2512.
>
> I'll wait for your additional info before speculating further.
>
> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
> > We actually setup a separate load balancer for port 8047 (we're
> submitting
> > these queries via the REST API at the moment) so Zookeeper etc. is out of
> > the equation, thus I doubt we're hitting DRILL-2512.
> >
> > When shutitng down the "troublesome" drillbit, it starts parallelizing
> much
> > nicer again.  We even added 10+ nodes to the cluster and as long as that
> > particular drillbit is shut down, it distributes very nicely.  The minute
> > we start the drillbit on that node again, it starts swamping it with
> work.
> >
> > I'll shoot through the JSON profiles and some more information on the
> > dataset etc. later today (Australian time!).
> >
> > On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <sphillips@maprtech.com
> >
> > wrote:
> >
> > > I didn't notice at first that Adam said "no matter who the foreman is".
> > >
> > > Another suspicion I have is that our current logic for assigning work
> > will
> > > assign to the exact same nodes every time we query a particular table.
> > > Changing affinity factor may change it, but it will still be the same
> > every
> > > time. That is my suspicion, but I am not sure why shutting down the
> > > drillbit would improve performance. I would expect that shutting down
> the
> > > drillbit would result in a different drillbit becoming the hotspot.
> > >
> > > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > On Steven's point, the node that the client connects to is not
> > currently
> > > > randomized.  Given your description of behavior, I'm not sure that
> > you're
> > > > hitting 2512 or just general undesirable distribution.
> > > >
> > > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> > > sphillips@maprtech.com>
> > > > wrote:
> > > >
> > > > > This is a known issue:
> > > > >
> > > > > https://issues.apache.org/jira/browse/DRILL-2512
> > > > >
> > > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > > > > aengelbrecht@maprtech.com> wrote:
> > > > >
> > > > > > What version of Drill are you running?
> > > > > >
> > > > > > Any hints when looking at the query profiles? Is the node that is
> > > being
> > > > > > hammered the foreman for the queries and most of the major
> > fragments
> > > > are
> > > > > > tied to the foreman?
> > > > > >
> > > > > > —Andries
> > > > > >
> > > > > >
> > > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <
> dragoncurve@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > I'm trying to understand how this could be possible.  I have a
> > > Hadoop
> > > > > > > cluster of a name node and two data nodes setup.  All have
> > > identical
> > > > > > specs
> > > > > > > in terms of CPU/RAM etc.
> > > > > > >
> > > > > > > The two data nodes have a replicated HDFS setup where I'm
> storing
> > > > some
> > > > > > > Parquet files.
> > > > > > >
> > > > > > > A Drill cluster (with Zookeeper) is running with Drillbits on
> all
> > > > three
> > > > > > > servers.
> > > > > > >
> > > > > > > When I submit a query to *any* of the Drillbits, no matter who
> > the
> > > > > > foreman
> > > > > > > is, one particular data node gets picked to do the vast
> majority
> > of
> > > > the
> > > > > > > work.
> > > > > > >
> > > > > > > We've even added three more task nodes to the cluster and
> > > everything
> > > > > > still
> > > > > > > puts a huge load on one particular server.
> > > > > > >
> > > > > > > There is nothing unique about this data node.  HDFS is fully
> > > > replicated
> > > > > > (no
> > > > > > > unreplicated blocks) to the other data node.
> > > > > > >
> > > > > > > I know that Drill tries to get data locality, so I'm wondering
> if
> > > > this
> > > > > is
> > > > > > > the cause, but this essentially swamping this data node with
> 100%
> > > CPU
> > > > > > usage
> > > > > > > while leaving the others barely doing any work.
> > > > > > >
> > > > > > > As soon as we shut down the Drillbit on this data node, query
> > > > > performance
> > > > > > > increases significantly.
> > > > > > >
> > > > > > > Any thoughts on how I can troubleshoot why Drill is picking
> that
> > > > > > particular
> > > > > > > node?
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >  Steven Phillips
> > > > >  Software Engineer
> > > > >
> > > > >  mapr.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Drill favouring a particular Drillbit

Posted by Steven Phillips <sp...@maprtech.com>.
Actually, I believe a query submitted through REST interface will
instantiate a DrillClient, which uses the same ZKClusterCoordinator that
sqlline uses, and thus the foreman for the query is not necessarily on the
same drillbit as it was submitted to. But I'm still not sure it's related
to DRILL-2512.

I'll wait for your additional info before speculating further.

On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore <dr...@gmail.com> wrote:

> We actually setup a separate load balancer for port 8047 (we're submitting
> these queries via the REST API at the moment) so Zookeeper etc. is out of
> the equation, thus I doubt we're hitting DRILL-2512.
>
> When shutitng down the "troublesome" drillbit, it starts parallelizing much
> nicer again.  We even added 10+ nodes to the cluster and as long as that
> particular drillbit is shut down, it distributes very nicely.  The minute
> we start the drillbit on that node again, it starts swamping it with work.
>
> I'll shoot through the JSON profiles and some more information on the
> dataset etc. later today (Australian time!).
>
> On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <sp...@maprtech.com>
> wrote:
>
> > I didn't notice at first that Adam said "no matter who the foreman is".
> >
> > Another suspicion I have is that our current logic for assigning work
> will
> > assign to the exact same nodes every time we query a particular table.
> > Changing affinity factor may change it, but it will still be the same
> every
> > time. That is my suspicion, but I am not sure why shutting down the
> > drillbit would improve performance. I would expect that shutting down the
> > drillbit would result in a different drillbit becoming the hotspot.
> >
> > On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > On Steven's point, the node that the client connects to is not
> currently
> > > randomized.  Given your description of behavior, I'm not sure that
> you're
> > > hitting 2512 or just general undesirable distribution.
> > >
> > > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> > sphillips@maprtech.com>
> > > wrote:
> > >
> > > > This is a known issue:
> > > >
> > > > https://issues.apache.org/jira/browse/DRILL-2512
> > > >
> > > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > > > aengelbrecht@maprtech.com> wrote:
> > > >
> > > > > What version of Drill are you running?
> > > > >
> > > > > Any hints when looking at the query profiles? Is the node that is
> > being
> > > > > hammered the foreman for the queries and most of the major
> fragments
> > > are
> > > > > tied to the foreman?
> > > > >
> > > > > —Andries
> > > > >
> > > > >
> > > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <dr...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > I'm trying to understand how this could be possible.  I have a
> > Hadoop
> > > > > > cluster of a name node and two data nodes setup.  All have
> > identical
> > > > > specs
> > > > > > in terms of CPU/RAM etc.
> > > > > >
> > > > > > The two data nodes have a replicated HDFS setup where I'm storing
> > > some
> > > > > > Parquet files.
> > > > > >
> > > > > > A Drill cluster (with Zookeeper) is running with Drillbits on all
> > > three
> > > > > > servers.
> > > > > >
> > > > > > When I submit a query to *any* of the Drillbits, no matter who
> the
> > > > > foreman
> > > > > > is, one particular data node gets picked to do the vast majority
> of
> > > the
> > > > > > work.
> > > > > >
> > > > > > We've even added three more task nodes to the cluster and
> > everything
> > > > > still
> > > > > > puts a huge load on one particular server.
> > > > > >
> > > > > > There is nothing unique about this data node.  HDFS is fully
> > > replicated
> > > > > (no
> > > > > > unreplicated blocks) to the other data node.
> > > > > >
> > > > > > I know that Drill tries to get data locality, so I'm wondering if
> > > this
> > > > is
> > > > > > the cause, but this essentially swamping this data node with 100%
> > CPU
> > > > > usage
> > > > > > while leaving the others barely doing any work.
> > > > > >
> > > > > > As soon as we shut down the Drillbit on this data node, query
> > > > performance
> > > > > > increases significantly.
> > > > > >
> > > > > > Any thoughts on how I can troubleshoot why Drill is picking that
> > > > > particular
> > > > > > node?
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >  Steven Phillips
> > > >  Software Engineer
> > > >
> > > >  mapr.com
> > > >
> > >
> >
> >
> >
> > --
> >  Steven Phillips
> >  Software Engineer
> >
> >  mapr.com
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Drill favouring a particular Drillbit

Posted by Adam Gilmore <dr...@gmail.com>.
We actually setup a separate load balancer for port 8047 (we're submitting
these queries via the REST API at the moment) so Zookeeper etc. is out of
the equation, thus I doubt we're hitting DRILL-2512.

When shutitng down the "troublesome" drillbit, it starts parallelizing much
nicer again.  We even added 10+ nodes to the cluster and as long as that
particular drillbit is shut down, it distributes very nicely.  The minute
we start the drillbit on that node again, it starts swamping it with work.

I'll shoot through the JSON profiles and some more information on the
dataset etc. later today (Australian time!).

On Thu, Mar 26, 2015 at 5:31 AM, Steven Phillips <sp...@maprtech.com>
wrote:

> I didn't notice at first that Adam said "no matter who the foreman is".
>
> Another suspicion I have is that our current logic for assigning work will
> assign to the exact same nodes every time we query a particular table.
> Changing affinity factor may change it, but it will still be the same every
> time. That is my suspicion, but I am not sure why shutting down the
> drillbit would improve performance. I would expect that shutting down the
> drillbit would result in a different drillbit becoming the hotspot.
>
> On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > On Steven's point, the node that the client connects to is not currently
> > randomized.  Given your description of behavior, I'm not sure that you're
> > hitting 2512 or just general undesirable distribution.
> >
> > On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <
> sphillips@maprtech.com>
> > wrote:
> >
> > > This is a known issue:
> > >
> > > https://issues.apache.org/jira/browse/DRILL-2512
> > >
> > > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > > > What version of Drill are you running?
> > > >
> > > > Any hints when looking at the query profiles? Is the node that is
> being
> > > > hammered the foreman for the queries and most of the major fragments
> > are
> > > > tied to the foreman?
> > > >
> > > > —Andries
> > > >
> > > >
> > > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <dr...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I'm trying to understand how this could be possible.  I have a
> Hadoop
> > > > > cluster of a name node and two data nodes setup.  All have
> identical
> > > > specs
> > > > > in terms of CPU/RAM etc.
> > > > >
> > > > > The two data nodes have a replicated HDFS setup where I'm storing
> > some
> > > > > Parquet files.
> > > > >
> > > > > A Drill cluster (with Zookeeper) is running with Drillbits on all
> > three
> > > > > servers.
> > > > >
> > > > > When I submit a query to *any* of the Drillbits, no matter who the
> > > > foreman
> > > > > is, one particular data node gets picked to do the vast majority of
> > the
> > > > > work.
> > > > >
> > > > > We've even added three more task nodes to the cluster and
> everything
> > > > still
> > > > > puts a huge load on one particular server.
> > > > >
> > > > > There is nothing unique about this data node.  HDFS is fully
> > replicated
> > > > (no
> > > > > unreplicated blocks) to the other data node.
> > > > >
> > > > > I know that Drill tries to get data locality, so I'm wondering if
> > this
> > > is
> > > > > the cause, but this essentially swamping this data node with 100%
> CPU
> > > > usage
> > > > > while leaving the others barely doing any work.
> > > > >
> > > > > As soon as we shut down the Drillbit on this data node, query
> > > performance
> > > > > increases significantly.
> > > > >
> > > > > Any thoughts on how I can troubleshoot why Drill is picking that
> > > > particular
> > > > > node?
> > > >
> > > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Drill favouring a particular Drillbit

Posted by Steven Phillips <sp...@maprtech.com>.
I didn't notice at first that Adam said "no matter who the foreman is".

Another suspicion I have is that our current logic for assigning work will
assign to the exact same nodes every time we query a particular table.
Changing affinity factor may change it, but it will still be the same every
time. That is my suspicion, but I am not sure why shutting down the
drillbit would improve performance. I would expect that shutting down the
drillbit would result in a different drillbit becoming the hotspot.

On Wed, Mar 25, 2015 at 12:16 PM, Jacques Nadeau <ja...@apache.org> wrote:

> On Steven's point, the node that the client connects to is not currently
> randomized.  Given your description of behavior, I'm not sure that you're
> hitting 2512 or just general undesirable distribution.
>
> On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <sp...@maprtech.com>
> wrote:
>
> > This is a known issue:
> >
> > https://issues.apache.org/jira/browse/DRILL-2512
> >
> > On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > What version of Drill are you running?
> > >
> > > Any hints when looking at the query profiles? Is the node that is being
> > > hammered the foreman for the queries and most of the major fragments
> are
> > > tied to the foreman?
> > >
> > > —Andries
> > >
> > >
> > > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <dr...@gmail.com>
> > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I'm trying to understand how this could be possible.  I have a Hadoop
> > > > cluster of a name node and two data nodes setup.  All have identical
> > > specs
> > > > in terms of CPU/RAM etc.
> > > >
> > > > The two data nodes have a replicated HDFS setup where I'm storing
> some
> > > > Parquet files.
> > > >
> > > > A Drill cluster (with Zookeeper) is running with Drillbits on all
> three
> > > > servers.
> > > >
> > > > When I submit a query to *any* of the Drillbits, no matter who the
> > > foreman
> > > > is, one particular data node gets picked to do the vast majority of
> the
> > > > work.
> > > >
> > > > We've even added three more task nodes to the cluster and everything
> > > still
> > > > puts a huge load on one particular server.
> > > >
> > > > There is nothing unique about this data node.  HDFS is fully
> replicated
> > > (no
> > > > unreplicated blocks) to the other data node.
> > > >
> > > > I know that Drill tries to get data locality, so I'm wondering if
> this
> > is
> > > > the cause, but this essentially swamping this data node with 100% CPU
> > > usage
> > > > while leaving the others barely doing any work.
> > > >
> > > > As soon as we shut down the Drillbit on this data node, query
> > performance
> > > > increases significantly.
> > > >
> > > > Any thoughts on how I can troubleshoot why Drill is picking that
> > > particular
> > > > node?
> > >
> > >
> >
> >
> > --
> >  Steven Phillips
> >  Software Engineer
> >
> >  mapr.com
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Drill favouring a particular Drillbit

Posted by Jacques Nadeau <ja...@apache.org>.
On Steven's point, the node that the client connects to is not currently
randomized.  Given your description of behavior, I'm not sure that you're
hitting 2512 or just general undesirable distribution.

On Wed, Mar 25, 2015 at 10:18 AM, Steven Phillips <sp...@maprtech.com>
wrote:

> This is a known issue:
>
> https://issues.apache.org/jira/browse/DRILL-2512
>
> On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > What version of Drill are you running?
> >
> > Any hints when looking at the query profiles? Is the node that is being
> > hammered the foreman for the queries and most of the major fragments are
> > tied to the foreman?
> >
> > —Andries
> >
> >
> > On Mar 25, 2015, at 12:00 AM, Adam Gilmore <dr...@gmail.com>
> wrote:
> >
> > > Hi guys,
> > >
> > > I'm trying to understand how this could be possible.  I have a Hadoop
> > > cluster of a name node and two data nodes setup.  All have identical
> > specs
> > > in terms of CPU/RAM etc.
> > >
> > > The two data nodes have a replicated HDFS setup where I'm storing some
> > > Parquet files.
> > >
> > > A Drill cluster (with Zookeeper) is running with Drillbits on all three
> > > servers.
> > >
> > > When I submit a query to *any* of the Drillbits, no matter who the
> > foreman
> > > is, one particular data node gets picked to do the vast majority of the
> > > work.
> > >
> > > We've even added three more task nodes to the cluster and everything
> > still
> > > puts a huge load on one particular server.
> > >
> > > There is nothing unique about this data node.  HDFS is fully replicated
> > (no
> > > unreplicated blocks) to the other data node.
> > >
> > > I know that Drill tries to get data locality, so I'm wondering if this
> is
> > > the cause, but this essentially swamping this data node with 100% CPU
> > usage
> > > while leaving the others barely doing any work.
> > >
> > > As soon as we shut down the Drillbit on this data node, query
> performance
> > > increases significantly.
> > >
> > > Any thoughts on how I can troubleshoot why Drill is picking that
> > particular
> > > node?
> >
> >
>
>
> --
>  Steven Phillips
>  Software Engineer
>
>  mapr.com
>

Re: Drill favouring a particular Drillbit

Posted by Steven Phillips <sp...@maprtech.com>.
This is a known issue:

https://issues.apache.org/jira/browse/DRILL-2512

On Wed, Mar 25, 2015 at 8:13 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> What version of Drill are you running?
>
> Any hints when looking at the query profiles? Is the node that is being
> hammered the foreman for the queries and most of the major fragments are
> tied to the foreman?
>
> —Andries
>
>
> On Mar 25, 2015, at 12:00 AM, Adam Gilmore <dr...@gmail.com> wrote:
>
> > Hi guys,
> >
> > I'm trying to understand how this could be possible.  I have a Hadoop
> > cluster of a name node and two data nodes setup.  All have identical
> specs
> > in terms of CPU/RAM etc.
> >
> > The two data nodes have a replicated HDFS setup where I'm storing some
> > Parquet files.
> >
> > A Drill cluster (with Zookeeper) is running with Drillbits on all three
> > servers.
> >
> > When I submit a query to *any* of the Drillbits, no matter who the
> foreman
> > is, one particular data node gets picked to do the vast majority of the
> > work.
> >
> > We've even added three more task nodes to the cluster and everything
> still
> > puts a huge load on one particular server.
> >
> > There is nothing unique about this data node.  HDFS is fully replicated
> (no
> > unreplicated blocks) to the other data node.
> >
> > I know that Drill tries to get data locality, so I'm wondering if this is
> > the cause, but this essentially swamping this data node with 100% CPU
> usage
> > while leaving the others barely doing any work.
> >
> > As soon as we shut down the Drillbit on this data node, query performance
> > increases significantly.
> >
> > Any thoughts on how I can troubleshoot why Drill is picking that
> particular
> > node?
>
>


-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Drill favouring a particular Drillbit

Posted by Andries Engelbrecht <ae...@maprtech.com>.
What version of Drill are you running?

Any hints when looking at the query profiles? Is the node that is being hammered the foreman for the queries and most of the major fragments are tied to the foreman?

—Andries


On Mar 25, 2015, at 12:00 AM, Adam Gilmore <dr...@gmail.com> wrote:

> Hi guys,
> 
> I'm trying to understand how this could be possible.  I have a Hadoop
> cluster of a name node and two data nodes setup.  All have identical specs
> in terms of CPU/RAM etc.
> 
> The two data nodes have a replicated HDFS setup where I'm storing some
> Parquet files.
> 
> A Drill cluster (with Zookeeper) is running with Drillbits on all three
> servers.
> 
> When I submit a query to *any* of the Drillbits, no matter who the foreman
> is, one particular data node gets picked to do the vast majority of the
> work.
> 
> We've even added three more task nodes to the cluster and everything still
> puts a huge load on one particular server.
> 
> There is nothing unique about this data node.  HDFS is fully replicated (no
> unreplicated blocks) to the other data node.
> 
> I know that Drill tries to get data locality, so I'm wondering if this is
> the cause, but this essentially swamping this data node with 100% CPU usage
> while leaving the others barely doing any work.
> 
> As soon as we shut down the Drillbit on this data node, query performance
> increases significantly.
> 
> Any thoughts on how I can troubleshoot why Drill is picking that particular
> node?