You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by scott <tc...@gmail.com> on 2018/08/21 20:48:04 UTC

query performance with unequal drillbits

Hi community,
I am trying to find a way to tune Drill so that weaker drillbits get less
data to work on so that the weak link doesn't drag my performance down. I
have drillbits running on a variety of hardware and sometimes these shared
resources get really slow. It seems that the query plan always evenly
divides the data fragments so that each drillbit gets the same data to chew
on. How do I make it give weaker drillbits less data?

Alternatively, is there a way to limit and queue fragments of the query and
leave them unassigned, then assign to drillbits as their resources become
free, similar to MapReduce?

Thanks for you time,
Scott

Re: query performance with unequal drillbits

Posted by Ted Dunning <te...@gmail.com>.
Paul,

Thanks for the reality side of this. Configuring a system to handle unusual
setups can definitely be a challenge.

Btw, the general term for running several sub-scale workers on each node to
allow more flexibility is "micro-sharding".



On Mon, Aug 27, 2018 at 3:24 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi All,
>
> For those following along who have not tried Ted's idea (running multiple
> Drillbits per host), note that when running two or more Drillbits per node,
> the admin is responsible for choosing non-conflicting port numbers.
>
> The port numbers are configured in drill-override.conf. See
> drill-override-example.conf for more info. By default, drill-override.conf
> is in $DRILL_HOME/conf, which would seem to imply that you must create a
> separate copy of the Drill distro for each Drillbit on each node. You'd
> then start Drill by pointing to the Drillbit-specific distro:
>
> $DRILL_HOME1/bin/drillbit.sh start
>
> For Drillbits 1, 2, 3...
>
> An alternative is to use the site directory feature. You still need a
> separate site directory per Drillbit, but they can share the Drill distro.
>
> $DRILL_HOME/bin/drillbit.sh start --site $DRILL_SITE1
>
> For a common $DRILL_HOME but separate sites for 1, 2, 3...
>
> Yet another approach is to pass the ports on the command line. The config
> system is supposed to allow this. I've not personally tested this, so
> caveat emptor:
>
> $DRILL_HOME/bin/drillbit.sh start -Ddrill.exec.rpc.user.server.port=31110
>
> You could wrap the above in a script so you can share both the Drill
> distro and config across Drillbits.
>
> Thanks,
> - Paul
>
>
>
>     On Monday, August 27, 2018, 6:17:11 AM PDT, John Omernik <
> john@omernik.com> wrote:
>
>  I will +1 Ted's idea. By doing small drillbits, it does take a bit more
> overhead, but you also have an ability to scale your Drill cluster size
> (especially using the Drillbit shutdown features added recently).
>
>
>
> On Wed, Aug 22, 2018 at 8:23 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Cool
> >
> > On Wed, Aug 22, 2018, 17:07 scott <tc...@gmail.com> wrote:
> >
> > > Thanks Ted and Paul. I've been experimenting with the "hack" method. It
> > > works somewhat, and I guess will have to do.
> > >
> > > On Tue, Aug 21, 2018 at 2:50 PM Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > > A cheap hack is to use multiple smaller drillbits. Put more drillbits
> > on
> > > > the hefty machines and fewer on the weaker ones.
> > > >
> > > > This increases overheads, but it might help you out.
> > > >
> > > >
> > > >
> > > > On Tue, Aug 21, 2018 at 1:48 PM scott <tc...@gmail.com> wrote:
> > > >
> > > > > Hi community,
> > > > > I am trying to find a way to tune Drill so that weaker drillbits
> get
> > > less
> > > > > data to work on so that the weak link doesn't drag my performance
> > > down. I
> > > > > have drillbits running on a variety of hardware and sometimes these
> > > > shared
> > > > > resources get really slow. It seems that the query plan always
> evenly
> > > > > divides the data fragments so that each drillbit gets the same data
> > to
> > > > chew
> > > > > on. How do I make it give weaker drillbits less data?
> > > > >
> > > > > Alternatively, is there a way to limit and queue fragments of the
> > query
> > > > and
> > > > > leave them unassigned, then assign to drillbits as their resources
> > > become
> > > > > free, similar to MapReduce?
> > > > >
> > > > > Thanks for you time,
> > > > > Scott
> > > > >
> > > >
> > >
> >
>

Re: query performance with unequal drillbits

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi All,

For those following along who have not tried Ted's idea (running multiple Drillbits per host), note that when running two or more Drillbits per node, the admin is responsible for choosing non-conflicting port numbers.

The port numbers are configured in drill-override.conf. See drill-override-example.conf for more info. By default, drill-override.conf is in $DRILL_HOME/conf, which would seem to imply that you must create a separate copy of the Drill distro for each Drillbit on each node. You'd then start Drill by pointing to the Drillbit-specific distro:

$DRILL_HOME1/bin/drillbit.sh start

For Drillbits 1, 2, 3...

An alternative is to use the site directory feature. You still need a separate site directory per Drillbit, but they can share the Drill distro.

$DRILL_HOME/bin/drillbit.sh start --site $DRILL_SITE1

For a common $DRILL_HOME but separate sites for 1, 2, 3...

Yet another approach is to pass the ports on the command line. The config system is supposed to allow this. I've not personally tested this, so caveat emptor:

$DRILL_HOME/bin/drillbit.sh start -Ddrill.exec.rpc.user.server.port=31110

You could wrap the above in a script so you can share both the Drill distro and config across Drillbits.

Thanks,
- Paul

 

    On Monday, August 27, 2018, 6:17:11 AM PDT, John Omernik <jo...@omernik.com> wrote:  
 
 I will +1 Ted's idea. By doing small drillbits, it does take a bit more
overhead, but you also have an ability to scale your Drill cluster size
(especially using the Drillbit shutdown features added recently).



On Wed, Aug 22, 2018 at 8:23 PM, Ted Dunning <te...@gmail.com> wrote:

> Cool
>
> On Wed, Aug 22, 2018, 17:07 scott <tc...@gmail.com> wrote:
>
> > Thanks Ted and Paul. I've been experimenting with the "hack" method. It
> > works somewhat, and I guess will have to do.
> >
> > On Tue, Aug 21, 2018 at 2:50 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > A cheap hack is to use multiple smaller drillbits. Put more drillbits
> on
> > > the hefty machines and fewer on the weaker ones.
> > >
> > > This increases overheads, but it might help you out.
> > >
> > >
> > >
> > > On Tue, Aug 21, 2018 at 1:48 PM scott <tc...@gmail.com> wrote:
> > >
> > > > Hi community,
> > > > I am trying to find a way to tune Drill so that weaker drillbits get
> > less
> > > > data to work on so that the weak link doesn't drag my performance
> > down. I
> > > > have drillbits running on a variety of hardware and sometimes these
> > > shared
> > > > resources get really slow. It seems that the query plan always evenly
> > > > divides the data fragments so that each drillbit gets the same data
> to
> > > chew
> > > > on. How do I make it give weaker drillbits less data?
> > > >
> > > > Alternatively, is there a way to limit and queue fragments of the
> query
> > > and
> > > > leave them unassigned, then assign to drillbits as their resources
> > become
> > > > free, similar to MapReduce?
> > > >
> > > > Thanks for you time,
> > > > Scott
> > > >
> > >
> >
>
  

Re: query performance with unequal drillbits

Posted by John Omernik <jo...@omernik.com>.
I will +1 Ted's idea. By doing small drillbits, it does take a bit more
overhead, but you also have an ability to scale your Drill cluster size
(especially using the Drillbit shutdown features added recently).



On Wed, Aug 22, 2018 at 8:23 PM, Ted Dunning <te...@gmail.com> wrote:

> Cool
>
> On Wed, Aug 22, 2018, 17:07 scott <tc...@gmail.com> wrote:
>
> > Thanks Ted and Paul. I've been experimenting with the "hack" method. It
> > works somewhat, and I guess will have to do.
> >
> > On Tue, Aug 21, 2018 at 2:50 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > A cheap hack is to use multiple smaller drillbits. Put more drillbits
> on
> > > the hefty machines and fewer on the weaker ones.
> > >
> > > This increases overheads, but it might help you out.
> > >
> > >
> > >
> > > On Tue, Aug 21, 2018 at 1:48 PM scott <tc...@gmail.com> wrote:
> > >
> > > > Hi community,
> > > > I am trying to find a way to tune Drill so that weaker drillbits get
> > less
> > > > data to work on so that the weak link doesn't drag my performance
> > down. I
> > > > have drillbits running on a variety of hardware and sometimes these
> > > shared
> > > > resources get really slow. It seems that the query plan always evenly
> > > > divides the data fragments so that each drillbit gets the same data
> to
> > > chew
> > > > on. How do I make it give weaker drillbits less data?
> > > >
> > > > Alternatively, is there a way to limit and queue fragments of the
> query
> > > and
> > > > leave them unassigned, then assign to drillbits as their resources
> > become
> > > > free, similar to MapReduce?
> > > >
> > > > Thanks for you time,
> > > > Scott
> > > >
> > >
> >
>

Re: query performance with unequal drillbits

Posted by Ted Dunning <te...@gmail.com>.
Cool

On Wed, Aug 22, 2018, 17:07 scott <tc...@gmail.com> wrote:

> Thanks Ted and Paul. I've been experimenting with the "hack" method. It
> works somewhat, and I guess will have to do.
>
> On Tue, Aug 21, 2018 at 2:50 PM Ted Dunning <te...@gmail.com> wrote:
>
> > A cheap hack is to use multiple smaller drillbits. Put more drillbits on
> > the hefty machines and fewer on the weaker ones.
> >
> > This increases overheads, but it might help you out.
> >
> >
> >
> > On Tue, Aug 21, 2018 at 1:48 PM scott <tc...@gmail.com> wrote:
> >
> > > Hi community,
> > > I am trying to find a way to tune Drill so that weaker drillbits get
> less
> > > data to work on so that the weak link doesn't drag my performance
> down. I
> > > have drillbits running on a variety of hardware and sometimes these
> > shared
> > > resources get really slow. It seems that the query plan always evenly
> > > divides the data fragments so that each drillbit gets the same data to
> > chew
> > > on. How do I make it give weaker drillbits less data?
> > >
> > > Alternatively, is there a way to limit and queue fragments of the query
> > and
> > > leave them unassigned, then assign to drillbits as their resources
> become
> > > free, similar to MapReduce?
> > >
> > > Thanks for you time,
> > > Scott
> > >
> >
>

Re: query performance with unequal drillbits

Posted by scott <tc...@gmail.com>.
Thanks Ted and Paul. I've been experimenting with the "hack" method. It
works somewhat, and I guess will have to do.

On Tue, Aug 21, 2018 at 2:50 PM Ted Dunning <te...@gmail.com> wrote:

> A cheap hack is to use multiple smaller drillbits. Put more drillbits on
> the hefty machines and fewer on the weaker ones.
>
> This increases overheads, but it might help you out.
>
>
>
> On Tue, Aug 21, 2018 at 1:48 PM scott <tc...@gmail.com> wrote:
>
> > Hi community,
> > I am trying to find a way to tune Drill so that weaker drillbits get less
> > data to work on so that the weak link doesn't drag my performance down. I
> > have drillbits running on a variety of hardware and sometimes these
> shared
> > resources get really slow. It seems that the query plan always evenly
> > divides the data fragments so that each drillbit gets the same data to
> chew
> > on. How do I make it give weaker drillbits less data?
> >
> > Alternatively, is there a way to limit and queue fragments of the query
> and
> > leave them unassigned, then assign to drillbits as their resources become
> > free, similar to MapReduce?
> >
> > Thanks for you time,
> > Scott
> >
>

Re: query performance with unequal drillbits

Posted by Ted Dunning <te...@gmail.com>.
A cheap hack is to use multiple smaller drillbits. Put more drillbits on
the hefty machines and fewer on the weaker ones.

This increases overheads, but it might help you out.



On Tue, Aug 21, 2018 at 1:48 PM scott <tc...@gmail.com> wrote:

> Hi community,
> I am trying to find a way to tune Drill so that weaker drillbits get less
> data to work on so that the weak link doesn't drag my performance down. I
> have drillbits running on a variety of hardware and sometimes these shared
> resources get really slow. It seems that the query plan always evenly
> divides the data fragments so that each drillbit gets the same data to chew
> on. How do I make it give weaker drillbits less data?
>
> Alternatively, is there a way to limit and queue fragments of the query and
> leave them unassigned, then assign to drillbits as their resources become
> free, similar to MapReduce?
>
> Thanks for you time,
> Scott
>

Re: query performance with unequal drillbits

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Scott,

Drillbit symmetry is built deep into Drill's distribution model: the planner assumes Drillbits are equal. Changing this assumption is possible (you cited MapReduce as a system that handles this case), but would require complex code changes:

* Distribute scan blocks based on locality, or include machine capability when attempting to balance reads (weaker machines get fewer reads, say)?
* When determining the number of minor fragments (execution tasks), base this on the total available slots? (With each machine having a number of slots determined by its configuration, say.) This is easier for simple operators (filter, project), but gets trickier for things like sorts and joins.
* Prefer more powerful machines for some operators such as sort? (Sort on machines with the most memory, or. a combination of memory and CPU)?
* Exclude weak nodes from being Foreman? (Or, dedicate such nodes to ONLY being Foreman?)

As you can see, the scheduling algorithm for an asymmetric cluster would be very complex and very hard to get right. I suspect that is why Drill went with the much simpler assumption: symmetric nodes.

In fact, to support asymmetry well, Drill would likely need a different paralyzer design, one that sees assigning minor fragments to nodes as a simple slice & dice activity to instead looking at more like YARN (or Kubernetes) does: as a process of assigning tasks to slots using some kind of best-fit or bin-packing algorithm. Obviously not a trivial change!

For now, the best advice would be to configure all Drillbits to use the same amount of memory and CPU. Use YARN to assign additional non-Drill tasks to larger nodes, while leaving Drill as the only task on weaker nodes.

Thanks,
- Paul

 

    On Tuesday, August 21, 2018, 1:48:19 PM PDT, scott <tc...@gmail.com> wrote:  
 
 Hi community,
I am trying to find a way to tune Drill so that weaker drillbits get less
data to work on so that the weak link doesn't drag my performance down. I
have drillbits running on a variety of hardware and sometimes these shared
resources get really slow. It seems that the query plan always evenly
divides the data fragments so that each drillbit gets the same data to chew
on. How do I make it give weaker drillbits less data?

Alternatively, is there a way to limit and queue fragments of the query and
leave them unassigned, then assign to drillbits as their resources become
free, similar to MapReduce?

Thanks for you time,
Scott