You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by weijie tong <to...@gmail.com> on 2017/08/20 12:39:01 UTC

Discuss about Drill's schedule policy

HI all:

  Drill's current schedule policy seems a little simple. The
SimpleParallelizer assigns endpoints in round robin model which ignores the
system's load and other factors. To critical scenario, some drillbits are
suffering frequent full GCs which will let their control RPC blocked.
Current assignment will not exclude these drillbits from the next coming
queries's assignment. then the problem will get worse .
  I propose to add a zk path to hold bad drillbits. Forman will recognize
bad drillbits by waiting timeout (timeout of  control response from
intermediate fragments), then update the bad drillbits path. Next coming
queries will exclude these drillbits from the assignment list.
  How do you think about it or any suggests ? If sounds ok ,will file a
JIRA and give some contributes.

Re: Discuss about Drill's schedule policy

Posted by Padma Penumarthy <pp...@mapr.com>.

If control RPC is down to a drillbit i.e if  a drillbit is not responding,
zookeeper should detect that and notify other drillbits to remove the dead drillbit
from their active list. Once that happens, the next query that comes in should
not even see that drillbit.
We need a way to differentiate drillbits based on their resource (i.e. CPU and memory)
availability/usage and consider that information for planning.
We do not have that capability. We treat all of them the same.

Thanks,
Padma


On Aug 20, 2017, at 5:39 AM, weijie tong <to...@gmail.com>> wrote:

HI all:

 Drill's current schedule policy seems a little simple. The
SimpleParallelizer assigns endpoints in round robin model which ignores the
system's load and other factors. To critical scenario, some drillbits are
suffering frequent full GCs which will let their control RPC blocked.
Current assignment will not exclude these drillbits from the next coming
queries's assignment. then the problem will get worse .
 I propose to add a zk path to hold bad drillbits. Forman will recognize
bad drillbits by waiting timeout (timeout of  control response from
intermediate fragments), then update the bad drillbits path. Next coming
queries will exclude these drillbits from the assignment list.
 How do you think about it or any suggests ? If sounds ok ,will file a
JIRA and give some contributes.

Re: Discuss about Drill's schedule policy

Posted by Paul Rogers <pr...@mapr.com>.

Hi Weijie,

Thanks much for the suggestions! It will take a while to digest all of this as Drill’s existing scheduler (for fragments) is quite complex, but it works. I’ll need to map those concepts to Sparrow. We still don’t have query-level scheduling, but perhaps there is something that can be done there; I’ll need to take a look.

Scheduling of fragments is tricky for a number of reasons.

1. Drill has no spilling between fragments. So, the “downstream” fragment must be primed and ready to go by the time that the “upstream” (scanner, say) fragment has data to send. 

2. In particular, the way dispatching works, the addressing from sender to receiver requires that the receiver exist at the time of the send. Drill has no mechanism to wait until the receiver (downstream fragment) is allocated and registered: it must be registered (but not necessarily started) before the sender sends. Specifically, the sender is started with knowledge of the host and identifier of the downstream fragment; major work would be needed to allocate this information after the upstream operator starts.

3. As noted before, Drill is designed for small, fast queries, so latency is a huge concern. Fragments must start as soon as possible; ready to receive data as soon as it is available. Thus, we can’t wait for a scheduler to allocate a fragment based on need.

We can certainly improve fragment scheduling. We start too many minor fragments in some cases. For very large queries, we start too many major fragments, causing CPU overload. Fixing this will be a major project.

All that said, we still need query-level scheduling to prevent overloading the fragment-level scheduler.

Anyway, you’ve given us lots to consider when we can turn our attention to this area.

Thanks,

- Paul

> On Aug 27, 2017, at 7:38 AM, weijie tong <to...@gmail.com> wrote:
> 
> Maybe we need to adjust the MajorFragments execution phase as the
> intermediate MajorFragments are lazy executed now(if the intermediate
> fragments tasks are lazy allocated or not allocated due to resource
> restrict, the down stream running works will couldn't send their data out).
> We should let MajorFragments execute from top to leaf ,then the
> corresponding execution tasks from top to down are all  sure to be
> allocated to do the pipeline works.
> 
> On Sun, 27 Aug 2017 at 7:46 PM weijie tong <to...@gmail.com> wrote:
> 
>> Hi Paul:
>> 
>>   I have read the codes of Sparrow and Spark-Sparrow last few days. It
>> seems Sparrow can match Drill's architecture very well. According to
>> sparrow's spark implementation, every MinorFragment can be treat as a spark
>> task ,a MajorFragment can be treat as a spark taskset.  We will start a
>> SparrowDaemon(process)  on every drillbit's machine. One SparrowDaemon has
>> two roles : a scheduler (which is a frontend to contact with other
>> NodeMonitor to request accepting  tasks  ),a NodeMonitor (which does the
>> real schedule work).  Foreman can utilize the SparrowFrontendClient to
>> submit the MajorFragments (communicate with Sparrow's  Scheduler) ,
>> WorkerBee( maybe other class we need to redefine) needs to implement the
>> BackendService.Iface to response Sparrow's task schedule (request by
>> NodeMonitor).
>> 
>>    Sparrow also support schedule tasks  to specific hosts (meaning
>> Drill's host affinity). It has FIFO , Priority, RoundRobin , NoQueue
>> schedule policys.
>> seems other schedulers will need to be added.
>> 
>>     Btw,  I think Yarn is not a good candidate. As a Yarn container is a
>> JVM process , so we need to start a Drillbit as a container executor, then
>> Yarn schedule the Drillbit not the Fragments which we  really care about.
>> 
>>    Here [1] are a complete scheduler introduction ,hope to help.
>> 
>> 
>> 
>> 
>> [1]
>> https://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2016-03-09-scheduler-architectures.html
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Aug 24, 2017 at 12:31 PM, Paul Rogers <pr...@mapr.com> wrote:
>> 
>>> Hi Weijie,
>>> 
>>> Thanks for the link. I’d seen this project a bit earlier, along with
>>> Apollo [1]. Sparrow is quite interesting, but is designed to place tasks
>>> (processes) on available nodes. This is not quite what Drill does: Drill
>>> launches multiple waves of “fragments” to all nodes across the cluster.
>>> 
>>> Other systems take the approach of just-in-time scheduling in which a
>>> fragment starts only when its inputs are available, and terminates (and
>>> releases its resources) after it has processed its last row. While this may
>>> be a very good technique for longer-running tasks (something like
>>> map/reduce or Hive), it introduces too much latency for short-running,
>>> interactive queries.
>>> 
>>> One could argue that Drill needs two levels of scheduling:
>>> 
>>> 1. Schedule queries as a whole.
>>> 2. Schedule tasks (“minor fragments”) within queries.
>>> 
>>> (There is, of course, a third level: scheduling the Drillbits themselves.
>>> Let’s leave that aside for now.)
>>> 
>>> The simplest place to start in Drill is to schedule entire queries, where
>>> each query gets a slice of cluster-wide resources (memory, CPU, etc.) Then,
>>> we can reuse Drill’s existing mechanism to schedule fragments on nodes.
>>> 
>>> The next level of refinement is to select the proper level of
>>> parallelization for a query: a balance between maximizing width, but not
>>> overwhelming the cluster with too many threads. For truly huge queries
>>> (dozens of nested subqueries), it might even make sense to introduce a way
>>> of sharing threads across fragments (something that Hanifi looked into a
>>> while back) or staging queries so that we don’t try to run all stages
>>> simultaneously. These are more advanced topics.
>>> 
>>> A good place to start would be a scheduler; with a model somewhat like
>>> YARNs, that selects queries to run when Drill resources are available; then
>>> to ensure that queries run within those resources.
>>> 
>>> Anyone know of such a schedule we could borrow to use with Drill? Or
>>> maybe we could adopt the core of Sparrow (or whatever) with the algorithm
>>> needed for Drill to avoid the need to invent yet another new scheduler.
>>> 
>>> Thanks,
>>> 
>>> - Paul
>>> 
>>> 
>>> [1]
>>> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-boutin_0.pdf
>>> 
>>> On Aug 23, 2017, at 7:41 AM, weijie tong <tongweijie178@gmail.com<mailto:
>>> tongweijie178@gmail.com>> wrote:
>>> 
>>> @paul  have you noticed the Sparrow project (
>>> https://github.com/radlab/sparrow ) and related paper mentioned in the
>>> github .  Sparrow is a non-central ,low latency scheduler . This seems
>>> meet
>>> Drill's demand. I think we can first abstract a scheduler interface like
>>> what Spark does , then we can have different scheduler implementations
>>> (central or non-central ,maybe non-central like sparrow be the default one
>>> ).
>>> 
>>> On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <tongweijie178@gmail.com
>>> <ma...@gmail.com>>
>>> wrote:
>>> 
>>> Thanks for all your suggestions.
>>> 
>>> @paul your analysis is impressive . I agree with  your opinion. Current
>>> queue solution can not solve this problem perfectly. Our system is
>>> suffering a  hard time once the cluster is in high load. I will think
>>> about
>>> this more deeply. welcome more ideas or suggestions to  be shared in this
>>> thread,maybe some little improvement .
>>> 
>>> 
>>> On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <progers@mapr.com<mailto:
>>> progers@mapr.com>> wrote:
>>> 
>>> Hi Weijie,
>>> 
>>> Great analysis. Let’s look at a few more data points.
>>> 
>>> Drill has no central scheduler (this is a feature: it makes the cluster
>>> much easier to manage and has no single point of failure. It was probably
>>> the easiest possible solution while Drill was being built.) Instead of
>>> central control, Drill is based on the assumption of symmetry: all
>>> Drillbits are identical. So, each Foreman, acting independently, should
>>> try
>>> to schedule its load in a way that evenly distributes work across nodes in
>>> the cluster. If all Drillbits do the same, then load should be balanced;
>>> there should be no “hot spots.”
>>> 
>>> Note, for this to work, Drill should either own the cluster, or any other
>>> workload on the cluster should also be evenly distributed.
>>> 
>>> Drill makes another simplification: that the cluster has infinite
>>> resources (or, equivalently, that the admin sized the cluster for peak
>>> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore,
>>> Drill
>>> usually runs with no throttling mechanism to limit overall cluster load.
>>> In
>>> real clusters, of course, resources are limited and either a large query
>>> load, or a few large queries, can saturate some or all of the available
>>> resources.
>>> 
>>> Drill has a feature, seldom used, to throttle queries based purely on
>>> number. These ZK-based queues can allow, say, 5 queries to run (each of
>>> which is assumed to be evenly distributed.) In actual fact, the ZK-based
>>> queues recognize that typical workload have many small, and a few large,
>>> queries and so Drill offers the “small query” and “large query” queues.
>>> 
>>> OK, so that’s where we are today. I think I’m not stepping too far out of
>>> line to observe that the above model is just a bit naive. It does not take
>>> into consideration the available cores, memory or disk I/Os. It does not
>>> consider the fact that memory has a hard upper limit and must be managed.
>>> Drill creates one thread for each minor fragment limited by the number of
>>> cores. But, each query can contain dozens or more fragments, resulting in
>>> far, far more threads per query than a node has cores. That is, Drill’s
>>> current scheduling model does not consider that, above a certain level,
>>> adding more threads makes the system slower because of thrashing.
>>> 
>>> You propose a closed-loop, reactive control system (schedule load based
>>> on observed load on each Drillbit.) However, control-system theory tells
>>> us
>>> that such a system is subject to oscillation. All Foremen observe that a
>>> node X is loaded so none send it work. Node X later finishes its work and
>>> becomes underloaded. All Foremen now prefer node X and it swings back to
>>> being overloaded. In fact, Impala tried an open-loop design and there is
>>> some evidence in their documentation that they hit these very problems.
>>> 
>>> So, what else could we do? As we’ve wrestled with these issues, we’ve
>>> come to the understanding that we need an open-loop, predictive solution.
>>> That is a fancy name for what YARN or Mesos does: keep track of available
>>> resources, reserve them for a task, and monitor the task so that it stays
>>> within the resource allocation. Predict load via allocation rather than
>>> reacting to actual load.
>>> 
>>> In Drill, that might mean a scheduler which looks at all incoming queries
>>> and assigns cluster resources to each; queueing the query if necessary
>>> until resources become available. It also means that queries must live
>>> within their resource allocation. (The planner can help by predicting the
>>> likely needed resources. Then, at run time, spill-to-disk and other
>>> mechanisms allow queries to honor the resource limits.)
>>> 
>>> The scheduler-based design is nothing new: it seems to be what Impala
>>> settled on, it is what YARN does for batch jobs, and it is a common
>>> pattern
>>> in other query engines.
>>> 
>>> Back to the RPC issue. With proper scheduling, we limit load on each
>>> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
>>> rather than overloading a node, then attempting to recover, we wish
>>> instead
>>> to manage to load to prevent the overload in the first place.
>>> 
>>> A coming pull request will take a first, small, step: it will allocate
>>> memory to queries based on the limit set by the ZK-based queues. The next
>>> step is to figure out how to limit the number of threads per query. (As
>>> noted above, a single large query can overwhelm the cluster if, say, it
>>> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
>>> suggestions and pointers to how others have solved the problem.
>>> 
>>> We also keep tossing around the idea of introducing that central
>>> scheduler. But, that is quite a bit of work and we’ve hard that users seem
>>> to have struggles with maintaining the YARN and Impala schedulers, so
>>> we’re
>>> somewhat hesitant to move away from a purely symmetrical configuration.
>>> Suggestions in this area are very welcome.
>>> 
>>> For now, try turning on the ZK queues to limit concurrent queries and
>>> prevent overload. Ensure your cluster is sized for your workload. Ensure
>>> other work on the cluster is also symmetrical and doe not compete with
>>> Drill for resources.
>>> 
>>> And, please continue to share your experiences!
>>> 
>>> Thanks,
>>> 
>>> - Paul
>>> 
>>> On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie178@gmail.com<mailto:
>>> tongweijie178@gmail.com>>
>>> wrote:
>>> 
>>> HI all:
>>> 
>>> Drill's current schedule policy seems a little simple. The
>>> SimpleParallelizer assigns endpoints in round robin model which ignores
>>> the
>>> system's load and other factors. To critical scenario, some drillbits
>>> are
>>> suffering frequent full GCs which will let their control RPC blocked.
>>> Current assignment will not exclude these drillbits from the next coming
>>> queries's assignment. then the problem will get worse .
>>> I propose to add a zk path to hold bad drillbits. Forman will recognize
>>> bad drillbits by waiting timeout (timeout of  control response from
>>> intermediate fragments), then update the bad drillbits path. Next coming
>>> queries will exclude these drillbits from the assignment list.
>>> How do you think about it or any suggests ? If sounds ok ,will file a
>>> JIRA and give some contributes.
>>> 
>>> 
>>> 
>>> 
>>

Re: Discuss about Drill's schedule policy

Posted by weijie tong <to...@gmail.com>.

Maybe we need to adjust the MajorFragments execution phase as the
 intermediate MajorFragments are lazy executed now(if the intermediate
fragments tasks are lazy allocated or not allocated due to resource
restrict, the down stream running works will couldn't send their data out).
We should let MajorFragments execute from top to leaf ,then the
corresponding execution tasks from top to down are all  sure to be
allocated to do the pipeline works.

On Sun, 27 Aug 2017 at 7:46 PM weijie tong <to...@gmail.com> wrote:

> Hi Paul:
>
>    I have read the codes of Sparrow and Spark-Sparrow last few days. It
> seems Sparrow can match Drill's architecture very well. According to
> sparrow's spark implementation, every MinorFragment can be treat as a spark
> task ,a MajorFragment can be treat as a spark taskset.  We will start a
> SparrowDaemon(process)  on every drillbit's machine. One SparrowDaemon has
> two roles : a scheduler (which is a frontend to contact with other
> NodeMonitor to request accepting  tasks  ),a NodeMonitor (which does the
> real schedule work).  Foreman can utilize the SparrowFrontendClient to
> submit the MajorFragments (communicate with Sparrow's  Scheduler) ,
> WorkerBee( maybe other class we need to redefine) needs to implement the
> BackendService.Iface to response Sparrow's task schedule (request by
> NodeMonitor).
>
>     Sparrow also support schedule tasks  to specific hosts (meaning
> Drill's host affinity). It has FIFO , Priority, RoundRobin , NoQueue
> schedule policys.
> seems other schedulers will need to be added.
>
>      Btw,  I think Yarn is not a good candidate. As a Yarn container is a
> JVM process , so we need to start a Drillbit as a container executor, then
> Yarn schedule the Drillbit not the Fragments which we  really care about.
>
>     Here [1] are a complete scheduler introduction ,hope to help.
>
>
>
>
> [1]
> https://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2016-03-09-scheduler-architectures.html
>
>
>
>
>
>
>
> On Thu, Aug 24, 2017 at 12:31 PM, Paul Rogers <pr...@mapr.com> wrote:
>
>> Hi Weijie,
>>
>> Thanks for the link. I’d seen this project a bit earlier, along with
>> Apollo [1]. Sparrow is quite interesting, but is designed to place tasks
>> (processes) on available nodes. This is not quite what Drill does: Drill
>> launches multiple waves of “fragments” to all nodes across the cluster.
>>
>> Other systems take the approach of just-in-time scheduling in which a
>> fragment starts only when its inputs are available, and terminates (and
>> releases its resources) after it has processed its last row. While this may
>> be a very good technique for longer-running tasks (something like
>> map/reduce or Hive), it introduces too much latency for short-running,
>> interactive queries.
>>
>> One could argue that Drill needs two levels of scheduling:
>>
>> 1. Schedule queries as a whole.
>> 2. Schedule tasks (“minor fragments”) within queries.
>>
>> (There is, of course, a third level: scheduling the Drillbits themselves.
>> Let’s leave that aside for now.)
>>
>> The simplest place to start in Drill is to schedule entire queries, where
>> each query gets a slice of cluster-wide resources (memory, CPU, etc.) Then,
>> we can reuse Drill’s existing mechanism to schedule fragments on nodes.
>>
>> The next level of refinement is to select the proper level of
>> parallelization for a query: a balance between maximizing width, but not
>> overwhelming the cluster with too many threads. For truly huge queries
>> (dozens of nested subqueries), it might even make sense to introduce a way
>> of sharing threads across fragments (something that Hanifi looked into a
>> while back) or staging queries so that we don’t try to run all stages
>> simultaneously. These are more advanced topics.
>>
>> A good place to start would be a scheduler; with a model somewhat like
>> YARNs, that selects queries to run when Drill resources are available; then
>> to ensure that queries run within those resources.
>>
>> Anyone know of such a schedule we could borrow to use with Drill? Or
>> maybe we could adopt the core of Sparrow (or whatever) with the algorithm
>> needed for Drill to avoid the need to invent yet another new scheduler.
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> [1]
>> https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-boutin_0.pdf
>>
>> On Aug 23, 2017, at 7:41 AM, weijie tong <tongweijie178@gmail.com<mailto:
>> tongweijie178@gmail.com>> wrote:
>>
>> @paul  have you noticed the Sparrow project (
>> https://github.com/radlab/sparrow ) and related paper mentioned in the
>> github .  Sparrow is a non-central ,low latency scheduler . This seems
>> meet
>> Drill's demand. I think we can first abstract a scheduler interface like
>> what Spark does , then we can have different scheduler implementations
>> (central or non-central ,maybe non-central like sparrow be the default one
>> ).
>>
>> On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <tongweijie178@gmail.com
>> <ma...@gmail.com>>
>> wrote:
>>
>> Thanks for all your suggestions.
>>
>> @paul your analysis is impressive . I agree with  your opinion. Current
>> queue solution can not solve this problem perfectly. Our system is
>> suffering a  hard time once the cluster is in high load. I will think
>> about
>> this more deeply. welcome more ideas or suggestions to  be shared in this
>> thread,maybe some little improvement .
>>
>>
>> On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <progers@mapr.com<mailto:
>> progers@mapr.com>> wrote:
>>
>> Hi Weijie,
>>
>> Great analysis. Let’s look at a few more data points.
>>
>> Drill has no central scheduler (this is a feature: it makes the cluster
>> much easier to manage and has no single point of failure. It was probably
>> the easiest possible solution while Drill was being built.) Instead of
>> central control, Drill is based on the assumption of symmetry: all
>> Drillbits are identical. So, each Foreman, acting independently, should
>> try
>> to schedule its load in a way that evenly distributes work across nodes in
>> the cluster. If all Drillbits do the same, then load should be balanced;
>> there should be no “hot spots.”
>>
>> Note, for this to work, Drill should either own the cluster, or any other
>> workload on the cluster should also be evenly distributed.
>>
>> Drill makes another simplification: that the cluster has infinite
>> resources (or, equivalently, that the admin sized the cluster for peak
>> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore,
>> Drill
>> usually runs with no throttling mechanism to limit overall cluster load.
>> In
>> real clusters, of course, resources are limited and either a large query
>> load, or a few large queries, can saturate some or all of the available
>> resources.
>>
>> Drill has a feature, seldom used, to throttle queries based purely on
>> number. These ZK-based queues can allow, say, 5 queries to run (each of
>> which is assumed to be evenly distributed.) In actual fact, the ZK-based
>> queues recognize that typical workload have many small, and a few large,
>> queries and so Drill offers the “small query” and “large query” queues.
>>
>> OK, so that’s where we are today. I think I’m not stepping too far out of
>> line to observe that the above model is just a bit naive. It does not take
>> into consideration the available cores, memory or disk I/Os. It does not
>> consider the fact that memory has a hard upper limit and must be managed.
>> Drill creates one thread for each minor fragment limited by the number of
>> cores. But, each query can contain dozens or more fragments, resulting in
>> far, far more threads per query than a node has cores. That is, Drill’s
>> current scheduling model does not consider that, above a certain level,
>> adding more threads makes the system slower because of thrashing.
>>
>> You propose a closed-loop, reactive control system (schedule load based
>> on observed load on each Drillbit.) However, control-system theory tells
>> us
>> that such a system is subject to oscillation. All Foremen observe that a
>> node X is loaded so none send it work. Node X later finishes its work and
>> becomes underloaded. All Foremen now prefer node X and it swings back to
>> being overloaded. In fact, Impala tried an open-loop design and there is
>> some evidence in their documentation that they hit these very problems.
>>
>> So, what else could we do? As we’ve wrestled with these issues, we’ve
>> come to the understanding that we need an open-loop, predictive solution.
>> That is a fancy name for what YARN or Mesos does: keep track of available
>> resources, reserve them for a task, and monitor the task so that it stays
>> within the resource allocation. Predict load via allocation rather than
>> reacting to actual load.
>>
>> In Drill, that might mean a scheduler which looks at all incoming queries
>> and assigns cluster resources to each; queueing the query if necessary
>> until resources become available. It also means that queries must live
>> within their resource allocation. (The planner can help by predicting the
>> likely needed resources. Then, at run time, spill-to-disk and other
>> mechanisms allow queries to honor the resource limits.)
>>
>> The scheduler-based design is nothing new: it seems to be what Impala
>> settled on, it is what YARN does for batch jobs, and it is a common
>> pattern
>> in other query engines.
>>
>> Back to the RPC issue. With proper scheduling, we limit load on each
>> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
>> rather than overloading a node, then attempting to recover, we wish
>> instead
>> to manage to load to prevent the overload in the first place.
>>
>> A coming pull request will take a first, small, step: it will allocate
>> memory to queries based on the limit set by the ZK-based queues. The next
>> step is to figure out how to limit the number of threads per query. (As
>> noted above, a single large query can overwhelm the cluster if, say, it
>> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
>> suggestions and pointers to how others have solved the problem.
>>
>> We also keep tossing around the idea of introducing that central
>> scheduler. But, that is quite a bit of work and we’ve hard that users seem
>> to have struggles with maintaining the YARN and Impala schedulers, so
>> we’re
>> somewhat hesitant to move away from a purely symmetrical configuration.
>> Suggestions in this area are very welcome.
>>
>> For now, try turning on the ZK queues to limit concurrent queries and
>> prevent overload. Ensure your cluster is sized for your workload. Ensure
>> other work on the cluster is also symmetrical and doe not compete with
>> Drill for resources.
>>
>> And, please continue to share your experiences!
>>
>> Thanks,
>>
>> - Paul
>>
>> On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie178@gmail.com<mailto:
>> tongweijie178@gmail.com>>
>> wrote:
>>
>> HI all:
>>
>> Drill's current schedule policy seems a little simple. The
>> SimpleParallelizer assigns endpoints in round robin model which ignores
>> the
>> system's load and other factors. To critical scenario, some drillbits
>> are
>> suffering frequent full GCs which will let their control RPC blocked.
>> Current assignment will not exclude these drillbits from the next coming
>> queries's assignment. then the problem will get worse .
>> I propose to add a zk path to hold bad drillbits. Forman will recognize
>> bad drillbits by waiting timeout (timeout of  control response from
>> intermediate fragments), then update the bad drillbits path. Next coming
>> queries will exclude these drillbits from the assignment list.
>> How do you think about it or any suggests ? If sounds ok ,will file a
>> JIRA and give some contributes.
>>
>>
>>
>>
>

Re: Discuss about Drill's schedule policy

Posted by weijie tong <to...@gmail.com>.

Hi Paul:

   I have read the codes of Sparrow and Spark-Sparrow last few days. It
seems Sparrow can match Drill's architecture very well. According to
sparrow's spark implementation, every MinorFragment can be treat as a spark
task ,a MajorFragment can be treat as a spark taskset.  We will start a
SparrowDaemon(process)  on every drillbit's machine. One SparrowDaemon has
two roles : a scheduler (which is a frontend to contact with other
NodeMonitor to request accepting  tasks  ),a NodeMonitor (which does the
real schedule work).  Foreman can utilize the SparrowFrontendClient to
submit the MajorFragments (communicate with Sparrow's  Scheduler) ,
WorkerBee( maybe other class we need to redefine) needs to implement the
BackendService.Iface to response Sparrow's task schedule (request by
NodeMonitor).

    Sparrow also support schedule tasks  to specific hosts (meaning Drill's
host affinity). It has FIFO , Priority, RoundRobin , NoQueue schedule
policys.
seems other schedulers will need to be added.

     Btw,  I think Yarn is not a good candidate. As a Yarn container is a
JVM process , so we need to start a Drillbit as a container executor, then
Yarn schedule the Drillbit not the Fragments which we  really care about.

    Here [1] are a complete scheduler introduction ,hope to help.




[1]
https://www.cl.cam.ac.uk/research/srg/netos/camsas/blog/2016-03-09-scheduler-architectures.html







On Thu, Aug 24, 2017 at 12:31 PM, Paul Rogers <pr...@mapr.com> wrote:

> Hi Weijie,
>
> Thanks for the link. I’d seen this project a bit earlier, along with
> Apollo [1]. Sparrow is quite interesting, but is designed to place tasks
> (processes) on available nodes. This is not quite what Drill does: Drill
> launches multiple waves of “fragments” to all nodes across the cluster.
>
> Other systems take the approach of just-in-time scheduling in which a
> fragment starts only when its inputs are available, and terminates (and
> releases its resources) after it has processed its last row. While this may
> be a very good technique for longer-running tasks (something like
> map/reduce or Hive), it introduces too much latency for short-running,
> interactive queries.
>
> One could argue that Drill needs two levels of scheduling:
>
> 1. Schedule queries as a whole.
> 2. Schedule tasks (“minor fragments”) within queries.
>
> (There is, of course, a third level: scheduling the Drillbits themselves.
> Let’s leave that aside for now.)
>
> The simplest place to start in Drill is to schedule entire queries, where
> each query gets a slice of cluster-wide resources (memory, CPU, etc.) Then,
> we can reuse Drill’s existing mechanism to schedule fragments on nodes.
>
> The next level of refinement is to select the proper level of
> parallelization for a query: a balance between maximizing width, but not
> overwhelming the cluster with too many threads. For truly huge queries
> (dozens of nested subqueries), it might even make sense to introduce a way
> of sharing threads across fragments (something that Hanifi looked into a
> while back) or staging queries so that we don’t try to run all stages
> simultaneously. These are more advanced topics.
>
> A good place to start would be a scheduler; with a model somewhat like
> YARNs, that selects queries to run when Drill resources are available; then
> to ensure that queries run within those resources.
>
> Anyone know of such a schedule we could borrow to use with Drill? Or maybe
> we could adopt the core of Sparrow (or whatever) with the algorithm needed
> for Drill to avoid the need to invent yet another new scheduler.
>
> Thanks,
>
> - Paul
>
>
> [1] https://www.usenix.org/system/files/conference/osdi14/
> osdi14-paper-boutin_0.pdf
>
> On Aug 23, 2017, at 7:41 AM, weijie tong <tongweijie178@gmail.com<mailto:
> tongweijie178@gmail.com>> wrote:
>
> @paul  have you noticed the Sparrow project (
> https://github.com/radlab/sparrow ) and related paper mentioned in the
> github .  Sparrow is a non-central ,low latency scheduler . This seems meet
> Drill's demand. I think we can first abstract a scheduler interface like
> what Spark does , then we can have different scheduler implementations
> (central or non-central ,maybe non-central like sparrow be the default one
> ).
>
> On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <tongweijie178@gmail.com<
> mailto:tongweijie178@gmail.com>>
> wrote:
>
> Thanks for all your suggestions.
>
> @paul your analysis is impressive . I agree with  your opinion. Current
> queue solution can not solve this problem perfectly. Our system is
> suffering a  hard time once the cluster is in high load. I will think about
> this more deeply. welcome more ideas or suggestions to  be shared in this
> thread,maybe some little improvement .
>
>
> On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <progers@mapr.com<mailto:proge
> rs@mapr.com>> wrote:
>
> Hi Weijie,
>
> Great analysis. Let’s look at a few more data points.
>
> Drill has no central scheduler (this is a feature: it makes the cluster
> much easier to manage and has no single point of failure. It was probably
> the easiest possible solution while Drill was being built.) Instead of
> central control, Drill is based on the assumption of symmetry: all
> Drillbits are identical. So, each Foreman, acting independently, should try
> to schedule its load in a way that evenly distributes work across nodes in
> the cluster. If all Drillbits do the same, then load should be balanced;
> there should be no “hot spots.”
>
> Note, for this to work, Drill should either own the cluster, or any other
> workload on the cluster should also be evenly distributed.
>
> Drill makes another simplification: that the cluster has infinite
> resources (or, equivalently, that the admin sized the cluster for peak
> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
> usually runs with no throttling mechanism to limit overall cluster load. In
> real clusters, of course, resources are limited and either a large query
> load, or a few large queries, can saturate some or all of the available
> resources.
>
> Drill has a feature, seldom used, to throttle queries based purely on
> number. These ZK-based queues can allow, say, 5 queries to run (each of
> which is assumed to be evenly distributed.) In actual fact, the ZK-based
> queues recognize that typical workload have many small, and a few large,
> queries and so Drill offers the “small query” and “large query” queues.
>
> OK, so that’s where we are today. I think I’m not stepping too far out of
> line to observe that the above model is just a bit naive. It does not take
> into consideration the available cores, memory or disk I/Os. It does not
> consider the fact that memory has a hard upper limit and must be managed.
> Drill creates one thread for each minor fragment limited by the number of
> cores. But, each query can contain dozens or more fragments, resulting in
> far, far more threads per query than a node has cores. That is, Drill’s
> current scheduling model does not consider that, above a certain level,
> adding more threads makes the system slower because of thrashing.
>
> You propose a closed-loop, reactive control system (schedule load based
> on observed load on each Drillbit.) However, control-system theory tells us
> that such a system is subject to oscillation. All Foremen observe that a
> node X is loaded so none send it work. Node X later finishes its work and
> becomes underloaded. All Foremen now prefer node X and it swings back to
> being overloaded. In fact, Impala tried an open-loop design and there is
> some evidence in their documentation that they hit these very problems.
>
> So, what else could we do? As we’ve wrestled with these issues, we’ve
> come to the understanding that we need an open-loop, predictive solution.
> That is a fancy name for what YARN or Mesos does: keep track of available
> resources, reserve them for a task, and monitor the task so that it stays
> within the resource allocation. Predict load via allocation rather than
> reacting to actual load.
>
> In Drill, that might mean a scheduler which looks at all incoming queries
> and assigns cluster resources to each; queueing the query if necessary
> until resources become available. It also means that queries must live
> within their resource allocation. (The planner can help by predicting the
> likely needed resources. Then, at run time, spill-to-disk and other
> mechanisms allow queries to honor the resource limits.)
>
> The scheduler-based design is nothing new: it seems to be what Impala
> settled on, it is what YARN does for batch jobs, and it is a common pattern
> in other query engines.
>
> Back to the RPC issue. With proper scheduling, we limit load on each
> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
> rather than overloading a node, then attempting to recover, we wish instead
> to manage to load to prevent the overload in the first place.
>
> A coming pull request will take a first, small, step: it will allocate
> memory to queries based on the limit set by the ZK-based queues. The next
> step is to figure out how to limit the number of threads per query. (As
> noted above, a single large query can overwhelm the cluster if, say, it
> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
> suggestions and pointers to how others have solved the problem.
>
> We also keep tossing around the idea of introducing that central
> scheduler. But, that is quite a bit of work and we’ve hard that users seem
> to have struggles with maintaining the YARN and Impala schedulers, so we’re
> somewhat hesitant to move away from a purely symmetrical configuration.
> Suggestions in this area are very welcome.
>
> For now, try turning on the ZK queues to limit concurrent queries and
> prevent overload. Ensure your cluster is sized for your workload. Ensure
> other work on the cluster is also symmetrical and doe not compete with
> Drill for resources.
>
> And, please continue to share your experiences!
>
> Thanks,
>
> - Paul
>
> On Aug 20, 2017, at 5:39 AM, weijie tong <tongweijie178@gmail.com<mailto:
> tongweijie178@gmail.com>>
> wrote:
>
> HI all:
>
> Drill's current schedule policy seems a little simple. The
> SimpleParallelizer assigns endpoints in round robin model which ignores
> the
> system's load and other factors. To critical scenario, some drillbits
> are
> suffering frequent full GCs which will let their control RPC blocked.
> Current assignment will not exclude these drillbits from the next coming
> queries's assignment. then the problem will get worse .
> I propose to add a zk path to hold bad drillbits. Forman will recognize
> bad drillbits by waiting timeout (timeout of  control response from
> intermediate fragments), then update the bad drillbits path. Next coming
> queries will exclude these drillbits from the assignment list.
> How do you think about it or any suggests ? If sounds ok ,will file a
> JIRA and give some contributes.
>
>
>
>

Re: Discuss about Drill's schedule policy

Posted by Paul Rogers <pr...@mapr.com>.

Hi Weijie,

Thanks for the link. I’d seen this project a bit earlier, along with Apollo [1]. Sparrow is quite interesting, but is designed to place tasks (processes) on available nodes. This is not quite what Drill does: Drill launches multiple waves of “fragments” to all nodes across the cluster.

Other systems take the approach of just-in-time scheduling in which a fragment starts only when its inputs are available, and terminates (and releases its resources) after it has processed its last row. While this may be a very good technique for longer-running tasks (something like map/reduce or Hive), it introduces too much latency for short-running, interactive queries.

One could argue that Drill needs two levels of scheduling:

1. Schedule queries as a whole.
2. Schedule tasks (“minor fragments”) within queries.

(There is, of course, a third level: scheduling the Drillbits themselves. Let’s leave that aside for now.)

The simplest place to start in Drill is to schedule entire queries, where each query gets a slice of cluster-wide resources (memory, CPU, etc.) Then, we can reuse Drill’s existing mechanism to schedule fragments on nodes.

The next level of refinement is to select the proper level of parallelization for a query: a balance between maximizing width, but not overwhelming the cluster with too many threads. For truly huge queries (dozens of nested subqueries), it might even make sense to introduce a way of sharing threads across fragments (something that Hanifi looked into a while back) or staging queries so that we don’t try to run all stages simultaneously. These are more advanced topics.

A good place to start would be a scheduler; with a model somewhat like YARNs, that selects queries to run when Drill resources are available; then to ensure that queries run within those resources.

Anyone know of such a schedule we could borrow to use with Drill? Or maybe we could adopt the core of Sparrow (or whatever) with the algorithm needed for Drill to avoid the need to invent yet another new scheduler.

Thanks,

- Paul

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-boutin_0.pdf

On Aug 23, 2017, at 7:41 AM, weijie tong <to...@gmail.com>> wrote:

@paul have you noticed the Sparrow project (
https://github.com/radlab/sparrow ) and related paper mentioned in the
github . Sparrow is a non-central ,low latency scheduler . This seems meet
Drill's demand. I think we can first abstract a scheduler interface like
what Spark does , then we can have different scheduler implementations
(central or non-central ,maybe non-central like sparrow be the default one
).

On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <to...@gmail.com>>
wrote:

Thanks for all your suggestions.

@paul your analysis is impressive . I agree with your opinion. Current
queue solution can not solve this problem perfectly. Our system is
suffering a hard time once the cluster is in high load. I will think about
this more deeply. welcome more ideas or suggestions to be shared in this
thread,maybe some little improvement .

On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <pr...@mapr.com>> wrote:

Hi Weijie,

Great analysis. Let’s look at a few more data points.

Drill has no central scheduler (this is a feature: it makes the cluster
much easier to manage and has no single point of failure. It was probably
the easiest possible solution while Drill was being built.) Instead of
central control, Drill is based on the assumption of symmetry: all
Drillbits are identical. So, each Foreman, acting independently, should try
to schedule its load in a way that evenly distributes work across nodes in
the cluster. If all Drillbits do the same, then load should be balanced;
there should be no “hot spots.”

Note, for this to work, Drill should either own the cluster, or any other
workload on the cluster should also be evenly distributed.

Drill makes another simplification: that the cluster has infinite
resources (or, equivalently, that the admin sized the cluster for peak
load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
usually runs with no throttling mechanism to limit overall cluster load. In
real clusters, of course, resources are limited and either a large query
load, or a few large queries, can saturate some or all of the available
resources.

Drill has a feature, seldom used, to throttle queries based purely on
number. These ZK-based queues can allow, say, 5 queries to run (each of
which is assumed to be evenly distributed.) In actual fact, the ZK-based
queues recognize that typical workload have many small, and a few large,
queries and so Drill offers the “small query” and “large query” queues.

OK, so that’s where we are today. I think I’m not stepping too far out of
line to observe that the above model is just a bit naive. It does not take
into consideration the available cores, memory or disk I/Os. It does not
consider the fact that memory has a hard upper limit and must be managed.
Drill creates one thread for each minor fragment limited by the number of
cores. But, each query can contain dozens or more fragments, resulting in
far, far more threads per query than a node has cores. That is, Drill’s
current scheduling model does not consider that, above a certain level,
adding more threads makes the system slower because of thrashing.

You propose a closed-loop, reactive control system (schedule load based
on observed load on each Drillbit.) However, control-system theory tells us
that such a system is subject to oscillation. All Foremen observe that a
node X is loaded so none send it work. Node X later finishes its work and
becomes underloaded. All Foremen now prefer node X and it swings back to
being overloaded. In fact, Impala tried an open-loop design and there is
some evidence in their documentation that they hit these very problems.

So, what else could we do? As we’ve wrestled with these issues, we’ve
come to the understanding that we need an open-loop, predictive solution.
That is a fancy name for what YARN or Mesos does: keep track of available
resources, reserve them for a task, and monitor the task so that it stays
within the resource allocation. Predict load via allocation rather than
reacting to actual load.

In Drill, that might mean a scheduler which looks at all incoming queries
and assigns cluster resources to each; queueing the query if necessary
until resources become available. It also means that queries must live
within their resource allocation. (The planner can help by predicting the
likely needed resources. Then, at run time, spill-to-disk and other
mechanisms allow queries to honor the resource limits.)

The scheduler-based design is nothing new: it seems to be what Impala
settled on, it is what YARN does for batch jobs, and it is a common pattern
in other query engines.

Back to the RPC issue. With proper scheduling, we limit load on each
Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
rather than overloading a node, then attempting to recover, we wish instead
to manage to load to prevent the overload in the first place.

A coming pull request will take a first, small, step: it will allocate
memory to queries based on the limit set by the ZK-based queues. The next
step is to figure out how to limit the number of threads per query. (As
noted above, a single large query can overwhelm the cluster if, say, it
tries to do 100 subqueries with many sorts, joins, etc.) We welcome
suggestions and pointers to how others have solved the problem.

We also keep tossing around the idea of introducing that central
scheduler. But, that is quite a bit of work and we’ve hard that users seem
to have struggles with maintaining the YARN and Impala schedulers, so we’re
somewhat hesitant to move away from a purely symmetrical configuration.
Suggestions in this area are very welcome.

For now, try turning on the ZK queues to limit concurrent queries and
prevent overload. Ensure your cluster is sized for your workload. Ensure
other work on the cluster is also symmetrical and doe not compete with
Drill for resources.

And, please continue to share your experiences!

Thanks,

- Paul

On Aug 20, 2017, at 5:39 AM, weijie tong <to...@gmail.com>>
wrote:

HI all:

Drill's current schedule policy seems a little simple. The
SimpleParallelizer assigns endpoints in round robin model which ignores
the
system's load and other factors. To critical scenario, some drillbits
are
suffering frequent full GCs which will let their control RPC blocked.
Current assignment will not exclude these drillbits from the next coming
queries's assignment. then the problem will get worse .
I propose to add a zk path to hold bad drillbits. Forman will recognize
bad drillbits by waiting timeout (timeout of control response from
intermediate fragments), then update the bad drillbits path. Next coming
queries will exclude these drillbits from the assignment list.
How do you think about it or any suggests ? If sounds ok ,will file a
JIRA and give some contributes.

Re: Discuss about Drill's schedule policy

Posted by weijie tong <to...@gmail.com>.

@paul  have you noticed the Sparrow project (
https://github.com/radlab/sparrow ) and related paper mentioned in the
github .  Sparrow is a non-central ,low latency scheduler . This seems meet
Drill's demand. I think we can first abstract a scheduler interface like
what Spark does , then we can have different scheduler implementations
(central or non-central ,maybe non-central like sparrow be the default one
).

On Mon, Aug 21, 2017 at 11:51 PM, weijie tong <to...@gmail.com>
wrote:

> Thanks for all your suggestions.
>
>  @paul your analysis is impressive . I agree with  your opinion. Current
> queue solution can not solve this problem perfectly. Our system is
> suffering a  hard time once the cluster is in high load. I will think about
> this more deeply. welcome more ideas or suggestions to  be shared in this
> thread,maybe some little improvement .
>
>
> On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <pr...@mapr.com> wrote:
>
>> Hi Weijie,
>>
>> Great analysis. Let’s look at a few more data points.
>>
>> Drill has no central scheduler (this is a feature: it makes the cluster
>> much easier to manage and has no single point of failure. It was probably
>> the easiest possible solution while Drill was being built.) Instead of
>> central control, Drill is based on the assumption of symmetry: all
>> Drillbits are identical. So, each Foreman, acting independently, should try
>> to schedule its load in a way that evenly distributes work across nodes in
>> the cluster. If all Drillbits do the same, then load should be balanced;
>> there should be no “hot spots.”
>>
>> Note, for this to work, Drill should either own the cluster, or any other
>> workload on the cluster should also be evenly distributed.
>>
>> Drill makes another simplification: that the cluster has infinite
>> resources (or, equivalently, that the admin sized the cluster for peak
>> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
>> usually runs with no throttling mechanism to limit overall cluster load. In
>> real clusters, of course, resources are limited and either a large query
>> load, or a few large queries, can saturate some or all of the available
>> resources.
>>
>> Drill has a feature, seldom used, to throttle queries based purely on
>> number. These ZK-based queues can allow, say, 5 queries to run (each of
>> which is assumed to be evenly distributed.) In actual fact, the ZK-based
>> queues recognize that typical workload have many small, and a few large,
>> queries and so Drill offers the “small query” and “large query” queues.
>>
>> OK, so that’s where we are today. I think I’m not stepping too far out of
>> line to observe that the above model is just a bit naive. It does not take
>> into consideration the available cores, memory or disk I/Os. It does not
>> consider the fact that memory has a hard upper limit and must be managed.
>> Drill creates one thread for each minor fragment limited by the number of
>> cores. But, each query can contain dozens or more fragments, resulting in
>> far, far more threads per query than a node has cores. That is, Drill’s
>> current scheduling model does not consider that, above a certain level,
>> adding more threads makes the system slower because of thrashing.
>>
>> You propose a closed-loop, reactive control system (schedule load based
>> on observed load on each Drillbit.) However, control-system theory tells us
>> that such a system is subject to oscillation. All Foremen observe that a
>> node X is loaded so none send it work. Node X later finishes its work and
>> becomes underloaded. All Foremen now prefer node X and it swings back to
>> being overloaded. In fact, Impala tried an open-loop design and there is
>> some evidence in their documentation that they hit these very problems.
>>
>> So, what else could we do? As we’ve wrestled with these issues, we’ve
>> come to the understanding that we need an open-loop, predictive solution.
>> That is a fancy name for what YARN or Mesos does: keep track of available
>> resources, reserve them for a task, and monitor the task so that it stays
>> within the resource allocation. Predict load via allocation rather than
>> reacting to actual load.
>>
>> In Drill, that might mean a scheduler which looks at all incoming queries
>> and assigns cluster resources to each; queueing the query if necessary
>> until resources become available. It also means that queries must live
>> within their resource allocation. (The planner can help by predicting the
>> likely needed resources. Then, at run time, spill-to-disk and other
>> mechanisms allow queries to honor the resource limits.)
>>
>> The scheduler-based design is nothing new: it seems to be what Impala
>> settled on, it is what YARN does for batch jobs, and it is a common pattern
>> in other query engines.
>>
>> Back to the RPC issue. With proper scheduling, we limit load on each
>> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
>> rather than overloading a node, then attempting to recover, we wish instead
>> to manage to load to prevent the overload in the first place.
>>
>> A coming pull request will take a first, small, step: it will allocate
>> memory to queries based on the limit set by the ZK-based queues. The next
>> step is to figure out how to limit the number of threads per query. (As
>> noted above, a single large query can overwhelm the cluster if, say, it
>> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
>> suggestions and pointers to how others have solved the problem.
>>
>> We also keep tossing around the idea of introducing that central
>> scheduler. But, that is quite a bit of work and we’ve hard that users seem
>> to have struggles with maintaining the YARN and Impala schedulers, so we’re
>> somewhat hesitant to move away from a purely symmetrical configuration.
>> Suggestions in this area are very welcome.
>>
>> For now, try turning on the ZK queues to limit concurrent queries and
>> prevent overload. Ensure your cluster is sized for your workload. Ensure
>> other work on the cluster is also symmetrical and doe not compete with
>> Drill for resources.
>>
>> And, please continue to share your experiences!
>>
>> Thanks,
>>
>> - Paul
>>
>> > On Aug 20, 2017, at 5:39 AM, weijie tong <to...@gmail.com>
>> wrote:
>> >
>> > HI all:
>> >
>> >  Drill's current schedule policy seems a little simple. The
>> > SimpleParallelizer assigns endpoints in round robin model which ignores
>> the
>> > system's load and other factors. To critical scenario, some drillbits
>> are
>> > suffering frequent full GCs which will let their control RPC blocked.
>> > Current assignment will not exclude these drillbits from the next coming
>> > queries's assignment. then the problem will get worse .
>> >  I propose to add a zk path to hold bad drillbits. Forman will recognize
>> > bad drillbits by waiting timeout (timeout of  control response from
>> > intermediate fragments), then update the bad drillbits path. Next coming
>> > queries will exclude these drillbits from the assignment list.
>> >  How do you think about it or any suggests ? If sounds ok ,will file a
>> > JIRA and give some contributes.
>>
>>

Re: Discuss about Drill's schedule policy

Posted by weijie tong <to...@gmail.com>.

Thanks for all your suggestions.

 @paul your analysis is impressive . I agree with  your opinion. Current
queue solution can not solve this problem perfectly. Our system is
suffering a  hard time once the cluster is in high load. I will think about
this more deeply. welcome more ideas or suggestions to  be shared in this
thread,maybe some little improvement .


On Mon, 21 Aug 2017 at 4:06 AM Paul Rogers <pr...@mapr.com> wrote:

> Hi Weijie,
>
> Great analysis. Let’s look at a few more data points.
>
> Drill has no central scheduler (this is a feature: it makes the cluster
> much easier to manage and has no single point of failure. It was probably
> the easiest possible solution while Drill was being built.) Instead of
> central control, Drill is based on the assumption of symmetry: all
> Drillbits are identical. So, each Foreman, acting independently, should try
> to schedule its load in a way that evenly distributes work across nodes in
> the cluster. If all Drillbits do the same, then load should be balanced;
> there should be no “hot spots.”
>
> Note, for this to work, Drill should either own the cluster, or any other
> workload on the cluster should also be evenly distributed.
>
> Drill makes another simplification: that the cluster has infinite
> resources (or, equivalently, that the admin sized the cluster for peak
> load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill
> usually runs with no throttling mechanism to limit overall cluster load. In
> real clusters, of course, resources are limited and either a large query
> load, or a few large queries, can saturate some or all of the available
> resources.
>
> Drill has a feature, seldom used, to throttle queries based purely on
> number. These ZK-based queues can allow, say, 5 queries to run (each of
> which is assumed to be evenly distributed.) In actual fact, the ZK-based
> queues recognize that typical workload have many small, and a few large,
> queries and so Drill offers the “small query” and “large query” queues.
>
> OK, so that’s where we are today. I think I’m not stepping too far out of
> line to observe that the above model is just a bit naive. It does not take
> into consideration the available cores, memory or disk I/Os. It does not
> consider the fact that memory has a hard upper limit and must be managed.
> Drill creates one thread for each minor fragment limited by the number of
> cores. But, each query can contain dozens or more fragments, resulting in
> far, far more threads per query than a node has cores. That is, Drill’s
> current scheduling model does not consider that, above a certain level,
> adding more threads makes the system slower because of thrashing.
>
> You propose a closed-loop, reactive control system (schedule load based on
> observed load on each Drillbit.) However, control-system theory tells us
> that such a system is subject to oscillation. All Foremen observe that a
> node X is loaded so none send it work. Node X later finishes its work and
> becomes underloaded. All Foremen now prefer node X and it swings back to
> being overloaded. In fact, Impala tried an open-loop design and there is
> some evidence in their documentation that they hit these very problems.
>
> So, what else could we do? As we’ve wrestled with these issues, we’ve come
> to the understanding that we need an open-loop, predictive solution. That
> is a fancy name for what YARN or Mesos does: keep track of available
> resources, reserve them for a task, and monitor the task so that it stays
> within the resource allocation. Predict load via allocation rather than
> reacting to actual load.
>
> In Drill, that might mean a scheduler which looks at all incoming queries
> and assigns cluster resources to each; queueing the query if necessary
> until resources become available. It also means that queries must live
> within their resource allocation. (The planner can help by predicting the
> likely needed resources. Then, at run time, spill-to-disk and other
> mechanisms allow queries to honor the resource limits.)
>
> The scheduler-based design is nothing new: it seems to be what Impala
> settled on, it is what YARN does for batch jobs, and it is a common pattern
> in other query engines.
>
> Back to the RPC issue. With proper scheduling, we limit load on each
> Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is,
> rather than overloading a node, then attempting to recover, we wish instead
> to manage to load to prevent the overload in the first place.
>
> A coming pull request will take a first, small, step: it will allocate
> memory to queries based on the limit set by the ZK-based queues. The next
> step is to figure out how to limit the number of threads per query. (As
> noted above, a single large query can overwhelm the cluster if, say, it
> tries to do 100 subqueries with many sorts, joins, etc.) We welcome
> suggestions and pointers to how others have solved the problem.
>
> We also keep tossing around the idea of introducing that central
> scheduler. But, that is quite a bit of work and we’ve hard that users seem
> to have struggles with maintaining the YARN and Impala schedulers, so we’re
> somewhat hesitant to move away from a purely symmetrical configuration.
> Suggestions in this area are very welcome.
>
> For now, try turning on the ZK queues to limit concurrent queries and
> prevent overload. Ensure your cluster is sized for your workload. Ensure
> other work on the cluster is also symmetrical and doe not compete with
> Drill for resources.
>
> And, please continue to share your experiences!
>
> Thanks,
>
> - Paul
>
> > On Aug 20, 2017, at 5:39 AM, weijie tong <to...@gmail.com>
> wrote:
> >
> > HI all:
> >
> >  Drill's current schedule policy seems a little simple. The
> > SimpleParallelizer assigns endpoints in round robin model which ignores
> the
> > system's load and other factors. To critical scenario, some drillbits are
> > suffering frequent full GCs which will let their control RPC blocked.
> > Current assignment will not exclude these drillbits from the next coming
> > queries's assignment. then the problem will get worse .
> >  I propose to add a zk path to hold bad drillbits. Forman will recognize
> > bad drillbits by waiting timeout (timeout of  control response from
> > intermediate fragments), then update the bad drillbits path. Next coming
> > queries will exclude these drillbits from the assignment list.
> >  How do you think about it or any suggests ? If sounds ok ,will file a
> > JIRA and give some contributes.
>
>

Re: Discuss about Drill's schedule policy

Posted by Paul Rogers <pr...@mapr.com>.

Hi Weijie,

Great analysis. Let’s look at a few more data points.

Drill has no central scheduler (this is a feature: it makes the cluster much easier to manage and has no single point of failure. It was probably the easiest possible solution while Drill was being built.) Instead of central control, Drill is based on the assumption of symmetry: all Drillbits are identical. So, each Foreman, acting independently, should try to schedule its load in a way that evenly distributes work across nodes in the cluster. If all Drillbits do the same, then load should be balanced; there should be no “hot spots.”

Note, for this to work, Drill should either own the cluster, or any other workload on the cluster should also be evenly distributed.

Drill makes another simplification: that the cluster has infinite resources (or, equivalently, that the admin sized the cluster for peak load.) That is, as Sudheesh puts it, “Drill is optimistic” Therefore, Drill usually runs with no throttling mechanism to limit overall cluster load. In real clusters, of course, resources are limited and either a large query load, or a few large queries, can saturate some or all of the available resources.

Drill has a feature, seldom used, to throttle queries based purely on number. These ZK-based queues can allow, say, 5 queries to run (each of which is assumed to be evenly distributed.) In actual fact, the ZK-based queues recognize that typical workload have many small, and a few large, queries and so Drill offers the “small query” and “large query” queues.

OK, so that’s where we are today. I think I’m not stepping too far out of line to observe that the above model is just a bit naive. It does not take into consideration the available cores, memory or disk I/Os. It does not consider the fact that memory has a hard upper limit and must be managed. Drill creates one thread for each minor fragment limited by the number of cores. But, each query can contain dozens or more fragments, resulting in far, far more threads per query than a node has cores. That is, Drill’s current scheduling model does not consider that, above a certain level, adding more threads makes the system slower because of thrashing.

You propose a closed-loop, reactive control system (schedule load based on observed load on each Drillbit.) However, control-system theory tells us that such a system is subject to oscillation. All Foremen observe that a node X is loaded so none send it work. Node X later finishes its work and becomes underloaded. All Foremen now prefer node X and it swings back to being overloaded. In fact, Impala tried an open-loop design and there is some evidence in their documentation that they hit these very problems.

So, what else could we do? As we’ve wrestled with these issues, we’ve come to the understanding that we need an open-loop, predictive solution. That is a fancy name for what YARN or Mesos does: keep track of available resources, reserve them for a task, and monitor the task so that it stays within the resource allocation. Predict load via allocation rather than reacting to actual load.

In Drill, that might mean a scheduler which looks at all incoming queries and assigns cluster resources to each; queueing the query if necessary until resources become available. It also means that queries must live within their resource allocation. (The planner can help by predicting the likely needed resources. Then, at run time, spill-to-disk and other mechanisms allow queries to honor the resource limits.)

The scheduler-based design is nothing new: it seems to be what Impala settled on, it is what YARN does for batch jobs, and it is a common pattern in other query engines.

Back to the RPC issue. With proper scheduling, we limit load on each Drillbit so that RPC (and ZK heartbeats) can operate correctly. That is, rather than overloading a node, then attempting to recover, we wish instead to manage to load to prevent the overload in the first place.

A coming pull request will take a first, small, step: it will allocate memory to queries based on the limit set by the ZK-based queues. The next step is to figure out how to limit the number of threads per query. (As noted above, a single large query can overwhelm the cluster if, say, it tries to do 100 subqueries with many sorts, joins, etc.) We welcome suggestions and pointers to how others have solved the problem.

We also keep tossing around the idea of introducing that central scheduler. But, that is quite a bit of work and we’ve hard that users seem to have struggles with maintaining the YARN and Impala schedulers, so we’re somewhat hesitant to move away from a purely symmetrical configuration. Suggestions in this area are very welcome.

For now, try turning on the ZK queues to limit concurrent queries and prevent overload. Ensure your cluster is sized for your workload. Ensure other work on the cluster is also symmetrical and doe not compete with Drill for resources.

And, please continue to share your experiences!

Thanks,

- Paul

> On Aug 20, 2017, at 5:39 AM, weijie tong <to...@gmail.com> wrote:
> 
> HI all:
> 
>  Drill's current schedule policy seems a little simple. The
> SimpleParallelizer assigns endpoints in round robin model which ignores the
> system's load and other factors. To critical scenario, some drillbits are
> suffering frequent full GCs which will let their control RPC blocked.
> Current assignment will not exclude these drillbits from the next coming
> queries's assignment. then the problem will get worse .
>  I propose to add a zk path to hold bad drillbits. Forman will recognize
> bad drillbits by waiting timeout (timeout of  control response from
> intermediate fragments), then update the bad drillbits path. Next coming
> queries will exclude these drillbits from the assignment list.
>  How do you think about it or any suggests ? If sounds ok ,will file a
> JIRA and give some contributes.