You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "weijie.tong (JIRA)" <ji...@apache.org> on 2017/12/22 15:31:00 UTC
[jira] [Commented] (DRILL-5975) Resource utilization

    [ https://issues.apache.org/jira/browse/DRILL-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16301563#comment-16301563 ] 

weijie.tong commented on DRILL-5975:
------------------------------------

[~Paul.Rogers] sorry for the late response . Maybe you are right to our cases that Druid took most of the hard calculation work ,caused Drill not so high load. We will test parquet to see the result later.

But I still has skepticism to current execution , specifically the shuffle .To shuffle, fragments of different level construct a consumer / producer model. Now   we let consumer to wait for the producer ,wasting the consumer  FragmentExecutor thread, if the producer can not output the data quickly. If the producer can produce data at a fast speed than the consumer , we will throttle them though it's right to keep a safe resource usage at current realization. If we can let different Fragment's thread to run only when the data arrived, then it will keep the cpu usage high.My design was to try to decouple the sender fragments and the receiver ones.  

I also noticed BigQuery's new [post|https://cloud.google.com/blog/big-data/2016/08/in-memory-query-execution-in-google-bigquery]. From the post ,we can find that the Dremel has some dedicated server nodes to hold the intermediate data in memory. Dremel also defines three role: Producer ,Consumer, Controller . From the definition , we can see that they use the  logical producerId and  consumerId to represent the sender and receiver . I guess this will benefit the schedule the running nodes without to bind to a specific host name like what we do. The Controller role manages the shuffle , it's the key point but not describe the detail ( This is very Google ;) ). I guess it will schedule the Consumer's data to flow through the remote dedicated memory server nodes and to notice the Consumer nodes to ingest the data .

Hope we can be inspired from this to benefit Drill . 






> Resource utilization
> --------------------
>
>                 Key: DRILL-5975
>                 URL: https://issues.apache.org/jira/browse/DRILL-5975
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 2.0.0
>            Reporter: weijie.tong
>            Assignee: weijie.tong
>
> h1. Motivation
> Now the resource utilization radio of Drill's cluster is not too good. Most of the cluster resource is wasted. We can not afford too much concurrent queries. Once the system accepted more queries with a not high cpu load, the query which originally is very quick will become slower and slower.
> The reason is Drill does not supply a scheduler . It just assume all the nodes have enough calculation resource. Once a query comes, it will schedule the related fragments to random nodes not caring about the node's load. Some nodes will suffer more cpu context switch to satisfy the coming query. The profound causes to this is that the runtime minor fragments construct a runtime tree whose nodes spread different drillbits. The runtime tree is a memory pipeline that is all the nodes will stay alone the whole lifecycle of a query by sending out data to upper nodes successively, even though some node could run quickly and quit immediately.What's more the runtime tree is constructed before actual running. The schedule target to Drill will become the whole runtime tree nodes.
> h1. Design
> It will be hard to schedule the runtime tree nodes as a whole. So I try to solve this by breaking the runtime cascade nodes. The graph below describes the initial design. !https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png!    [graph link|https://raw.githubusercontent.com/wiki/weijietong/drill/images/design.png]
> Every Drillbit instance will have a RecordBatchManager which will accept all the RecordBatchs written by the senders of local different MinorFragments. The RecordBatchManager will hold the RecordBatchs in memory firstly then disk storage . Once the first RecordBatch of a MinorFragment sender of one query occurs , it will notice the FragmentScheduler. The FragmentScheduler is instanced by the Foreman.It holds the whole PlanFragment execution graph.It will allocate a new corresponding FragmentExecutor to run the generated RecordBatch. The allocated FragmentExecutor will then notify the corresponding FragmentManager to indicate that I am ready to receive the data. Then the FragmentManger will send out the RecordBatch one by one to the corresponding FragmentExecutor's receiver like what the current Sender does by throttling the data stream.
> What we can gain from this design is :
> a. The computation leaf node does not to wait for the consumer's speed to end its life to release the resource.
> b. The sending data logic will be isolated from the computation nodes and shared by different FragmentManagers.
> c. We can schedule the MajorFragments according to Drillbit's actual resource capacity at runtime.
> d. Drill's pipeline data processing characteristic is also retained.
> h1. Plan
> This will be a large PR ,so I plan to divide it into some small ones.
> a. to implement the RecordManager.
> b. to implement a simple random FragmentScheduler and the whole event flow.
> c. to implement a primitive FragmentScheduler which may reference the Sparrow project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)