You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Akshay Bhasin (BLOOMBERG/ 731 LEX)" <ab...@bloomberg.net> on 2021/06/01 13:48:03 UTC

Re: Feature/Question

"I believe that any of the 5 drill machines can handle queries completely symmetrically. When a query is received, the planning is done and execution fragments are scheduled on the other nodes."

Thats interesting - I've not come across this. I currently have a lot of history of data partitioned in s3, and I've tried different queries - however the load has NOT been parallelized/distributed. 
Therefore - I had an understanding distributed mode refers to multiple queries being able to run on a single node - but not the other way around.

I was running the queries from drill-localhost instance from one of the nodes. Is there something else I should try to achieve the above behavior ?

From: ted.dunning@gmail.com At: 05/30/21 18:51:03 UTC-4:00To:  Akshay Bhasin (BLOOMBERG/ 731 LEX ) 
Cc:  dev@drill.apache.org
Subject: Re: Feature/Question


Here are some more answers:


On Thu, May 27, 2021 at 10:09 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <ab...@bloomberg.net> wrote:

Hi Ted, 

Yes sure - below are the 2 reasons for it - 

1) If I run 5 drill machines in a cluster, all connected to a single end point at s3, I'll have to use the machines to create the parquet files. Now, there are 2 sub questions here - 

         - I'm not sure if a single drill end point is exposed for me to query  ... a unique cluster ID I can use where all requests will be load balanced ?

I believe that any of the 5 drill machines can handle queries completely symmetrically. When a query is received, the planning is done and execution fragments are scheduled on the other nodes.

As such, you can either build a load balancer in front of the cluster or you can do roughly the same thing using DNS round-robin. It won't make a lot of difference, in practice, though because the load is spread around pretty well even if only one node does all of the planning (at least if your queries involve a lot of work).


         - What if the node goes down ? For instance, on a single node (say A in above example) - one user is running a read query & at the same time I run a create table query ? That would block and congest the node. 

If a node goes down, any query involving it will fail. The loss will be detected in a few seconds and any queries accepted by the cluster during that time may hang up a little bit. Once the failure has been detected, operation will continue without any problems. Clients may or may not retry their queries automatically (I think that most won't).
 

2) This is a minor one - and I could be wrong - I'm not sure drill can write to s3 bucket. I think you can only put/upload files there, you cannot write to it. 

Charles' answer was on the mark here.


Re: Feature/Question

Posted by Ted Dunning <te...@gmail.com>.
Akshay,

One of the major design goals of Drill is to run individual queries in a
parallelized fashion. It does that very well, under the right conditions.

So the first issue is the installation. You have to have a number of
drill-bits that are configured to work together. Your description of "running
the queries from drill-localhost instance" sounds at odds with this.

If you look at this page, there is a good description of how Drill
processes queries:

https://drill.apache.org/architecture/

This page echoes what I said earlier in more detail and with pictures.

To understand how to install Drill in distributed mode, this page can help:

https://drill.apache.org/docs/getting-started/

Look down the table of contents to find "Installing Drill in Distributed
Mode".

Now, there are issues that can prevent Drill from being able to parallelize
queries. In particular, if your data is in a single file the actual reading
of the data is likely to happen in a single thread of execution. If the
data that comes from that reading is not large and especially if your query
does not involve sorting or aggregating the data, Drill might opt not to
parallelize the processing of the data. This can happen even if your data
is spread across many files if Drill can determine that all of the
interesting data is in a single file via information about how the file is
partitioned.

Does this help?




On Tue, Jun 1, 2021 at 6:48 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasin16@bloomberg.net> wrote:

> "I believe that any of the 5 drill machines can handle queries completely
> symmetrically. When a query is received, the planning is done and execution
> fragments are scheduled on the other nodes."
>
> Thats interesting - I've not come across this. I currently have a lot of
> history of data partitioned in s3, and I've tried different queries -
> however the load has NOT been parallelized/distributed.
> Therefore - I had an understanding distributed mode refers to multiple
> queries being able to run on a single node - but not the other way around.
>
> I was running the queries from drill-localhost instance from one of the
> nodes. Is there something else I should try to achieve the above behavior ?
>
> From: ted.dunning@gmail.com At: 05/30/21 18:51:03 UTC-4:00
> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) <ab...@bloomberg.net>
> Cc: dev@drill.apache.org
> Subject: Re: Feature/Question
>
>
> Here are some more answers:
>
>
> On Thu, May 27, 2021 at 10:09 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
> abhasin16@bloomberg.net> wrote:
>
>> Hi Ted,
>>
>> Yes sure - below are the 2 reasons for it -
>>
>> 1) If I run 5 drill machines in a cluster, all connected to a single end
>> point at s3, I'll have to use the machines to create the parquet files.
>> Now, there are 2 sub questions here -
>>
>> - I'm not sure if a single drill end point is exposed for me to query ...
>> a unique cluster ID I can use where all requests will be load balanced ?
>>
>
> I believe that any of the 5 drill machines can handle queries completely
> symmetrically. When a query is received, the planning is done and execution
> fragments are scheduled on the other nodes.
>
> As such, you can either build a load balancer in front of the cluster or
> you can do roughly the same thing using DNS round-robin. It won't make a
> lot of difference, in practice, though because the load is spread around
> pretty well even if only one node does all of the planning (at least if
> your queries involve a lot of work).
>
>
>> - What if the node goes down ? For instance, on a single node (say A in
>> above example) - one user is running a read query & at the same time I run
>> a create table query ? That would block and congest the node.
>>
>
> If a node goes down, any query involving it will fail. The loss will be
> detected in a few seconds and any queries accepted by the cluster during
> that time may hang up a little bit. Once the failure has been detected,
> operation will continue without any problems. Clients may or may not retry
> their queries automatically (I think that most won't).
>
>
>>
>> 2) This is a minor one - and I could be wrong - I'm not sure drill can
>> write to s3 bucket. I think you can only put/upload files there, you cannot
>> write to it.
>>
>
> Charles' answer was on the mark here.
>
>
>