You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by PROJJWAL SAHA <pr...@gmail.com> on 2017/02/23 07:31:27 UTC

Distribution of workload across nodes in a cluster

Hello,

I am doing select * query on a csv file of 1 GB with a 5 node drill
cluster. The csv file is stored in another storage cluster within the
enterprise.

In the query profile, I see one major fragment and within the major
fragment, I see only 1 minor fragment. The hostname for the minor fragment
corresponds to one of the nodes of the cluster.

I think therefore, that all the resources of the cluster are not utilized.
Is there any configuration parameters that can be tweaked to achieve more
effective workload distribution across cluster machines ?

Let me know what you think.

Regards,
Projjwal

Re: Distribution of workload across nodes in a cluster

Posted by Paul Rogers <pr...@mapr.com>.

In our test setup, it appears that large PSV (pipe-separate-values) files are divided into chunks for scanning in the usual Hadoop way. Drill seems to create one scan per chunk in the underlying file system (in our case, MFS with 256 MB chunks). Have not tested this particular scenario against HDFS or S3.

CSV files may be special since Drill may need to read the header to get column names. Would need to play around a bit to check.

- Paul

> On Feb 23, 2017, at 7:38 AM, Andries Engelbrecht <ae...@mapr.com> wrote:
> 
> Look at your query profile to see where time is spend. But the text reader typically only uses a single thread per text file, thus you want to have a larger number of files for larger data sets. Also time taken depends on the network between the 2 clusters as remote reads can be expensive, and typically not ideal if you want to use the data for interactive queries.
> 
> 
> There has been improvements recently in the parquet reader, I will let others comment on that.
> 
> 
> Is there no way you can copy the data to the local Drill cluster? Perhaps use Drill to read it remotely and copy locally in parquet will greatly speed up the future response.
> 
> 
> 
> 
> --Andries
> 
> ________________________________
> From: Chetan Kothari <ch...@oracle.com>
> Sent: Thursday, February 23, 2017 7:26:57 AM
> To: user@drill.apache.org
> Subject: RE: Distribution of workload across nodes in a cluster
> 
> Thanks Andries
> 
> 
> 
> There is no way to utilize all nodes in a cluster if we have single large file?
> 
> Does splitting single file into multiple files will help to  utilize all nodes in  a cluster?
> 
> 
> 
> What if our use case requirement is to query against csv/tsv file of size > 1 GB?
> 
> 
> 
> Even with parquet flle of size 1 GB, select query with limit 100 takes  more than 20 minutes.
> 
> 
> 
> 
> 
> Regards
> 
> Chetan
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Andries Engelbrecht [mailto:aengelbrecht@mapr.com]
> Sent: Thursday, February 23, 2017 8:51 PM
> To: user@drill.apache.org
> Subject: Re: Distribution of workload across nodes in a cluster
> 
> 
> 
> Last I checked csv data will be read with a single thread per file. To make matters more challenging Drill will typically scan the whole file (well in the case of a select * you are requesting a full scan of the data).
> 
> 
> 
> 
> 
> Try to split the file into several smaller files (128MB or 256MB or smaller pending your requirements) . Also consider migrating the data locally to your Drill cluster, or use parquet. Some use cases you may read data remotely and then write it locally for repeated access, then just try to split the file into smaller files on the remote cluster and write locally in parquet.
> 
> 
> 
> 
> 
> --Andries
> 
> 
> 
> ________________________________
> 
> From: PROJJWAL SAHA <HYPERLINK "mailto:proj.saha@gmail.com"proj.saha@gmail.com>
> 
> Sent: Wednesday, February 22, 2017 11:31:27 PM
> 
> To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org
> 
> Subject: Distribution of workload across nodes in a cluster
> 
> 
> 
> Hello,
> 
> 
> 
> I am doing select * query on a csv file of 1 GB with a 5 node drill cluster. The csv file is stored in another storage cluster within the enterprise.
> 
> 
> 
> In the query profile, I see one major fragment and within the major fragment, I see only 1 minor fragment. The hostname for the minor fragment corresponds to one of the nodes of the cluster.
> 
> 
> 
> I think therefore, that all the resources of the cluster are not utilized.
> 
> Is there any configuration parameters that can be tweaked to achieve more effective workload distribution across cluster machines ?
> 
> 
> 
> Let me know what you think.
> 
> 
> 
> Regards,
> 
> Projjwal
> 
>

Re: Distribution of workload across nodes in a cluster

Posted by Andries Engelbrecht <ae...@mapr.com>.

Look at your query profile to see where time is spend. But the text reader typically only uses a single thread per text file, thus you want to have a larger number of files for larger data sets. Also time taken depends on the network between the 2 clusters as remote reads can be expensive, and typically not ideal if you want to use the data for interactive queries.


There has been improvements recently in the parquet reader, I will let others comment on that.


Is there no way you can copy the data to the local Drill cluster? Perhaps use Drill to read it remotely and copy locally in parquet will greatly speed up the future response.




--Andries

________________________________
From: Chetan Kothari <ch...@oracle.com>
Sent: Thursday, February 23, 2017 7:26:57 AM
To: user@drill.apache.org
Subject: RE: Distribution of workload across nodes in a cluster

Thanks Andries



There is no way to utilize all nodes in a cluster if we have single large file?

Does splitting single file into multiple files will help to  utilize all nodes in  a cluster?



What if our use case requirement is to query against csv/tsv file of size > 1 GB?



Even with parquet flle of size 1 GB, select query with limit 100 takes  more than 20 minutes.





Regards

Chetan









-----Original Message-----
From: Andries Engelbrecht [mailto:aengelbrecht@mapr.com]
Sent: Thursday, February 23, 2017 8:51 PM
To: user@drill.apache.org
Subject: Re: Distribution of workload across nodes in a cluster



Last I checked csv data will be read with a single thread per file. To make matters more challenging Drill will typically scan the whole file (well in the case of a select * you are requesting a full scan of the data).





Try to split the file into several smaller files (128MB or 256MB or smaller pending your requirements) . Also consider migrating the data locally to your Drill cluster, or use parquet. Some use cases you may read data remotely and then write it locally for repeated access, then just try to split the file into smaller files on the remote cluster and write locally in parquet.





--Andries



________________________________

From: PROJJWAL SAHA <HYPERLINK "mailto:proj.saha@gmail.com"proj.saha@gmail.com>

Sent: Wednesday, February 22, 2017 11:31:27 PM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: Distribution of workload across nodes in a cluster



Hello,



I am doing select * query on a csv file of 1 GB with a 5 node drill cluster. The csv file is stored in another storage cluster within the enterprise.



In the query profile, I see one major fragment and within the major fragment, I see only 1 minor fragment. The hostname for the minor fragment corresponds to one of the nodes of the cluster.



I think therefore, that all the resources of the cluster are not utilized.

Is there any configuration parameters that can be tweaked to achieve more effective workload distribution across cluster machines ?



Let me know what you think.



Regards,

Projjwal

RE: Distribution of workload across nodes in a cluster

Posted by Chetan Kothari <ch...@oracle.com>.

Thanks Andries

 

There is no way to utilize all nodes in a cluster if we have single large file?

Does splitting single file into multiple files will help to  utilize all nodes in  a cluster?

 

What if our use case requirement is to query against csv/tsv file of size > 1 GB?

 

Even with parquet flle of size 1 GB, select query with limit 100 takes  more than 20 minutes.

 

 

Regards

Chetan

 

 

 

 

-----Original Message-----
From: Andries Engelbrecht [mailto:aengelbrecht@mapr.com] 
Sent: Thursday, February 23, 2017 8:51 PM
To: user@drill.apache.org
Subject: Re: Distribution of workload across nodes in a cluster

 

Last I checked csv data will be read with a single thread per file. To make matters more challenging Drill will typically scan the whole file (well in the case of a select * you are requesting a full scan of the data).

 

 

Try to split the file into several smaller files (128MB or 256MB or smaller pending your requirements) . Also consider migrating the data locally to your Drill cluster, or use parquet. Some use cases you may read data remotely and then write it locally for repeated access, then just try to split the file into smaller files on the remote cluster and write locally in parquet.

 

 

--Andries

 

________________________________

From: PROJJWAL SAHA <HYPERLINK "mailto:proj.saha@gmail.com"proj.saha@gmail.com>

Sent: Wednesday, February 22, 2017 11:31:27 PM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: Distribution of workload across nodes in a cluster

 

Hello,

 

I am doing select * query on a csv file of 1 GB with a 5 node drill cluster. The csv file is stored in another storage cluster within the enterprise.

 

In the query profile, I see one major fragment and within the major fragment, I see only 1 minor fragment. The hostname for the minor fragment corresponds to one of the nodes of the cluster.

 

I think therefore, that all the resources of the cluster are not utilized.

Is there any configuration parameters that can be tweaked to achieve more effective workload distribution across cluster machines ?

 

Let me know what you think.

 

Regards,

Projjwal

Re: Distribution of workload across nodes in a cluster

Posted by Andries Engelbrecht <ae...@mapr.com>.

Last I checked csv data will be read with a single thread per file. To make matters more challenging Drill will typically scan the whole file (well in the case of a select * you are requesting a full scan of the data).


Try to split the file into several smaller files (128MB or 256MB or smaller pending your requirements) . Also consider migrating the data locally to your Drill cluster, or use parquet. Some use cases you may read data remotely and then write it locally for repeated access, then just try to split the file into smaller files on the remote cluster and write locally in parquet.


--Andries

________________________________
From: PROJJWAL SAHA <pr...@gmail.com>
Sent: Wednesday, February 22, 2017 11:31:27 PM
To: user@drill.apache.org
Subject: Distribution of workload across nodes in a cluster

Hello,

I am doing select * query on a csv file of 1 GB with a 5 node drill
cluster. The csv file is stored in another storage cluster within the
enterprise.

In the query profile, I see one major fragment and within the major
fragment, I see only 1 minor fragment. The hostname for the minor fragment
corresponds to one of the nodes of the cluster.

I think therefore, that all the resources of the cluster are not utilized.
Is there any configuration parameters that can be tweaked to achieve more
effective workload distribution across cluster machines ?

Let me know what you think.

Regards,
Projjwal