You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Bhavesh Shah <bh...@gmail.com> on 2012/05/08 06:37:49 UTC

Want to improve the performance for execution of Hive Jobs.

Hello all,
I have written a Hive JDBC code and created a JAR of it. I am running that
JAR on 10 cluster.
But the problem as I am using the 10 cluster still the performance is same
as that on single cluster.

What to do to improve the performance of Hive Jobs? Is there anything
configuration setting to set before the submitting Hive Jobs to cluster?
One more thing I want to know is that How can we come to know that is job
running on all cluster?

Please let me know if anyone knows about it?

-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bejoy Ks <be...@yahoo.com>.

Hi Bhavesh

     For the two properties you mentioned,
mapred.map.tasks

Number of map tasks is determined from input split and input format.

mapred.reduce.tasks
Your hive job may not require a reduce task, hence hive sets number of reducers to zero

Other parameters, I'm not sure why it is not even reflecting in job.xml.

Regards
Bejoy KS



________________________________
 From: Bhavesh Shah <bh...@gmail.com>
To: user@hive.apache.org; bejoy_ks@yahoo.com 
Sent: Tuesday, May 8, 2012 6:16 PM
Subject: Re: Want to improve the performance for execution of Hive Jobs.
 

Thanks Bejoy for your reply.
Yes I saw that for ewvery job new XML is created. In that I saw that whatever variable I set is different from that.
Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2 
and In for all job XML it is showing value for  map is 1 and for reduce is 0.
Same thing are with other parameters too.
why is it? 




On Tue, May 8, 2012 at 5:32 PM, Bejoy KS <be...@yahoo.com> wrote:

Hi Bhavesh
>On a job level, if you set/override some properties it won't go into mapred-site.xml. Check your corresponding Job.xml to get the values. Also confirm from task logs that there is no warnings with respect to overriding those properties. If these two are good then you can confirm that the properties supplied by you are actually utilized for the job.
>
>Disclaimer: I'm not a EWS guy to comment on some specifics in there. My responses are related to generic hadoop behavior. :)
>
>
>Regards
>Bejoy KS
>
>Sent from handheld, please excuse typos.
>
>________________________________
>
>From:  Bhavesh Shah <bh...@gmail.com> 
>Date: Tue, 8 May 2012 17:15:44 +0530
>To: <us...@hive.apache.org>; Bejoy Ks<be...@yahoo.com>
>ReplyTo:  user@hive.apache.org 
>Subject: Re: Want to improve the performance for execution of Hive Jobs.
>
>Hello Bejoy KS,
>I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
>and when I observed the mapred-site.xml, all variables that I have set in above file are set by default with their values. I didn't see my set values.
>
>And the performance is slow too.
>I have tried this on my local cluster by setting this values and I saw some boost in the performance.
>
>
>
>On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <be...@yahoo.com> wrote:
>
>Hi Bhavesh
>>
>>
>>      I'm not sure of AWS, but from a quick reading cluster wide settings like hdfs block size can be set on hdfs-site.xml through bootstrap actions. Since you are changing hdfs block size set min and max split size across the cluster using bootstrap actions itself. The rest of the properties can on set on a per job level. 
>>
>>
>>Doesn't AWS provide an option to use "hive -f"? If so, just provide all the properties required for tuning the query followed by queries(in order) in a file and simply execute it using "hive -f <file name>".
>>
>>
>>Regards
>>Bejoy KS
>>
>>________________________________
>> From: Bhavesh Shah <bh...@gmail.com>
>>To: user@hive.apache.org; Bejoy Ks <be...@yahoo.com> 
>>Sent: Tuesday, May 8, 2012 3:33 PM
>>
>>Subject: Re: Want to improve the performance for execution of Hive Jobs.
>> 
>>
>>
>>Thanks Bejoy KS for your reply,
>>I want to ask one thing that If I want to set this parameter on Amazon Elastic Mapreduce then how can I set these variable like:
>>e.g. SET mapred.min.split.size=m;
>>      SET mapred.max.split.size=m+n;
>>      set dfs.block.size=128
>>      set mapred.compress.map.output=true
>>      set io.sort.mb=400  etc....
>>
>>For all this do I need to write shell script for setting this variables on the particular path /home/hadoop/hive/bin/hive -e 'set .....'
>>or pass all this steps in bootstrap actions??? 
>>
>>I found this link to pass the bootstrap actions
>>http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
>>
>>What should I do in such case??
>>
>>
>>
>>
>>On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <be...@yahoo.com> wrote:
>>
>>Hi Bhavesh
>>>
>>>
>>>     In sqoop you can optimize the performance by using --direct mode for import and increasing the number of mappers used for import. When you increase the number of mappers you need to ensure that the RDBMS connection pool will handle those number of connections gracefully. Also use a evenly distributed column as --split-by, that'll ensure that all mappers are kind of equally loaded.
>>>   min split size and map split size can be set on a job level. But, there are chances of slight loss in data locality if you increase these values. By increasing these values you are increasing the data volume processed per mapper so less number of mappers , now you need to see whether this will that get you substantial performance gains. I havent seen much gains there when I tried out those on some of my workflows in the past. A better approach than this would be increasing the hdfs block size itself if your cluster deals with relatively larger files. Of you change the hdfs block size then make the changes accordingly on min split and max split values.
>>>    You can set all min and max split sizes using SET command in hive CLI itself.
>>>hive> SET mapred.min.split.size=m;
>>>hive> SET mapred.max.split.size=m+n;
>>>
>>>
>>>Regards
>>>Bejoy KS
>>>     
>>>
>>>
>>>
>>>________________________________
>>> From: Bhavesh Shah <bh...@gmail.com>
>>>To: user@hive.apache.org 
>>>Sent: Tuesday, May 8, 2012 11:35 AM
>>>Subject: Re: Want to improve the performance for execution of Hive Jobs.
>>> 
>>>
>>>
>>>Thanks Both of you for their replies,
>>>If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>>>
>>>1) Default block size is 64 MB, so insuch case I have to set it to 128 MB..... is it right???
>>>2) Amazon EMR has already values for  mapred.min.split.size and mapred.max.split.size, and mapper and reducer too. So is there any need to set the values there? If yes then how to set for all clusters? Is it possible by setting all these above parameters in --bootstrap-actions.... to apply this for all nodes while submitting jobs to Amazon EMR??
>>>
>>>Thanks both of u very much
>>>
>>>-- 
>>>Regards,
>>>Bhavesh Shah
>>>
>>>
>>>On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com> wrote:
>>>
>>>Try setting this value to your block
>>>>Size, for 128 mb block size,
>>>>
>>>>
>>>>set mapred.min.split.size=128000
>>>>Sent from my iPhone
>>>>
>>>>On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>>>>
>>>>
>>>>Thanks Nitin for your reply.
>>>>>
>>>>>In short my Task is 
>>>>>1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
>>>>>2) Through Hive I am processing the data and generating the result in one table
>>>>>3) That result containing table from Hive is again exported to MS SQL SERVER back.
>>>>>
>>>>>Actually the data which I am importing from MS SQL Server is very large 
(near about 5,00,000 entries in one table. Like wise I have 30 tables). 
For this I have written a task in Hive which contains only queries (And 
each query has used a lot of joins in it). So due to this the 
performance is very poor on  my single local machine ( It takes near 
about 3 hrs to execute completely). I have observed that when I have submitted a single query to Hive CLI it took 10-11 jobs to execute completely.
>>>>>
>>>>>set mapred.min.split.size 
>>>>>set mapred.max.split.size
>>>>>Should this value to be set in bootstrap action while submitting jobs to amazon EMR? What value to be set for it as I don't know?
>>>>>
>>>>>
>>>>>-- 
>>>>>Regards,
>>>>>Bhavesh Shah
>>>>>
>>>>>
>>>>>On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <ni...@gmail.com> wrote:
>>>>>
>>>>>1) check the jobtracker url to see how many maps/reducers have been launched
>>>>>>2) if you have a large dataset and wants to execute it fast, you set mapred.min.split.size and mapred.max.split.size to an optimal value so that more mappers will be launched and will finish 
>>>>>>3) if you are doing joins, there are different ways to go according to the data you have and size of data 
>>>>>>
>>>>>>
>>>>>>it will be helpful if you can let us know your datasizes and query details 
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com> wrote:
>>>>>>
>>>>>>Hello all,
>>>>>>>I have written a Hive JDBC code and created a JAR of it. I am running that JAR on 10 cluster.
>>>>>>>But the problem as I am using the 10 cluster still the performance is same as that on single cluster.
>>>>>>>
>>>>>>>What to do to improve the performance of Hive Jobs? Is there anything configuration setting to set before the submitting Hive Jobs to cluster?
>>>>>>>One more thing I want to know is that How can we come to know that is job running on all cluster?
>>>>>>>
>>>>>>>Please let me know if anyone knows about it?
>>>>>>>
>>>>>>>-- 
>>>>>>>Regards,
>>>>>>>Bhavesh Shah
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-- 
>>>>>>Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>-- 
>>Regards,
>>Bhavesh Shah
>>
>>
>>
>
>
>-- 
>Regards,
>Bhavesh Shah
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bhavesh Shah <bh...@gmail.com>.

Thanks Bejoy for your reply.
Yes I saw that for ewvery job new XML is created. In that I saw that
whatever variable I set is different from that.
Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2
and In for all job XML it is showing value for  map is 1 and for reduce is
0.
Same thing are with other parameters too.
why is it?



On Tue, May 8, 2012 at 5:32 PM, Bejoy KS <be...@yahoo.com> wrote:

> **
> Hi Bhavesh
> On a job level, if you set/override some properties it won't go into
> mapred-site.xml. Check your corresponding Job.xml to get the values. Also
> confirm from task logs that there is no warnings with respect to overriding
> those properties. If these two are good then you can confirm that the
> properties supplied by you are actually utilized for the job.
>
> Disclaimer: I'm not a EWS guy to comment on some specifics in there. My
> responses are related to generic hadoop behavior. :)
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Bhavesh Shah <bh...@gmail.com>
> *Date: *Tue, 8 May 2012 17:15:44 +0530
> *To: *<us...@hive.apache.org>; Bejoy Ks<be...@yahoo.com>
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Want to improve the performance for execution of Hive Jobs.
>
> Hello Bejoy KS,
> I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
> and when I observed the mapred-site.xml, all variables that I have set in
> above file are set by default with their values. I didn't see my set values.
>
> And the performance is slow too.
> I have tried this on my local cluster by setting this values and I saw
> some boost in the performance.
>
>
> On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <be...@yahoo.com> wrote:
>
>> Hi Bhavesh
>>
>>       I'm not sure of AWS, but from a quick reading cluster wide settings
>> like hdfs block size can be set on hdfs-site.xml through bootstrap actions.
>> Since you are changing hdfs block size set min and max split size across
>> the cluster using bootstrap actions itself. The rest of the properties can
>> on set on a per job level.
>>
>> Doesn't AWS provide an option to use "hive -f"? If so, just provide all
>> the properties required for tuning the query followed by queries(in order)
>> in a file and simply execute it using "hive -f <file name>".
>>
>> Regards
>> Bejoy KS
>>   ------------------------------
>> *From:* Bhavesh Shah <bh...@gmail.com>
>> *To:* user@hive.apache.org; Bejoy Ks <be...@yahoo.com>
>> *Sent:* Tuesday, May 8, 2012 3:33 PM
>>
>> *Subject:* Re: Want to improve the performance for execution of Hive
>> Jobs.
>>
>> Thanks Bejoy KS for your reply,
>> I want to ask one thing that If I want to set this parameter on Amazon
>> Elastic Mapreduce then how can I set these variable like:
>> e.g. SET mapred.min.split.size=m;
>>       SET mapred.max.split.size=m+n;
>>       set dfs.block.size=128
>>       set mapred.compress.map.output=true
>>       set io.sort.mb=400  etc....
>>
>> For all this do I need to write shell script for setting this variables
>> on the particular path /home/hadoop/hive/bin/hive -e 'set .....'
>> or pass all this steps in bootstrap actions???
>>
>> I found this link to pass the bootstrap actions
>>
>> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
>>
>> What should I do in such case??
>>
>>
>>
>> On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <be...@yahoo.com> wrote:
>>
>> Hi Bhavesh
>>
>>      In sqoop you can optimize the performance by using --direct mode for
>> import and increasing the number of mappers used for import. When you
>> increase the number of mappers you need to ensure that the RDBMS connection
>> pool will handle those number of connections gracefully. Also use a evenly
>> distributed column as --split-by, that'll ensure that all mappers are kind
>> of equally loaded.
>>    min split size and map split size can be set on a job level. But,
>> there are chances of slight loss in data locality if you increase these
>> values. By increasing these values you are increasing the data volume
>> processed per mapper so less number of mappers , now you need to see
>> whether this will that get you substantial performance gains. I havent seen
>> much gains there when I tried out those on some of my workflows in the
>> past. A better approach than this would be increasing the hdfs block size
>> itself if your cluster deals with relatively larger files. Of
>> you change the hdfs block size then make the changes accordingly on min
>> split and max split values.
>>     You can set all min and max split sizes using SET command in hive CLI
>> itself.
>> hive> SET mapred.min.split.size=m;
>> hive> SET mapred.max.split.size=m+n;
>>
>> Regards
>> Bejoy KS
>>
>>
>>   ------------------------------
>> *From:* Bhavesh Shah <bh...@gmail.com>
>> *To:* user@hive.apache.org
>> *Sent:* Tuesday, May 8, 2012 11:35 AM
>> *Subject:* Re: Want to improve the performance for execution of Hive
>> Jobs.
>>
>> Thanks Both of you for their replies,
>> If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>>
>> 1) Default block size is 64 MB, so insuch case I have to set it to 128
>> MB..... is it right???
>> 2) Amazon EMR has already values for  mapred.min.split.size
>> and mapred.max.split.size, and mapper and reducer too. So is there any need
>> to set the values there? If yes then how to set for all clusters? Is it
>> possible by setting all these above parameters in --bootstrap-actions....
>> to apply this for all nodes while submitting jobs to Amazon EMR??
>>
>> Thanks both of u very much
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com>wrote:
>>
>> Try setting this value to your block
>> Size, for 128 mb block size,
>>
>> *set mapred.min.split.size=128000*
>>
>>
>> Sent from my iPhone
>>
>> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com>
>> wrote:
>>
>> Thanks Nitin for your reply.
>>
>> In short my Task is
>> 1) Initially I want to import the data from MS SQL Server into HDFS using
>> SQOOP.
>> 2) Through Hive I am processing the data and generating the result in one
>> table
>> 3) That result containing table from Hive is again exported to MS SQL
>> SERVER back.
>>
>> Actually the data which I am importing from MS SQL Server is very large
>> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
>> this I have written a task in Hive which contains only queries (And each
>> query has used a lot of joins in it). So due to this the performance is
>> very poor on  my single local machine ( It takes near about 3 hrs to
>> execute completely). I have observed that when I have submitted a single
>> query to Hive CLI it took 10-11 jobs to execute completely.
>>
>> * set mapred.min.split.size
>> set mapred.max.split.size*
>> Should this value to be set in bootstrap action while submitting jobs to
>> amazon EMR? What value to be set for it as I don't know?
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <ni...@gmail.com>
>> nitinpawar432@gmail.com> wrote:
>>
>> 1) check the jobtracker url to see how many maps/reducers have been
>> launched
>> 2) if you have a large dataset and wants to execute it fast, you
>> set mapred.min.split.size and mapred.max.split.size to an optimal value so
>> that more mappers will be launched and will finish
>> 3) if you are doing joins, there are different ways to go according to
>> the data you have and size of data
>>
>> it will be helpful if you can let us know your datasizes and query
>> details
>>
>>
>> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bh...@gmail.com>
>> bhavesh25shah@gmail.com> wrote:
>>
>> Hello all,
>> I have written a Hive JDBC code and created a JAR of it. I am running
>> that JAR on 10 cluster.
>> But the problem as I am using the 10 cluster still the performance is
>> same as that on single cluster.
>>
>> What to do to improve the performance of Hive Jobs? Is there anything
>> configuration setting to set before the submitting Hive Jobs to cluster?
>> One more thing I want to know is that How can we come to know that is job
>> running on all cluster?
>>
>> Please let me know if anyone knows about it?
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>>
>>
>
>
> --
> Regards,
> Bhavesh Shah
>
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bejoy KS <be...@yahoo.com>.

Hi Bhavesh
     On a job level, if you set/override some properties it won't go into mapred-site.xml. Check your corresponding Job.xml to get the values. Also confirm from task logs that there is no warnings with respect to overriding those properties. If these two are good then you can confirm that the properties supplied by you are actually utilized for the job.

Disclaimer: I'm not a EWS guy to comment on some specifics in there. My responses are related to generic hadoop behavior. :)


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Bhavesh Shah <bh...@gmail.com>
Date: Tue, 8 May 2012 17:15:44 
To: <us...@hive.apache.org>; Bejoy Ks<be...@yahoo.com>
Reply-To: user@hive.apache.org
Subject: Re: Want to improve the performance for execution of Hive Jobs.

Hello Bejoy KS,
I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
and when I observed the mapred-site.xml, all variables that I have set in
above file are set by default with their values. I didn't see my set values.

And the performance is slow too.
I have tried this on my local cluster by setting this values and I saw some
boost in the performance.


On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <be...@yahoo.com> wrote:

> Hi Bhavesh
>
>       I'm not sure of AWS, but from a quick reading cluster wide settings
> like hdfs block size can be set on hdfs-site.xml through bootstrap actions.
> Since you are changing hdfs block size set min and max split size across
> the cluster using bootstrap actions itself. The rest of the properties can
> on set on a per job level.
>
> Doesn't AWS provide an option to use "hive -f"? If so, just provide all
> the properties required for tuning the query followed by queries(in order)
> in a file and simply execute it using "hive -f <file name>".
>
> Regards
> Bejoy KS
>   ------------------------------
> *From:* Bhavesh Shah <bh...@gmail.com>
> *To:* user@hive.apache.org; Bejoy Ks <be...@yahoo.com>
> *Sent:* Tuesday, May 8, 2012 3:33 PM
>
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
>
> Thanks Bejoy KS for your reply,
> I want to ask one thing that If I want to set this parameter on Amazon
> Elastic Mapreduce then how can I set these variable like:
> e.g. SET mapred.min.split.size=m;
>       SET mapred.max.split.size=m+n;
>       set dfs.block.size=128
>       set mapred.compress.map.output=true
>       set io.sort.mb=400  etc....
>
> For all this do I need to write shell script for setting this variables on
> the particular path /home/hadoop/hive/bin/hive -e 'set .....'
> or pass all this steps in bootstrap actions???
>
> I found this link to pass the bootstrap actions
>
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
>
> What should I do in such case??
>
>
>
> On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <be...@yahoo.com> wrote:
>
> Hi Bhavesh
>
>      In sqoop you can optimize the performance by using --direct mode for
> import and increasing the number of mappers used for import. When you
> increase the number of mappers you need to ensure that the RDBMS connection
> pool will handle those number of connections gracefully. Also use a evenly
> distributed column as --split-by, that'll ensure that all mappers are kind
> of equally loaded.
>    min split size and map split size can be set on a job level. But, there
> are chances of slight loss in data locality if you increase these values.
> By increasing these values you are increasing the data volume processed per
> mapper so less number of mappers , now you need to see whether this will
> that get you substantial performance gains. I havent seen much gains there
> when I tried out those on some of my workflows in the past. A better
> approach than this would be increasing the hdfs block size itself if your
> cluster deals with relatively larger files. Of you change the hdfs block
> size then make the changes accordingly on min split and max split values.
>     You can set all min and max split sizes using SET command in hive CLI
> itself.
> hive> SET mapred.min.split.size=m;
> hive> SET mapred.max.split.size=m+n;
>
> Regards
> Bejoy KS
>
>
>   ------------------------------
> *From:* Bhavesh Shah <bh...@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Tuesday, May 8, 2012 11:35 AM
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
>
> Thanks Both of you for their replies,
> If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>
> 1) Default block size is 64 MB, so insuch case I have to set it to 128
> MB..... is it right???
> 2) Amazon EMR has already values for  mapred.min.split.size
> and mapred.max.split.size, and mapper and reducer too. So is there any need
> to set the values there? If yes then how to set for all clusters? Is it
> possible by setting all these above parameters in --bootstrap-actions....
> to apply this for all nodes while submitting jobs to Amazon EMR??
>
> Thanks both of u very much
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com>wrote:
>
> Try setting this value to your block
> Size, for 128 mb block size,
>
> *set mapred.min.split.size=128000*
>
>
> Sent from my iPhone
>
> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>
> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <ni...@gmail.com>
> nitinpawar432@gmail.com> wrote:
>
> 1) check the jobtracker url to see how many maps/reducers have been
> launched
> 2) if you have a large dataset and wants to execute it fast, you
> set mapred.min.split.size and mapred.max.split.size to an optimal value so
> that more mappers will be launched and will finish
> 3) if you are doing joins, there are different ways to go according to the
> data you have and size of data
>
> it will be helpful if you can let us know your datasizes and query details
>
>
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bh...@gmail.com>
> bhavesh25shah@gmail.com> wrote:
>
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same
> as that on single cluster.
>
> What to do to improve the performance of Hive Jobs? Is there anything
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job
> running on all cluster?
>
> Please let me know if anyone knows about it?
>
> --
> Regards,
> Bhavesh Shah
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
>
>
>
>
>
>
> --
> Regards,
> Bhavesh Shah
>
>
>
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bhavesh Shah <bh...@gmail.com>.

Hello Bejoy KS,
I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
and when I observed the mapred-site.xml, all variables that I have set in
above file are set by default with their values. I didn't see my set values.

And the performance is slow too.
I have tried this on my local cluster by setting this values and I saw some
boost in the performance.


On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <be...@yahoo.com> wrote:

> Hi Bhavesh
>
>       I'm not sure of AWS, but from a quick reading cluster wide settings
> like hdfs block size can be set on hdfs-site.xml through bootstrap actions.
> Since you are changing hdfs block size set min and max split size across
> the cluster using bootstrap actions itself. The rest of the properties can
> on set on a per job level.
>
> Doesn't AWS provide an option to use "hive -f"? If so, just provide all
> the properties required for tuning the query followed by queries(in order)
> in a file and simply execute it using "hive -f <file name>".
>
> Regards
> Bejoy KS
>   ------------------------------
> *From:* Bhavesh Shah <bh...@gmail.com>
> *To:* user@hive.apache.org; Bejoy Ks <be...@yahoo.com>
> *Sent:* Tuesday, May 8, 2012 3:33 PM
>
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
>
> Thanks Bejoy KS for your reply,
> I want to ask one thing that If I want to set this parameter on Amazon
> Elastic Mapreduce then how can I set these variable like:
> e.g. SET mapred.min.split.size=m;
>       SET mapred.max.split.size=m+n;
>       set dfs.block.size=128
>       set mapred.compress.map.output=true
>       set io.sort.mb=400  etc....
>
> For all this do I need to write shell script for setting this variables on
> the particular path /home/hadoop/hive/bin/hive -e 'set .....'
> or pass all this steps in bootstrap actions???
>
> I found this link to pass the bootstrap actions
>
> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
>
> What should I do in such case??
>
>
>
> On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <be...@yahoo.com> wrote:
>
> Hi Bhavesh
>
>      In sqoop you can optimize the performance by using --direct mode for
> import and increasing the number of mappers used for import. When you
> increase the number of mappers you need to ensure that the RDBMS connection
> pool will handle those number of connections gracefully. Also use a evenly
> distributed column as --split-by, that'll ensure that all mappers are kind
> of equally loaded.
>    min split size and map split size can be set on a job level. But, there
> are chances of slight loss in data locality if you increase these values.
> By increasing these values you are increasing the data volume processed per
> mapper so less number of mappers , now you need to see whether this will
> that get you substantial performance gains. I havent seen much gains there
> when I tried out those on some of my workflows in the past. A better
> approach than this would be increasing the hdfs block size itself if your
> cluster deals with relatively larger files. Of you change the hdfs block
> size then make the changes accordingly on min split and max split values.
>     You can set all min and max split sizes using SET command in hive CLI
> itself.
> hive> SET mapred.min.split.size=m;
> hive> SET mapred.max.split.size=m+n;
>
> Regards
> Bejoy KS
>
>
>   ------------------------------
> *From:* Bhavesh Shah <bh...@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Tuesday, May 8, 2012 11:35 AM
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
>
> Thanks Both of you for their replies,
> If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>
> 1) Default block size is 64 MB, so insuch case I have to set it to 128
> MB..... is it right???
> 2) Amazon EMR has already values for  mapred.min.split.size
> and mapred.max.split.size, and mapper and reducer too. So is there any need
> to set the values there? If yes then how to set for all clusters? Is it
> possible by setting all these above parameters in --bootstrap-actions....
> to apply this for all nodes while submitting jobs to Amazon EMR??
>
> Thanks both of u very much
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com>wrote:
>
> Try setting this value to your block
> Size, for 128 mb block size,
>
> *set mapred.min.split.size=128000*
>
>
> Sent from my iPhone
>
> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>
> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <ni...@gmail.com>
> nitinpawar432@gmail.com> wrote:
>
> 1) check the jobtracker url to see how many maps/reducers have been
> launched
> 2) if you have a large dataset and wants to execute it fast, you
> set mapred.min.split.size and mapred.max.split.size to an optimal value so
> that more mappers will be launched and will finish
> 3) if you are doing joins, there are different ways to go according to the
> data you have and size of data
>
> it will be helpful if you can let us know your datasizes and query details
>
>
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bh...@gmail.com>
> bhavesh25shah@gmail.com> wrote:
>
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same
> as that on single cluster.
>
> What to do to improve the performance of Hive Jobs? Is there anything
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job
> running on all cluster?
>
> Please let me know if anyone knows about it?
>
> --
> Regards,
> Bhavesh Shah
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
>
>
>
>
>
>
> --
> Regards,
> Bhavesh Shah
>
>
>
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bejoy Ks <be...@yahoo.com>.

Hi Bhavesh

      I'm not sure of AWS, but from a quick reading cluster wide settings like hdfs block size can be set on hdfs-site.xml through bootstrap actions. Since you are changing hdfs block size set min and max split size across the cluster using bootstrap actions itself. The rest of the properties can on set on a per job level. 

Doesn't AWS provide an option to use "hive -f"? If so, just provide all the properties required for tuning the query followed by queries(in order) in a file and simply execute it using "hive -f <file name>".

Regards
Bejoy KS

________________________________
 From: Bhavesh Shah <bh...@gmail.com>
To: user@hive.apache.org; Bejoy Ks <be...@yahoo.com> 
Sent: Tuesday, May 8, 2012 3:33 PM
Subject: Re: Want to improve the performance for execution of Hive Jobs.
 

Thanks Bejoy KS for your reply,
I want to ask one thing that If I want to set this parameter on Amazon Elastic Mapreduce then how can I set these variable like:
e.g. SET mapred.min.split.size=m;
      SET mapred.max.split.size=m+n;
      set dfs.block.size=128
      set mapred.compress.map.output=true
      set io.sort.mb=400  etc....

For all this do I need to write shell script for setting this variables on the particular path /home/hadoop/hive/bin/hive -e 'set .....'
or pass all this steps in bootstrap actions??? 

I found this link to pass the bootstrap actions
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined

What should I do in such case??




On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <be...@yahoo.com> wrote:

Hi Bhavesh
>
>
>     In sqoop you can optimize the performance by using --direct mode for import and increasing the number of mappers used for import. When you increase the number of mappers you need to ensure that the RDBMS connection pool will handle those number of connections gracefully. Also use a evenly distributed column as --split-by, that'll ensure that all mappers are kind of equally loaded.
>   min split size and map split size can be set on a job level. But, there are chances of slight loss in data locality if you increase these values. By increasing these values you are increasing the data volume processed per mapper so less number of mappers , now you need to see whether this will that get you substantial performance gains. I havent seen much gains there when I tried out those on some of my workflows in the past. A better approach than this would be increasing the hdfs block size itself if your cluster deals with relatively larger files. Of you change the hdfs block size then make the changes accordingly on min split and max split values.
>    You can set all min and max split sizes using SET command in hive CLI itself.
>hive> SET mapred.min.split.size=m;
>hive> SET mapred.max.split.size=m+n;
>
>
>Regards
>Bejoy KS
>     
>
>
>
>________________________________
> From: Bhavesh Shah <bh...@gmail.com>
>To: user@hive.apache.org 
>Sent: Tuesday, May 8, 2012 11:35 AM
>Subject: Re: Want to improve the performance for execution of Hive Jobs.
> 
>
>
>Thanks Both of you for their replies,
>If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>
>1) Default block size is 64 MB, so insuch case I have to set it to 128 MB..... is it right???
>2) Amazon EMR has already values for  mapred.min.split.size and mapred.max.split.size, and mapper and reducer too. So is there any need to set the values there? If yes then how to set for all clusters? Is it possible by setting all these above parameters in --bootstrap-actions.... to apply this for all nodes while submitting jobs to Amazon EMR??
>
>Thanks both of u very much
>
>-- 
>Regards,
>Bhavesh Shah
>
>
>On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com> wrote:
>
>Try setting this value to your block
>>Size, for 128 mb block size,
>>
>>
>>set mapred.min.split.size=128000
>>Sent from my iPhone
>>
>>On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>>
>>
>>Thanks Nitin for your reply.
>>>
>>>In short my Task is 
>>>1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
>>>2) Through Hive I am processing the data and generating the result in one table
>>>3) That result containing table from Hive is again exported to MS SQL SERVER back.
>>>
>>>Actually the data which I am importing from MS SQL Server is very large 
(near about 5,00,000 entries in one table. Like wise I have 30 tables). 
For this I have written a task in Hive which contains only queries (And 
each query has used a lot of joins in it). So due to this the 
performance is very poor on  my single local machine ( It takes near 
about 3 hrs to execute completely). I have observed that when I have submitted a single query to Hive CLI it took 10-11 jobs to execute completely.
>>>
>>>set mapred.min.split.size 
>>>set mapred.max.split.size
>>>Should this value to be set in bootstrap action while submitting jobs to amazon EMR? What value to be set for it as I don't know?
>>>
>>>
>>>-- 
>>>Regards,
>>>Bhavesh Shah
>>>
>>>
>>>On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <ni...@gmail.com> wrote:
>>>
>>>1) check the jobtracker url to see how many maps/reducers have been launched
>>>>2) if you have a large dataset and wants to execute it fast, you set mapred.min.split.size and mapred.max.split.size to an optimal value so that more mappers will be launched and will finish 
>>>>3) if you are doing joins, there are different ways to go according to the data you have and size of data 
>>>>
>>>>
>>>>it will be helpful if you can let us know your datasizes and query details 
>>>>
>>>>
>>>>
>>>>On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com> wrote:
>>>>
>>>>Hello all,
>>>>>I have written a Hive JDBC code and created a JAR of it. I am running that JAR on 10 cluster.
>>>>>But the problem as I am using the 10 cluster still the performance is same as that on single cluster.
>>>>>
>>>>>What to do to improve the performance of Hive Jobs? Is there anything configuration setting to set before the submitting Hive Jobs to cluster?
>>>>>One more thing I want to know is that How can we come to know that is job running on all cluster?
>>>>>
>>>>>Please let me know if anyone knows about it?
>>>>>
>>>>>-- 
>>>>>Regards,
>>>>>Bhavesh Shah
>>>>>
>>>>
>>>>
>>>>
>>>>-- 
>>>>Nitin Pawar
>>>>
>>>>
>>>
>>>
>
>
>
>
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bhavesh Shah <bh...@gmail.com>.

Thanks Bejoy KS for your reply,
I want to ask one thing that If I want to set this parameter on Amazon
Elastic Mapreduce then how can I set these variable like:
e.g. SET mapred.min.split.size=m;
      SET mapred.max.split.size=m+n;
      set dfs.block.size=128
      set mapred.compress.map.output=true
      set io.sort.mb=400  etc....

For all this do I need to write shell script for setting this variables on
the particular path /home/hadoop/hive/bin/hive -e 'set .....'
or pass all this steps in bootstrap actions???

I found this link to pass the bootstrap actions
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined

What should I do in such case??



On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <be...@yahoo.com> wrote:

> Hi Bhavesh
>
>      In sqoop you can optimize the performance by using --direct mode for
> import and increasing the number of mappers used for import. When you
> increase the number of mappers you need to ensure that the RDBMS connection
> pool will handle those number of connections gracefully. Also use a evenly
> distributed column as --split-by, that'll ensure that all mappers are kind
> of equally loaded.
>    min split size and map split size can be set on a job level. But, there
> are chances of slight loss in data locality if you increase these values.
> By increasing these values you are increasing the data volume processed per
> mapper so less number of mappers , now you need to see whether this will
> that get you substantial performance gains. I havent seen much gains there
> when I tried out those on some of my workflows in the past. A better
> approach than this would be increasing the hdfs block size itself if your
> cluster deals with relatively larger files. Of you change the hdfs block
> size then make the changes accordingly on min split and max split values.
>     You can set all min and max split sizes using SET command in hive CLI
> itself.
> hive> SET mapred.min.split.size=m;
> hive> SET mapred.max.split.size=m+n;
>
> Regards
> Bejoy KS
>
>
>   ------------------------------
> *From:* Bhavesh Shah <bh...@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Tuesday, May 8, 2012 11:35 AM
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
>
> Thanks Both of you for their replies,
> If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>
> 1) Default block size is 64 MB, so insuch case I have to set it to 128
> MB..... is it right???
> 2) Amazon EMR has already values for  mapred.min.split.size
> and mapred.max.split.size, and mapper and reducer too. So is there any need
> to set the values there? If yes then how to set for all clusters? Is it
> possible by setting all these above parameters in --bootstrap-actions....
> to apply this for all nodes while submitting jobs to Amazon EMR??
>
> Thanks both of u very much
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com>wrote:
>
> Try setting this value to your block
> Size, for 128 mb block size,
>
> *set mapred.min.split.size=128000*
>
>
> Sent from my iPhone
>
> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>
> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <ni...@gmail.com>
> nitinpawar432@gmail.com> wrote:
>
> 1) check the jobtracker url to see how many maps/reducers have been
> launched
> 2) if you have a large dataset and wants to execute it fast, you
> set mapred.min.split.size and mapred.max.split.size to an optimal value so
> that more mappers will be launched and will finish
> 3) if you are doing joins, there are different ways to go according to the
> data you have and size of data
>
> it will be helpful if you can let us know your datasizes and query details
>
>
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bh...@gmail.com>
> bhavesh25shah@gmail.com> wrote:
>
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same
> as that on single cluster.
>
> What to do to improve the performance of Hive Jobs? Is there anything
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job
> running on all cluster?
>
> Please let me know if anyone knows about it?
>
> --
> Regards,
> Bhavesh Shah
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
>
>
>
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bejoy Ks <be...@yahoo.com>.

Hi Bhavesh

     In sqoop you can optimize the performance by using --direct mode for import and increasing the number of mappers used for import. When you increase the number of mappers you need to ensure that the RDBMS connection pool will handle those number of connections gracefully. Also use a evenly distributed column as --split-by, that'll ensure that all mappers are kind of equally loaded.
   min split size and map split size can be set on a job level. But, there are chances of slight loss in data locality if you increase these values. By increasing these values you are increasing the data volume processed per mapper so less number of mappers , now you need to see whether this will that get you substantial performance gains. I havent seen much gains there when I tried out those on some of my workflows in the past. A better approach than this would be increasing the hdfs block size itself if your cluster deals with relatively larger files. Of you change the hdfs block size then make the changes accordingly on min split and max split values.
    You can set all min and max split sizes using SET command in hive CLI itself.
hive> SET mapred.min.split.size=m;
hive> SET mapred.max.split.size=m+n;

Regards
Bejoy KS

________________________________
 From: Bhavesh Shah <bh...@gmail.com>
To: user@hive.apache.org 
Sent: Tuesday, May 8, 2012 11:35 AM
Subject: Re: Want to improve the performance for execution of Hive Jobs.

Thanks Both of you for their replies,
If I decide to deploy my JAR on Amazon Elastic Mapreduce then,

1) Default block size is 64 MB, so insuch case I have to set it to 128 MB..... is it right???
2) Amazon EMR has already values for  mapred.min.split.size and mapred.max.split.size, and mapper and reducer too. So is there any need to set the values there? If yes then how to set for all clusters? Is it possible by setting all these above parameters in --bootstrap-actions.... to apply this for all nodes while submitting jobs to Amazon EMR??

Thanks both of u very much

-- 
Regards,
Bhavesh Shah

On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com> wrote:

Try setting this value to your block
>Size, for 128 mb block size,
>
>
>set mapred.min.split.size=128000
>Sent from my iPhone
>
>On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>
>
>Thanks Nitin for your reply.
>>
>>In short my Task is 
>>1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
>>2) Through Hive I am processing the data and generating the result in one table
>>3) That result containing table from Hive is again exported to MS SQL SERVER back.
>>
>>Actually the data which I am importing from MS SQL Server is very large 
(near about 5,00,000 entries in one table. Like wise I have 30 tables). 
For this I have written a task in Hive which contains only queries (And 
each query has used a lot of joins in it). So due to this the 
performance is very poor on  my single local machine ( It takes near 
about 3 hrs to execute completely). I have observed that when I have submitted a single query to Hive CLI it took 10-11 jobs to execute completely.
>>
>>set mapred.min.split.size 
>>set mapred.max.split.size
>>Should this value to be set in bootstrap action while submitting jobs to amazon EMR? What value to be set for it as I don't know?
>>
>>
>>-- 
>>Regards,
>>Bhavesh Shah
>>
>>
>>On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <ni...@gmail.com> wrote:
>>
>>1) check the jobtracker url to see how many maps/reducers have been launched
>>>2) if you have a large dataset and wants to execute it fast, you set mapred.min.split.size and mapred.max.split.size to an optimal value so that more mappers will be launched and will finish 
>>>3) if you are doing joins, there are different ways to go according to the data you have and size of data 
>>>
>>>
>>>it will be helpful if you can let us know your datasizes and query details 
>>>
>>>
>>>
>>>On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com> wrote:
>>>
>>>Hello all,
>>>>I have written a Hive JDBC code and created a JAR of it. I am running that JAR on 10 cluster.
>>>>But the problem as I am using the 10 cluster still the performance is same as that on single cluster.
>>>>
>>>>What to do to improve the performance of Hive Jobs? Is there anything configuration setting to set before the submitting Hive Jobs to cluster?
>>>>One more thing I want to know is that How can we come to know that is job running on all cluster?
>>>>
>>>>Please let me know if anyone knows about it?
>>>>
>>>>-- 
>>>>Regards,
>>>>Bhavesh Shah
>>>>
>>>
>>>
>>>
>>>-- 
>>>Nitin Pawar
>>>
>>>
>>
>>

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bhavesh Shah <bh...@gmail.com>.

Thanks Both of you for their replies,
If I decide to deploy my JAR on Amazon Elastic Mapreduce then,

1) Default block size is 64 MB, so insuch case I have to set it to 128
MB..... is it right???
2) Amazon EMR has already values for  mapred.min.split.size
and mapred.max.split.size, and mapper and reducer too. So is there any need
to set the values there? If yes then how to set for all clusters? Is it
possible by setting all these above parameters in --bootstrap-actions....
to apply this for all nodes while submitting jobs to Amazon EMR??

Thanks both of u very much

-- 
Regards,
Bhavesh Shah


On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <ma...@gmail.com>wrote:

> Try setting this value to your block
> Size, for 128 mb block size,
>
> *set mapred.min.split.size=128000*
>
>
> Sent from my iPhone
>
> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:
>
> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <ni...@gmail.com>
> nitinpawar432@gmail.com> wrote:
>
>> 1) check the jobtracker url to see how many maps/reducers have been
>> launched
>> 2) if you have a large dataset and wants to execute it fast, you
>> set mapred.min.split.size and mapred.max.split.size to an optimal value so
>> that more mappers will be launched and will finish
>> 3) if you are doing joins, there are different ways to go according to
>> the data you have and size of data
>>
>> it will be helpful if you can let us know your datasizes and query
>> details
>>
>>
>> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bh...@gmail.com>
>> bhavesh25shah@gmail.com> wrote:
>>
>>> Hello all,
>>> I have written a Hive JDBC code and created a JAR of it. I am running
>>> that JAR on 10 cluster.
>>> But the problem as I am using the 10 cluster still the performance is
>>> same as that on single cluster.
>>>
>>> What to do to improve the performance of Hive Jobs? Is there anything
>>> configuration setting to set before the submitting Hive Jobs to cluster?
>>> One more thing I want to know is that How can we come to know that is
>>> job running on all cluster?
>>>
>>> Please let me know if anyone knows about it?
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Mapred Learn <ma...@gmail.com>.

Try setting this value to your block
Size, for 128 mb block size,

> set mapred.min.split.size=128000

Sent from my iPhone

On May 7, 2012, at 10:11 PM, Bhavesh Shah <bh...@gmail.com> wrote:

> Thanks Nitin for your reply.
> 
> In short my Task is 
> 1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
> 2) Through Hive I am processing the data and generating the result in one table
> 3) That result containing table from Hive is again exported to MS SQL SERVER back.
> 
> Actually the data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on  my single local machine ( It takes near about 3 hrs to execute completely). I have observed that when I have submitted a single query to Hive CLI it took 10-11 jobs to execute completely.
> 
> set mapred.min.split.size 
> set mapred.max.split.size
> Should this value to be set in bootstrap action while submitting jobs to amazon EMR? What value to be set for it as I don't know?
> 
> 
> -- 
> Regards,
> Bhavesh Shah
> 
> 
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <ni...@gmail.com> wrote:
> 1) check the jobtracker url to see how many maps/reducers have been launched
> 2) if you have a large dataset and wants to execute it fast, you set mapred.min.split.size and mapred.max.split.size to an optimal value so that more mappers will be launched and will finish 
> 3) if you are doing joins, there are different ways to go according to the data you have and size of data 
> 
> it will be helpful if you can let us know your datasizes and query details 
> 
> 
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com> wrote:
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same as that on single cluster.
> 
> What to do to improve the performance of Hive Jobs? Is there anything configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job running on all cluster?
> 
> Please let me know if anyone knows about it?
> 
> -- 
> Regards,
> Bhavesh Shah
> 
> 
> 
> 
> -- 
> Nitin Pawar
> 
> 
>

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Nitin Pawar <ni...@gmail.com>.

I am no expert on sqoop so i may be wrong but importing 30*0.5M records
(table by table) is a huge operation. I would rather prefer just dump and
import using hive cli (sqoop is good choice too but i dont know the
benchmarks)

if you are doing so many joins then its good to be on hadoop cluster
instead of the single machine. If you have a 10 node cluster then it should
certainly improve your query performance.

also you may want to take a look at different kind of joins available to
you (mapjoins, bucketedmapjoins, skewed joins etc) cause each join comes up
with its own optimized approach.


the options i said in previous mail are part of job conf submitted to
hadoop. on hive cli we just set them on command line or through hiverc



On Tue, May 8, 2012 at 10:41 AM, Bhavesh Shah <bh...@gmail.com>wrote:

> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> 1) check the jobtracker url to see how many maps/reducers have been
>> launched
>> 2) if you have a large dataset and wants to execute it fast, you
>> set mapred.min.split.size and mapred.max.split.size to an optimal value so
>> that more mappers will be launched and will finish
>> 3) if you are doing joins, there are different ways to go according to
>> the data you have and size of data
>>
>> it will be helpful if you can let us know your datasizes and query
>> details
>>
>>
>> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com>wrote:
>>
>>> Hello all,
>>> I have written a Hive JDBC code and created a JAR of it. I am running
>>> that JAR on 10 cluster.
>>> But the problem as I am using the 10 cluster still the performance is
>>> same as that on single cluster.
>>>
>>> What to do to improve the performance of Hive Jobs? Is there anything
>>> configuration setting to set before the submitting Hive Jobs to cluster?
>>> One more thing I want to know is that How can we come to know that is
>>> job running on all cluster?
>>>
>>> Please let me know if anyone knows about it?
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>


-- 
Nitin Pawar

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Bhavesh Shah <bh...@gmail.com>.

Thanks Nitin for your reply.

In short my Task is
1) Initially I want to import the data from MS SQL Server into HDFS using
SQOOP.
2) Through Hive I am processing the data and generating the result in one
table
3) That result containing table from Hive is again exported to MS SQL
SERVER back.

Actually the data which I am importing from MS SQL Server is very large
(near about 5,00,000 entries in one table. Like wise I have 30 tables). For
this I have written a task in Hive which contains only queries (And each
query has used a lot of joins in it). So due to this the performance is
very poor on  my single local machine ( It takes near about 3 hrs to
execute completely). I have observed that when I have submitted a single
query to Hive CLI it took 10-11 jobs to execute completely.

* set mapred.min.split.size
set mapred.max.split.size*
Should this value to be set in bootstrap action while submitting jobs to
amazon EMR? What value to be set for it as I don't know?

-- 
Regards,
Bhavesh Shah

On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <ni...@gmail.com>wrote:

> 1) check the jobtracker url to see how many maps/reducers have been
> launched
> 2) if you have a large dataset and wants to execute it fast, you
> set mapred.min.split.size and mapred.max.split.size to an optimal value so
> that more mappers will be launched and will finish
> 3) if you are doing joins, there are different ways to go according to the
> data you have and size of data
>
> it will be helpful if you can let us know your datasizes and query details
>
>
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com>wrote:
>
>> Hello all,
>> I have written a Hive JDBC code and created a JAR of it. I am running
>> that JAR on 10 cluster.
>> But the problem as I am using the 10 cluster still the performance is
>> same as that on single cluster.
>>
>> What to do to improve the performance of Hive Jobs? Is there anything
>> configuration setting to set before the submitting Hive Jobs to cluster?
>> One more thing I want to know is that How can we come to know that is job
>> running on all cluster?
>>
>> Please let me know if anyone knows about it?
>>
>> --
>> Regards,
>> Bhavesh Shah
>>
>>
>
>
> --
> Nitin Pawar
>
>

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Nitin Pawar <ni...@gmail.com>.

1) check the jobtracker url to see how many maps/reducers have been launched
2) if you have a large dataset and wants to execute it fast, you
set mapred.min.split.size and mapred.max.split.size to an optimal value so
that more mappers will be launched and will finish
3) if you are doing joins, there are different ways to go according to the
data you have and size of data

it will be helpful if you can let us know your datasizes and query details

On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bh...@gmail.com>wrote:

> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same
> as that on single cluster.
>
> What to do to improve the performance of Hive Jobs? Is there anything
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job
> running on all cluster?
>
> Please let me know if anyone knows about it?
>
> --
> Regards,
> Bhavesh Shah
>
>


-- 
Nitin Pawar

Re: Want to improve the performance for execution of Hive Jobs.

Posted by Alexis De La Cruz Toledo <al...@gmail.com>.

Hi,
the way to know if the job is running on all cluster,
you will look at logs of Hive(By default $HADOOP_HOME/logs)
another way is running the query in Hive
and use the web interface of hadoop:

address: http://server_jobtracker:port_mapreduce

Dear.

2012/5/7 Bhavesh Shah <bh...@gmail.com>

> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same
> as that on single cluster.
>
> What to do to improve the performance of Hive Jobs? Is there anything
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job
> running on all cluster?
>
> Please let me know if anyone knows about it?
>
> --
> Regards,
> Bhavesh Shah
>



-- 
Ing. Alexis de la Cruz Toledo.
*Av. Instituto Politécnico Nacional No. 2508 Col. San Pedro Zacatenco. México,
D.F, 07360 *
*CINVESTAV, DF.*