You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by sagar nikam <sa...@gmail.com> on 2012/10/30 14:51:29 UTC

problem in Hive performance

Respected sir,

     I am dealing with a database (2.5 GB) having some tables only 40 row
to some having 9 million rows data.
when I am doing any query for large table it takes more time.
I want results in less time

small query-->
=========================================================================
hive> select count(*) from cidade;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201210300724_0003, Tracking URL =
http://localhost:50030/jobdetails.jsp?jobid=job_201210300724_0003
Kill Command = /home/trendwise/Hadoop/hadoop-0.20.2/bin/../bin/hadoop job
-Dmapred.job.tracker=localhost:54311 -kill job_201210300724_0003
2012-10-30 07:37:41,588 Stage-1 map = 0%,  reduce = 0%
2012-10-30 07:37:57,493 Stage-1 map = 100%,  reduce = 0%
2012-10-30 07:38:17,905 Stage-1 map = 100%,  reduce = 33%
2012-10-30 07:38:20,965 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201210300724_0003
OK
5566
Time taken: 50.172 seconds
=================================================================================================================
hdfs-site.xml

<configuration>
<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>dfs.block.size</name>
  <value>131072</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is
created.
  The default is used if replication is not specified in create time.
  </description>
</property>
</configuration>


does these setting affects performance of hive?
dfs.replication=3
dfs.block.size=131072

can i set it from hive prompt as
hive>set dfs.replication=5
Is this value remains for a perticular session only ?
or Is it better to change it in .xml file ?



which more setting should i do to incrase performance ?



Sagar Nikam
Trendwise Analytics
Bangalore,INDIA

Re: problem in Hive performance

Posted by Bharath Ganesh <bh...@gmail.com>.

Yes, when you set a property via the 'set' command on the Hive CLI, they
live for the life of that particular client session.

There is not 'golden rule' that increases your performance; it all depends
on your installation, data and query pattern. Based on these you can
consider leveraging some join optimizations, partitions, compression
techniques, storage formats.. if they really make sense to your use-case
and if numbers prove that.

You might want to take a look at some these articles, which can be starting
points for you:
http://kb.tableausoftware.com/articles/knowledgebase/cloudera-hadoop-hive-performance
http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/

Thanks,
Bharath



On Tue, Oct 30, 2012 at 7:21 PM, sagar nikam <sa...@gmail.com>wrote:

> Respected sir,
>
>      I am dealing with a database (2.5 GB) having some tables only 40 row
> to some having 9 million rows data.
> when I am doing any query for large table it takes more time.
> I want results in less time
>
> small query-->
> =========================================================================
> hive> select count(*) from cidade;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=<number>
> Starting Job = job_201210300724_0003, Tracking URL =
> http://localhost:50030/jobdetails.jsp?jobid=job_201210300724_0003
> Kill Command = /home/trendwise/Hadoop/hadoop-0.20.2/bin/../bin/hadoop job
> -Dmapred.job.tracker=localhost:54311 -kill job_201210300724_0003
> 2012-10-30 07:37:41,588 Stage-1 map = 0%,  reduce = 0%
> 2012-10-30 07:37:57,493 Stage-1 map = 100%,  reduce = 0%
> 2012-10-30 07:38:17,905 Stage-1 map = 100%,  reduce = 33%
> 2012-10-30 07:38:20,965 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201210300724_0003
> OK
> 5566
> Time taken: 50.172 seconds
>
> =================================================================================================================
> hdfs-site.xml
>
> <configuration>
> <property>
>   <name>dfs.replication</name>
>   <value>3</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
>
> <property>
>   <name>dfs.block.size</name>
>   <value>131072</value>
>   <description>Default block replication.
>   The actual number of replications can be specified when the file is
> created.
>   The default is used if replication is not specified in create time.
>   </description>
> </property>
> </configuration>
>
>
> does these setting affects performance of hive?
> dfs.replication=3
> dfs.block.size=131072
>
> can i set it from hive prompt as
> hive>set dfs.replication=5
> Is this value remains for a perticular session only ?
> or Is it better to change it in .xml file ?
>
>
>
> which more setting should i do to incrase performance ?
>
>
>
> Sagar Nikam
> Trendwise Analytics
> Bangalore,INDIA
>
>