You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Gross, Danny" <Da...@spansion.com> on 2009/06/29 23:03:51 UTC

Teragen defaults to 2 maps; terasort defaults to 1 reducer

Hello all,

 

I'm trying to run the hadoop-1.19.1-examples.jar teragen and terasort
programs on a cluster.  I have two problems with these programs:

 

1.	The data is generated in a fashion to where it is not balanced
across my cluster.  This is because the data is generated with 2 maps.

	*	With the command "hadoop jar hadoop-0.19.1-examples.jar
teragen 1000000000 /terasort"  (or any size) per the example doc, I get
2 maps.  With replication set to 2, this tends to place data more
heavily on 2 of my nodes, and the cluster believes it is balanced.

 

2.	The terasort program runs out of disk space on the reduce
operation.  This is because the program runs with a single reduce task.


	*	When running "hadoop jar hadoop-0.19.1-examples.jar
terasort /terasort /out" per the example doc, I get the appropriate
number of maps, but one reduce.  I've scoured the web and the new Hadoop
book, and I'm just not able to change the number of reducers.  An
example attempt was with the command "hadoop jar
-Dmapred.reduce.tasks=16 hadoop-0.19.1-examples.jar terasort /terasort
/out".

 

Could anyone help shed some light on how to modify the execution of
these programs to more appropriately balance the data, and spread the
reduce load out across my cluster?  

 

Best regards,

 

Danny Gross

Re: Teragen defaults to 2 maps; terasort defaults to 1 reducer

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

These are due to the default #maps/#reduces in Map-Reduce.

Use:
$ bin/hadoop jar hadoop-*-dev-examples.jar teragen - 
Dmapred.map.tasks=8000 10000000000 /tera/in
$ bin/hadoop jar hadoop-*-dev-examples.jar terasort - 
Dmapred.reduce.tasks=5300 /tera/in /tera/out

Arun

On Jun 29, 2009, at 2:03 PM, Gross, Danny wrote:

> Hello all,
>
>
>
> I'm trying to run the hadoop-1.19.1-examples.jar teragen and terasort
> programs on a cluster.  I have two problems with these programs:
>
>
>
> 1.	The data is generated in a fashion to where it is not balanced
> across my cluster.  This is because the data is generated with 2 maps.
>
> 	*	With the command "hadoop jar hadoop-0.19.1-examples.jar
> teragen 1000000000 /terasort"  (or any size) per the example doc, I  
> get
> 2 maps.  With replication set to 2, this tends to place data more
> heavily on 2 of my nodes, and the cluster believes it is balanced.
>
>
>
> 2.	The terasort program runs out of disk space on the reduce
> operation.  This is because the program runs with a single reduce  
> task.
>
>
> 	*	When running "hadoop jar hadoop-0.19.1-examples.jar
> terasort /terasort /out" per the example doc, I get the appropriate
> number of maps, but one reduce.  I've scoured the web and the new  
> Hadoop
> book, and I'm just not able to change the number of reducers.  An
> example attempt was with the command "hadoop jar
> -Dmapred.reduce.tasks=16 hadoop-0.19.1-examples.jar terasort /terasort
> /out".
>
>
>
> Could anyone help shed some light on how to modify the execution of
> these programs to more appropriately balance the data, and spread the
> reduce load out across my cluster?
>
>
>
> Best regards,
>
>
>
> Danny Gross
>
>
>