You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by venkataswamy <ve...@gmail.com> on 2012/05/05 23:16:42 UTC

Nested map reduce job

Hi,
   I encountered a strange issue in developing a system. I have data where
reducer recieves about 3 millions values. The reducer emits all the 
permutations of the values.

Reducer{
   List<values>
   FindPermutations(List<values>)
   foreach( permutation )
       emit( key, permutation )
}


It is feasible to hold values in memory to calculate permutations if the
number of values are low i.e. say less than 10,000. Otherwise, this is not
scalable even in computational point of view.

I tried to write the values into a file and move it to HDFS. Start a new
mapreduce job for permutation from the reducer, this distributes the load of
the reducer among available machines. let me call it as nested mapreduce
job. The task waits until the nested job completes and uses the obtained
result to emit the permutations. The parent job's task stills idle, so the
nested job's tasks can run on the same tasktracker, but the tasktracker is
not doing it. Is there a way to signal tasktracker that the current task is
paused or sitting idle, but not to terminate.

All the available tasktrackers are running parent mapreduce job's tasks and
the nested mapreduce job never getting resources to start and falling into
deadlock scenario.

I can suspend parent task after starting a nested job for permutations, but
it does continue from the same instruction when it resumes. In simple words,
the parent task is not pausing but suspending.

Anybody got into this situation. If you have any thoughts on it please post
it here.


All your help is appreciated.


Thanks,
Venkat




-- 
View this message in context: http://old.nabble.com/Nested-map-reduce-job-tp33763485p33763485.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


RE: Nested map reduce job

Posted by Mingxi Wu <Mi...@turn.com>.
You may not need nested map-reduce job. 

All you need to do is to use keys to partition the permutation. And duplicate the data from map. 

output.collect(1, value);
output.collect(2, value);
.
.
.
output.collect(n, value);

Then, set your reducer number to n. When you emit data in the mapper, the key is set to the reducer ID. 
In each reducer, you enumerate the permutation with prefix of reducer id.

You also need to ensure data are hash to the same reduce by implementing 

public class Record implements Writable,WritableComparable<Record>{

    public void readExternal(ObjectInput in) throws IOException { //xxx}
 
    public void writeExternal(ObjectOutput out) throws IOException {
		write(out); 
	}

	public int hashCode() { }
	
}

Hope this helps.

Mingxi

-----Original Message-----
From: venkataswamy [mailto:venkat.taku@gmail.com] 
Sent: Saturday, May 05, 2012 2:17 PM
To: core-user@hadoop.apache.org
Subject: Nested map reduce job


Hi,
   I encountered a strange issue in developing a system. I have data where reducer recieves about 3 millions values. The reducer emits all the permutations of the values.

Reducer{
   List<values>
   FindPermutations(List<values>)
   foreach( permutation )
       emit( key, permutation )
}


It is feasible to hold values in memory to calculate permutations if the number of values are low i.e. say less than 10,000. Otherwise, this is not scalable even in computational point of view.

I tried to write the values into a file and move it to HDFS. Start a new mapreduce job for permutation from the reducer, this distributes the load of the reducer among available machines. let me call it as nested mapreduce job. The task waits until the nested job completes and uses the obtained result to emit the permutations. The parent job's task stills idle, so the nested job's tasks can run on the same tasktracker, but the tasktracker is not doing it. Is there a way to signal tasktracker that the current task is paused or sitting idle, but not to terminate.

All the available tasktrackers are running parent mapreduce job's tasks and the nested mapreduce job never getting resources to start and falling into deadlock scenario.

I can suspend parent task after starting a nested job for permutations, but it does continue from the same instruction when it resumes. In simple words, the parent task is not pausing but suspending.

Anybody got into this situation. If you have any thoughts on it please post it here.


All your help is appreciated.


Thanks,
Venkat




--
View this message in context: http://old.nabble.com/Nested-map-reduce-job-tp33763485p33763485.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: Nested map reduce job

Posted by Shi Yu <sh...@uchicago.edu>.
A quick glance at your problem indicates that you might have a design 
problem with your code. In my opinion you should avoid nested Map/Reduce 
job.  You could use chain Map/Reduce,  but the nested or recursive 
structure is not suggested.  I don't know how you implemented your 
nested M/R job, maybe showing some code fragment?  For the permutation 
problem, it might be easier to split the permutation candidates for 
Mappers, then sort (discard duplicated values) at reducers.   
Permutation of 3 million values seems huge, are you sure you want to 
permutation all 3 million values (what problem requires that 
permutation) or you just need to permute a small set sampled from those 
3 million values?

Shi

On 5/5/2012 4:16 PM, venkataswamy wrote:
> Hi,
>     I encountered a strange issue in developing a system. I have data where
> reducer recieves about 3 millions values. The reducer emits all the
> permutations of the values.
>
> Reducer{
>     List<values>
>     FindPermutations(List<values>)
>     foreach( permutation )
>         emit( key, permutation )
> }
>
>
> It is feasible to hold values in memory to calculate permutations if the
> number of values are low i.e. say less than 10,000. Otherwise, this is not
> scalable even in computational point of view.
>
> I tried to write the values into a file and move it to HDFS. Start a new
> mapreduce job for permutation from the reducer, this distributes the load of
> the reducer among available machines. let me call it as nested mapreduce
> job. The task waits until the nested job completes and uses the obtained
> result to emit the permutations. The parent job's task stills idle, so the
> nested job's tasks can run on the same tasktracker, but the tasktracker is
> not doing it. Is there a way to signal tasktracker that the current task is
> paused or sitting idle, but not to terminate.
>
> All the available tasktrackers are running parent mapreduce job's tasks and
> the nested mapreduce job never getting resources to start and falling into
> deadlock scenario.
>
> I can suspend parent task after starting a nested job for permutations, but
> it does continue from the same instruction when it resumes. In simple words,
> the parent task is not pausing but suspending.
>
> Anybody got into this situation. If you have any thoughts on it please post
> it here.
>
>
> All your help is appreciated.
>
>
> Thanks,
> Venkat
>
>
>
>