You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Mahajan, Neeraj" <ne...@ebay.com> on 2007/06/21 23:18:47 UTC

How to compose this as a Map reduce problem and solve it with hadoop?

I have text data available in split files. Per file there are say n
groups of data lines that I need to process as one unit of data. Later
based on different parametrs I would combine the results of the earlier
process. Are there any suggestions how this can be implemented in map
reduce framework?
 
To give a more detailed example, say I have two split files.
file 1 -- "F1"
Containts of F1:
Type a line1
Type a line2
Type a line3
Type b line1
Type b line2
 
file 2 -- "F2"
Containts of F2:
Type c line1
Type c line2
Type d line1
Type d line2
Type d line3
 
I have to process all the Type x lines at once and generate some output.
These multiple outputs can later be combined on either Types grouping or
some other grouping, 
For now, we can assume that all the Type x lines are available in a
single split file.
One easy approach out is that I could write a map task to put the Type x
as key and value as the line and then have only one reduce task per Type
(can be done ??) and then do my processing in the reduce task. Then
again schedule map-reduce tasks to further combine the results or just
do it in a single process. 
The problem with this approach is that all the data is passed as it is
between the map and reduce tasks. And this data is huge! 
I am not using HDFS.
 
Any thoughts on what approach to take?
 
~ Neeraj