You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Cornelio IƱigo <co...@gmail.com> on 2010/11/25 23:25:52 UTC

best way to implement a solution

Hi
 I have a program that analyzes text from a CSV and on it I have 9 operators
or functions, so in my normal java program in the main class y call these
functions in serial mode (just when the first function finishes the second
starts and so on), my actual solution was to put all these functions on one
map function,  some like the following:

static class Map extends Mapper<LongWritable, Text, Text, IntWritable>{

                 //declaration of operators  objects
                 Operator1 op1 = new Operator1();
                 ...
                 ...

        public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException{

                                //convert the value (row of the csv) to
string
                                 String line = value.toString();

                              //begin with the process of operators


                                op1.process(line );
                                String[] sentences = op1.getSentences();

                                op2.process(sentences);
                                String[][] tokens = toke.getTokens();

                               op3...
                               op4...

                     //final result is in a matrix and then write results
                                for( int k = 0 ; k < matrix.length ; k++ ){
                                   for( int j = 0; j < matrix[k].length; j++
){


                                        context.write(x,y);    //results

                                }
                            }

        }
    }



when I test it vs the java normal program it reduces the time (java program
30 minutes, hadoop 6 minutes) but when I compare it with a much bigger csv
and vs a cascading implementation the cascading time was a lot better (28
minutes vs 1 hour and 30 minutes!!)

My question is if its fine to put all these functions (9 operators) on a
single map?
Its a better way to do it?


Thanks
-- 
*Cornelio*