You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Aayush Garg <aa...@gmail.com> on 2008/03/26 17:39:05 UTC

Hadoop: Multiple map reduce or some better way

HI,
I am developing the simple inverted index program frm the hadoop. My map
function has the output:
<word, doc>
and the reducer has:
<word, list(docs)>

Now I want to use one more mapreduce to remove stop and scrub words from
this output. Also in the next stage I would like to have short summay
associated with every word. How should I design my program from this stage?
I mean how would I apply multiple mapreduce to this? What would be the
better way to perform this?

Thanks,

Regards,
-
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Theodore Van Rooy <mu...@gmail.com>.

In my experience the advice above is good... the less reading and writing
that you have to do at each step the better.

While you could do map | reduce | map | reduce as you are proposing...
perhaps you could try several maps in a row

i.e.  map - <word, doc> | no reduce ->  map- <word doc, scrubbed word> |
reduce - <scrubbed word, list(docs)>

also, if you consider how hadoop streaming works you might just write a
script in python (or whatever) that does

stdin | map -everything you want to do in one script | reduce - aggregate
results of previous map script

because your data set is spread out in x number of blocks you may be able to
gain more parallelization speedup by simply doing everything you want in one
step and then aggregating it with a reduce.  Though this sidesteps the
mapReduce paradigm of <key, value>, it acheives the benefit of using hadoop
to handle the distribution of tasks and pieces of the file.

On Wed, Mar 26, 2008 at 12:19 PM, Arun C Murthy <ar...@yahoo-inc.com> wrote:

>
> On Mar 26, 2008, at 11:05 AM, Arun C Murthy wrote:
>
> >
> > On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote:
> >
> >> HI,
> >> I am developing the simple inverted index program frm the hadoop.
> >> My map
> >> function has the output:
> >> <word, doc>
> >> and the reducer has:
> >> <word, list(docs)>
> >>
> >> Now I want to use one more mapreduce to remove stop and scrub
> >> words from
> >> this output. Also in the next stage I would like to have short summay
> >> associated with every word. How should I design my program from
> >> this stage?
> >> I mean how would I apply multiple mapreduce to this? What would be
> >> the
> >> better way to perform this?
> >>
> >
> > In general you are better off with lesser number of Map-Reduce
> > jobs ... lesser i/o works better.
> >
>
> I forgot to add that you can use the apis in JobClient and JobControl
> to chain jobs together ...
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job
> +Control
> http://hadoop.apache.org/core/docs/current/
> mapred_tutorial.html#JobControl
>
> Arun
>
> > Use the DistributedCache if you can and fix your first Map to not
> > emit the stop words at all. Use the combiner to crunch down amount
> > of intermediate map-outputs etc.
> >
> > Something useful to look at:
> > http://hadoop.apache.org/core/docs/current/
> > mapred_tutorial.html#Example%3A+WordCount+v2.0
> >
> > Arun
> >
> >> Thanks,
> >>
> >> Regards,
> >> -
> >> Aayush Garg,
> >> Phone: +41 76 482 240
> >
>
>

-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Hadoop: Multiple map reduce or some better way

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Mar 26, 2008, at 11:05 AM, Arun C Murthy wrote:

>
> On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote:
>
>> HI,
>> I am developing the simple inverted index program frm the hadoop.  
>> My map
>> function has the output:
>> <word, doc>
>> and the reducer has:
>> <word, list(docs)>
>>
>> Now I want to use one more mapreduce to remove stop and scrub  
>> words from
>> this output. Also in the next stage I would like to have short summay
>> associated with every word. How should I design my program from  
>> this stage?
>> I mean how would I apply multiple mapreduce to this? What would be  
>> the
>> better way to perform this?
>>
>
> In general you are better off with lesser number of Map-Reduce  
> jobs ... lesser i/o works better.
>

I forgot to add that you can use the apis in JobClient and JobControl  
to chain jobs together ...
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job 
+Control
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#JobControl

Arun

> Use the DistributedCache if you can and fix your first Map to not  
> emit the stop words at all. Use the combiner to crunch down amount  
> of intermediate map-outputs etc.
>
> Something useful to look at:
> http://hadoop.apache.org/core/docs/current/ 
> mapred_tutorial.html#Example%3A+WordCount+v2.0
>
> Arun
>
>> Thanks,
>>
>> Regards,
>> -
>> Aayush Garg,
>> Phone: +41 76 482 240
>

Re: Hadoop: Multiple map reduce or some better way

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote:

> HI,
> I am developing the simple inverted index program frm the hadoop.  
> My map
> function has the output:
> <word, doc>
> and the reducer has:
> <word, list(docs)>
>
> Now I want to use one more mapreduce to remove stop and scrub words  
> from
> this output. Also in the next stage I would like to have short summay
> associated with every word. How should I design my program from  
> this stage?
> I mean how would I apply multiple mapreduce to this? What would be the
> better way to perform this?
>

In general you are better off with lesser number of Map-Reduce  
jobs ... lesser i/o works better.

Use the DistributedCache if you can and fix your first Map to not  
emit the stop words at all. Use the combiner to crunch down amount of  
intermediate map-outputs etc.

Something useful to look at:
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#Example%3A+WordCount+v2.0

Arun

> Thanks,
>
> Regards,
> -
> Aayush Garg,
> Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.

Please give your inputs for my problem.

Thanks,


On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <rd...@techcfl.com> wrote:

> Ted,
>
> It appears that Nutch hasn't been updated in a while (in Internet time at
> least). Do you know if it works with the latest versions of Hadoop? Thanks.
>
> - Robert Dempsey (new to the list)
>
>
> On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
> >
> >
> > See Nutch.  See Nutch run.
> >
> > http://en.wikipedia.org/wiki/Nutch
> > http://lucene.apache.org/nutch/
> >
>


-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Robert Dempsey <rd...@techcfl.com>.

Ted,

It appears that Nutch hasn't been updated in a while (in Internet time  
at least). Do you know if it works with the latest versions of Hadoop?  
Thanks.

- Robert Dempsey (new to the list)

On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
>
> See Nutch.  See Nutch run.
>
> http://en.wikipedia.org/wiki/Nutch
> http://lucene.apache.org/nutch/

Re: Hadoop: Multiple map reduce or some better way

Posted by Ted Dunning <td...@veoh.com>.


See Nutch.  See Nutch run.

http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/



On 4/4/08 1:22 PM, "Aayush Garg" <aa...@gmail.com> wrote:

> Hi,
> 
> I have not used lucene index ever before. I do not get how we build it with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
> 
> 
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
> 
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>> 
>> Regards,
>> Ning
>> 
>> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
>>> No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>>> trying to build up this type of inverted index and then measure
>> performance
>>> criteria with respect to others.
>>> 
>>> Thanks,
>>> 
>>> 
>>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>>> 
>>>> 
>>>> Are you implementing this for instruction or production?
>>>> 
>>>> If production, why not use Lucene?
>>>> 
>>>> 
>>>> On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>>>> 
>>>>> HI  Amar , Theodore, Arun,
>>>>> 
>>>>> Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>>>> much.
>>>>> I have written following code for inverted index. This code maps
>> each
>>>> word
>>>>> from the document to its document id.
>>>>> ex: apple file1 file123
>>>>> Main functions of the code are:-
>>>>> 
>>>>> public class HadoopProgram extends Configured implements Tool {
>>>>> public static class MapClass extends MapReduceBase
>>>>>     implements Mapper<LongWritable, Text, Text, Text> {
>>>>> 
>>>>>     private final static IntWritable one = new IntWritable(1);
>>>>>     private Text word = new Text();
>>>>>     private Text doc = new Text();
>>>>>     private long numRecords=0;
>>>>>     private String inputFile;
>>>>> 
>>>>>    public void configure(JobConf job){
>>>>>         System.out.println("Configure function is called");
>>>>>         inputFile = job.get("map.input.file");
>>>>>         System.out.println("In conf the input file is"+inputFile);
>>>>>     }
>>>>> 
>>>>> 
>>>>>     public void map(LongWritable key, Text value,
>>>>>                     OutputCollector<Text, Text> output,
>>>>>                     Reporter reporter) throws IOException {
>>>>>       String line = value.toString();
>>>>>       StringTokenizer itr = new StringTokenizer(line);
>>>>>       doc.set(inputFile);
>>>>>       while (itr.hasMoreTokens()) {
>>>>>         word.set(itr.nextToken());
>>>>>         output.collect(word,doc);
>>>>>       }
>>>>>       if(++numRecords%4==0){
>>>>>        System.out.println("Finished processing of input
>>>> file"+inputFile);
>>>>>      }
>>>>>     }
>>>>>   }
>>>>> 
>>>>>   /**
>>>>>    * A reducer class that just emits the sum of the input values.
>>>>>    */
>>>>>   public static class Reduce extends MapReduceBase
>>>>>     implements Reducer<Text, Text, Text, DocIDs> {
>>>>> 
>>>>>   // This works as K2, V2, K3, V3
>>>>>     public void reduce(Text key, Iterator<Text> values,
>>>>>                        OutputCollector<Text, DocIDs> output,
>>>>>                        Reporter reporter) throws IOException {
>>>>>       int sum = 0;
>>>>>       Text dummy = new Text();
>>>>>       ArrayList<String> IDs = new ArrayList<String>();
>>>>>       String str;
>>>>> 
>>>>>       while (values.hasNext()) {
>>>>>          dummy = values.next();
>>>>>          str = dummy.toString();
>>>>>          IDs.add(str);
>>>>>        }
>>>>>        DocIDs dc = new DocIDs();
>>>>>        dc.setListdocs(IDs);
>>>>>       output.collect(key,dc);
>>>>>     }
>>>>>   }
>>>>> 
>>>>>  public int run(String[] args) throws Exception {
>>>>>   System.out.println("Run function is called");
>>>>>     JobConf conf = new JobConf(getConf(), WordCount.class);
>>>>>     conf.setJobName("wordcount");
>>>>> 
>>>>>     // the keys are words (strings)
>>>>>     conf.setOutputKeyClass(Text.class);
>>>>> 
>>>>>     conf.setOutputValueClass(Text.class);
>>>>> 
>>>>> 
>>>>>     conf.setMapperClass(MapClass.class);
>>>>> 
>>>>>     conf.setReducerClass(Reduce.class);
>>>>> }
>>>>> 
>>>>> 
>>>>> Now I am getting output array from the reducer as:-
>>>>> word \root\test\test123, \root\test12
>>>>> 
>>>>> In the next stage I want to stop 'stop  words',  scrub words etc.
>> and
>>>> like
>>>>> position of the word in the document. How would I apply multiple
>> maps or
>>>>> multilevel map reduce jobs programmatically? I guess I need to make
>>>> another
>>>>> class or add some functions in it? I am not able to figure it out.
>>>>> Any pointers for these type of problems?
>>>>> 
>>>>> Thanks,
>>>>> Aayush
>>>>> 
>>>>> 
>>>>> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
>>>> wrote:
>>>>> 
>>>>>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>>>>>> 
>>>>>>> HI,
>>>>>>> I am developing the simple inverted index program frm the hadoop.
>> My
>>>> map
>>>>>>> function has the output:
>>>>>>> <word, doc>
>>>>>>> and the reducer has:
>>>>>>> <word, list(docs)>
>>>>>>> 
>>>>>>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>>>> from
>>>>>> Use distributed cache as Arun mentioned.
>>>>>>> this output. Also in the next stage I would like to have short
>> summay
>>>>>> Whether to use a separate MR job depends on what exactly you mean
>> by
>>>>>> summary. If its like a window around the current word then you can
>>>>>> possibly do it in one go.
>>>>>> Amar
>>>>>>> associated with every word. How should I design my program from
>> this
>>>>>> stage?
>>>>>>> I mean how would I apply multiple mapreduce to this? What would be
>> the
>>>>>>> better way to perform this?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Regards,
>>>>>>> -
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Aayush Garg,
>>> Phone: +41 76 482 240
>>> 
>> 
> 
>

Re: Hadoop: Multiple map reduce or some better way

Posted by Nikit Saraf <ni...@gmail.com>.

Hi Aayush

So have been able to find solution for the Multi-Level Map/Reduce.I am also
stuck on  this problem and I cannot find a way out. Can you help me?

Thanks


Aayush Garg wrote:
> 
> Hi,
> 
> I have not used lucene index ever before. I do not get how we build it
> with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
> 
> 
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
> 
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>>
>> Regards,
>> Ning
>>
>> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
>> > No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>> > trying to build up this type of inverted index and then measure
>> performance
>> > criteria with respect to others.
>> >
>> > Thanks,
>> >
>> >
>> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>> >
>> > >
>> > > Are you implementing this for instruction or production?
>> > >
>> > > If production, why not use Lucene?
>> > >
>> > >
>> > > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>> > >
>> > > > HI  Amar , Theodore, Arun,
>> > > >
>> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>> > > much.
>> > > > I have written following code for inverted index. This code maps
>> each
>> > > word
>> > > > from the document to its document id.
>> > > > ex: apple file1 file123
>> > > > Main functions of the code are:-
>> > > >
>> > > > public class HadoopProgram extends Configured implements Tool {
>> > > > public static class MapClass extends MapReduceBase
>> > > >     implements Mapper<LongWritable, Text, Text, Text> {
>> > > >
>> > > >     private final static IntWritable one = new IntWritable(1);
>> > > >     private Text word = new Text();
>> > > >     private Text doc = new Text();
>> > > >     private long numRecords=0;
>> > > >     private String inputFile;
>> > > >
>> > > >    public void configure(JobConf job){
>> > > >         System.out.println("Configure function is called");
>> > > >         inputFile = job.get("map.input.file");
>> > > >         System.out.println("In conf the input file is"+inputFile);
>> > > >     }
>> > > >
>> > > >
>> > > >     public void map(LongWritable key, Text value,
>> > > >                     OutputCollector<Text, Text> output,
>> > > >                     Reporter reporter) throws IOException {
>> > > >       String line = value.toString();
>> > > >       StringTokenizer itr = new StringTokenizer(line);
>> > > >       doc.set(inputFile);
>> > > >       while (itr.hasMoreTokens()) {
>> > > >         word.set(itr.nextToken());
>> > > >         output.collect(word,doc);
>> > > >       }
>> > > >       if(++numRecords%4==0){
>> > > >        System.out.println("Finished processing of input
>> > > file"+inputFile);
>> > > >      }
>> > > >     }
>> > > >   }
>> > > >
>> > > >   /**
>> > > >    * A reducer class that just emits the sum of the input values.
>> > > >    */
>> > > >   public static class Reduce extends MapReduceBase
>> > > >     implements Reducer<Text, Text, Text, DocIDs> {
>> > > >
>> > > >   // This works as K2, V2, K3, V3
>> > > >     public void reduce(Text key, Iterator<Text> values,
>> > > >                        OutputCollector<Text, DocIDs> output,
>> > > >                        Reporter reporter) throws IOException {
>> > > >       int sum = 0;
>> > > >       Text dummy = new Text();
>> > > >       ArrayList<String> IDs = new ArrayList<String>();
>> > > >       String str;
>> > > >
>> > > >       while (values.hasNext()) {
>> > > >          dummy = values.next();
>> > > >          str = dummy.toString();
>> > > >          IDs.add(str);
>> > > >        }
>> > > >        DocIDs dc = new DocIDs();
>> > > >        dc.setListdocs(IDs);
>> > > >       output.collect(key,dc);
>> > > >     }
>> > > >   }
>> > > >
>> > > >  public int run(String[] args) throws Exception {
>> > > >   System.out.println("Run function is called");
>> > > >     JobConf conf = new JobConf(getConf(), WordCount.class);
>> > > >     conf.setJobName("wordcount");
>> > > >
>> > > >     // the keys are words (strings)
>> > > >     conf.setOutputKeyClass(Text.class);
>> > > >
>> > > >     conf.setOutputValueClass(Text.class);
>> > > >
>> > > >
>> > > >     conf.setMapperClass(MapClass.class);
>> > > >
>> > > >     conf.setReducerClass(Reduce.class);
>> > > > }
>> > > >
>> > > >
>> > > > Now I am getting output array from the reducer as:-
>> > > > word \root\test\test123, \root\test12
>> > > >
>> > > > In the next stage I want to stop 'stop  words',  scrub words etc.
>> and
>> > > like
>> > > > position of the word in the document. How would I apply multiple
>> maps or
>> > > > multilevel map reduce jobs programmatically? I guess I need to make
>> > > another
>> > > > class or add some functions in it? I am not able to figure it out.
>> > > > Any pointers for these type of problems?
>> > > >
>> > > > Thanks,
>> > > > Aayush
>> > > >
>> > > >
>> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
>> > > wrote:
>> > > >
>> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> > > >>
>> > > >>> HI,
>> > > >>> I am developing the simple inverted index program frm the hadoop.
>> My
>> > > map
>> > > >>> function has the output:
>> > > >>> <word, doc>
>> > > >>> and the reducer has:
>> > > >>> <word, list(docs)>
>> > > >>>
>> > > >>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>> > > from
>> > > >> Use distributed cache as Arun mentioned.
>> > > >>> this output. Also in the next stage I would like to have short
>> summay
>> > > >> Whether to use a separate MR job depends on what exactly you mean
>> by
>> > > >> summary. If its like a window around the current word then you can
>> > > >> possibly do it in one go.
>> > > >> Amar
>> > > >>> associated with every word. How should I design my program from
>> this
>> > > >> stage?
>> > > >>> I mean how would I apply multiple mapreduce to this? What would
>> be
>> the
>> > > >>> better way to perform this?
>> > > >>>
>> > > >>> Thanks,
>> > > >>>
>> > > >>> Regards,
>> > > >>> -
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>> >
>> > --
>> > Aayush Garg,
>> > Phone: +41 76 482 240
>> >
>>
> 
> 
> 
> -- 
> Aayush Garg,
> Phone: +41 76 482 240
> 
> 

-- 
View this message in context: http://old.nabble.com/Hadoop%3A-Multiple-map-reduce-or-some-better-way-tp16309172p34009971.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.

Hi,

I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.


On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:

> You can build Lucene indexes using Hadoop Map/Reduce. See the index
> contrib package in the trunk. Or is it still not something you are
> looking for?
>
> Regards,
> Ning
>
> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
> > No, currently my requirement is to solve this problem by apache hadoop.
> I am
> > trying to build up this type of inverted index and then measure
> performance
> > criteria with respect to others.
> >
> > Thanks,
> >
> >
> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> > >
> > > Are you implementing this for instruction or production?
> > >
> > > If production, why not use Lucene?
> > >
> > >
> > > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> > >
> > > > HI  Amar , Theodore, Arun,
> > > >
> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
> out
> > > much.
> > > > I have written following code for inverted index. This code maps
> each
> > > word
> > > > from the document to its document id.
> > > > ex: apple file1 file123
> > > > Main functions of the code are:-
> > > >
> > > > public class HadoopProgram extends Configured implements Tool {
> > > > public static class MapClass extends MapReduceBase
> > > >     implements Mapper<LongWritable, Text, Text, Text> {
> > > >
> > > >     private final static IntWritable one = new IntWritable(1);
> > > >     private Text word = new Text();
> > > >     private Text doc = new Text();
> > > >     private long numRecords=0;
> > > >     private String inputFile;
> > > >
> > > >    public void configure(JobConf job){
> > > >         System.out.println("Configure function is called");
> > > >         inputFile = job.get("map.input.file");
> > > >         System.out.println("In conf the input file is"+inputFile);
> > > >     }
> > > >
> > > >
> > > >     public void map(LongWritable key, Text value,
> > > >                     OutputCollector<Text, Text> output,
> > > >                     Reporter reporter) throws IOException {
> > > >       String line = value.toString();
> > > >       StringTokenizer itr = new StringTokenizer(line);
> > > >       doc.set(inputFile);
> > > >       while (itr.hasMoreTokens()) {
> > > >         word.set(itr.nextToken());
> > > >         output.collect(word,doc);
> > > >       }
> > > >       if(++numRecords%4==0){
> > > >        System.out.println("Finished processing of input
> > > file"+inputFile);
> > > >      }
> > > >     }
> > > >   }
> > > >
> > > >   /**
> > > >    * A reducer class that just emits the sum of the input values.
> > > >    */
> > > >   public static class Reduce extends MapReduceBase
> > > >     implements Reducer<Text, Text, Text, DocIDs> {
> > > >
> > > >   // This works as K2, V2, K3, V3
> > > >     public void reduce(Text key, Iterator<Text> values,
> > > >                        OutputCollector<Text, DocIDs> output,
> > > >                        Reporter reporter) throws IOException {
> > > >       int sum = 0;
> > > >       Text dummy = new Text();
> > > >       ArrayList<String> IDs = new ArrayList<String>();
> > > >       String str;
> > > >
> > > >       while (values.hasNext()) {
> > > >          dummy = values.next();
> > > >          str = dummy.toString();
> > > >          IDs.add(str);
> > > >        }
> > > >        DocIDs dc = new DocIDs();
> > > >        dc.setListdocs(IDs);
> > > >       output.collect(key,dc);
> > > >     }
> > > >   }
> > > >
> > > >  public int run(String[] args) throws Exception {
> > > >   System.out.println("Run function is called");
> > > >     JobConf conf = new JobConf(getConf(), WordCount.class);
> > > >     conf.setJobName("wordcount");
> > > >
> > > >     // the keys are words (strings)
> > > >     conf.setOutputKeyClass(Text.class);
> > > >
> > > >     conf.setOutputValueClass(Text.class);
> > > >
> > > >
> > > >     conf.setMapperClass(MapClass.class);
> > > >
> > > >     conf.setReducerClass(Reduce.class);
> > > > }
> > > >
> > > >
> > > > Now I am getting output array from the reducer as:-
> > > > word \root\test\test123, \root\test12
> > > >
> > > > In the next stage I want to stop 'stop  words',  scrub words etc.
> and
> > > like
> > > > position of the word in the document. How would I apply multiple
> maps or
> > > > multilevel map reduce jobs programmatically? I guess I need to make
> > > another
> > > > class or add some functions in it? I am not able to figure it out.
> > > > Any pointers for these type of problems?
> > > >
> > > > Thanks,
> > > > Aayush
> > > >
> > > >
> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> > > wrote:
> > > >
> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > > >>
> > > >>> HI,
> > > >>> I am developing the simple inverted index program frm the hadoop.
> My
> > > map
> > > >>> function has the output:
> > > >>> <word, doc>
> > > >>> and the reducer has:
> > > >>> <word, list(docs)>
> > > >>>
> > > >>> Now I want to use one more mapreduce to remove stop and scrub
> words
> > > from
> > > >> Use distributed cache as Arun mentioned.
> > > >>> this output. Also in the next stage I would like to have short
> summay
> > > >> Whether to use a separate MR job depends on what exactly you mean
> by
> > > >> summary. If its like a window around the current word then you can
> > > >> possibly do it in one go.
> > > >> Amar
> > > >>> associated with every word. How should I design my program from
> this
> > > >> stage?
> > > >>> I mean how would I apply multiple mapreduce to this? What would be
> the
> > > >>> better way to perform this?
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Regards,
> > > >>> -
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
> >
> > --
> > Aayush Garg,
> > Phone: +41 76 482 240
> >
>



-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Ning Li <ni...@gmail.com>.

You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?

Regards,
Ning

On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> trying to build up this type of inverted index and then measure performance
> criteria with respect to others.
>
> Thanks,
>
>
> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>
> >
> > Are you implementing this for instruction or production?
> >
> > If production, why not use Lucene?
> >
> >
> > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> >
> > > HI  Amar , Theodore, Arun,
> > >
> > > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> > much.
> > > I have written following code for inverted index. This code maps each
> > word
> > > from the document to its document id.
> > > ex: apple file1 file123
> > > Main functions of the code are:-
> > >
> > > public class HadoopProgram extends Configured implements Tool {
> > > public static class MapClass extends MapReduceBase
> > >     implements Mapper<LongWritable, Text, Text, Text> {
> > >
> > >     private final static IntWritable one = new IntWritable(1);
> > >     private Text word = new Text();
> > >     private Text doc = new Text();
> > >     private long numRecords=0;
> > >     private String inputFile;
> > >
> > >    public void configure(JobConf job){
> > >         System.out.println("Configure function is called");
> > >         inputFile = job.get("map.input.file");
> > >         System.out.println("In conf the input file is"+inputFile);
> > >     }
> > >
> > >
> > >     public void map(LongWritable key, Text value,
> > >                     OutputCollector<Text, Text> output,
> > >                     Reporter reporter) throws IOException {
> > >       String line = value.toString();
> > >       StringTokenizer itr = new StringTokenizer(line);
> > >       doc.set(inputFile);
> > >       while (itr.hasMoreTokens()) {
> > >         word.set(itr.nextToken());
> > >         output.collect(word,doc);
> > >       }
> > >       if(++numRecords%4==0){
> > >        System.out.println("Finished processing of input
> > file"+inputFile);
> > >      }
> > >     }
> > >   }
> > >
> > >   /**
> > >    * A reducer class that just emits the sum of the input values.
> > >    */
> > >   public static class Reduce extends MapReduceBase
> > >     implements Reducer<Text, Text, Text, DocIDs> {
> > >
> > >   // This works as K2, V2, K3, V3
> > >     public void reduce(Text key, Iterator<Text> values,
> > >                        OutputCollector<Text, DocIDs> output,
> > >                        Reporter reporter) throws IOException {
> > >       int sum = 0;
> > >       Text dummy = new Text();
> > >       ArrayList<String> IDs = new ArrayList<String>();
> > >       String str;
> > >
> > >       while (values.hasNext()) {
> > >          dummy = values.next();
> > >          str = dummy.toString();
> > >          IDs.add(str);
> > >        }
> > >        DocIDs dc = new DocIDs();
> > >        dc.setListdocs(IDs);
> > >       output.collect(key,dc);
> > >     }
> > >   }
> > >
> > >  public int run(String[] args) throws Exception {
> > >   System.out.println("Run function is called");
> > >     JobConf conf = new JobConf(getConf(), WordCount.class);
> > >     conf.setJobName("wordcount");
> > >
> > >     // the keys are words (strings)
> > >     conf.setOutputKeyClass(Text.class);
> > >
> > >     conf.setOutputValueClass(Text.class);
> > >
> > >
> > >     conf.setMapperClass(MapClass.class);
> > >
> > >     conf.setReducerClass(Reduce.class);
> > > }
> > >
> > >
> > > Now I am getting output array from the reducer as:-
> > > word \root\test\test123, \root\test12
> > >
> > > In the next stage I want to stop 'stop  words',  scrub words etc. and
> > like
> > > position of the word in the document. How would I apply multiple maps or
> > > multilevel map reduce jobs programmatically? I guess I need to make
> > another
> > > class or add some functions in it? I am not able to figure it out.
> > > Any pointers for these type of problems?
> > >
> > > Thanks,
> > > Aayush
> > >
> > >
> > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> > wrote:
> > >
> > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > >>
> > >>> HI,
> > >>> I am developing the simple inverted index program frm the hadoop. My
> > map
> > >>> function has the output:
> > >>> <word, doc>
> > >>> and the reducer has:
> > >>> <word, list(docs)>
> > >>>
> > >>> Now I want to use one more mapreduce to remove stop and scrub words
> > from
> > >> Use distributed cache as Arun mentioned.
> > >>> this output. Also in the next stage I would like to have short summay
> > >> Whether to use a separate MR job depends on what exactly you mean by
> > >> summary. If its like a window around the current word then you can
> > >> possibly do it in one go.
> > >> Amar
> > >>> associated with every word. How should I design my program from this
> > >> stage?
> > >>> I mean how would I apply multiple mapreduce to this? What would be the
> > >>> better way to perform this?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Regards,
> > >>> -
> > >>>
> > >>>
> > >>
> >
> >
>
>
> --
> Aayush Garg,
> Phone: +41 76 482 240
>

Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.

No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.

Thanks,


On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:

>
> Are you implementing this for instruction or production?
>
> If production, why not use Lucene?
>
>
> On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>
> > HI  Amar , Theodore, Arun,
> >
> > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> much.
> > I have written following code for inverted index. This code maps each
> word
> > from the document to its document id.
> > ex: apple file1 file123
> > Main functions of the code are:-
> >
> > public class HadoopProgram extends Configured implements Tool {
> > public static class MapClass extends MapReduceBase
> >     implements Mapper<LongWritable, Text, Text, Text> {
> >
> >     private final static IntWritable one = new IntWritable(1);
> >     private Text word = new Text();
> >     private Text doc = new Text();
> >     private long numRecords=0;
> >     private String inputFile;
> >
> >    public void configure(JobConf job){
> >         System.out.println("Configure function is called");
> >         inputFile = job.get("map.input.file");
> >         System.out.println("In conf the input file is"+inputFile);
> >     }
> >
> >
> >     public void map(LongWritable key, Text value,
> >                     OutputCollector<Text, Text> output,
> >                     Reporter reporter) throws IOException {
> >       String line = value.toString();
> >       StringTokenizer itr = new StringTokenizer(line);
> >       doc.set(inputFile);
> >       while (itr.hasMoreTokens()) {
> >         word.set(itr.nextToken());
> >         output.collect(word,doc);
> >       }
> >       if(++numRecords%4==0){
> >        System.out.println("Finished processing of input
> file"+inputFile);
> >      }
> >     }
> >   }
> >
> >   /**
> >    * A reducer class that just emits the sum of the input values.
> >    */
> >   public static class Reduce extends MapReduceBase
> >     implements Reducer<Text, Text, Text, DocIDs> {
> >
> >   // This works as K2, V2, K3, V3
> >     public void reduce(Text key, Iterator<Text> values,
> >                        OutputCollector<Text, DocIDs> output,
> >                        Reporter reporter) throws IOException {
> >       int sum = 0;
> >       Text dummy = new Text();
> >       ArrayList<String> IDs = new ArrayList<String>();
> >       String str;
> >
> >       while (values.hasNext()) {
> >          dummy = values.next();
> >          str = dummy.toString();
> >          IDs.add(str);
> >        }
> >        DocIDs dc = new DocIDs();
> >        dc.setListdocs(IDs);
> >       output.collect(key,dc);
> >     }
> >   }
> >
> >  public int run(String[] args) throws Exception {
> >   System.out.println("Run function is called");
> >     JobConf conf = new JobConf(getConf(), WordCount.class);
> >     conf.setJobName("wordcount");
> >
> >     // the keys are words (strings)
> >     conf.setOutputKeyClass(Text.class);
> >
> >     conf.setOutputValueClass(Text.class);
> >
> >
> >     conf.setMapperClass(MapClass.class);
> >
> >     conf.setReducerClass(Reduce.class);
> > }
> >
> >
> > Now I am getting output array from the reducer as:-
> > word \root\test\test123, \root\test12
> >
> > In the next stage I want to stop 'stop  words',  scrub words etc. and
> like
> > position of the word in the document. How would I apply multiple maps or
> > multilevel map reduce jobs programmatically? I guess I need to make
> another
> > class or add some functions in it? I am not able to figure it out.
> > Any pointers for these type of problems?
> >
> > Thanks,
> > Aayush
> >
> >
> > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> wrote:
> >
> >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> >>
> >>> HI,
> >>> I am developing the simple inverted index program frm the hadoop. My
> map
> >>> function has the output:
> >>> <word, doc>
> >>> and the reducer has:
> >>> <word, list(docs)>
> >>>
> >>> Now I want to use one more mapreduce to remove stop and scrub words
> from
> >> Use distributed cache as Arun mentioned.
> >>> this output. Also in the next stage I would like to have short summay
> >> Whether to use a separate MR job depends on what exactly you mean by
> >> summary. If its like a window around the current word then you can
> >> possibly do it in one go.
> >> Amar
> >>> associated with every word. How should I design my program from this
> >> stage?
> >>> I mean how would I apply multiple mapreduce to this? What would be the
> >>> better way to perform this?
> >>>
> >>> Thanks,
> >>>
> >>> Regards,
> >>> -
> >>>
> >>>
> >>
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Ted Dunning <td...@veoh.com>.

Are you implementing this for instruction or production?

If production, why not use Lucene?


On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:

> HI  Amar , Theodore, Arun,
> 
> Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
> I have written following code for inverted index. This code maps each word
> from the document to its document id.
> ex: apple file1 file123
> Main functions of the code are:-
> 
> public class HadoopProgram extends Configured implements Tool {
> public static class MapClass extends MapReduceBase
>     implements Mapper<LongWritable, Text, Text, Text> {
> 
>     private final static IntWritable one = new IntWritable(1);
>     private Text word = new Text();
>     private Text doc = new Text();
>     private long numRecords=0;
>     private String inputFile;
> 
>    public void configure(JobConf job){
>         System.out.println("Configure function is called");
>         inputFile = job.get("map.input.file");
>         System.out.println("In conf the input file is"+inputFile);
>     }
> 
> 
>     public void map(LongWritable key, Text value,
>                     OutputCollector<Text, Text> output,
>                     Reporter reporter) throws IOException {
>       String line = value.toString();
>       StringTokenizer itr = new StringTokenizer(line);
>       doc.set(inputFile);
>       while (itr.hasMoreTokens()) {
>         word.set(itr.nextToken());
>         output.collect(word,doc);
>       }
>       if(++numRecords%4==0){
>        System.out.println("Finished processing of input file"+inputFile);
>      }
>     }
>   }
> 
>   /**
>    * A reducer class that just emits the sum of the input values.
>    */
>   public static class Reduce extends MapReduceBase
>     implements Reducer<Text, Text, Text, DocIDs> {
> 
>   // This works as K2, V2, K3, V3
>     public void reduce(Text key, Iterator<Text> values,
>                        OutputCollector<Text, DocIDs> output,
>                        Reporter reporter) throws IOException {
>       int sum = 0;
>       Text dummy = new Text();
>       ArrayList<String> IDs = new ArrayList<String>();
>       String str;
> 
>       while (values.hasNext()) {
>          dummy = values.next();
>          str = dummy.toString();
>          IDs.add(str);
>        }
>        DocIDs dc = new DocIDs();
>        dc.setListdocs(IDs);
>       output.collect(key,dc);
>     }
>   }
> 
>  public int run(String[] args) throws Exception {
>   System.out.println("Run function is called");
>     JobConf conf = new JobConf(getConf(), WordCount.class);
>     conf.setJobName("wordcount");
> 
>     // the keys are words (strings)
>     conf.setOutputKeyClass(Text.class);
> 
>     conf.setOutputValueClass(Text.class);
> 
> 
>     conf.setMapperClass(MapClass.class);
> 
>     conf.setReducerClass(Reduce.class);
> }
> 
> 
> Now I am getting output array from the reducer as:-
> word \root\test\test123, \root\test12
> 
> In the next stage I want to stop 'stop  words',  scrub words etc. and like
> position of the word in the document. How would I apply multiple maps or
> multilevel map reduce jobs programmatically? I guess I need to make another
> class or add some functions in it? I am not able to figure it out.
> Any pointers for these type of problems?
> 
> Thanks,
> Aayush
> 
> 
> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> 
>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> 
>>> HI,
>>> I am developing the simple inverted index program frm the hadoop. My map
>>> function has the output:
>>> <word, doc>
>>> and the reducer has:
>>> <word, list(docs)>
>>> 
>>> Now I want to use one more mapreduce to remove stop and scrub words from
>> Use distributed cache as Arun mentioned.
>>> this output. Also in the next stage I would like to have short summay
>> Whether to use a separate MR job depends on what exactly you mean by
>> summary. If its like a window around the current word then you can
>> possibly do it in one go.
>> Amar
>>> associated with every word. How should I design my program from this
>> stage?
>>> I mean how would I apply multiple mapreduce to this? What would be the
>>> better way to perform this?
>>> 
>>> Thanks,
>>> 
>>> Regards,
>>> -
>>> 
>>> 
>>

Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.

HI  Amar , Theodore, Arun,

Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
I have written following code for inverted index. This code maps each word
from the document to its document id.
ex: apple file1 file123
Main functions of the code are:-

public class HadoopProgram extends Configured implements Tool {
public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private Text doc = new Text();
    private long numRecords=0;
    private String inputFile;

   public void configure(JobConf job){
        System.out.println("Configure function is called");
        inputFile = job.get("map.input.file");
        System.out.println("In conf the input file is"+inputFile);
    }


    public void map(LongWritable key, Text value,
                    OutputCollector<Text, Text> output,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      doc.set(inputFile);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word,doc);
      }
      if(++numRecords%4==0){
       System.out.println("Finished processing of input file"+inputFile);
     }
    }
  }

  /**
   * A reducer class that just emits the sum of the input values.
   */
  public static class Reduce extends MapReduceBase
    implements Reducer<Text, Text, Text, DocIDs> {

  // This works as K2, V2, K3, V3
    public void reduce(Text key, Iterator<Text> values,
                       OutputCollector<Text, DocIDs> output,
                       Reporter reporter) throws IOException {
      int sum = 0;
      Text dummy = new Text();
      ArrayList<String> IDs = new ArrayList<String>();
      String str;

      while (values.hasNext()) {
         dummy = values.next();
         str = dummy.toString();
         IDs.add(str);
       }
       DocIDs dc = new DocIDs();
       dc.setListdocs(IDs);
      output.collect(key,dc);
    }
  }

 public int run(String[] args) throws Exception {
  System.out.println("Run function is called");
    JobConf conf = new JobConf(getConf(), WordCount.class);
    conf.setJobName("wordcount");

    // the keys are words (strings)
    conf.setOutputKeyClass(Text.class);

    conf.setOutputValueClass(Text.class);


    conf.setMapperClass(MapClass.class);

    conf.setReducerClass(Reduce.class);
}


Now I am getting output array from the reducer as:-
word \root\test\test123, \root\test12

In the next stage I want to stop 'stop  words',  scrub words etc. and like
position of the word in the document. How would I apply multiple maps or
multilevel map reduce jobs programmatically? I guess I need to make another
class or add some functions in it? I am not able to figure it out.
Any pointers for these type of problems?

Thanks,
Aayush


On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com> wrote:

> On Wed, 26 Mar 2008, Aayush Garg wrote:
>
> > HI,
> > I am developing the simple inverted index program frm the hadoop. My map
> > function has the output:
> > <word, doc>
> > and the reducer has:
> > <word, list(docs)>
> >
> > Now I want to use one more mapreduce to remove stop and scrub words from
> Use distributed cache as Arun mentioned.
> > this output. Also in the next stage I would like to have short summay
> Whether to use a separate MR job depends on what exactly you mean by
> summary. If its like a window around the current word then you can
> possibly do it in one go.
> Amar
> > associated with every word. How should I design my program from this
> stage?
> > I mean how would I apply multiple mapreduce to this? What would be the
> > better way to perform this?
> >
> > Thanks,
> >
> > Regards,
> > -
> >
> >
>

Re: Hadoop: Multiple map reduce or some better way

Posted by Amar Kamat <am...@yahoo-inc.com>.

On Wed, 26 Mar 2008, Aayush Garg wrote:

> HI,
> I am developing the simple inverted index program frm the hadoop. My map
> function has the output:
> <word, doc>
> and the reducer has:
> <word, list(docs)>
>
> Now I want to use one more mapreduce to remove stop and scrub words from
Use distributed cache as Arun mentioned.
> this output. Also in the next stage I would like to have short summay
Whether to use a separate MR job depends on what exactly you mean by
summary. If its like a window around the current word then you can
possibly do it in one go.
Amar
> associated with every word. How should I design my program from this stage?
> I mean how would I apply multiple mapreduce to this? What would be the
> better way to perform this?
>
> Thanks,
>
> Regards,
> -
> Aayush Garg,
> Phone: +41 76 482 240
>