You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Aayush Garg <aa...@gmail.com> on 2008/04/04 03:45:55 UTC

Re: Hadoop: Multiple map reduce or some better way

HI  Amar , Theodore, Arun,

Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
I have written following code for inverted index. This code maps each word
from the document to its document id.
ex: apple file1 file123
Main functions of the code are:-

public class HadoopProgram extends Configured implements Tool {
public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, Text> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private Text doc = new Text();
    private long numRecords=0;
    private String inputFile;

   public void configure(JobConf job){
        System.out.println("Configure function is called");
        inputFile = job.get("map.input.file");
        System.out.println("In conf the input file is"+inputFile);
    }


    public void map(LongWritable key, Text value,
                    OutputCollector<Text, Text> output,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      doc.set(inputFile);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word,doc);
      }
      if(++numRecords%4==0){
       System.out.println("Finished processing of input file"+inputFile);
     }
    }
  }

  /**
   * A reducer class that just emits the sum of the input values.
   */
  public static class Reduce extends MapReduceBase
    implements Reducer<Text, Text, Text, DocIDs> {

  // This works as K2, V2, K3, V3
    public void reduce(Text key, Iterator<Text> values,
                       OutputCollector<Text, DocIDs> output,
                       Reporter reporter) throws IOException {
      int sum = 0;
      Text dummy = new Text();
      ArrayList<String> IDs = new ArrayList<String>();
      String str;

      while (values.hasNext()) {
         dummy = values.next();
         str = dummy.toString();
         IDs.add(str);
       }
       DocIDs dc = new DocIDs();
       dc.setListdocs(IDs);
      output.collect(key,dc);
    }
  }

 public int run(String[] args) throws Exception {
  System.out.println("Run function is called");
    JobConf conf = new JobConf(getConf(), WordCount.class);
    conf.setJobName("wordcount");

    // the keys are words (strings)
    conf.setOutputKeyClass(Text.class);

    conf.setOutputValueClass(Text.class);


    conf.setMapperClass(MapClass.class);

    conf.setReducerClass(Reduce.class);
}


Now I am getting output array from the reducer as:-
word \root\test\test123, \root\test12

In the next stage I want to stop 'stop  words',  scrub words etc. and like
position of the word in the document. How would I apply multiple maps or
multilevel map reduce jobs programmatically? I guess I need to make another
class or add some functions in it? I am not able to figure it out.
Any pointers for these type of problems?

Thanks,
Aayush


On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com> wrote:

> On Wed, 26 Mar 2008, Aayush Garg wrote:
>
> > HI,
> > I am developing the simple inverted index program frm the hadoop. My map
> > function has the output:
> > <word, doc>
> > and the reducer has:
> > <word, list(docs)>
> >
> > Now I want to use one more mapreduce to remove stop and scrub words from
> Use distributed cache as Arun mentioned.
> > this output. Also in the next stage I would like to have short summay
> Whether to use a separate MR job depends on what exactly you mean by
> summary. If its like a window around the current word then you can
> possibly do it in one go.
> Amar
> > associated with every word. How should I design my program from this
> stage?
> > I mean how would I apply multiple mapreduce to this? What would be the
> > better way to perform this?
> >
> > Thanks,
> >
> > Regards,
> > -
> >
> >
>

Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.
Please give your inputs for my problem.

Thanks,


On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <rd...@techcfl.com> wrote:

> Ted,
>
> It appears that Nutch hasn't been updated in a while (in Internet time at
> least). Do you know if it works with the latest versions of Hadoop? Thanks.
>
> - Robert Dempsey (new to the list)
>
>
> On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
> >
> >
> > See Nutch.  See Nutch run.
> >
> > http://en.wikipedia.org/wiki/Nutch
> > http://lucene.apache.org/nutch/
> >
>


-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Robert Dempsey <rd...@techcfl.com>.
Ted,

It appears that Nutch hasn't been updated in a while (in Internet time  
at least). Do you know if it works with the latest versions of Hadoop?  
Thanks.

- Robert Dempsey (new to the list)

On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
>
> See Nutch.  See Nutch run.
>
> http://en.wikipedia.org/wiki/Nutch
> http://lucene.apache.org/nutch/

Re: Hadoop: Multiple map reduce or some better way

Posted by Ted Dunning <td...@veoh.com>.

See Nutch.  See Nutch run.

http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/



On 4/4/08 1:22 PM, "Aayush Garg" <aa...@gmail.com> wrote:

> Hi,
> 
> I have not used lucene index ever before. I do not get how we build it with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
> 
> 
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
> 
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>> 
>> Regards,
>> Ning
>> 
>> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
>>> No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>>> trying to build up this type of inverted index and then measure
>> performance
>>> criteria with respect to others.
>>> 
>>> Thanks,
>>> 
>>> 
>>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>>> 
>>>> 
>>>> Are you implementing this for instruction or production?
>>>> 
>>>> If production, why not use Lucene?
>>>> 
>>>> 
>>>> On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>>>> 
>>>>> HI  Amar , Theodore, Arun,
>>>>> 
>>>>> Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>>>> much.
>>>>> I have written following code for inverted index. This code maps
>> each
>>>> word
>>>>> from the document to its document id.
>>>>> ex: apple file1 file123
>>>>> Main functions of the code are:-
>>>>> 
>>>>> public class HadoopProgram extends Configured implements Tool {
>>>>> public static class MapClass extends MapReduceBase
>>>>>     implements Mapper<LongWritable, Text, Text, Text> {
>>>>> 
>>>>>     private final static IntWritable one = new IntWritable(1);
>>>>>     private Text word = new Text();
>>>>>     private Text doc = new Text();
>>>>>     private long numRecords=0;
>>>>>     private String inputFile;
>>>>> 
>>>>>    public void configure(JobConf job){
>>>>>         System.out.println("Configure function is called");
>>>>>         inputFile = job.get("map.input.file");
>>>>>         System.out.println("In conf the input file is"+inputFile);
>>>>>     }
>>>>> 
>>>>> 
>>>>>     public void map(LongWritable key, Text value,
>>>>>                     OutputCollector<Text, Text> output,
>>>>>                     Reporter reporter) throws IOException {
>>>>>       String line = value.toString();
>>>>>       StringTokenizer itr = new StringTokenizer(line);
>>>>>       doc.set(inputFile);
>>>>>       while (itr.hasMoreTokens()) {
>>>>>         word.set(itr.nextToken());
>>>>>         output.collect(word,doc);
>>>>>       }
>>>>>       if(++numRecords%4==0){
>>>>>        System.out.println("Finished processing of input
>>>> file"+inputFile);
>>>>>      }
>>>>>     }
>>>>>   }
>>>>> 
>>>>>   /**
>>>>>    * A reducer class that just emits the sum of the input values.
>>>>>    */
>>>>>   public static class Reduce extends MapReduceBase
>>>>>     implements Reducer<Text, Text, Text, DocIDs> {
>>>>> 
>>>>>   // This works as K2, V2, K3, V3
>>>>>     public void reduce(Text key, Iterator<Text> values,
>>>>>                        OutputCollector<Text, DocIDs> output,
>>>>>                        Reporter reporter) throws IOException {
>>>>>       int sum = 0;
>>>>>       Text dummy = new Text();
>>>>>       ArrayList<String> IDs = new ArrayList<String>();
>>>>>       String str;
>>>>> 
>>>>>       while (values.hasNext()) {
>>>>>          dummy = values.next();
>>>>>          str = dummy.toString();
>>>>>          IDs.add(str);
>>>>>        }
>>>>>        DocIDs dc = new DocIDs();
>>>>>        dc.setListdocs(IDs);
>>>>>       output.collect(key,dc);
>>>>>     }
>>>>>   }
>>>>> 
>>>>>  public int run(String[] args) throws Exception {
>>>>>   System.out.println("Run function is called");
>>>>>     JobConf conf = new JobConf(getConf(), WordCount.class);
>>>>>     conf.setJobName("wordcount");
>>>>> 
>>>>>     // the keys are words (strings)
>>>>>     conf.setOutputKeyClass(Text.class);
>>>>> 
>>>>>     conf.setOutputValueClass(Text.class);
>>>>> 
>>>>> 
>>>>>     conf.setMapperClass(MapClass.class);
>>>>> 
>>>>>     conf.setReducerClass(Reduce.class);
>>>>> }
>>>>> 
>>>>> 
>>>>> Now I am getting output array from the reducer as:-
>>>>> word \root\test\test123, \root\test12
>>>>> 
>>>>> In the next stage I want to stop 'stop  words',  scrub words etc.
>> and
>>>> like
>>>>> position of the word in the document. How would I apply multiple
>> maps or
>>>>> multilevel map reduce jobs programmatically? I guess I need to make
>>>> another
>>>>> class or add some functions in it? I am not able to figure it out.
>>>>> Any pointers for these type of problems?
>>>>> 
>>>>> Thanks,
>>>>> Aayush
>>>>> 
>>>>> 
>>>>> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
>>>> wrote:
>>>>> 
>>>>>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>>>>>> 
>>>>>>> HI,
>>>>>>> I am developing the simple inverted index program frm the hadoop.
>> My
>>>> map
>>>>>>> function has the output:
>>>>>>> <word, doc>
>>>>>>> and the reducer has:
>>>>>>> <word, list(docs)>
>>>>>>> 
>>>>>>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>>>> from
>>>>>> Use distributed cache as Arun mentioned.
>>>>>>> this output. Also in the next stage I would like to have short
>> summay
>>>>>> Whether to use a separate MR job depends on what exactly you mean
>> by
>>>>>> summary. If its like a window around the current word then you can
>>>>>> possibly do it in one go.
>>>>>> Amar
>>>>>>> associated with every word. How should I design my program from
>> this
>>>>>> stage?
>>>>>>> I mean how would I apply multiple mapreduce to this? What would be
>> the
>>>>>>> better way to perform this?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Regards,
>>>>>>> -
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Aayush Garg,
>>> Phone: +41 76 482 240
>>> 
>> 
> 
> 


Re: Hadoop: Multiple map reduce or some better way

Posted by Nikit Saraf <ni...@gmail.com>.
Hi Aayush

So have been able to find solution for the Multi-Level Map/Reduce.I am also
stuck on  this problem and I cannot find a way out. Can you help me?

Thanks


Aayush Garg wrote:
> 
> Hi,
> 
> I have not used lucene index ever before. I do not get how we build it
> with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
> 
> 
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
> 
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>>
>> Regards,
>> Ning
>>
>> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
>> > No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>> > trying to build up this type of inverted index and then measure
>> performance
>> > criteria with respect to others.
>> >
>> > Thanks,
>> >
>> >
>> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>> >
>> > >
>> > > Are you implementing this for instruction or production?
>> > >
>> > > If production, why not use Lucene?
>> > >
>> > >
>> > > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>> > >
>> > > > HI  Amar , Theodore, Arun,
>> > > >
>> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>> > > much.
>> > > > I have written following code for inverted index. This code maps
>> each
>> > > word
>> > > > from the document to its document id.
>> > > > ex: apple file1 file123
>> > > > Main functions of the code are:-
>> > > >
>> > > > public class HadoopProgram extends Configured implements Tool {
>> > > > public static class MapClass extends MapReduceBase
>> > > >     implements Mapper<LongWritable, Text, Text, Text> {
>> > > >
>> > > >     private final static IntWritable one = new IntWritable(1);
>> > > >     private Text word = new Text();
>> > > >     private Text doc = new Text();
>> > > >     private long numRecords=0;
>> > > >     private String inputFile;
>> > > >
>> > > >    public void configure(JobConf job){
>> > > >         System.out.println("Configure function is called");
>> > > >         inputFile = job.get("map.input.file");
>> > > >         System.out.println("In conf the input file is"+inputFile);
>> > > >     }
>> > > >
>> > > >
>> > > >     public void map(LongWritable key, Text value,
>> > > >                     OutputCollector<Text, Text> output,
>> > > >                     Reporter reporter) throws IOException {
>> > > >       String line = value.toString();
>> > > >       StringTokenizer itr = new StringTokenizer(line);
>> > > >       doc.set(inputFile);
>> > > >       while (itr.hasMoreTokens()) {
>> > > >         word.set(itr.nextToken());
>> > > >         output.collect(word,doc);
>> > > >       }
>> > > >       if(++numRecords%4==0){
>> > > >        System.out.println("Finished processing of input
>> > > file"+inputFile);
>> > > >      }
>> > > >     }
>> > > >   }
>> > > >
>> > > >   /**
>> > > >    * A reducer class that just emits the sum of the input values.
>> > > >    */
>> > > >   public static class Reduce extends MapReduceBase
>> > > >     implements Reducer<Text, Text, Text, DocIDs> {
>> > > >
>> > > >   // This works as K2, V2, K3, V3
>> > > >     public void reduce(Text key, Iterator<Text> values,
>> > > >                        OutputCollector<Text, DocIDs> output,
>> > > >                        Reporter reporter) throws IOException {
>> > > >       int sum = 0;
>> > > >       Text dummy = new Text();
>> > > >       ArrayList<String> IDs = new ArrayList<String>();
>> > > >       String str;
>> > > >
>> > > >       while (values.hasNext()) {
>> > > >          dummy = values.next();
>> > > >          str = dummy.toString();
>> > > >          IDs.add(str);
>> > > >        }
>> > > >        DocIDs dc = new DocIDs();
>> > > >        dc.setListdocs(IDs);
>> > > >       output.collect(key,dc);
>> > > >     }
>> > > >   }
>> > > >
>> > > >  public int run(String[] args) throws Exception {
>> > > >   System.out.println("Run function is called");
>> > > >     JobConf conf = new JobConf(getConf(), WordCount.class);
>> > > >     conf.setJobName("wordcount");
>> > > >
>> > > >     // the keys are words (strings)
>> > > >     conf.setOutputKeyClass(Text.class);
>> > > >
>> > > >     conf.setOutputValueClass(Text.class);
>> > > >
>> > > >
>> > > >     conf.setMapperClass(MapClass.class);
>> > > >
>> > > >     conf.setReducerClass(Reduce.class);
>> > > > }
>> > > >
>> > > >
>> > > > Now I am getting output array from the reducer as:-
>> > > > word \root\test\test123, \root\test12
>> > > >
>> > > > In the next stage I want to stop 'stop  words',  scrub words etc.
>> and
>> > > like
>> > > > position of the word in the document. How would I apply multiple
>> maps or
>> > > > multilevel map reduce jobs programmatically? I guess I need to make
>> > > another
>> > > > class or add some functions in it? I am not able to figure it out.
>> > > > Any pointers for these type of problems?
>> > > >
>> > > > Thanks,
>> > > > Aayush
>> > > >
>> > > >
>> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
>> > > wrote:
>> > > >
>> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> > > >>
>> > > >>> HI,
>> > > >>> I am developing the simple inverted index program frm the hadoop.
>> My
>> > > map
>> > > >>> function has the output:
>> > > >>> <word, doc>
>> > > >>> and the reducer has:
>> > > >>> <word, list(docs)>
>> > > >>>
>> > > >>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>> > > from
>> > > >> Use distributed cache as Arun mentioned.
>> > > >>> this output. Also in the next stage I would like to have short
>> summay
>> > > >> Whether to use a separate MR job depends on what exactly you mean
>> by
>> > > >> summary. If its like a window around the current word then you can
>> > > >> possibly do it in one go.
>> > > >> Amar
>> > > >>> associated with every word. How should I design my program from
>> this
>> > > >> stage?
>> > > >>> I mean how would I apply multiple mapreduce to this? What would
>> be
>> the
>> > > >>> better way to perform this?
>> > > >>>
>> > > >>> Thanks,
>> > > >>>
>> > > >>> Regards,
>> > > >>> -
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>> >
>> > --
>> > Aayush Garg,
>> > Phone: +41 76 482 240
>> >
>>
> 
> 
> 
> -- 
> Aayush Garg,
> Phone: +41 76 482 240
> 
> 

-- 
View this message in context: http://old.nabble.com/Hadoop%3A-Multiple-map-reduce-or-some-better-way-tp16309172p34009971.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.
Hi,

I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.


On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:

> You can build Lucene indexes using Hadoop Map/Reduce. See the index
> contrib package in the trunk. Or is it still not something you are
> looking for?
>
> Regards,
> Ning
>
> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
> > No, currently my requirement is to solve this problem by apache hadoop.
> I am
> > trying to build up this type of inverted index and then measure
> performance
> > criteria with respect to others.
> >
> > Thanks,
> >
> >
> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> > >
> > > Are you implementing this for instruction or production?
> > >
> > > If production, why not use Lucene?
> > >
> > >
> > > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> > >
> > > > HI  Amar , Theodore, Arun,
> > > >
> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
> out
> > > much.
> > > > I have written following code for inverted index. This code maps
> each
> > > word
> > > > from the document to its document id.
> > > > ex: apple file1 file123
> > > > Main functions of the code are:-
> > > >
> > > > public class HadoopProgram extends Configured implements Tool {
> > > > public static class MapClass extends MapReduceBase
> > > >     implements Mapper<LongWritable, Text, Text, Text> {
> > > >
> > > >     private final static IntWritable one = new IntWritable(1);
> > > >     private Text word = new Text();
> > > >     private Text doc = new Text();
> > > >     private long numRecords=0;
> > > >     private String inputFile;
> > > >
> > > >    public void configure(JobConf job){
> > > >         System.out.println("Configure function is called");
> > > >         inputFile = job.get("map.input.file");
> > > >         System.out.println("In conf the input file is"+inputFile);
> > > >     }
> > > >
> > > >
> > > >     public void map(LongWritable key, Text value,
> > > >                     OutputCollector<Text, Text> output,
> > > >                     Reporter reporter) throws IOException {
> > > >       String line = value.toString();
> > > >       StringTokenizer itr = new StringTokenizer(line);
> > > >       doc.set(inputFile);
> > > >       while (itr.hasMoreTokens()) {
> > > >         word.set(itr.nextToken());
> > > >         output.collect(word,doc);
> > > >       }
> > > >       if(++numRecords%4==0){
> > > >        System.out.println("Finished processing of input
> > > file"+inputFile);
> > > >      }
> > > >     }
> > > >   }
> > > >
> > > >   /**
> > > >    * A reducer class that just emits the sum of the input values.
> > > >    */
> > > >   public static class Reduce extends MapReduceBase
> > > >     implements Reducer<Text, Text, Text, DocIDs> {
> > > >
> > > >   // This works as K2, V2, K3, V3
> > > >     public void reduce(Text key, Iterator<Text> values,
> > > >                        OutputCollector<Text, DocIDs> output,
> > > >                        Reporter reporter) throws IOException {
> > > >       int sum = 0;
> > > >       Text dummy = new Text();
> > > >       ArrayList<String> IDs = new ArrayList<String>();
> > > >       String str;
> > > >
> > > >       while (values.hasNext()) {
> > > >          dummy = values.next();
> > > >          str = dummy.toString();
> > > >          IDs.add(str);
> > > >        }
> > > >        DocIDs dc = new DocIDs();
> > > >        dc.setListdocs(IDs);
> > > >       output.collect(key,dc);
> > > >     }
> > > >   }
> > > >
> > > >  public int run(String[] args) throws Exception {
> > > >   System.out.println("Run function is called");
> > > >     JobConf conf = new JobConf(getConf(), WordCount.class);
> > > >     conf.setJobName("wordcount");
> > > >
> > > >     // the keys are words (strings)
> > > >     conf.setOutputKeyClass(Text.class);
> > > >
> > > >     conf.setOutputValueClass(Text.class);
> > > >
> > > >
> > > >     conf.setMapperClass(MapClass.class);
> > > >
> > > >     conf.setReducerClass(Reduce.class);
> > > > }
> > > >
> > > >
> > > > Now I am getting output array from the reducer as:-
> > > > word \root\test\test123, \root\test12
> > > >
> > > > In the next stage I want to stop 'stop  words',  scrub words etc.
> and
> > > like
> > > > position of the word in the document. How would I apply multiple
> maps or
> > > > multilevel map reduce jobs programmatically? I guess I need to make
> > > another
> > > > class or add some functions in it? I am not able to figure it out.
> > > > Any pointers for these type of problems?
> > > >
> > > > Thanks,
> > > > Aayush
> > > >
> > > >
> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> > > wrote:
> > > >
> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > > >>
> > > >>> HI,
> > > >>> I am developing the simple inverted index program frm the hadoop.
> My
> > > map
> > > >>> function has the output:
> > > >>> <word, doc>
> > > >>> and the reducer has:
> > > >>> <word, list(docs)>
> > > >>>
> > > >>> Now I want to use one more mapreduce to remove stop and scrub
> words
> > > from
> > > >> Use distributed cache as Arun mentioned.
> > > >>> this output. Also in the next stage I would like to have short
> summay
> > > >> Whether to use a separate MR job depends on what exactly you mean
> by
> > > >> summary. If its like a window around the current word then you can
> > > >> possibly do it in one go.
> > > >> Amar
> > > >>> associated with every word. How should I design my program from
> this
> > > >> stage?
> > > >>> I mean how would I apply multiple mapreduce to this? What would be
> the
> > > >>> better way to perform this?
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Regards,
> > > >>> -
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
> >
> > --
> > Aayush Garg,
> > Phone: +41 76 482 240
> >
>



-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Ning Li <ni...@gmail.com>.
You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?

Regards,
Ning

On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> trying to build up this type of inverted index and then measure performance
> criteria with respect to others.
>
> Thanks,
>
>
> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>
> >
> > Are you implementing this for instruction or production?
> >
> > If production, why not use Lucene?
> >
> >
> > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> >
> > > HI  Amar , Theodore, Arun,
> > >
> > > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> > much.
> > > I have written following code for inverted index. This code maps each
> > word
> > > from the document to its document id.
> > > ex: apple file1 file123
> > > Main functions of the code are:-
> > >
> > > public class HadoopProgram extends Configured implements Tool {
> > > public static class MapClass extends MapReduceBase
> > >     implements Mapper<LongWritable, Text, Text, Text> {
> > >
> > >     private final static IntWritable one = new IntWritable(1);
> > >     private Text word = new Text();
> > >     private Text doc = new Text();
> > >     private long numRecords=0;
> > >     private String inputFile;
> > >
> > >    public void configure(JobConf job){
> > >         System.out.println("Configure function is called");
> > >         inputFile = job.get("map.input.file");
> > >         System.out.println("In conf the input file is"+inputFile);
> > >     }
> > >
> > >
> > >     public void map(LongWritable key, Text value,
> > >                     OutputCollector<Text, Text> output,
> > >                     Reporter reporter) throws IOException {
> > >       String line = value.toString();
> > >       StringTokenizer itr = new StringTokenizer(line);
> > >       doc.set(inputFile);
> > >       while (itr.hasMoreTokens()) {
> > >         word.set(itr.nextToken());
> > >         output.collect(word,doc);
> > >       }
> > >       if(++numRecords%4==0){
> > >        System.out.println("Finished processing of input
> > file"+inputFile);
> > >      }
> > >     }
> > >   }
> > >
> > >   /**
> > >    * A reducer class that just emits the sum of the input values.
> > >    */
> > >   public static class Reduce extends MapReduceBase
> > >     implements Reducer<Text, Text, Text, DocIDs> {
> > >
> > >   // This works as K2, V2, K3, V3
> > >     public void reduce(Text key, Iterator<Text> values,
> > >                        OutputCollector<Text, DocIDs> output,
> > >                        Reporter reporter) throws IOException {
> > >       int sum = 0;
> > >       Text dummy = new Text();
> > >       ArrayList<String> IDs = new ArrayList<String>();
> > >       String str;
> > >
> > >       while (values.hasNext()) {
> > >          dummy = values.next();
> > >          str = dummy.toString();
> > >          IDs.add(str);
> > >        }
> > >        DocIDs dc = new DocIDs();
> > >        dc.setListdocs(IDs);
> > >       output.collect(key,dc);
> > >     }
> > >   }
> > >
> > >  public int run(String[] args) throws Exception {
> > >   System.out.println("Run function is called");
> > >     JobConf conf = new JobConf(getConf(), WordCount.class);
> > >     conf.setJobName("wordcount");
> > >
> > >     // the keys are words (strings)
> > >     conf.setOutputKeyClass(Text.class);
> > >
> > >     conf.setOutputValueClass(Text.class);
> > >
> > >
> > >     conf.setMapperClass(MapClass.class);
> > >
> > >     conf.setReducerClass(Reduce.class);
> > > }
> > >
> > >
> > > Now I am getting output array from the reducer as:-
> > > word \root\test\test123, \root\test12
> > >
> > > In the next stage I want to stop 'stop  words',  scrub words etc. and
> > like
> > > position of the word in the document. How would I apply multiple maps or
> > > multilevel map reduce jobs programmatically? I guess I need to make
> > another
> > > class or add some functions in it? I am not able to figure it out.
> > > Any pointers for these type of problems?
> > >
> > > Thanks,
> > > Aayush
> > >
> > >
> > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> > wrote:
> > >
> > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > >>
> > >>> HI,
> > >>> I am developing the simple inverted index program frm the hadoop. My
> > map
> > >>> function has the output:
> > >>> <word, doc>
> > >>> and the reducer has:
> > >>> <word, list(docs)>
> > >>>
> > >>> Now I want to use one more mapreduce to remove stop and scrub words
> > from
> > >> Use distributed cache as Arun mentioned.
> > >>> this output. Also in the next stage I would like to have short summay
> > >> Whether to use a separate MR job depends on what exactly you mean by
> > >> summary. If its like a window around the current word then you can
> > >> possibly do it in one go.
> > >> Amar
> > >>> associated with every word. How should I design my program from this
> > >> stage?
> > >>> I mean how would I apply multiple mapreduce to this? What would be the
> > >>> better way to perform this?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Regards,
> > >>> -
> > >>>
> > >>>
> > >>
> >
> >
>
>
> --
> Aayush Garg,
> Phone: +41 76 482 240
>

Re: Hadoop: Multiple map reduce or some better way

Posted by Aayush Garg <aa...@gmail.com>.
No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.

Thanks,


On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:

>
> Are you implementing this for instruction or production?
>
> If production, why not use Lucene?
>
>
> On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>
> > HI  Amar , Theodore, Arun,
> >
> > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> much.
> > I have written following code for inverted index. This code maps each
> word
> > from the document to its document id.
> > ex: apple file1 file123
> > Main functions of the code are:-
> >
> > public class HadoopProgram extends Configured implements Tool {
> > public static class MapClass extends MapReduceBase
> >     implements Mapper<LongWritable, Text, Text, Text> {
> >
> >     private final static IntWritable one = new IntWritable(1);
> >     private Text word = new Text();
> >     private Text doc = new Text();
> >     private long numRecords=0;
> >     private String inputFile;
> >
> >    public void configure(JobConf job){
> >         System.out.println("Configure function is called");
> >         inputFile = job.get("map.input.file");
> >         System.out.println("In conf the input file is"+inputFile);
> >     }
> >
> >
> >     public void map(LongWritable key, Text value,
> >                     OutputCollector<Text, Text> output,
> >                     Reporter reporter) throws IOException {
> >       String line = value.toString();
> >       StringTokenizer itr = new StringTokenizer(line);
> >       doc.set(inputFile);
> >       while (itr.hasMoreTokens()) {
> >         word.set(itr.nextToken());
> >         output.collect(word,doc);
> >       }
> >       if(++numRecords%4==0){
> >        System.out.println("Finished processing of input
> file"+inputFile);
> >      }
> >     }
> >   }
> >
> >   /**
> >    * A reducer class that just emits the sum of the input values.
> >    */
> >   public static class Reduce extends MapReduceBase
> >     implements Reducer<Text, Text, Text, DocIDs> {
> >
> >   // This works as K2, V2, K3, V3
> >     public void reduce(Text key, Iterator<Text> values,
> >                        OutputCollector<Text, DocIDs> output,
> >                        Reporter reporter) throws IOException {
> >       int sum = 0;
> >       Text dummy = new Text();
> >       ArrayList<String> IDs = new ArrayList<String>();
> >       String str;
> >
> >       while (values.hasNext()) {
> >          dummy = values.next();
> >          str = dummy.toString();
> >          IDs.add(str);
> >        }
> >        DocIDs dc = new DocIDs();
> >        dc.setListdocs(IDs);
> >       output.collect(key,dc);
> >     }
> >   }
> >
> >  public int run(String[] args) throws Exception {
> >   System.out.println("Run function is called");
> >     JobConf conf = new JobConf(getConf(), WordCount.class);
> >     conf.setJobName("wordcount");
> >
> >     // the keys are words (strings)
> >     conf.setOutputKeyClass(Text.class);
> >
> >     conf.setOutputValueClass(Text.class);
> >
> >
> >     conf.setMapperClass(MapClass.class);
> >
> >     conf.setReducerClass(Reduce.class);
> > }
> >
> >
> > Now I am getting output array from the reducer as:-
> > word \root\test\test123, \root\test12
> >
> > In the next stage I want to stop 'stop  words',  scrub words etc. and
> like
> > position of the word in the document. How would I apply multiple maps or
> > multilevel map reduce jobs programmatically? I guess I need to make
> another
> > class or add some functions in it? I am not able to figure it out.
> > Any pointers for these type of problems?
> >
> > Thanks,
> > Aayush
> >
> >
> > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> wrote:
> >
> >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> >>
> >>> HI,
> >>> I am developing the simple inverted index program frm the hadoop. My
> map
> >>> function has the output:
> >>> <word, doc>
> >>> and the reducer has:
> >>> <word, list(docs)>
> >>>
> >>> Now I want to use one more mapreduce to remove stop and scrub words
> from
> >> Use distributed cache as Arun mentioned.
> >>> this output. Also in the next stage I would like to have short summay
> >> Whether to use a separate MR job depends on what exactly you mean by
> >> summary. If its like a window around the current word then you can
> >> possibly do it in one go.
> >> Amar
> >>> associated with every word. How should I design my program from this
> >> stage?
> >>> I mean how would I apply multiple mapreduce to this? What would be the
> >>> better way to perform this?
> >>>
> >>> Thanks,
> >>>
> >>> Regards,
> >>> -
> >>>
> >>>
> >>
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240

Re: Hadoop: Multiple map reduce or some better way

Posted by Ted Dunning <td...@veoh.com>.
Are you implementing this for instruction or production?

If production, why not use Lucene?


On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:

> HI  Amar , Theodore, Arun,
> 
> Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
> I have written following code for inverted index. This code maps each word
> from the document to its document id.
> ex: apple file1 file123
> Main functions of the code are:-
> 
> public class HadoopProgram extends Configured implements Tool {
> public static class MapClass extends MapReduceBase
>     implements Mapper<LongWritable, Text, Text, Text> {
> 
>     private final static IntWritable one = new IntWritable(1);
>     private Text word = new Text();
>     private Text doc = new Text();
>     private long numRecords=0;
>     private String inputFile;
> 
>    public void configure(JobConf job){
>         System.out.println("Configure function is called");
>         inputFile = job.get("map.input.file");
>         System.out.println("In conf the input file is"+inputFile);
>     }
> 
> 
>     public void map(LongWritable key, Text value,
>                     OutputCollector<Text, Text> output,
>                     Reporter reporter) throws IOException {
>       String line = value.toString();
>       StringTokenizer itr = new StringTokenizer(line);
>       doc.set(inputFile);
>       while (itr.hasMoreTokens()) {
>         word.set(itr.nextToken());
>         output.collect(word,doc);
>       }
>       if(++numRecords%4==0){
>        System.out.println("Finished processing of input file"+inputFile);
>      }
>     }
>   }
> 
>   /**
>    * A reducer class that just emits the sum of the input values.
>    */
>   public static class Reduce extends MapReduceBase
>     implements Reducer<Text, Text, Text, DocIDs> {
> 
>   // This works as K2, V2, K3, V3
>     public void reduce(Text key, Iterator<Text> values,
>                        OutputCollector<Text, DocIDs> output,
>                        Reporter reporter) throws IOException {
>       int sum = 0;
>       Text dummy = new Text();
>       ArrayList<String> IDs = new ArrayList<String>();
>       String str;
> 
>       while (values.hasNext()) {
>          dummy = values.next();
>          str = dummy.toString();
>          IDs.add(str);
>        }
>        DocIDs dc = new DocIDs();
>        dc.setListdocs(IDs);
>       output.collect(key,dc);
>     }
>   }
> 
>  public int run(String[] args) throws Exception {
>   System.out.println("Run function is called");
>     JobConf conf = new JobConf(getConf(), WordCount.class);
>     conf.setJobName("wordcount");
> 
>     // the keys are words (strings)
>     conf.setOutputKeyClass(Text.class);
> 
>     conf.setOutputValueClass(Text.class);
> 
> 
>     conf.setMapperClass(MapClass.class);
> 
>     conf.setReducerClass(Reduce.class);
> }
> 
> 
> Now I am getting output array from the reducer as:-
> word \root\test\test123, \root\test12
> 
> In the next stage I want to stop 'stop  words',  scrub words etc. and like
> position of the word in the document. How would I apply multiple maps or
> multilevel map reduce jobs programmatically? I guess I need to make another
> class or add some functions in it? I am not able to figure it out.
> Any pointers for these type of problems?
> 
> Thanks,
> Aayush
> 
> 
> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> 
>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> 
>>> HI,
>>> I am developing the simple inverted index program frm the hadoop. My map
>>> function has the output:
>>> <word, doc>
>>> and the reducer has:
>>> <word, list(docs)>
>>> 
>>> Now I want to use one more mapreduce to remove stop and scrub words from
>> Use distributed cache as Arun mentioned.
>>> this output. Also in the next stage I would like to have short summay
>> Whether to use a separate MR job depends on what exactly you mean by
>> summary. If its like a window around the current word then you can
>> possibly do it in one go.
>> Amar
>>> associated with every word. How should I design my program from this
>> stage?
>>> I mean how would I apply multiple mapreduce to this? What would be the
>>> better way to perform this?
>>> 
>>> Thanks,
>>> 
>>> Regards,
>>> -
>>> 
>>> 
>>