You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Aayush Garg <aa...@gmail.com> on 2008/04/04 03:45:55 UTC
Re: Hadoop: Multiple map reduce or some better way
HI Amar , Theodore, Arun,
Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
I have written following code for inverted index. This code maps each word
from the document to its document id.
ex: apple file1 file123
Main functions of the code are:-
public class HadoopProgram extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Text doc = new Text();
private long numRecords=0;
private String inputFile;
public void configure(JobConf job){
System.out.println("Configure function is called");
inputFile = job.get("map.input.file");
System.out.println("In conf the input file is"+inputFile);
}
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
doc.set(inputFile);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word,doc);
}
if(++numRecords%4==0){
System.out.println("Finished processing of input file"+inputFile);
}
}
}
/**
* A reducer class that just emits the sum of the input values.
*/
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, DocIDs> {
// This works as K2, V2, K3, V3
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, DocIDs> output,
Reporter reporter) throws IOException {
int sum = 0;
Text dummy = new Text();
ArrayList<String> IDs = new ArrayList<String>();
String str;
while (values.hasNext()) {
dummy = values.next();
str = dummy.toString();
IDs.add(str);
}
DocIDs dc = new DocIDs();
dc.setListdocs(IDs);
output.collect(key,dc);
}
}
public int run(String[] args) throws Exception {
System.out.println("Run function is called");
JobConf conf = new JobConf(getConf(), WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
}
Now I am getting output array from the reducer as:-
word \root\test\test123, \root\test12
In the next stage I want to stop 'stop words', scrub words etc. and like
position of the word in the document. How would I apply multiple maps or
multilevel map reduce jobs programmatically? I guess I need to make another
class or add some functions in it? I am not able to figure it out.
Any pointers for these type of problems?
Thanks,
Aayush
On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
> On Wed, 26 Mar 2008, Aayush Garg wrote:
>
> > HI,
> > I am developing the simple inverted index program frm the hadoop. My map
> > function has the output:
> > <word, doc>
> > and the reducer has:
> > <word, list(docs)>
> >
> > Now I want to use one more mapreduce to remove stop and scrub words from
> Use distributed cache as Arun mentioned.
> > this output. Also in the next stage I would like to have short summay
> Whether to use a separate MR job depends on what exactly you mean by
> summary. If its like a window around the current word then you can
> possibly do it in one go.
> Amar
> > associated with every word. How should I design my program from this
> stage?
> > I mean how would I apply multiple mapreduce to this? What would be the
> > better way to perform this?
> >
> > Thanks,
> >
> > Regards,
> > -
> >
> >
>
Re: Hadoop: Multiple map reduce or some better way
Posted by Aayush Garg <aa...@gmail.com>.
Please give your inputs for my problem.
Thanks,
On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <rd...@techcfl.com> wrote:
> Ted,
>
> It appears that Nutch hasn't been updated in a while (in Internet time at
> least). Do you know if it works with the latest versions of Hadoop? Thanks.
>
> - Robert Dempsey (new to the list)
>
>
> On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
> >
> >
> > See Nutch. See Nutch run.
> >
> > http://en.wikipedia.org/wiki/Nutch
> > http://lucene.apache.org/nutch/
> >
>
--
Aayush Garg,
Phone: +41 76 482 240
Re: Hadoop: Multiple map reduce or some better way
Posted by Robert Dempsey <rd...@techcfl.com>.
Ted,
It appears that Nutch hasn't been updated in a while (in Internet time
at least). Do you know if it works with the latest versions of Hadoop?
Thanks.
- Robert Dempsey (new to the list)
On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
>
> See Nutch. See Nutch run.
>
> http://en.wikipedia.org/wiki/Nutch
> http://lucene.apache.org/nutch/
Re: Hadoop: Multiple map reduce or some better way
Posted by Ted Dunning <td...@veoh.com>.
See Nutch. See Nutch run.
http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/
On 4/4/08 1:22 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> Hi,
>
> I have not used lucene index ever before. I do not get how we build it with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
>
>
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
>
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>>
>> Regards,
>> Ning
>>
>> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
>>> No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>>> trying to build up this type of inverted index and then measure
>> performance
>>> criteria with respect to others.
>>>
>>> Thanks,
>>>
>>>
>>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>>>
>>>>
>>>> Are you implementing this for instruction or production?
>>>>
>>>> If production, why not use Lucene?
>>>>
>>>>
>>>> On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>>>>
>>>>> HI Amar , Theodore, Arun,
>>>>>
>>>>> Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>>>> much.
>>>>> I have written following code for inverted index. This code maps
>> each
>>>> word
>>>>> from the document to its document id.
>>>>> ex: apple file1 file123
>>>>> Main functions of the code are:-
>>>>>
>>>>> public class HadoopProgram extends Configured implements Tool {
>>>>> public static class MapClass extends MapReduceBase
>>>>> implements Mapper<LongWritable, Text, Text, Text> {
>>>>>
>>>>> private final static IntWritable one = new IntWritable(1);
>>>>> private Text word = new Text();
>>>>> private Text doc = new Text();
>>>>> private long numRecords=0;
>>>>> private String inputFile;
>>>>>
>>>>> public void configure(JobConf job){
>>>>> System.out.println("Configure function is called");
>>>>> inputFile = job.get("map.input.file");
>>>>> System.out.println("In conf the input file is"+inputFile);
>>>>> }
>>>>>
>>>>>
>>>>> public void map(LongWritable key, Text value,
>>>>> OutputCollector<Text, Text> output,
>>>>> Reporter reporter) throws IOException {
>>>>> String line = value.toString();
>>>>> StringTokenizer itr = new StringTokenizer(line);
>>>>> doc.set(inputFile);
>>>>> while (itr.hasMoreTokens()) {
>>>>> word.set(itr.nextToken());
>>>>> output.collect(word,doc);
>>>>> }
>>>>> if(++numRecords%4==0){
>>>>> System.out.println("Finished processing of input
>>>> file"+inputFile);
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> /**
>>>>> * A reducer class that just emits the sum of the input values.
>>>>> */
>>>>> public static class Reduce extends MapReduceBase
>>>>> implements Reducer<Text, Text, Text, DocIDs> {
>>>>>
>>>>> // This works as K2, V2, K3, V3
>>>>> public void reduce(Text key, Iterator<Text> values,
>>>>> OutputCollector<Text, DocIDs> output,
>>>>> Reporter reporter) throws IOException {
>>>>> int sum = 0;
>>>>> Text dummy = new Text();
>>>>> ArrayList<String> IDs = new ArrayList<String>();
>>>>> String str;
>>>>>
>>>>> while (values.hasNext()) {
>>>>> dummy = values.next();
>>>>> str = dummy.toString();
>>>>> IDs.add(str);
>>>>> }
>>>>> DocIDs dc = new DocIDs();
>>>>> dc.setListdocs(IDs);
>>>>> output.collect(key,dc);
>>>>> }
>>>>> }
>>>>>
>>>>> public int run(String[] args) throws Exception {
>>>>> System.out.println("Run function is called");
>>>>> JobConf conf = new JobConf(getConf(), WordCount.class);
>>>>> conf.setJobName("wordcount");
>>>>>
>>>>> // the keys are words (strings)
>>>>> conf.setOutputKeyClass(Text.class);
>>>>>
>>>>> conf.setOutputValueClass(Text.class);
>>>>>
>>>>>
>>>>> conf.setMapperClass(MapClass.class);
>>>>>
>>>>> conf.setReducerClass(Reduce.class);
>>>>> }
>>>>>
>>>>>
>>>>> Now I am getting output array from the reducer as:-
>>>>> word \root\test\test123, \root\test12
>>>>>
>>>>> In the next stage I want to stop 'stop words', scrub words etc.
>> and
>>>> like
>>>>> position of the word in the document. How would I apply multiple
>> maps or
>>>>> multilevel map reduce jobs programmatically? I guess I need to make
>>>> another
>>>>> class or add some functions in it? I am not able to figure it out.
>>>>> Any pointers for these type of problems?
>>>>>
>>>>> Thanks,
>>>>> Aayush
>>>>>
>>>>>
>>>>> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
>>>> wrote:
>>>>>
>>>>>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>>>>>>
>>>>>>> HI,
>>>>>>> I am developing the simple inverted index program frm the hadoop.
>> My
>>>> map
>>>>>>> function has the output:
>>>>>>> <word, doc>
>>>>>>> and the reducer has:
>>>>>>> <word, list(docs)>
>>>>>>>
>>>>>>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>>>> from
>>>>>> Use distributed cache as Arun mentioned.
>>>>>>> this output. Also in the next stage I would like to have short
>> summay
>>>>>> Whether to use a separate MR job depends on what exactly you mean
>> by
>>>>>> summary. If its like a window around the current word then you can
>>>>>> possibly do it in one go.
>>>>>> Amar
>>>>>>> associated with every word. How should I design my program from
>> this
>>>>>> stage?
>>>>>>> I mean how would I apply multiple mapreduce to this? What would be
>> the
>>>>>>> better way to perform this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Regards,
>>>>>>> -
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Aayush Garg,
>>> Phone: +41 76 482 240
>>>
>>
>
>
Re: Hadoop: Multiple map reduce or some better way
Posted by Nikit Saraf <ni...@gmail.com>.
Hi Aayush
So have been able to find solution for the Multi-Level Map/Reduce.I am also
stuck on this problem and I cannot find a way out. Can you help me?
Thanks
Aayush Garg wrote:
>
> Hi,
>
> I have not used lucene index ever before. I do not get how we build it
> with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
>
>
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
>
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>>
>> Regards,
>> Ning
>>
>> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
>> > No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>> > trying to build up this type of inverted index and then measure
>> performance
>> > criteria with respect to others.
>> >
>> > Thanks,
>> >
>> >
>> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>> >
>> > >
>> > > Are you implementing this for instruction or production?
>> > >
>> > > If production, why not use Lucene?
>> > >
>> > >
>> > > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>> > >
>> > > > HI Amar , Theodore, Arun,
>> > > >
>> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
>> > > much.
>> > > > I have written following code for inverted index. This code maps
>> each
>> > > word
>> > > > from the document to its document id.
>> > > > ex: apple file1 file123
>> > > > Main functions of the code are:-
>> > > >
>> > > > public class HadoopProgram extends Configured implements Tool {
>> > > > public static class MapClass extends MapReduceBase
>> > > > implements Mapper<LongWritable, Text, Text, Text> {
>> > > >
>> > > > private final static IntWritable one = new IntWritable(1);
>> > > > private Text word = new Text();
>> > > > private Text doc = new Text();
>> > > > private long numRecords=0;
>> > > > private String inputFile;
>> > > >
>> > > > public void configure(JobConf job){
>> > > > System.out.println("Configure function is called");
>> > > > inputFile = job.get("map.input.file");
>> > > > System.out.println("In conf the input file is"+inputFile);
>> > > > }
>> > > >
>> > > >
>> > > > public void map(LongWritable key, Text value,
>> > > > OutputCollector<Text, Text> output,
>> > > > Reporter reporter) throws IOException {
>> > > > String line = value.toString();
>> > > > StringTokenizer itr = new StringTokenizer(line);
>> > > > doc.set(inputFile);
>> > > > while (itr.hasMoreTokens()) {
>> > > > word.set(itr.nextToken());
>> > > > output.collect(word,doc);
>> > > > }
>> > > > if(++numRecords%4==0){
>> > > > System.out.println("Finished processing of input
>> > > file"+inputFile);
>> > > > }
>> > > > }
>> > > > }
>> > > >
>> > > > /**
>> > > > * A reducer class that just emits the sum of the input values.
>> > > > */
>> > > > public static class Reduce extends MapReduceBase
>> > > > implements Reducer<Text, Text, Text, DocIDs> {
>> > > >
>> > > > // This works as K2, V2, K3, V3
>> > > > public void reduce(Text key, Iterator<Text> values,
>> > > > OutputCollector<Text, DocIDs> output,
>> > > > Reporter reporter) throws IOException {
>> > > > int sum = 0;
>> > > > Text dummy = new Text();
>> > > > ArrayList<String> IDs = new ArrayList<String>();
>> > > > String str;
>> > > >
>> > > > while (values.hasNext()) {
>> > > > dummy = values.next();
>> > > > str = dummy.toString();
>> > > > IDs.add(str);
>> > > > }
>> > > > DocIDs dc = new DocIDs();
>> > > > dc.setListdocs(IDs);
>> > > > output.collect(key,dc);
>> > > > }
>> > > > }
>> > > >
>> > > > public int run(String[] args) throws Exception {
>> > > > System.out.println("Run function is called");
>> > > > JobConf conf = new JobConf(getConf(), WordCount.class);
>> > > > conf.setJobName("wordcount");
>> > > >
>> > > > // the keys are words (strings)
>> > > > conf.setOutputKeyClass(Text.class);
>> > > >
>> > > > conf.setOutputValueClass(Text.class);
>> > > >
>> > > >
>> > > > conf.setMapperClass(MapClass.class);
>> > > >
>> > > > conf.setReducerClass(Reduce.class);
>> > > > }
>> > > >
>> > > >
>> > > > Now I am getting output array from the reducer as:-
>> > > > word \root\test\test123, \root\test12
>> > > >
>> > > > In the next stage I want to stop 'stop words', scrub words etc.
>> and
>> > > like
>> > > > position of the word in the document. How would I apply multiple
>> maps or
>> > > > multilevel map reduce jobs programmatically? I guess I need to make
>> > > another
>> > > > class or add some functions in it? I am not able to figure it out.
>> > > > Any pointers for these type of problems?
>> > > >
>> > > > Thanks,
>> > > > Aayush
>> > > >
>> > > >
>> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
>> > > wrote:
>> > > >
>> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> > > >>
>> > > >>> HI,
>> > > >>> I am developing the simple inverted index program frm the hadoop.
>> My
>> > > map
>> > > >>> function has the output:
>> > > >>> <word, doc>
>> > > >>> and the reducer has:
>> > > >>> <word, list(docs)>
>> > > >>>
>> > > >>> Now I want to use one more mapreduce to remove stop and scrub
>> words
>> > > from
>> > > >> Use distributed cache as Arun mentioned.
>> > > >>> this output. Also in the next stage I would like to have short
>> summay
>> > > >> Whether to use a separate MR job depends on what exactly you mean
>> by
>> > > >> summary. If its like a window around the current word then you can
>> > > >> possibly do it in one go.
>> > > >> Amar
>> > > >>> associated with every word. How should I design my program from
>> this
>> > > >> stage?
>> > > >>> I mean how would I apply multiple mapreduce to this? What would
>> be
>> the
>> > > >>> better way to perform this?
>> > > >>>
>> > > >>> Thanks,
>> > > >>>
>> > > >>> Regards,
>> > > >>> -
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>> >
>> > --
>> > Aayush Garg,
>> > Phone: +41 76 482 240
>> >
>>
>
>
>
> --
> Aayush Garg,
> Phone: +41 76 482 240
>
>
--
View this message in context: http://old.nabble.com/Hadoop%3A-Multiple-map-reduce-or-some-better-way-tp16309172p34009971.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop: Multiple map reduce or some better way
Posted by Aayush Garg <aa...@gmail.com>.
Hi,
I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.
On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <ni...@gmail.com> wrote:
> You can build Lucene indexes using Hadoop Map/Reduce. See the index
> contrib package in the trunk. Or is it still not something you are
> looking for?
>
> Regards,
> Ning
>
> On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
> > No, currently my requirement is to solve this problem by apache hadoop.
> I am
> > trying to build up this type of inverted index and then measure
> performance
> > criteria with respect to others.
> >
> > Thanks,
> >
> >
> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> > >
> > > Are you implementing this for instruction or production?
> > >
> > > If production, why not use Lucene?
> > >
> > >
> > > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> > >
> > > > HI Amar , Theodore, Arun,
> > > >
> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
> out
> > > much.
> > > > I have written following code for inverted index. This code maps
> each
> > > word
> > > > from the document to its document id.
> > > > ex: apple file1 file123
> > > > Main functions of the code are:-
> > > >
> > > > public class HadoopProgram extends Configured implements Tool {
> > > > public static class MapClass extends MapReduceBase
> > > > implements Mapper<LongWritable, Text, Text, Text> {
> > > >
> > > > private final static IntWritable one = new IntWritable(1);
> > > > private Text word = new Text();
> > > > private Text doc = new Text();
> > > > private long numRecords=0;
> > > > private String inputFile;
> > > >
> > > > public void configure(JobConf job){
> > > > System.out.println("Configure function is called");
> > > > inputFile = job.get("map.input.file");
> > > > System.out.println("In conf the input file is"+inputFile);
> > > > }
> > > >
> > > >
> > > > public void map(LongWritable key, Text value,
> > > > OutputCollector<Text, Text> output,
> > > > Reporter reporter) throws IOException {
> > > > String line = value.toString();
> > > > StringTokenizer itr = new StringTokenizer(line);
> > > > doc.set(inputFile);
> > > > while (itr.hasMoreTokens()) {
> > > > word.set(itr.nextToken());
> > > > output.collect(word,doc);
> > > > }
> > > > if(++numRecords%4==0){
> > > > System.out.println("Finished processing of input
> > > file"+inputFile);
> > > > }
> > > > }
> > > > }
> > > >
> > > > /**
> > > > * A reducer class that just emits the sum of the input values.
> > > > */
> > > > public static class Reduce extends MapReduceBase
> > > > implements Reducer<Text, Text, Text, DocIDs> {
> > > >
> > > > // This works as K2, V2, K3, V3
> > > > public void reduce(Text key, Iterator<Text> values,
> > > > OutputCollector<Text, DocIDs> output,
> > > > Reporter reporter) throws IOException {
> > > > int sum = 0;
> > > > Text dummy = new Text();
> > > > ArrayList<String> IDs = new ArrayList<String>();
> > > > String str;
> > > >
> > > > while (values.hasNext()) {
> > > > dummy = values.next();
> > > > str = dummy.toString();
> > > > IDs.add(str);
> > > > }
> > > > DocIDs dc = new DocIDs();
> > > > dc.setListdocs(IDs);
> > > > output.collect(key,dc);
> > > > }
> > > > }
> > > >
> > > > public int run(String[] args) throws Exception {
> > > > System.out.println("Run function is called");
> > > > JobConf conf = new JobConf(getConf(), WordCount.class);
> > > > conf.setJobName("wordcount");
> > > >
> > > > // the keys are words (strings)
> > > > conf.setOutputKeyClass(Text.class);
> > > >
> > > > conf.setOutputValueClass(Text.class);
> > > >
> > > >
> > > > conf.setMapperClass(MapClass.class);
> > > >
> > > > conf.setReducerClass(Reduce.class);
> > > > }
> > > >
> > > >
> > > > Now I am getting output array from the reducer as:-
> > > > word \root\test\test123, \root\test12
> > > >
> > > > In the next stage I want to stop 'stop words', scrub words etc.
> and
> > > like
> > > > position of the word in the document. How would I apply multiple
> maps or
> > > > multilevel map reduce jobs programmatically? I guess I need to make
> > > another
> > > > class or add some functions in it? I am not able to figure it out.
> > > > Any pointers for these type of problems?
> > > >
> > > > Thanks,
> > > > Aayush
> > > >
> > > >
> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> > > wrote:
> > > >
> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > > >>
> > > >>> HI,
> > > >>> I am developing the simple inverted index program frm the hadoop.
> My
> > > map
> > > >>> function has the output:
> > > >>> <word, doc>
> > > >>> and the reducer has:
> > > >>> <word, list(docs)>
> > > >>>
> > > >>> Now I want to use one more mapreduce to remove stop and scrub
> words
> > > from
> > > >> Use distributed cache as Arun mentioned.
> > > >>> this output. Also in the next stage I would like to have short
> summay
> > > >> Whether to use a separate MR job depends on what exactly you mean
> by
> > > >> summary. If its like a window around the current word then you can
> > > >> possibly do it in one go.
> > > >> Amar
> > > >>> associated with every word. How should I design my program from
> this
> > > >> stage?
> > > >>> I mean how would I apply multiple mapreduce to this? What would be
> the
> > > >>> better way to perform this?
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Regards,
> > > >>> -
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
> >
> > --
> > Aayush Garg,
> > Phone: +41 76 482 240
> >
>
--
Aayush Garg,
Phone: +41 76 482 240
Re: Hadoop: Multiple map reduce or some better way
Posted by Ning Li <ni...@gmail.com>.
You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?
Regards,
Ning
On 4/4/08, Aayush Garg <aa...@gmail.com> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> trying to build up this type of inverted index and then measure performance
> criteria with respect to others.
>
> Thanks,
>
>
> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>
> >
> > Are you implementing this for instruction or production?
> >
> > If production, why not use Lucene?
> >
> >
> > On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> >
> > > HI Amar , Theodore, Arun,
> > >
> > > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> > much.
> > > I have written following code for inverted index. This code maps each
> > word
> > > from the document to its document id.
> > > ex: apple file1 file123
> > > Main functions of the code are:-
> > >
> > > public class HadoopProgram extends Configured implements Tool {
> > > public static class MapClass extends MapReduceBase
> > > implements Mapper<LongWritable, Text, Text, Text> {
> > >
> > > private final static IntWritable one = new IntWritable(1);
> > > private Text word = new Text();
> > > private Text doc = new Text();
> > > private long numRecords=0;
> > > private String inputFile;
> > >
> > > public void configure(JobConf job){
> > > System.out.println("Configure function is called");
> > > inputFile = job.get("map.input.file");
> > > System.out.println("In conf the input file is"+inputFile);
> > > }
> > >
> > >
> > > public void map(LongWritable key, Text value,
> > > OutputCollector<Text, Text> output,
> > > Reporter reporter) throws IOException {
> > > String line = value.toString();
> > > StringTokenizer itr = new StringTokenizer(line);
> > > doc.set(inputFile);
> > > while (itr.hasMoreTokens()) {
> > > word.set(itr.nextToken());
> > > output.collect(word,doc);
> > > }
> > > if(++numRecords%4==0){
> > > System.out.println("Finished processing of input
> > file"+inputFile);
> > > }
> > > }
> > > }
> > >
> > > /**
> > > * A reducer class that just emits the sum of the input values.
> > > */
> > > public static class Reduce extends MapReduceBase
> > > implements Reducer<Text, Text, Text, DocIDs> {
> > >
> > > // This works as K2, V2, K3, V3
> > > public void reduce(Text key, Iterator<Text> values,
> > > OutputCollector<Text, DocIDs> output,
> > > Reporter reporter) throws IOException {
> > > int sum = 0;
> > > Text dummy = new Text();
> > > ArrayList<String> IDs = new ArrayList<String>();
> > > String str;
> > >
> > > while (values.hasNext()) {
> > > dummy = values.next();
> > > str = dummy.toString();
> > > IDs.add(str);
> > > }
> > > DocIDs dc = new DocIDs();
> > > dc.setListdocs(IDs);
> > > output.collect(key,dc);
> > > }
> > > }
> > >
> > > public int run(String[] args) throws Exception {
> > > System.out.println("Run function is called");
> > > JobConf conf = new JobConf(getConf(), WordCount.class);
> > > conf.setJobName("wordcount");
> > >
> > > // the keys are words (strings)
> > > conf.setOutputKeyClass(Text.class);
> > >
> > > conf.setOutputValueClass(Text.class);
> > >
> > >
> > > conf.setMapperClass(MapClass.class);
> > >
> > > conf.setReducerClass(Reduce.class);
> > > }
> > >
> > >
> > > Now I am getting output array from the reducer as:-
> > > word \root\test\test123, \root\test12
> > >
> > > In the next stage I want to stop 'stop words', scrub words etc. and
> > like
> > > position of the word in the document. How would I apply multiple maps or
> > > multilevel map reduce jobs programmatically? I guess I need to make
> > another
> > > class or add some functions in it? I am not able to figure it out.
> > > Any pointers for these type of problems?
> > >
> > > Thanks,
> > > Aayush
> > >
> > >
> > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> > wrote:
> > >
> > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > >>
> > >>> HI,
> > >>> I am developing the simple inverted index program frm the hadoop. My
> > map
> > >>> function has the output:
> > >>> <word, doc>
> > >>> and the reducer has:
> > >>> <word, list(docs)>
> > >>>
> > >>> Now I want to use one more mapreduce to remove stop and scrub words
> > from
> > >> Use distributed cache as Arun mentioned.
> > >>> this output. Also in the next stage I would like to have short summay
> > >> Whether to use a separate MR job depends on what exactly you mean by
> > >> summary. If its like a window around the current word then you can
> > >> possibly do it in one go.
> > >> Amar
> > >>> associated with every word. How should I design my program from this
> > >> stage?
> > >>> I mean how would I apply multiple mapreduce to this? What would be the
> > >>> better way to perform this?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Regards,
> > >>> -
> > >>>
> > >>>
> > >>
> >
> >
>
>
> --
> Aayush Garg,
> Phone: +41 76 482 240
>
Re: Hadoop: Multiple map reduce or some better way
Posted by Aayush Garg <aa...@gmail.com>.
No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.
Thanks,
On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <td...@veoh.com> wrote:
>
> Are you implementing this for instruction or production?
>
> If production, why not use Lucene?
>
>
> On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
>
> > HI Amar , Theodore, Arun,
> >
> > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> much.
> > I have written following code for inverted index. This code maps each
> word
> > from the document to its document id.
> > ex: apple file1 file123
> > Main functions of the code are:-
> >
> > public class HadoopProgram extends Configured implements Tool {
> > public static class MapClass extends MapReduceBase
> > implements Mapper<LongWritable, Text, Text, Text> {
> >
> > private final static IntWritable one = new IntWritable(1);
> > private Text word = new Text();
> > private Text doc = new Text();
> > private long numRecords=0;
> > private String inputFile;
> >
> > public void configure(JobConf job){
> > System.out.println("Configure function is called");
> > inputFile = job.get("map.input.file");
> > System.out.println("In conf the input file is"+inputFile);
> > }
> >
> >
> > public void map(LongWritable key, Text value,
> > OutputCollector<Text, Text> output,
> > Reporter reporter) throws IOException {
> > String line = value.toString();
> > StringTokenizer itr = new StringTokenizer(line);
> > doc.set(inputFile);
> > while (itr.hasMoreTokens()) {
> > word.set(itr.nextToken());
> > output.collect(word,doc);
> > }
> > if(++numRecords%4==0){
> > System.out.println("Finished processing of input
> file"+inputFile);
> > }
> > }
> > }
> >
> > /**
> > * A reducer class that just emits the sum of the input values.
> > */
> > public static class Reduce extends MapReduceBase
> > implements Reducer<Text, Text, Text, DocIDs> {
> >
> > // This works as K2, V2, K3, V3
> > public void reduce(Text key, Iterator<Text> values,
> > OutputCollector<Text, DocIDs> output,
> > Reporter reporter) throws IOException {
> > int sum = 0;
> > Text dummy = new Text();
> > ArrayList<String> IDs = new ArrayList<String>();
> > String str;
> >
> > while (values.hasNext()) {
> > dummy = values.next();
> > str = dummy.toString();
> > IDs.add(str);
> > }
> > DocIDs dc = new DocIDs();
> > dc.setListdocs(IDs);
> > output.collect(key,dc);
> > }
> > }
> >
> > public int run(String[] args) throws Exception {
> > System.out.println("Run function is called");
> > JobConf conf = new JobConf(getConf(), WordCount.class);
> > conf.setJobName("wordcount");
> >
> > // the keys are words (strings)
> > conf.setOutputKeyClass(Text.class);
> >
> > conf.setOutputValueClass(Text.class);
> >
> >
> > conf.setMapperClass(MapClass.class);
> >
> > conf.setReducerClass(Reduce.class);
> > }
> >
> >
> > Now I am getting output array from the reducer as:-
> > word \root\test\test123, \root\test12
> >
> > In the next stage I want to stop 'stop words', scrub words etc. and
> like
> > position of the word in the document. How would I apply multiple maps or
> > multilevel map reduce jobs programmatically? I guess I need to make
> another
> > class or add some functions in it? I am not able to figure it out.
> > Any pointers for these type of problems?
> >
> > Thanks,
> > Aayush
> >
> >
> > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com>
> wrote:
> >
> >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> >>
> >>> HI,
> >>> I am developing the simple inverted index program frm the hadoop. My
> map
> >>> function has the output:
> >>> <word, doc>
> >>> and the reducer has:
> >>> <word, list(docs)>
> >>>
> >>> Now I want to use one more mapreduce to remove stop and scrub words
> from
> >> Use distributed cache as Arun mentioned.
> >>> this output. Also in the next stage I would like to have short summay
> >> Whether to use a separate MR job depends on what exactly you mean by
> >> summary. If its like a window around the current word then you can
> >> possibly do it in one go.
> >> Amar
> >>> associated with every word. How should I design my program from this
> >> stage?
> >>> I mean how would I apply multiple mapreduce to this? What would be the
> >>> better way to perform this?
> >>>
> >>> Thanks,
> >>>
> >>> Regards,
> >>> -
> >>>
> >>>
> >>
>
>
--
Aayush Garg,
Phone: +41 76 482 240
Re: Hadoop: Multiple map reduce or some better way
Posted by Ted Dunning <td...@veoh.com>.
Are you implementing this for instruction or production?
If production, why not use Lucene?
On 4/3/08 6:45 PM, "Aayush Garg" <aa...@gmail.com> wrote:
> HI Amar , Theodore, Arun,
>
> Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
> I have written following code for inverted index. This code maps each word
> from the document to its document id.
> ex: apple file1 file123
> Main functions of the code are:-
>
> public class HadoopProgram extends Configured implements Tool {
> public static class MapClass extends MapReduceBase
> implements Mapper<LongWritable, Text, Text, Text> {
>
> private final static IntWritable one = new IntWritable(1);
> private Text word = new Text();
> private Text doc = new Text();
> private long numRecords=0;
> private String inputFile;
>
> public void configure(JobConf job){
> System.out.println("Configure function is called");
> inputFile = job.get("map.input.file");
> System.out.println("In conf the input file is"+inputFile);
> }
>
>
> public void map(LongWritable key, Text value,
> OutputCollector<Text, Text> output,
> Reporter reporter) throws IOException {
> String line = value.toString();
> StringTokenizer itr = new StringTokenizer(line);
> doc.set(inputFile);
> while (itr.hasMoreTokens()) {
> word.set(itr.nextToken());
> output.collect(word,doc);
> }
> if(++numRecords%4==0){
> System.out.println("Finished processing of input file"+inputFile);
> }
> }
> }
>
> /**
> * A reducer class that just emits the sum of the input values.
> */
> public static class Reduce extends MapReduceBase
> implements Reducer<Text, Text, Text, DocIDs> {
>
> // This works as K2, V2, K3, V3
> public void reduce(Text key, Iterator<Text> values,
> OutputCollector<Text, DocIDs> output,
> Reporter reporter) throws IOException {
> int sum = 0;
> Text dummy = new Text();
> ArrayList<String> IDs = new ArrayList<String>();
> String str;
>
> while (values.hasNext()) {
> dummy = values.next();
> str = dummy.toString();
> IDs.add(str);
> }
> DocIDs dc = new DocIDs();
> dc.setListdocs(IDs);
> output.collect(key,dc);
> }
> }
>
> public int run(String[] args) throws Exception {
> System.out.println("Run function is called");
> JobConf conf = new JobConf(getConf(), WordCount.class);
> conf.setJobName("wordcount");
>
> // the keys are words (strings)
> conf.setOutputKeyClass(Text.class);
>
> conf.setOutputValueClass(Text.class);
>
>
> conf.setMapperClass(MapClass.class);
>
> conf.setReducerClass(Reduce.class);
> }
>
>
> Now I am getting output array from the reducer as:-
> word \root\test\test123, \root\test12
>
> In the next stage I want to stop 'stop words', scrub words etc. and like
> position of the word in the document. How would I apply multiple maps or
> multilevel map reduce jobs programmatically? I guess I need to make another
> class or add some functions in it? I am not able to figure it out.
> Any pointers for these type of problems?
>
> Thanks,
> Aayush
>
>
> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <am...@yahoo-inc.com> wrote:
>
>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>>
>>> HI,
>>> I am developing the simple inverted index program frm the hadoop. My map
>>> function has the output:
>>> <word, doc>
>>> and the reducer has:
>>> <word, list(docs)>
>>>
>>> Now I want to use one more mapreduce to remove stop and scrub words from
>> Use distributed cache as Arun mentioned.
>>> this output. Also in the next stage I would like to have short summay
>> Whether to use a separate MR job depends on what exactly you mean by
>> summary. If its like a window around the current word then you can
>> possibly do it in one go.
>> Amar
>>> associated with every word. How should I design my program from this
>> stage?
>>> I mean how would I apply multiple mapreduce to this? What would be the
>>> better way to perform this?
>>>
>>> Thanks,
>>>
>>> Regards,
>>> -
>>>
>>>
>>