You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Naama Kraus <na...@gmail.com> on 2008/03/06 13:57:29 UTC

Nutch Extensions to MapReduce

Hi,

I've seen in
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.

More specifically, I saw in
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
15) that MapReduce outputs two files each holds different <key,value>
pairs. I'd be curious to know if I can achieve that using the standard API.

Thanks, Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Nutch Extensions to MapReduce

Posted by Naama Kraus <na...@gmail.com>.

Found related details in the wiki Map Reduce tutorial. In particular, in the
section "Task Side-Effect Files".

Thanks all for the various inputs, Naama

On Sun, Mar 9, 2008 at 8:22 AM, Ted Dunning <td...@veoh.com> wrote:

>
> Yes.
>
> Look on the wiki or in the discussion archives for details of how to get
> to
> the output directory name.
>
>
> On 3/8/08 1:06 PM, "Naama Kraus" <na...@gmail.com> wrote:
>
> > So the configure() method is called when the Reduce task starts, before
> the
> > actual reduce takes place ? Is that so ?
> > Same for map ?
> >
> > Thanks, Naama
> >
> > On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> >>
> >>
> >> This is not difficult to do.  Simply open an extra file in the reducers
> >> configure method and close it in the close method.  Make sure you make
> it
> >> relative to the map reduce output directory so that you can take
> advantage
> >> of all of the machinery that handles lost jobs and such.
> >>
> >> Search the mailing list archives for more details.
> >>
> >>
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Nutch Extensions to MapReduce

Posted by Ted Dunning <td...@veoh.com>.

Yes.

Look on the wiki or in the discussion archives for details of how to get to
the output directory name.


On 3/8/08 1:06 PM, "Naama Kraus" <na...@gmail.com> wrote:

> So the configure() method is called when the Reduce task starts, before the
> actual reduce takes place ? Is that so ?
> Same for map ?
> 
> Thanks, Naama
> 
> On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <td...@veoh.com> wrote:
> 
>> 
>> 
>> This is not difficult to do.  Simply open an extra file in the reducers
>> configure method and close it in the close method.  Make sure you make it
>> relative to the map reduce output directory so that you can take advantage
>> of all of the machinery that handles lost jobs and such.
>> 
>> Search the mailing list archives for more details.
>> 
>>

Re: Nutch Extensions to MapReduce

Posted by Naama Kraus <na...@gmail.com>.

So the configure() method is called when the Reduce task starts, before the
actual reduce takes place ? Is that so ?
Same for map ?

Thanks, Naama

On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <td...@veoh.com> wrote:

>
>
> This is not difficult to do.  Simply open an extra file in the reducers
> configure method and close it in the close method.  Make sure you make it
> relative to the map reduce output directory so that you can take advantage
> of all of the machinery that handles lost jobs and such.
>
> Search the mailing list archives for more details.
>
>
> On 3/6/08 5:22 AM, "Naama Kraus" <na...@gmail.com> wrote:
>
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different <key,value> pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments
> > /yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with <key,value> pairs, and nutch uses specific <key,value>
> >> pairs such as <url, CrawlDatum>, etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon
> >> 05.pdf(slide<
> http://wiki.apache.org/nutch-data/attachments/Presentations/atta
> >> chments/oscon05.pdf%28slide>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachment
> >> s/yahoo-sds.pdf(slide<
> http://wiki.apache.org/hadoop-data/attachments/HadoopPr
> >> esentations/attachments/yahoo-sds.pdf%28slide>
> >>> 15) that MapReduce outputs two files each holds different <key,value>
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >> API.
> >>>
> >>> Thanks, Naama
> >>>
> >>>
> >>
> >
> >
>
>


-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

about yahoo hadoop deploy

Posted by Leon Liu <lj...@alibaba-inc.com>.

hi,
   I hear that yahoo deploy 10000 nodes hadoop.
   I want to know,  what hardware were used and detail parameter of 
hadoop.  And what type of application run on hadoop.
  anyone know ? or it is a  secret?
  Hadoop can use in product enviroment is exciting!

BR/Leon

Re: Nutch Extensions to MapReduce

Posted by Naama Kraus <na...@gmail.com>.

OK, so what I've learned -

One: there is only one reducer type per job.
Two: sounds like ParseOutputFormat is the reference I was looking for, I'll
go have a look.

And yes, I admit my example was a naive one, it was for demonstration
purposes only.

Thanks a lot for the input,
Naama

On Thu, Mar 6, 2008 at 5:26 PM, Enis Soztutar <en...@gmail.com>
wrote:

> Naama Kraus wrote:
> > OK. Let me try an example:
> >
> > Say my map maps a person name to a his child name. <p, c>. If a person
> "Dan"
> > has more than 1 child, bunch of <Dan, c>* pairs will be produced, right
> ?
> > Now say I have two different information needs:
> > 1. Get a list of all children names for each person.
> > 2. Get the number of children of each person.
> >
> > I could run two different MapReduce jobs, with same map but different
> > reducres:
> > 1. emits <p, lc>* pairs where p is the person, lc is a concatenation of
> his
> > children names.
> > 2. emits <p,n>* pairs where p is the person, n is the number of
> children.
> >
> No you cannot have more than one type of reduces in one job. But yes you
> can write more than one file as the
> result of the reduce phase, which is what I wanted to explain by
> pointing to ParseOutputFormat which writes ParseText and ParseDatato
> different MapFiles at the end of the reduce step.  So this is done by
> implementing OutputFormat + RecordWriter(given a resulting record from
> the reduce, write separate parts of it in different files)
> > Does that make any sense by now ?
> >
> > Now, my question is whether I can save the two jobs and have a single
> one
> > only which emits both two type of pairs - <p, lc>* and <p,n>*. In
> separate
> > files probably. This way I gain one pass on the input files instead of
> two
> > (or more, if I had more output types ...).
> >
> Actually for this scenario you do not even need two different files with
> <p, cl>* and <p,n>*.  You can just compute
> <p, <c1,c2, ..>> which also contains the number of the children (The
> value is a List(for example ArrayWritable) containing children names).
>
> > If not, that's also fine, I was just curious :-)
> >
> > Naama
> >
> >
> >
> > On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <en...@gmail.com>
> > wrote:
> >
> >
> >> Let me explain this more technically :)
> >>
> >> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
> >> <k2,v2>* pairs. So at the end of the map stage, the output will be of
> >> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
> >> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
> >>
> >> I cannot understand what you meant by
> >>
> >> if a MapReduce job could output multiple files each holds different
> >> <key,value> pairs"
> >>
> >> The resulting segment directories after a crawl contain
> >> subdirectories(like crawl_generate, content, etc), but these are
> >> generated one-by-one in several jobs running sequentially(and sometimes
> >> by the same job, see ParseOutputFormat in nutch). You can refer further
> >> to the OutputFormat and RecordWriter interfaces for specific needs.
> >>
> >> For each split in the reduce phrase a different output file will be
> >> generated, but all the records in the files have the same type. However
> >> in some cases using GenericWritable or ObjectWtritable, you can wrap
> >> different types of keys and values.
> >>
> >> Hope it helps,
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Well, I was not actually thinking to use Nutch.
> >>> To be concrete, I was interested if a MapReduce job could output
> >>>
> >> multiple
> >>
> >>> files each holds different <key,value> pairs. I got the impression
> this
> >>>
> >> is
> >>
> >>> done in Nutch from slide 15 of
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> >>
> >>> but maybe I was mis-understanding.
> >>> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> >>> different reducers do the trick ?
> >>>
> >>> Thanks for offering to help, I might have more concrete details of
> what
> >>>
> >> I am
> >>
> >>> trying to implement later on, now I am basically learning.
> >>>
> >>> Naama
> >>>
> >>> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <
> enis.soz.nutch@gmail.com>
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> Currently nutch is a fairly complex application that *uses* hadoop as
> a
> >>>> base for distributed computing and storage. In this regard there is
> no
> >>>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >>>> does work with <key,value> pairs, and nutch uses specific <key,value>
> >>>> pairs such as <url, CrawlDatum>, etc.
> >>>>
> >>>> So long story short, it depends on what you want to build. If you
> >>>> working on something that is not related to nutch, you do not need
> it.
> >>>> You can give further info about your project if you want extended
> help.
> >>>>
> >>>> best wishes.
> >>>> Enis
> >>>>
> >>>> Naama Kraus wrote:
> >>>>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I've seen in
> >>>>>
> >>>>>
> >>>>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> <
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >
> >> <
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >>
> >>>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>>>> these are part of the Hadoop API or inside Nutch only.
> >>>>>
> >>>>> More specifically, I saw in
> >>>>>
> >>>>>
> >>>>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> <
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >
> >> <
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >>
> >>>>> 15) that MapReduce outputs two files each holds different
> <key,value>
> >>>>> pairs. I'd be curious to know if I can achieve that using the
> standard
> >>>>>
> >>>>>
> >>>> API.
> >>>>
> >>>>
> >>>>> Thanks, Naama
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >
> >
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Nutch Extensions to MapReduce

Posted by Enis Soztutar <en...@gmail.com>.

Naama Kraus wrote:
> OK. Let me try an example:
>
> Say my map maps a person name to a his child name. <p, c>. If a person "Dan"
> has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ?
> Now say I have two different information needs:
> 1. Get a list of all children names for each person.
> 2. Get the number of children of each person.
>
> I could run two different MapReduce jobs, with same map but different
> reducres:
> 1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his
> children names.
> 2. emits <p,n>* pairs where p is the person, n is the number of children.
>   
No you cannot have more than one type of reduces in one job. But yes you 
can write more than one file as the
result of the reduce phase, which is what I wanted to explain by 
pointing to ParseOutputFormat which writes ParseText and ParseDatato 
different MapFiles at the end of the reduce step.  So this is done by 
implementing OutputFormat + RecordWriter(given a resulting record from 
the reduce, write separate parts of it in different files)
> Does that make any sense by now ?
>
> Now, my question is whether I can save the two jobs and have a single one
> only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate
> files probably. This way I gain one pass on the input files instead of two
> (or more, if I had more output types ...).
>   
Actually for this scenario you do not even need two different files with 
<p, cl>* and <p,n>*.  You can just compute
<p, <c1,c2, ..>> which also contains the number of the children (The 
value is a List(for example ArrayWritable) containing children names).

> If not, that's also fine, I was just curious :-)
>
> Naama
>
>
>
> On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <en...@gmail.com>
> wrote:
>
>   
>> Let me explain this more technically :)
>>
>> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
>> <k2,v2>* pairs. So at the end of the map stage, the output will be of
>> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
>> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
>>
>> I cannot understand what you meant by
>>
>> if a MapReduce job could output multiple files each holds different
>> <key,value> pairs"
>>
>> The resulting segment directories after a crawl contain
>> subdirectories(like crawl_generate, content, etc), but these are
>> generated one-by-one in several jobs running sequentially(and sometimes
>> by the same job, see ParseOutputFormat in nutch). You can refer further
>> to the OutputFormat and RecordWriter interfaces for specific needs.
>>
>> For each split in the reduce phrase a different output file will be
>> generated, but all the records in the files have the same type. However
>> in some cases using GenericWritable or ObjectWtritable, you can wrap
>> different types of keys and values.
>>
>> Hope it helps,
>> Enis
>>
>> Naama Kraus wrote:
>>     
>>> Well, I was not actually thinking to use Nutch.
>>> To be concrete, I was interested if a MapReduce job could output
>>>       
>> multiple
>>     
>>> files each holds different <key,value> pairs. I got the impression this
>>>       
>> is
>>     
>>> done in Nutch from slide 15 of
>>>
>>>       
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
>>     
>>> but maybe I was mis-understanding.
>>> Is it Nutch specific or achievable using Hadoop API ? Would multiple
>>> different reducers do the trick ?
>>>
>>> Thanks for offering to help, I might have more concrete details of what
>>>       
>> I am
>>     
>>> trying to implement later on, now I am basically learning.
>>>
>>> Naama
>>>
>>> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
>>> wrote:
>>>
>>>
>>>       
>>>> Hi,
>>>>
>>>> Currently nutch is a fairly complex application that *uses* hadoop as a
>>>> base for distributed computing and storage. In this regard there is no
>>>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>>>> does work with <key,value> pairs, and nutch uses specific <key,value>
>>>> pairs such as <url, CrawlDatum>, etc.
>>>>
>>>> So long story short, it depends on what you want to build. If you
>>>> working on something that is not related to nutch, you do not need it.
>>>> You can give further info about your project if you want extended help.
>>>>
>>>> best wishes.
>>>> Enis
>>>>
>>>> Naama Kraus wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I've seen in
>>>>>
>>>>>
>>>>>           
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
>> <
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
>>     
>>>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>>>> these are part of the Hadoop API or inside Nutch only.
>>>>>
>>>>> More specifically, I saw in
>>>>>
>>>>>
>>>>>           
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
>> <
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
>>     
>>>>> 15) that MapReduce outputs two files each holds different <key,value>
>>>>> pairs. I'd be curious to know if I can achieve that using the standard
>>>>>
>>>>>           
>>>> API.
>>>>
>>>>         
>>>>> Thanks, Naama
>>>>>
>>>>>
>>>>>
>>>>>           
>>>
>>>
>>>       
>
>
>
>

Re: Nutch Extensions to MapReduce

Posted by Naama Kraus <na...@gmail.com>.

OK. Let me try an example:

Say my map maps a person name to a his child name. <p, c>. If a person "Dan"
has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.

I could run two different MapReduce jobs, with same map but different
reducres:
1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his
children names.
2. emits <p,n>* pairs where p is the person, n is the number of children.

Does that make any sense by now ?

Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).

If not, that's also fine, I was just curious :-)

Naama



On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <en...@gmail.com>
wrote:

> Let me explain this more technically :)
>
> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
> <k2,v2>* pairs. So at the end of the map stage, the output will be of
> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
>
> I cannot understand what you meant by
>
> if a MapReduce job could output multiple files each holds different
> <key,value> pairs"
>
> The resulting segment directories after a crawl contain
> subdirectories(like crawl_generate, content, etc), but these are
> generated one-by-one in several jobs running sequentially(and sometimes
> by the same job, see ParseOutputFormat in nutch). You can refer further
> to the OutputFormat and RecordWriter interfaces for specific needs.
>
> For each split in the reduce phrase a different output file will be
> generated, but all the records in the files have the same type. However
> in some cases using GenericWritable or ObjectWtritable, you can wrap
> different types of keys and values.
>
> Hope it helps,
> Enis
>
> Naama Kraus wrote:
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different <key,value> pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> > wrote:
> >
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with <key,value> pairs, and nutch uses specific <key,value>
> >> pairs such as <url, CrawlDatum>, etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> <
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >
> >>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> <
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >
> >>
> >>> 15) that MapReduce outputs two files each holds different <key,value>
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >>>
> >> API.
> >>
> >>> Thanks, Naama
> >>>
> >>>
> >>>
> >
> >
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Nutch Extensions to MapReduce

Posted by Enis Soztutar <en...@gmail.com>.

Let me explain this more technically :)

An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result 
<k2,v2>* pairs. So at the end of the map stage, the output will be of 
the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3, 
v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.

I cannot understand what you meant by

if a MapReduce job could output multiple files each holds different <key,value> pairs"

The resulting segment directories after a crawl contain 
subdirectories(like crawl_generate, content, etc), but these are 
generated one-by-one in several jobs running sequentially(and sometimes 
by the same job, see ParseOutputFormat in nutch). You can refer further 
to the OutputFormat and RecordWriter interfaces for specific needs.

For each split in the reduce phrase a different output file will be 
generated, but all the records in the files have the same type. However 
in some cases using GenericWritable or ObjectWtritable, you can wrap 
different types of keys and values.

Hope it helps,
Enis

Naama Kraus wrote:
> Well, I was not actually thinking to use Nutch.
> To be concrete, I was interested if a MapReduce job could output multiple
> files each holds different <key,value> pairs. I got the impression this is
> done in Nutch from slide 15 of
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> but maybe I was mis-understanding.
> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> different reducers do the trick ?
>
> Thanks for offering to help, I might have more concrete details of what I am
> trying to implement later on, now I am basically learning.
>
> Naama
>
> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> wrote:
>
>   
>> Hi,
>>
>> Currently nutch is a fairly complex application that *uses* hadoop as a
>> base for distributed computing and storage. In this regard there is no
>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>> does work with <key,value> pairs, and nutch uses specific <key,value>
>> pairs such as <url, CrawlDatum>, etc.
>>
>> So long story short, it depends on what you want to build. If you
>> working on something that is not related to nutch, you do not need it.
>> You can give further info about your project if you want extended help.
>>
>> best wishes.
>> Enis
>>
>> Naama Kraus wrote:
>>     
>>> Hi,
>>>
>>> I've seen in
>>>
>>>       
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
>>     
>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>> these are part of the Hadoop API or inside Nutch only.
>>>
>>> More specifically, I saw in
>>>
>>>       
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
>>     
>>> 15) that MapReduce outputs two files each holds different <key,value>
>>> pairs. I'd be curious to know if I can achieve that using the standard
>>>       
>> API.
>>     
>>> Thanks, Naama
>>>
>>>
>>>       
>
>
>
>

Re: Nutch Extensions to MapReduce

Posted by Ted Dunning <td...@veoh.com>.


This is not difficult to do.  Simply open an extra file in the reducers
configure method and close it in the close method.  Make sure you make it
relative to the map reduce output directory so that you can take advantage
of all of the machinery that handles lost jobs and such.

Search the mailing list archives for more details.


On 3/6/08 5:22 AM, "Naama Kraus" <na...@gmail.com> wrote:

> Well, I was not actually thinking to use Nutch.
> To be concrete, I was interested if a MapReduce job could output multiple
> files each holds different <key,value> pairs. I got the impression this is
> done in Nutch from slide 15 of
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments
> /yahoo-sds.pdf
> but maybe I was mis-understanding.
> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> different reducers do the trick ?
> 
> Thanks for offering to help, I might have more concrete details of what I am
> trying to implement later on, now I am basically learning.
> 
> Naama
> 
> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> Currently nutch is a fairly complex application that *uses* hadoop as a
>> base for distributed computing and storage. In this regard there is no
>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>> does work with <key,value> pairs, and nutch uses specific <key,value>
>> pairs such as <url, CrawlDatum>, etc.
>> 
>> So long story short, it depends on what you want to build. If you
>> working on something that is not related to nutch, you do not need it.
>> You can give further info about your project if you want extended help.
>> 
>> best wishes.
>> Enis
>> 
>> Naama Kraus wrote:
>>> Hi,
>>> 
>>> I've seen in
>>> 
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon
>> 05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/atta
>> chments/oscon05.pdf%28slide>
>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>> these are part of the Hadoop API or inside Nutch only.
>>> 
>>> More specifically, I saw in
>>> 
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachment
>> s/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPr
>> esentations/attachments/yahoo-sds.pdf%28slide>
>>> 15) that MapReduce outputs two files each holds different <key,value>
>>> pairs. I'd be curious to know if I can achieve that using the standard
>> API.
>>> 
>>> Thanks, Naama
>>> 
>>> 
>> 
> 
>

Re: Nutch Extensions to MapReduce

Posted by Naama Kraus <na...@gmail.com>.

Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output multiple
files each holds different <key,value> pairs. I got the impression this is
done in Nutch from slide 15 of
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?

Thanks for offering to help, I might have more concrete details of what I am
trying to implement later on, now I am basically learning.

Naama

On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
wrote:

> Hi,
>
> Currently nutch is a fairly complex application that *uses* hadoop as a
> base for distributed computing and storage. In this regard there is no
> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> does work with <key,value> pairs, and nutch uses specific <key,value>
> pairs such as <url, CrawlDatum>, etc.
>
> So long story short, it depends on what you want to build. If you
> working on something that is not related to nutch, you do not need it.
> You can give further info about your project if you want extended help.
>
> best wishes.
> Enis
>
> Naama Kraus wrote:
> > Hi,
> >
> > I've seen in
> >
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> > 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> > these are part of the Hadoop API or inside Nutch only.
> >
> > More specifically, I saw in
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> > 15) that MapReduce outputs two files each holds different <key,value>
> > pairs. I'd be curious to know if I can achieve that using the standard
> API.
> >
> > Thanks, Naama
> >
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Nutch Extensions to MapReduce

Posted by Enis Soztutar <en...@gmail.com>.

Hi,

Currently nutch is a fairly complex application that *uses* hadoop as a 
base for distributed computing and storage. In this regard there is no 
part in nutch that "extends" hadoop. The core of the mapreduce indeed 
does work with <key,value> pairs, and nutch uses specific <key,value> 
pairs such as <url, CrawlDatum>, etc.

So long story short, it depends on what you want to build. If you 
working on something that is not related to nutch, you do not need it. 
You can give further info about your project if you want extended help.

best wishes.
Enis

Naama Kraus wrote:
> Hi,
>
> I've seen in
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> these are part of the Hadoop API or inside Nutch only.
>
> More specifically, I saw in
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
> 15) that MapReduce outputs two files each holds different <key,value>
> pairs. I'd be curious to know if I can achieve that using the standard API.
>
> Thanks, Naama
>
>