You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Naama Kraus <na...@gmail.com> on 2008/03/06 13:57:29 UTC
Nutch Extensions to MapReduce
Hi,
I've seen in
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
12) that Nutch has extensions to MapReduce. I wanted to ask whether
these are part of the Hadoop API or inside Nutch only.
More specifically, I saw in
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
15) that MapReduce outputs two files each holds different <key,value>
pairs. I'd be curious to know if I can achieve that using the standard API.
Thanks, Naama
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)
Re: Nutch Extensions to MapReduce
Posted by Naama Kraus <na...@gmail.com>.
Found related details in the wiki Map Reduce tutorial. In particular, in the
section "Task Side-Effect Files".
Thanks all for the various inputs, Naama
On Sun, Mar 9, 2008 at 8:22 AM, Ted Dunning <td...@veoh.com> wrote:
>
> Yes.
>
> Look on the wiki or in the discussion archives for details of how to get
> to
> the output directory name.
>
>
> On 3/8/08 1:06 PM, "Naama Kraus" <na...@gmail.com> wrote:
>
> > So the configure() method is called when the Reduce task starts, before
> the
> > actual reduce takes place ? Is that so ?
> > Same for map ?
> >
> > Thanks, Naama
> >
> > On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> >>
> >>
> >> This is not difficult to do. Simply open an extra file in the reducers
> >> configure method and close it in the close method. Make sure you make
> it
> >> relative to the map reduce output directory so that you can take
> advantage
> >> of all of the machinery that handles lost jobs and such.
> >>
> >> Search the mailing list archives for more details.
> >>
> >>
>
>
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)
Re: Nutch Extensions to MapReduce
Posted by Ted Dunning <td...@veoh.com>.
Yes.
Look on the wiki or in the discussion archives for details of how to get to
the output directory name.
On 3/8/08 1:06 PM, "Naama Kraus" <na...@gmail.com> wrote:
> So the configure() method is called when the Reduce task starts, before the
> actual reduce takes place ? Is that so ?
> Same for map ?
>
> Thanks, Naama
>
> On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <td...@veoh.com> wrote:
>
>>
>>
>> This is not difficult to do. Simply open an extra file in the reducers
>> configure method and close it in the close method. Make sure you make it
>> relative to the map reduce output directory so that you can take advantage
>> of all of the machinery that handles lost jobs and such.
>>
>> Search the mailing list archives for more details.
>>
>>
Re: Nutch Extensions to MapReduce
Posted by Naama Kraus <na...@gmail.com>.
So the configure() method is called when the Reduce task starts, before the
actual reduce takes place ? Is that so ?
Same for map ?
Thanks, Naama
On Thu, Mar 6, 2008 at 6:02 PM, Ted Dunning <td...@veoh.com> wrote:
>
>
> This is not difficult to do. Simply open an extra file in the reducers
> configure method and close it in the close method. Make sure you make it
> relative to the map reduce output directory so that you can take advantage
> of all of the machinery that handles lost jobs and such.
>
> Search the mailing list archives for more details.
>
>
> On 3/6/08 5:22 AM, "Naama Kraus" <na...@gmail.com> wrote:
>
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different <key,value> pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments
> > /yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with <key,value> pairs, and nutch uses specific <key,value>
> >> pairs such as <url, CrawlDatum>, etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon
> >> 05.pdf(slide<
> http://wiki.apache.org/nutch-data/attachments/Presentations/atta
> >> chments/oscon05.pdf%28slide>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachment
> >> s/yahoo-sds.pdf(slide<
> http://wiki.apache.org/hadoop-data/attachments/HadoopPr
> >> esentations/attachments/yahoo-sds.pdf%28slide>
> >>> 15) that MapReduce outputs two files each holds different <key,value>
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >> API.
> >>>
> >>> Thanks, Naama
> >>>
> >>>
> >>
> >
> >
>
>
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)
about yahoo hadoop deploy
Posted by Leon Liu <lj...@alibaba-inc.com>.
hi,
I hear that yahoo deploy 10000 nodes hadoop.
I want to know, what hardware were used and detail parameter of
hadoop. And what type of application run on hadoop.
anyone know ? or it is a secret?
Hadoop can use in product enviroment is exciting!
BR/Leon
Re: Nutch Extensions to MapReduce
Posted by Naama Kraus <na...@gmail.com>.
OK, so what I've learned -
One: there is only one reducer type per job.
Two: sounds like ParseOutputFormat is the reference I was looking for, I'll
go have a look.
And yes, I admit my example was a naive one, it was for demonstration
purposes only.
Thanks a lot for the input,
Naama
On Thu, Mar 6, 2008 at 5:26 PM, Enis Soztutar <en...@gmail.com>
wrote:
> Naama Kraus wrote:
> > OK. Let me try an example:
> >
> > Say my map maps a person name to a his child name. <p, c>. If a person
> "Dan"
> > has more than 1 child, bunch of <Dan, c>* pairs will be produced, right
> ?
> > Now say I have two different information needs:
> > 1. Get a list of all children names for each person.
> > 2. Get the number of children of each person.
> >
> > I could run two different MapReduce jobs, with same map but different
> > reducres:
> > 1. emits <p, lc>* pairs where p is the person, lc is a concatenation of
> his
> > children names.
> > 2. emits <p,n>* pairs where p is the person, n is the number of
> children.
> >
> No you cannot have more than one type of reduces in one job. But yes you
> can write more than one file as the
> result of the reduce phase, which is what I wanted to explain by
> pointing to ParseOutputFormat which writes ParseText and ParseDatato
> different MapFiles at the end of the reduce step. So this is done by
> implementing OutputFormat + RecordWriter(given a resulting record from
> the reduce, write separate parts of it in different files)
> > Does that make any sense by now ?
> >
> > Now, my question is whether I can save the two jobs and have a single
> one
> > only which emits both two type of pairs - <p, lc>* and <p,n>*. In
> separate
> > files probably. This way I gain one pass on the input files instead of
> two
> > (or more, if I had more output types ...).
> >
> Actually for this scenario you do not even need two different files with
> <p, cl>* and <p,n>*. You can just compute
> <p, <c1,c2, ..>> which also contains the number of the children (The
> value is a List(for example ArrayWritable) containing children names).
>
> > If not, that's also fine, I was just curious :-)
> >
> > Naama
> >
> >
> >
> > On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <en...@gmail.com>
> > wrote:
> >
> >
> >> Let me explain this more technically :)
> >>
> >> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
> >> <k2,v2>* pairs. So at the end of the map stage, the output will be of
> >> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
> >> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
> >>
> >> I cannot understand what you meant by
> >>
> >> if a MapReduce job could output multiple files each holds different
> >> <key,value> pairs"
> >>
> >> The resulting segment directories after a crawl contain
> >> subdirectories(like crawl_generate, content, etc), but these are
> >> generated one-by-one in several jobs running sequentially(and sometimes
> >> by the same job, see ParseOutputFormat in nutch). You can refer further
> >> to the OutputFormat and RecordWriter interfaces for specific needs.
> >>
> >> For each split in the reduce phrase a different output file will be
> >> generated, but all the records in the files have the same type. However
> >> in some cases using GenericWritable or ObjectWtritable, you can wrap
> >> different types of keys and values.
> >>
> >> Hope it helps,
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Well, I was not actually thinking to use Nutch.
> >>> To be concrete, I was interested if a MapReduce job could output
> >>>
> >> multiple
> >>
> >>> files each holds different <key,value> pairs. I got the impression
> this
> >>>
> >> is
> >>
> >>> done in Nutch from slide 15 of
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> >>
> >>> but maybe I was mis-understanding.
> >>> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> >>> different reducers do the trick ?
> >>>
> >>> Thanks for offering to help, I might have more concrete details of
> what
> >>>
> >> I am
> >>
> >>> trying to implement later on, now I am basically learning.
> >>>
> >>> Naama
> >>>
> >>> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <
> enis.soz.nutch@gmail.com>
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> Currently nutch is a fairly complex application that *uses* hadoop as
> a
> >>>> base for distributed computing and storage. In this regard there is
> no
> >>>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >>>> does work with <key,value> pairs, and nutch uses specific <key,value>
> >>>> pairs such as <url, CrawlDatum>, etc.
> >>>>
> >>>> So long story short, it depends on what you want to build. If you
> >>>> working on something that is not related to nutch, you do not need
> it.
> >>>> You can give further info about your project if you want extended
> help.
> >>>>
> >>>> best wishes.
> >>>> Enis
> >>>>
> >>>> Naama Kraus wrote:
> >>>>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I've seen in
> >>>>>
> >>>>>
> >>>>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> <
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >
> >> <
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >>
> >>>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>>>> these are part of the Hadoop API or inside Nutch only.
> >>>>>
> >>>>> More specifically, I saw in
> >>>>>
> >>>>>
> >>>>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> <
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >
> >> <
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >>
> >>>>> 15) that MapReduce outputs two files each holds different
> <key,value>
> >>>>> pairs. I'd be curious to know if I can achieve that using the
> standard
> >>>>>
> >>>>>
> >>>> API.
> >>>>
> >>>>
> >>>>> Thanks, Naama
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >
> >
> >
> >
>
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)
Re: Nutch Extensions to MapReduce
Posted by Enis Soztutar <en...@gmail.com>.
Naama Kraus wrote:
> OK. Let me try an example:
>
> Say my map maps a person name to a his child name. <p, c>. If a person "Dan"
> has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ?
> Now say I have two different information needs:
> 1. Get a list of all children names for each person.
> 2. Get the number of children of each person.
>
> I could run two different MapReduce jobs, with same map but different
> reducres:
> 1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his
> children names.
> 2. emits <p,n>* pairs where p is the person, n is the number of children.
>
No you cannot have more than one type of reduces in one job. But yes you
can write more than one file as the
result of the reduce phase, which is what I wanted to explain by
pointing to ParseOutputFormat which writes ParseText and ParseDatato
different MapFiles at the end of the reduce step. So this is done by
implementing OutputFormat + RecordWriter(given a resulting record from
the reduce, write separate parts of it in different files)
> Does that make any sense by now ?
>
> Now, my question is whether I can save the two jobs and have a single one
> only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate
> files probably. This way I gain one pass on the input files instead of two
> (or more, if I had more output types ...).
>
Actually for this scenario you do not even need two different files with
<p, cl>* and <p,n>*. You can just compute
<p, <c1,c2, ..>> which also contains the number of the children (The
value is a List(for example ArrayWritable) containing children names).
> If not, that's also fine, I was just curious :-)
>
> Naama
>
>
>
> On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <en...@gmail.com>
> wrote:
>
>
>> Let me explain this more technically :)
>>
>> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
>> <k2,v2>* pairs. So at the end of the map stage, the output will be of
>> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
>> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
>>
>> I cannot understand what you meant by
>>
>> if a MapReduce job could output multiple files each holds different
>> <key,value> pairs"
>>
>> The resulting segment directories after a crawl contain
>> subdirectories(like crawl_generate, content, etc), but these are
>> generated one-by-one in several jobs running sequentially(and sometimes
>> by the same job, see ParseOutputFormat in nutch). You can refer further
>> to the OutputFormat and RecordWriter interfaces for specific needs.
>>
>> For each split in the reduce phrase a different output file will be
>> generated, but all the records in the files have the same type. However
>> in some cases using GenericWritable or ObjectWtritable, you can wrap
>> different types of keys and values.
>>
>> Hope it helps,
>> Enis
>>
>> Naama Kraus wrote:
>>
>>> Well, I was not actually thinking to use Nutch.
>>> To be concrete, I was interested if a MapReduce job could output
>>>
>> multiple
>>
>>> files each holds different <key,value> pairs. I got the impression this
>>>
>> is
>>
>>> done in Nutch from slide 15 of
>>>
>>>
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
>>
>>> but maybe I was mis-understanding.
>>> Is it Nutch specific or achievable using Hadoop API ? Would multiple
>>> different reducers do the trick ?
>>>
>>> Thanks for offering to help, I might have more concrete details of what
>>>
>> I am
>>
>>> trying to implement later on, now I am basically learning.
>>>
>>> Naama
>>>
>>> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>> Hi,
>>>>
>>>> Currently nutch is a fairly complex application that *uses* hadoop as a
>>>> base for distributed computing and storage. In this regard there is no
>>>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>>>> does work with <key,value> pairs, and nutch uses specific <key,value>
>>>> pairs such as <url, CrawlDatum>, etc.
>>>>
>>>> So long story short, it depends on what you want to build. If you
>>>> working on something that is not related to nutch, you do not need it.
>>>> You can give further info about your project if you want extended help.
>>>>
>>>> best wishes.
>>>> Enis
>>>>
>>>> Naama Kraus wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I've seen in
>>>>>
>>>>>
>>>>>
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
>> <
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
>>
>>>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>>>> these are part of the Hadoop API or inside Nutch only.
>>>>>
>>>>> More specifically, I saw in
>>>>>
>>>>>
>>>>>
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
>> <
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
>>
>>>>> 15) that MapReduce outputs two files each holds different <key,value>
>>>>> pairs. I'd be curious to know if I can achieve that using the standard
>>>>>
>>>>>
>>>> API.
>>>>
>>>>
>>>>> Thanks, Naama
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>
>
Re: Nutch Extensions to MapReduce
Posted by Naama Kraus <na...@gmail.com>.
OK. Let me try an example:
Say my map maps a person name to a his child name. <p, c>. If a person "Dan"
has more than 1 child, bunch of <Dan, c>* pairs will be produced, right ?
Now say I have two different information needs:
1. Get a list of all children names for each person.
2. Get the number of children of each person.
I could run two different MapReduce jobs, with same map but different
reducres:
1. emits <p, lc>* pairs where p is the person, lc is a concatenation of his
children names.
2. emits <p,n>* pairs where p is the person, n is the number of children.
Does that make any sense by now ?
Now, my question is whether I can save the two jobs and have a single one
only which emits both two type of pairs - <p, lc>* and <p,n>*. In separate
files probably. This way I gain one pass on the input files instead of two
(or more, if I had more output types ...).
If not, that's also fine, I was just curious :-)
Naama
On Thu, Mar 6, 2008 at 3:58 PM, Enis Soztutar <en...@gmail.com>
wrote:
> Let me explain this more technically :)
>
> An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
> <k2,v2>* pairs. So at the end of the map stage, the output will be of
> the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
> v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
>
> I cannot understand what you meant by
>
> if a MapReduce job could output multiple files each holds different
> <key,value> pairs"
>
> The resulting segment directories after a crawl contain
> subdirectories(like crawl_generate, content, etc), but these are
> generated one-by-one in several jobs running sequentially(and sometimes
> by the same job, see ParseOutputFormat in nutch). You can refer further
> to the OutputFormat and RecordWriter interfaces for specific needs.
>
> For each split in the reduce phrase a different output file will be
> generated, but all the records in the files have the same type. However
> in some cases using GenericWritable or ObjectWtritable, you can wrap
> different types of keys and values.
>
> Hope it helps,
> Enis
>
> Naama Kraus wrote:
> > Well, I was not actually thinking to use Nutch.
> > To be concrete, I was interested if a MapReduce job could output
> multiple
> > files each holds different <key,value> pairs. I got the impression this
> is
> > done in Nutch from slide 15 of
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> > but maybe I was mis-understanding.
> > Is it Nutch specific or achievable using Hadoop API ? Would multiple
> > different reducers do the trick ?
> >
> > Thanks for offering to help, I might have more concrete details of what
> I am
> > trying to implement later on, now I am basically learning.
> >
> > Naama
> >
> > On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> > wrote:
> >
> >
> >> Hi,
> >>
> >> Currently nutch is a fairly complex application that *uses* hadoop as a
> >> base for distributed computing and storage. In this regard there is no
> >> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> >> does work with <key,value> pairs, and nutch uses specific <key,value>
> >> pairs such as <url, CrawlDatum>, etc.
> >>
> >> So long story short, it depends on what you want to build. If you
> >> working on something that is not related to nutch, you do not need it.
> >> You can give further info about your project if you want extended help.
> >>
> >> best wishes.
> >> Enis
> >>
> >> Naama Kraus wrote:
> >>
> >>> Hi,
> >>>
> >>> I've seen in
> >>>
> >>>
> >>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> <
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide
> >
> >>
> >>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> >>> these are part of the Hadoop API or inside Nutch only.
> >>>
> >>> More specifically, I saw in
> >>>
> >>>
> >>
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> <
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide
> >
> >>
> >>> 15) that MapReduce outputs two files each holds different <key,value>
> >>> pairs. I'd be curious to know if I can achieve that using the standard
> >>>
> >> API.
> >>
> >>> Thanks, Naama
> >>>
> >>>
> >>>
> >
> >
> >
> >
>
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)
Re: Nutch Extensions to MapReduce
Posted by Enis Soztutar <en...@gmail.com>.
Let me explain this more technically :)
An MR job takes <k1, v1> pairs. Each map(k1,v1) may result result
<k2,v2>* pairs. So at the end of the map stage, the output will be of
the form <k2,v2> pairs. The reduce takes <k2, v2*> pairs and emits <k3,
v3>* pairs, where k1,k2,k3,v1,v2,v3 are all types.
I cannot understand what you meant by
if a MapReduce job could output multiple files each holds different <key,value> pairs"
The resulting segment directories after a crawl contain
subdirectories(like crawl_generate, content, etc), but these are
generated one-by-one in several jobs running sequentially(and sometimes
by the same job, see ParseOutputFormat in nutch). You can refer further
to the OutputFormat and RecordWriter interfaces for specific needs.
For each split in the reduce phrase a different output file will be
generated, but all the records in the files have the same type. However
in some cases using GenericWritable or ObjectWtritable, you can wrap
different types of keys and values.
Hope it helps,
Enis
Naama Kraus wrote:
> Well, I was not actually thinking to use Nutch.
> To be concrete, I was interested if a MapReduce job could output multiple
> files each holds different <key,value> pairs. I got the impression this is
> done in Nutch from slide 15 of
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
> but maybe I was mis-understanding.
> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> different reducers do the trick ?
>
> Thanks for offering to help, I might have more concrete details of what I am
> trying to implement later on, now I am basically learning.
>
> Naama
>
> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> wrote:
>
>
>> Hi,
>>
>> Currently nutch is a fairly complex application that *uses* hadoop as a
>> base for distributed computing and storage. In this regard there is no
>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>> does work with <key,value> pairs, and nutch uses specific <key,value>
>> pairs such as <url, CrawlDatum>, etc.
>>
>> So long story short, it depends on what you want to build. If you
>> working on something that is not related to nutch, you do not need it.
>> You can give further info about your project if you want extended help.
>>
>> best wishes.
>> Enis
>>
>> Naama Kraus wrote:
>>
>>> Hi,
>>>
>>> I've seen in
>>>
>>>
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
>>
>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>> these are part of the Hadoop API or inside Nutch only.
>>>
>>> More specifically, I saw in
>>>
>>>
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
>>
>>> 15) that MapReduce outputs two files each holds different <key,value>
>>> pairs. I'd be curious to know if I can achieve that using the standard
>>>
>> API.
>>
>>> Thanks, Naama
>>>
>>>
>>>
>
>
>
>
Re: Nutch Extensions to MapReduce
Posted by Ted Dunning <td...@veoh.com>.
This is not difficult to do. Simply open an extra file in the reducers
configure method and close it in the close method. Make sure you make it
relative to the map reduce output directory so that you can take advantage
of all of the machinery that handles lost jobs and such.
Search the mailing list archives for more details.
On 3/6/08 5:22 AM, "Naama Kraus" <na...@gmail.com> wrote:
> Well, I was not actually thinking to use Nutch.
> To be concrete, I was interested if a MapReduce job could output multiple
> files each holds different <key,value> pairs. I got the impression this is
> done in Nutch from slide 15 of
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments
> /yahoo-sds.pdf
> but maybe I was mis-understanding.
> Is it Nutch specific or achievable using Hadoop API ? Would multiple
> different reducers do the trick ?
>
> Thanks for offering to help, I might have more concrete details of what I am
> trying to implement later on, now I am basically learning.
>
> Naama
>
> On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Currently nutch is a fairly complex application that *uses* hadoop as a
>> base for distributed computing and storage. In this regard there is no
>> part in nutch that "extends" hadoop. The core of the mapreduce indeed
>> does work with <key,value> pairs, and nutch uses specific <key,value>
>> pairs such as <url, CrawlDatum>, etc.
>>
>> So long story short, it depends on what you want to build. If you
>> working on something that is not related to nutch, you do not need it.
>> You can give further info about your project if you want extended help.
>>
>> best wishes.
>> Enis
>>
>> Naama Kraus wrote:
>>> Hi,
>>>
>>> I've seen in
>>>
>> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon
>> 05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/atta
>> chments/oscon05.pdf%28slide>
>>> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
>>> these are part of the Hadoop API or inside Nutch only.
>>>
>>> More specifically, I saw in
>>>
>> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachment
>> s/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPr
>> esentations/attachments/yahoo-sds.pdf%28slide>
>>> 15) that MapReduce outputs two files each holds different <key,value>
>>> pairs. I'd be curious to know if I can achieve that using the standard
>> API.
>>>
>>> Thanks, Naama
>>>
>>>
>>
>
>
Re: Nutch Extensions to MapReduce
Posted by Naama Kraus <na...@gmail.com>.
Well, I was not actually thinking to use Nutch.
To be concrete, I was interested if a MapReduce job could output multiple
files each holds different <key,value> pairs. I got the impression this is
done in Nutch from slide 15 of
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf
but maybe I was mis-understanding.
Is it Nutch specific or achievable using Hadoop API ? Would multiple
different reducers do the trick ?
Thanks for offering to help, I might have more concrete details of what I am
trying to implement later on, now I am basically learning.
Naama
On Thu, Mar 6, 2008 at 3:13 PM, Enis Soztutar <en...@gmail.com>
wrote:
> Hi,
>
> Currently nutch is a fairly complex application that *uses* hadoop as a
> base for distributed computing and storage. In this regard there is no
> part in nutch that "extends" hadoop. The core of the mapreduce indeed
> does work with <key,value> pairs, and nutch uses specific <key,value>
> pairs such as <url, CrawlDatum>, etc.
>
> So long story short, it depends on what you want to build. If you
> working on something that is not related to nutch, you do not need it.
> You can give further info about your project if you want extended help.
>
> best wishes.
> Enis
>
> Naama Kraus wrote:
> > Hi,
> >
> > I've seen in
> >
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide<http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf%28slide>
> > 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> > these are part of the Hadoop API or inside Nutch only.
> >
> > More specifically, I saw in
> >
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide<http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf%28slide>
> > 15) that MapReduce outputs two files each holds different <key,value>
> > pairs. I'd be curious to know if I can achieve that using the standard
> API.
> >
> > Thanks, Naama
> >
> >
>
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)
Re: Nutch Extensions to MapReduce
Posted by Enis Soztutar <en...@gmail.com>.
Hi,
Currently nutch is a fairly complex application that *uses* hadoop as a
base for distributed computing and storage. In this regard there is no
part in nutch that "extends" hadoop. The core of the mapreduce indeed
does work with <key,value> pairs, and nutch uses specific <key,value>
pairs such as <url, CrawlDatum>, etc.
So long story short, it depends on what you want to build. If you
working on something that is not related to nutch, you do not need it.
You can give further info about your project if you want extended help.
best wishes.
Enis
Naama Kraus wrote:
> Hi,
>
> I've seen in
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf(slide
> 12) that Nutch has extensions to MapReduce. I wanted to ask whether
> these are part of the Hadoop API or inside Nutch only.
>
> More specifically, I saw in
> http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/yahoo-sds.pdf(slide
> 15) that MapReduce outputs two files each holds different <key,value>
> pairs. I'd be curious to know if I can achieve that using the standard API.
>
> Thanks, Naama
>
>