You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Weilei Zhang <zh...@gmail.com> on 2013/02/07 03:31:04 UTC

performance question: fetcher and parser in separate map/reduce jobs?

Hi
I have a performance question:
why fetcher and parser is staged in two separate jobs instead of one?
Intuitively, parser can be included as a part of fetcher reducer,  is
it? This seems to be more efficient.
Thanks
-- 
Best Regards
-Weilei

Re: performance question: fetcher and parser in separate map/reduce jobs?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Ken,

So the question relates to whether "fetching and parsing" of a parsing
fetcher happen in the map or the reduce phase of the fetch job ;)

Looking at the code it now appears that the new wiki entry is incorrect (I
will change this ASAP) in this regard and that outlinks and therefore
parsing (and subsequently) fetching jobs are all executed in the reduce
phase... this is easier to identify in the 2.x code to be honest as it gets
somewhat hidden in amongst the 1.x Fetcher code.

BTW please feel free to add your comments below to the wiki entry as I
think they are valuable to the discussion. Thanks for the input Ken.
Lewis

On Wed, Feb 6, 2013 at 8:21 PM, Ken Krugler <kk...@transpac.com>wrote:

> Hi Lewis,
>
> On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote:
>
> > I've eventually added this to our FAQ's
> >
> >
> http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
>
> I'm way out of date on the Nutch code base, but I thought that fetching
> happened during the reduce phase (to enable queue processing by domain or
> IP address).
>
> And that multiple threads were spun up to fetch at a higher level of
> parallelization than what you'd get out of configuring Hadoop's # of
> reducers per slave.
>
> In which case if you parse at the same time that you fetch, you'd need #
> threads * (memory & CPU parsing requirements) in addition to the (mostly
> I/O-bound) resources from fetching.
>
> But from the note on the wiki ("In a parsing fetcher, outlinks are
> processed in the mapper") it sounds like when using a parsing fetcher this
> is happening in a map task.
>
> So I'm curious about the current architecture of Nutch.
>
> Thanks,
>
> -- Ken
>
>
> >
> > This should explain for you.
> > Lewis
> >
> > On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <zh...@gmail.com> wrote:
> >
> >> Hi
> >> I have a performance question:
> >> why fetcher and parser is staged in two separate jobs instead of one?
> >> Intuitively, parser can be included as a part of fetcher reducer,  is
> >> it? This seems to be more efficient.
> >> Thanks
> >> --
> >> Best Regards
> >> -Weilei
> >>
> >
> >
> >
> > --
> > *Lewis*
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
*Lewis*

Re: performance question: fetcher and parser in separate map/reduce jobs?

Posted by Ken Krugler <kk...@transpac.com>.

Hi Lewis,

On Feb 6, 2013, at 6:50pm, Lewis John Mcgibbney wrote:

> I've eventually added this to our FAQ's
> 
> http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F

I'm way out of date on the Nutch code base, but I thought that fetching happened during the reduce phase (to enable queue processing by domain or IP address).

And that multiple threads were spun up to fetch at a higher level of parallelization than what you'd get out of configuring Hadoop's # of reducers per slave.

In which case if you parse at the same time that you fetch, you'd need # threads * (memory & CPU parsing requirements) in addition to the (mostly I/O-bound) resources from fetching.

But from the note on the wiki ("In a parsing fetcher, outlinks are processed in the mapper") it sounds like when using a parsing fetcher this is happening in a map task.

So I'm curious about the current architecture of Nutch.

Thanks,

-- Ken

> 
> This should explain for you.
> Lewis
> 
> On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <zh...@gmail.com> wrote:
> 
>> Hi
>> I have a performance question:
>> why fetcher and parser is staged in two separate jobs instead of one?
>> Intuitively, parser can be included as a part of fetcher reducer,  is
>> it? This seems to be more efficient.
>> Thanks
>> --
>> Best Regards
>> -Weilei
>> 
> 
> 
> 
> -- 
> *Lewis*

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: performance question: fetcher and parser in separate map/reduce jobs?

Posted by Markus Jelsma <ma...@openindex.io>.

Oh, i'd like to add that the biggest problem is memory and the possibility for a parser to hang, consume resources and time out everything else and destroying the segment.
 
 
-----Original message-----
> From:Weilei Zhang <zh...@gmail.com>
> Sent: Sat 09-Feb-2013 23:40
> To: user@nutch.apache.org
> Subject: Re: performance question: fetcher and parser in separate map/reduce jobs?
> 
> This is indeed helpful. Thanks Lewis.
> 
> On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> > I've eventually added this to our FAQ's
> >
> > http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
> >
> > This should explain for you.
> > Lewis
> >
> > On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <zh...@gmail.com> wrote:
> >
> >> Hi
> >> I have a performance question:
> >> why fetcher and parser is staged in two separate jobs instead of one?
> >> Intuitively, parser can be included as a part of fetcher reducer,  is
> >> it? This seems to be more efficient.
> >> Thanks
> >> --
> >> Best Regards
> >> -Weilei
> >>
> >
> >
> >
> > --
> > *Lewis*
> 
> 
> 
> -- 
> Best Regards
> -Weilei
>

RE: performance question: fetcher and parser in separate map/reduce jobs?

Posted by Markus Jelsma <ma...@openindex.io>.

A parsing fetcher does everything in the mapper. Please check the output() method around line 1012 onwards:

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup

Parsing, signature, outlink processing (using code in ParseOutputFormat) all happens there.

Cheers,
Markus
 
 
-----Original message-----
> From:Weilei Zhang <zh...@gmail.com>
> Sent: Sat 09-Feb-2013 23:40
> To: user@nutch.apache.org
> Subject: Re: performance question: fetcher and parser in separate map/reduce jobs?
> 
> This is indeed helpful. Thanks Lewis.
> 
> On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney
> <le...@gmail.com> wrote:
> > I've eventually added this to our FAQ's
> >
> > http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
> >
> > This should explain for you.
> > Lewis
> >
> > On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <zh...@gmail.com> wrote:
> >
> >> Hi
> >> I have a performance question:
> >> why fetcher and parser is staged in two separate jobs instead of one?
> >> Intuitively, parser can be included as a part of fetcher reducer,  is
> >> it? This seems to be more efficient.
> >> Thanks
> >> --
> >> Best Regards
> >> -Weilei
> >>
> >
> >
> >
> > --
> > *Lewis*
> 
> 
> 
> -- 
> Best Regards
> -Weilei
>

Re: performance question: fetcher and parser in separate map/reduce jobs?

Posted by Weilei Zhang <zh...@gmail.com>.

This is indeed helpful. Thanks Lewis.

On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> I've eventually added this to our FAQ's
>
> http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
>
> This should explain for you.
> Lewis
>
> On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <zh...@gmail.com> wrote:
>
>> Hi
>> I have a performance question:
>> why fetcher and parser is staged in two separate jobs instead of one?
>> Intuitively, parser can be included as a part of fetcher reducer,  is
>> it? This seems to be more efficient.
>> Thanks
>> --
>> Best Regards
>> -Weilei
>>
>
>
>
> --
> *Lewis*



-- 
Best Regards
-Weilei

Re: performance question: fetcher and parser in separate map/reduce jobs?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

I've eventually added this to our FAQ's

http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F

This should explain for you.
Lewis

On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang <zh...@gmail.com> wrote:

> Hi
> I have a performance question:
> why fetcher and parser is staged in two separate jobs instead of one?
> Intuitively, parser can be included as a part of fetcher reducer,  is
> it? This seems to be more efficient.
> Thanks
> --
> Best Regards
> -Weilei
>



-- 
*Lewis*