You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Azhar Jassal <az...@gmail.com> on 2014/05/25 16:51:07 UTC

Single combined generator and fetch job

Hi

I'm using Nutch 2.2.1

Each of the 4 jobs in the crawl cycle, as explained here need to reread the
entire webtable to get started: http://wiki.apache.org/nutch/Nutch2Crawling

This is a serious bottleneck for my use case.

I know that the fetch and parse job can be combined via the Nutch config.
This removes the need for the parse job to be run separately- and therefore
the webtable does not to be read again.

The page I linked to states that a future development might be combining
the generate and fetch stages so that only one read of the webtable is
required.

Has anyone attempted to do is? Is there a patch out there for a combined
generator and fetch job?

Thanks

Az

Re: Single combined generator and fetch job

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi Azhar,

Can you explain why is that a botleneck for you ?

Thanks
25 May 2014 17:51 tarihinde "Azhar Jassal" <az...@gmail.com> yazdı:

> Hi
>
> I'm using Nutch 2.2.1
>
> Each of the 4 jobs in the crawl cycle, as explained here need to reread the
> entire webtable to get started:
> http://wiki.apache.org/nutch/Nutch2Crawling
>
> This is a serious bottleneck for my use case.
>
> I know that the fetch and parse job can be combined via the Nutch config.
> This removes the need for the parse job to be run separately- and therefore
> the webtable does not to be read again.
>
> The page I linked to states that a future development might be combining
> the generate and fetch stages so that only one read of the webtable is
> required.
>
> Has anyone attempted to do is? Is there a patch out there for a combined
> generator and fetch job?
>
> Thanks
>
> Az
>

Re: Single combined generator and fetch job

Posted by Julien Nioche <li...@gmail.com>.
Hi

There has been a lot of changes in 2.x recently notably the use of filtered
scans in GORA, which addresses this issue.

Please checkout 2.x
(https://svn.apache.org/repos/asf/nutch/branches/2.x/) and give it a try.

See https://issues.apache.org/jira/browse/NUTCH-1674 and
https://issues.apache.org/jira/browse/NUTCH-1714 for details.

Julien


On Sunday, 25 May 2014, Azhar Jassal <az...@gmail.com> wrote:

> Hi
>
> I'm using Nutch 2.2.1
>
> Each of the 4 jobs in the crawl cycle, as explained here need to reread the
> entire webtable to get started:
> http://wiki.apache.org/nutch/Nutch2Crawling
>
> This is a serious bottleneck for my use case.
>
> I know that the fetch and parse job can be combined via the Nutch config.
> This removes the need for the parse job to be run separately- and therefore
> the webtable does not to be read again.
>
> The page I linked to states that a future development might be combining
> the generate and fetch stages so that only one read of the webtable is
> required.
>
> Has anyone attempted to do is? Is there a patch out there for a combined
> generator and fetch job?
>
> Thanks
>
> Az
>