You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Sabah Sajjad Khan <sa...@wayne.edu> on 2016/03/23 18:06:59 UTC

New to Nutch2.x

Hello,


We worked with nutch1.x for a project and were able to successfully crawl the way we want. Our project now requires us to use nutch2.x and we seem to see a lack of documentation to help. We are able to inject but have no idea what to do next. Is it the same as nutch1.x? Any help would be appreciated as we are students and have been struggling for a good 2 months now!


Thank you

Re: New to Nutch2.x

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

Nutch 2.x is different because it makes use of external data stores
to hold all crawled data.

However, the steps to run a crawl are the same:
 inject
 loop: generate, fetch, parse, update
 invert links, index
Only the arguments passed to run each step may be different.

Have a look at:
 https://wiki.apache.org/nutch/Nutch2Tutorial
and
 the bin/crawl script
which is provided for both 1.x and 2.x
The differences in should be obvious.

But may I ask, why you do not keep going to use Nutch 1.x
which is still maintained, in some respects even better
than 2.x?

Cheers,
Sebastian

On 03/23/2016 06:06 PM, Sabah Sajjad Khan wrote:
> Hello, 
> 
> 
> We worked with nutch1.x for a project and were able to successfully crawl the way we want. Our
> project now requires us to use nutch2.x and we seem to see a lack of documentation to help. We are
> able to inject but have no idea what to do next. Is it the same as nutch1.x? Any help would be
> appreciated as we are students and have been struggling for a good 2 months now!
> 
> 
> Thank you
>