You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arcadius Ahouansou <ar...@menelic.com> on 2011/09/16 00:29:14 UTC

Crawling search result pages

Hello.
I am new to Nutch.

I need to use Nutch to index data into Solr.

Lets say I need to crawl some newspaper search pages and index any article
regarding the word "java".

I understand that I would need to point Nutch to the search result page.

1) What I need from nutch is not to crawl/index the result page which
contains only the summary, but to follow each result link and index the
content of the full article.

2) I would also need nutch to follow the pagination Next link to the next
set of results and do the same as step 1)

3) Repeat 1) and 2) until there is no result left.


Please, is this something that Nutch can easily do?

Any hint will be well appreciated.

Thanks.

Arcadius.

Re: Crawling search result pages

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

If you want to index every page on example.org with the term 'java' you need 
to crawl every page on example.org. Nutch will follow links and add the links 
to its CrawlDB so you can crawl newly discovered pages in the next crawl cycle 
(read the wiki tutorial on what a cycle is).

If you really only want to index pages in which one of more terms occur you 
must create a custom indexing filter. This indexing filter must then check the 
document content for the occurance if your string(s) and then conditionally 
pass or reject the document from being send to the index.

First try to crawl your domain(s) via the tutorials on the wiki. Then you can 
try to build a filter.

Cheers

> Hello.
> I am new to Nutch.
> 
> I need to use Nutch to index data into Solr.
> 
> Lets say I need to crawl some newspaper search pages and index any article
> regarding the word "java".
> 
> I understand that I would need to point Nutch to the search result page.
> 
> 1) What I need from nutch is not to crawl/index the result page which
> contains only the summary, but to follow each result link and index the
> content of the full article.
> 
> 2) I would also need nutch to follow the pagination Next link to the next
> set of results and do the same as step 1)
> 
> 3) Repeat 1) and 2) until there is no result left.
> 
> 
> Please, is this something that Nutch can easily do?
> 
> Any hint will be well appreciated.
> 
> Thanks.
> 
> Arcadius.