You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Fadzi Ushewokunze <de...@butterflycluster.com> on 2006/09/03 11:47:16 UTC

searching dynamic pages

Hi,

Is it possible to configure nutch to crawl a url like 
http://www.butterflycluster.com/index.php?searchword=java&option=com_search&Itemid

I dont want to crawl the _whole_ website. I want my crawl to start on the results returned
from this query. 

I have injected this url but it doesnt seem to be fetched at all. If i inject the url http://www.butterflycluster.com it is crawled but I dont want this. 

In essence I want to crawl the search results of this website. And i have a lot more I want to crawl like this.

Any suggestions will greatly appreciated;.

Thanks

RE: searching dynamic pages

Posted by Vishal Shah <vi...@rediff.co.in>.

Hi,

   Did you check your urlfilter files? The default ones exclude URLs
that are dynamic, so you might want to comment the following line from
your crawl-urlfilter.txt:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

Regards,

-vishal.

-----Original Message-----
From: Fadzi Ushewokunze [mailto:dev@butterflycluster.com] 
Sent: Sunday, September 03, 2006 3:17 PM
To: nutch-user@lucene.apache.org
Subject: searching dynamic pages

Hi,

Is it possible to configure nutch to crawl a url like 
http://www.butterflycluster.com/index.php?searchword=java&option=com_sea
rch&Itemid

I dont want to crawl the _whole_ website. I want my crawl to start on
the results returned
from this query. 

I have injected this url but it doesnt seem to be fetched at all. If i
inject the url http://www.butterflycluster.com it is crawled but I dont
want this. 

In essence I want to crawl the search results of this website. And i
have a lot more I want to crawl like this.

Any suggestions will greatly appreciated;.

Thanks