You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2007/08/11 00:44:14 UTC

wildcard urls

I am using nutch 0.7.2. I would like to crawl a certain section of a
website...that is
http://domain.com/ID1124
http://domain.com/ID22351
http://domain.com/ID546
and so on....

I tried feeding in just this line:
http://domain.com/ID*
(added it in url.txt and fed that file)...that didn't work. 
It will be difficult to generate a list of IDs from the website and feed
that static list to nutch.

Does nutch accept wildcard in the urls? If so, how can I get it working? If
not, are there any work-arounds?

My crawl-filter works well. I just passed in http://domain.com/ID546 and was
able to retrieve that page.
Thanks.
-- 
View this message in context: http://www.nabble.com/wildcard-urls-tf4251600.html#a12100349
Sent from the Nutch - User mailing list archive at Nabble.com.