You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vinci <vi...@polyu.edu.hk> on 2008/01/30 21:13:48 UTC

Can Nutch use part of the url found for the next crawling?

hi,

I have some trouble with a site that doing content redirection: nutch can't
crawl this site but can crawl its rss, but unfortunately the link in rss is
redirect to the site -- this is the bad thing, but I found the link i want
is appear in the link as an get parameter:
http://site/disallowpart?url=the_link_i_want

i see there is something call url-filter and regex-filter, which one can
help me to extract the_link_i_want?

Thank you.
-- 
View this message in context: http://www.nabble.com/Can-Nutch-use-part-of-the-url-found-for-the-next-crawling--tp15190975p15190975.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can Nutch use part of the url found for the next crawling?

Posted by Susam Pal <su...@gmail.com>.

crawl-urlfilter.txt and regex-urlfilter.txt are used to block or allow
certain URLs to be called. It does not allow you to extract a URL from
another. You might want to use conf/regex-normalize.xml to do this.

Regards,
Susam Pal

On Jan 31, 2008 1:43 AM, Vinci <vi...@polyu.edu.hk> wrote:
>
> hi,
>
> I have some trouble with a site that doing content redirection: nutch can't
> crawl this site but can crawl its rss, but unfortunately the link in rss is
> redirect to the site -- this is the bad thing, but I found the link i want
> is appear in the link as an get parameter:
> http://site/disallowpart?url=the_link_i_want
>
> i see there is something call url-filter and regex-filter, which one can
> help me to extract the_link_i_want?
>
> Thank you.
> --
> View this message in context: http://www.nabble.com/Can-Nutch-use-part-of-the-url-found-for-the-next-crawling--tp15190975p15190975.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>