You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Beats <ta...@yahoo.com> on 2009/07/10 15:41:57 UTC
how to allow every url to b accepted
hi,
i want that every url that i list in "seed list" shd b crawled...
plz tell me how to allow every url
thanx in advance
--
View this message in context: http://www.nabble.com/how-to-allow-every-url-to-b-accepted-tp24427859p24427859.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to allow every url to b accepted
Posted by lei wang <nu...@gmail.com>.
change crawl-urlfilter.txt to this:
===========================================
# skip URLs containing certain characters as probable queries, etc.
-[ ]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^.*
On Fri, Jul 10, 2009 at 9:41 PM, Beats <ta...@yahoo.com> wrote:
>
> hi,
>
> i want that every url that i list in "seed list" shd b crawled...
>
> plz tell me how to allow every url
>
>
> thanx in advance
>
>
> --
> View this message in context:
> http://www.nabble.com/how-to-allow-every-url-to-b-accepted-tp24427859p24427859.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>