You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Beats <ta...@yahoo.com> on 2009/07/10 15:41:57 UTC

how to allow every url to b accepted

hi,

i want that every url that i list in "seed list" shd b crawled...

plz tell me how to allow every url


thanx in  advance


-- 
View this message in context: http://www.nabble.com/how-to-allow-every-url-to-b-accepted-tp24427859p24427859.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: how to allow every url to b accepted

Posted by lei wang <nu...@gmail.com>.
change crawl-urlfilter.txt to this:
===========================================
# skip URLs containing certain characters as probable queries, etc.
-[ ]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^.*


On Fri, Jul 10, 2009 at 9:41 PM, Beats <ta...@yahoo.com> wrote:

>
> hi,
>
> i want that every url that i list in "seed list" shd b crawled...
>
> plz tell me how to allow every url
>
>
> thanx in  advance
>
>
> --
> View this message in context:
> http://www.nabble.com/how-to-allow-every-url-to-b-accepted-tp24427859p24427859.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>