You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by raviksingh <ra...@gmail.com> on 2013/03/05 16:22:52 UTC

Continue Nutch Crawling After Exception

I am new to Nutch.I have already configured Nutch with MYSQL. I have few
questions :

1.Currently I am crawling all the domains from my SEED.TXT. If some
exception occurs the crawling stops and some domains are not crawled, just
because of one domain/webpage. Is there a way to force nutch to continue
crawling after exception occurs ?

2.I want domains/URLs to be crawled from DB. Currently I and reading from DB
and writing to SEED.TXT before starting to crawl. Is there a better way?

3.Is there a way to provide URLFilter for scanning/restricting particular
domain/Url programatically? I have checked org.apache.nutch.net.URLFilter. I
was unable to make it work.

Please ask any details if required.



--
View this message in context: http://lucene.472066.n3.nabble.com/Continue-Nutch-Crawling-After-Exception-tp4044888.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Continue Nutch Crawling After Exception

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,

On Tue, Mar 5, 2013 at 7:22 AM, raviksingh <ra...@gmail.com>wrote:

> I am new to Nutch.I have already configured Nutch with MYSQL. I have few
> questions :
>

I would like to star by saying that this is not a great idea. If you read
this list you will see why.


>
> 1.Currently I am crawling all the domains from my SEED.TXT. If some
> exception occurs the crawling stops and some domains are not crawled, just
> because of one domain/webpage. Is there a way to force nutch to continue
> crawling after exception occurs ?
>

What are the exceptions?


>
> 2.I want domains/URLs to be crawled from DB. Currently I and reading from
> DB
> and writing to SEED.TXT before starting to crawl. Is there a better way?
>

Not yet, this has also been discussed pretty thoroughly.


>
> 3.Is there a way to provide URLFilter for scanning/restricting particular
> domain/Url programatically? I have checked org.apache.nutch.net.URLFilter.
> I
> was unable to make it work.
>
>
Please give an example of what you are trying to do here? Are you using the
de facto scripts provided with Nutch or something else to run your Nutch
server?
-- 
*Lewis*