You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by raviksingh <ra...@gmail.com> on 2013/03/05 16:22:52 UTC
Continue Nutch Crawling After Exception
I am new to Nutch.I have already configured Nutch with MYSQL. I have few
questions :
1.Currently I am crawling all the domains from my SEED.TXT. If some
exception occurs the crawling stops and some domains are not crawled, just
because of one domain/webpage. Is there a way to force nutch to continue
crawling after exception occurs ?
2.I want domains/URLs to be crawled from DB. Currently I and reading from DB
and writing to SEED.TXT before starting to crawl. Is there a better way?
3.Is there a way to provide URLFilter for scanning/restricting particular
domain/Url programatically? I have checked org.apache.nutch.net.URLFilter. I
was unable to make it work.
Please ask any details if required.
--
View this message in context: http://lucene.472066.n3.nabble.com/Continue-Nutch-Crawling-After-Exception-tp4044888.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Continue Nutch Crawling After Exception
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,
On Tue, Mar 5, 2013 at 7:22 AM, raviksingh <ra...@gmail.com>wrote:
> I am new to Nutch.I have already configured Nutch with MYSQL. I have few
> questions :
>
I would like to star by saying that this is not a great idea. If you read
this list you will see why.
>
> 1.Currently I am crawling all the domains from my SEED.TXT. If some
> exception occurs the crawling stops and some domains are not crawled, just
> because of one domain/webpage. Is there a way to force nutch to continue
> crawling after exception occurs ?
>
What are the exceptions?
>
> 2.I want domains/URLs to be crawled from DB. Currently I and reading from
> DB
> and writing to SEED.TXT before starting to crawl. Is there a better way?
>
Not yet, this has also been discussed pretty thoroughly.
>
> 3.Is there a way to provide URLFilter for scanning/restricting particular
> domain/Url programatically? I have checked org.apache.nutch.net.URLFilter.
> I
> was unable to make it work.
>
>
Please give an example of what you are trying to do here? Are you using the
de facto scripts provided with Nutch or something else to run your Nutch
server?
--
*Lewis*