You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in> on 2011/06/29 19:15:43 UTC

No more urls to fetch

Hi,
I was going through past threads and found the problem i face has been faced
by many others. But mostly either it has been ignored or has been
unresolved.

I use Nutch 1.1. My crawl has been working fine mostly (though i am still
getting a hang of how all the screws work).

I have a particular url which I generally need to crawl more than others
(its a site-map). So i cleaned up my Solr index of the domain (to re-start.
My index had lot of 404 urls which were not getting cleaned up) i.e. deleted
all the docs of the domain of the url i need to fetch.

I deleted everything from the crawl folder, so everything is fresh.

I start off a crawl with depth = 1 and topN = 1000 and noOfThreads = 10.

It fetched lot of site in the index (though not everythin). So i repeated
the same crawl command another 7-8 times. The docs in the index kept on
increasing.

But then this final time when i try running the crawl it fails at depth 0,
with the message

Stopping at depth=0 - no more URLs to fetch.
    No URLs to fetch - check your seed list and URL filters.
    crawl finished:


I cleaned up everythin again and started from scratch, crawlingstarted off
again only to fail again after a few inital successful crawls.

Awaiting your reply.

--
View this message in context: http://lucene.472066.n3.nabble.com/No-more-urls-to-fetch-tp3122462p3122462.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: No more urls to fetch

Posted by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in>.
I forgot to mention that no changes were made either in the
crawl-ulrfilter.txt and regex-urlfilter.txt between  a successful crawl and
a crawl with the message "no more urls to fetch"

rootUrlDir = urls/$folder/urls.txt
threads = 10
depth = 1
indexer=lucene
topN = 1500
Injector: starting
Injector: crawlDb: crawl/$folder/crawldb
Injector: urlDir: urls/$folder/urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1500
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl/$folder


Still awaiting a reply...........

--
View this message in context: http://lucene.472066.n3.nabble.com/No-more-urls-to-fetch-tp3122462p3124519.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: No more urls to fetch

Posted by "tamanjit.bindra@yahoo.co.in" <ta...@yahoo.co.in>.
Although am not sure if this is the real solution. But my understanding of
the problem is at the time of reading from crawldb.

I feel crawldb files may have been corrupted (am not sure here). So i
deleted the crawldb folder and it worked. Though it starts frm the scratch.
As in it re-crawls all the pages it has crawled.

Is this the only solution? Experienced guys please revert back.

--
View this message in context: http://lucene.472066.n3.nabble.com/No-more-urls-to-fetch-tp3122462p3124591.html
Sent from the Nutch - User mailing list archive at Nabble.com.