You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amit Sela <am...@infolinks.com> on 2013/02/26 10:19:27 UTC

Only a small portion of URLs is indexed in Solr at the end of the crawl

Hi all,

I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN
1000000 and 'db.update.additions.allowed' false.
The idea is to fetch, parse and index only the URLs in the seed list.

I seed ~120K URLs but in solr I see only ~20K indexed.

The fetch job counters show:

moved 49,937
robots_denied 1,149
robots_denied_maxcrawldelay 267
hitByTimeLimit 6,072
exception 4,479
notmodified 2
access_denied 4
temp_moved 4,658
success 23,033
notfound 1,658

and the ParserStatus success count is 22844

What happened to all the URLs ? they are all active URLs, not some old
list...

Thanks,

Amit.

Re: Only a small portion of URLs is indexed in Solr at the end of the crawl

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.
Am 26.02.2013 10:19, schrieb Amit Sela:
> Hi all,
>
> I'm running nutch 1.6 and solr 3.6.2 and I'm crawling with depth 1 topN
> 1000000 and 'db.update.additions.allowed' false.
> The idea is to fetch, parse and index only the URLs in the seed list.
>
> I seed ~120K URLs but in solr I see only ~20K indexed.
>
> The fetch job counters show:
>
> moved 49,937 -> redirections i think (not be crawled, there is a nutch property, which allows this)
> robots_denied 1,149 -> forbidden by the robots txt of the seed url
> robots_denied_maxcrawldelay 267 -> forbidden by the robots txt delay option of the seed url
> hitByTimeLimit 6,072 -> response timeout
> exception 4,479 -> other stuff
> notmodified 2
> access_denied 4 -> login needed
> temp_moved 4,658 -> redirections (not be crawled, there is a nutch property, which allows this)
> success 23,033 -> your 20k, which are indexed
> notfound 1,658 -> 404
By the way. if you crawl just with a depth of 1, you don´t need to 
specify a topN, because you will allways crawl just the seedurl
>
> and the ParserStatus success count is 22844
>
> What happened to all the URLs ? they are all active URLs, not some old
> list...
>
> Thanks,
>
> Amit.
>


-- 
Stefan Scheffler
Avantgarde Labs GmbH
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheffler@avantgarde-labs.de