You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Amit Sela <am...@infolinks.com> on 2013/02/28 19:51:06 UTC

Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Hi everyone,

I'm running with nutch 1.6 and Solr 3.6.2.
I'm trying to crawl only the seed list (depth 1) and it seems that the
process ends with only ~255 of the URLs indexed in Solr.

Seed list is about 120K.
Fetcher map input is 117K where success is 62K and temp_moved 45K.
Parse shows success of 62K.
CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
and db_fetched=22K.

And finally IndexerStatus shows 20K documents added.
What am I missing ?

Thanks!

my nutch-site.xml includes:
-----------------------------------------
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
<name>metatags.names</name>
<value>keywords;Keywords;description;Description</value>
<name>index.parse.md</name>
<value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
<name>db.update.additions.allowed</name>
<value>false</value>
<name>generate.count.mode</name>
<value>domain</value>
<name>partition.url.mode</name>
<value>byDomain</value>
<name>file.content.limit</name>
<value>262144</value>
<name>http.content.limit</name>
<value>262144</value>
<name>parse.filter.urls</name>
<value>true</value>
<name>parse.normalize.urls</name>
<value>true</value>

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Posted by Amit Sela <am...@infolinks.com>.

I tried setting http.redirect.max=30 (since I saw there is a bug preventing
from setting -1 as all) but still not much difference, it did help a little
bit since now I get ~28K but still it's less then half...

On Sat, Mar 2, 2013 at 9:00 AM, Stefan Scheffler <
sscheffler@avantgarde-labs.de> wrote:

> Hi Amit.
> As i answered you before. There is a config paramter to activate the
> crawling of redirections  (db_redir_temp 4,770, db_redir_perm 56,810). you
> have to activate this in the nutch-site.xml.
> Please have a look at the nutch-default.xml to find out which one it is...
> Only the pages with db_fetched will be indexed.
>
> Regards
> Stefan
>
> Am 02.03.2013 01:01, schrieb Amit Sela:
>
>  I am using the crawl script that executes Solr indexing with:
>>    $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
>> $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
>> and then executes Solr dedup:
>>    $bin/nutch solrdedup $SOLRURL
>>
>> I think it has something to do with the CrawlDB job. The job counters
>> show:
>> db_redir_temp 4,770
>> db_redir_perm 56,810
>> db_notmodified 5,343
>> db_unfetched 27,385
>> db_gone  3,741
>> db_fetched 22,065
>>
>>
>> On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
>> <ch...@gmail.com>**wrote:
>>
>>  This looks odd. From what i know, the successfully parsed documents are
>>> sent to Solr. Did you check the logs for any exceptions ?
>>>
>>> What command are you using to index ?
>>>
>>>
>>> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>>
>>>  Hi everyone,
>>>>
>>>> I'm running with nutch 1.6 and Solr 3.6.2.
>>>> I'm trying to crawl only the seed list (depth 1) and it seems that the
>>>> process ends with only ~255 of the URLs indexed in Solr.
>>>>
>>>> Seed list is about 120K.
>>>> Fetcher map input is 117K where success is 62K and temp_moved 45K.
>>>> Parse shows success of 62K.
>>>> CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
>>>> and db_fetched=22K.
>>>>
>>>> And finally IndexerStatus shows 20K documents added.
>>>> What am I missing ?
>>>>
>>>> Thanks!
>>>>
>>>> my nutch-site.xml includes:
>>>> ------------------------------**-----------
>>>> <name>plugin.includes</name>
>>>>
>>>>
>>>>  <value>protocol-httpclient|**urlfilter-regex|parse-(text|**
>>> html|tika|metatags|js)|index-(**basic|anchor|metadata)|query-(**
>>> basic|site|url)|response-(**json|xml)|summary-basic|**
>>> scoring-opic|urlnormalizer-(**pass|regex|basic)i</value>
>>>
>>>> <name>metatags.names</name>
>>>> <value>keywords;Keywords;**description;Description</**value>
>>>> <name>index.parse.md</name>
>>>>
>>>>
>>>>  <value>metatag.keywords,**metatag.Keywords,metatag.**
>>> description,metatag.**Description</value>
>>>
>>>> <name>db.update.additions.**allowed</name>
>>>> <value>false</value>
>>>> <name>generate.count.mode</**name>
>>>> <value>domain</value>
>>>> <name>partition.url.mode</**name>
>>>> <value>byDomain</value>
>>>> <name>file.content.limit</**name>
>>>> <value>262144</value>
>>>> <name>http.content.limit</**name>
>>>> <value>262144</value>
>>>> <name>parse.filter.urls</name>
>>>> <value>true</value>
>>>> <name>parse.normalize.urls</**name>
>>>> <value>true</value>
>>>>
>>>>
>>>
>>> --
>>> Kiran Chitturi
>>>
>>>
>

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Posted by Stefan Scheffler <ss...@avantgarde-labs.de>.

Hi Amit.
As i answered you before. There is a config paramter to activate the 
crawling of redirections  (db_redir_temp 4,770, db_redir_perm 56,810). 
you have to activate this in the nutch-site.xml.
Please have a look at the nutch-default.xml to find out which one it is...
Only the pages with db_fetched will be indexed.

Regards
Stefan

Am 02.03.2013 01:01, schrieb Amit Sela:
> I am using the crawl script that executes Solr indexing with:
>    $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
> $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
> and then executes Solr dedup:
>    $bin/nutch solrdedup $SOLRURL
>
> I think it has something to do with the CrawlDB job. The job counters show:
> db_redir_temp 4,770
> db_redir_perm 56,810
> db_notmodified 5,343
> db_unfetched 27,385
> db_gone  3,741
> db_fetched 22,065
>
>
> On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
> <ch...@gmail.com>wrote:
>
>> This looks odd. From what i know, the successfully parsed documents are
>> sent to Solr. Did you check the logs for any exceptions ?
>>
>> What command are you using to index ?
>>
>>
>> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm running with nutch 1.6 and Solr 3.6.2.
>>> I'm trying to crawl only the seed list (depth 1) and it seems that the
>>> process ends with only ~255 of the URLs indexed in Solr.
>>>
>>> Seed list is about 120K.
>>> Fetcher map input is 117K where success is 62K and temp_moved 45K.
>>> Parse shows success of 62K.
>>> CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
>>> and db_fetched=22K.
>>>
>>> And finally IndexerStatus shows 20K documents added.
>>> What am I missing ?
>>>
>>> Thanks!
>>>
>>> my nutch-site.xml includes:
>>> -----------------------------------------
>>> <name>plugin.includes</name>
>>>
>>>
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
>>> <name>metatags.names</name>
>>> <value>keywords;Keywords;description;Description</value>
>>> <name>index.parse.md</name>
>>>
>>>
>> <value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
>>> <name>db.update.additions.allowed</name>
>>> <value>false</value>
>>> <name>generate.count.mode</name>
>>> <value>domain</value>
>>> <name>partition.url.mode</name>
>>> <value>byDomain</value>
>>> <name>file.content.limit</name>
>>> <value>262144</value>
>>> <name>http.content.limit</name>
>>> <value>262144</value>
>>> <name>parse.filter.urls</name>
>>> <value>true</value>
>>> <name>parse.normalize.urls</name>
>>> <value>true</value>
>>>
>>
>>
>> --
>> Kiran Chitturi
>>

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Posted by Amit Sela <am...@infolinks.com>.

I am using the crawl script that executes Solr indexing with:
  $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
and then executes Solr dedup:
  $bin/nutch solrdedup $SOLRURL

I think it has something to do with the CrawlDB job. The job counters show:
db_redir_temp 4,770
db_redir_perm 56,810
db_notmodified 5,343
db_unfetched 27,385
db_gone  3,741
db_fetched 22,065


On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi
<ch...@gmail.com>wrote:

> This looks odd. From what i know, the successfully parsed documents are
> sent to Solr. Did you check the logs for any exceptions ?
>
> What command are you using to index ?
>
>
> On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <am...@infolinks.com> wrote:
>
> > Hi everyone,
> >
> > I'm running with nutch 1.6 and Solr 3.6.2.
> > I'm trying to crawl only the seed list (depth 1) and it seems that the
> > process ends with only ~255 of the URLs indexed in Solr.
> >
> > Seed list is about 120K.
> > Fetcher map input is 117K where success is 62K and temp_moved 45K.
> > Parse shows success of 62K.
> > CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
> > and db_fetched=22K.
> >
> > And finally IndexerStatus shows 20K documents added.
> > What am I missing ?
> >
> > Thanks!
> >
> > my nutch-site.xml includes:
> > -----------------------------------------
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
> > <name>metatags.names</name>
> > <value>keywords;Keywords;description;Description</value>
> > <name>index.parse.md</name>
> >
> >
> <value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
> > <name>db.update.additions.allowed</name>
> > <value>false</value>
> > <name>generate.count.mode</name>
> > <value>domain</value>
> > <name>partition.url.mode</name>
> > <value>byDomain</value>
> > <name>file.content.limit</name>
> > <value>262144</value>
> > <name>http.content.limit</name>
> > <value>262144</value>
> > <name>parse.filter.urls</name>
> > <value>true</value>
> > <name>parse.normalize.urls</name>
> > <value>true</value>
> >
>
>
>
> --
> Kiran Chitturi
>

Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr

Posted by kiran chitturi <ch...@gmail.com>.

This looks odd. From what i know, the successfully parsed documents are
sent to Solr. Did you check the logs for any exceptions ?

What command are you using to index ?


On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela <am...@infolinks.com> wrote:

> Hi everyone,
>
> I'm running with nutch 1.6 and Solr 3.6.2.
> I'm trying to crawl only the seed list (depth 1) and it seems that the
> process ends with only ~255 of the URLs indexed in Solr.
>
> Seed list is about 120K.
> Fetcher map input is 117K where success is 62K and temp_moved 45K.
> Parse shows success of 62K.
> CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K
> and db_fetched=22K.
>
> And finally IndexerStatus shows 20K documents added.
> What am I missing ?
>
> Thanks!
>
> my nutch-site.xml includes:
> -----------------------------------------
> <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i</value>
> <name>metatags.names</name>
> <value>keywords;Keywords;description;Description</value>
> <name>index.parse.md</name>
>
> <value>metatag.keywords,metatag.Keywords,metatag.description,metatag.Description</value>
> <name>db.update.additions.allowed</name>
> <value>false</value>
> <name>generate.count.mode</name>
> <value>domain</value>
> <name>partition.url.mode</name>
> <value>byDomain</value>
> <name>file.content.limit</name>
> <value>262144</value>
> <name>http.content.limit</name>
> <value>262144</value>
> <name>parse.filter.urls</name>
> <value>true</value>
> <name>parse.normalize.urls</name>
> <value>true</value>
>



-- 
Kiran Chitturi