You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by caezar <ca...@gmail.com> on 2009/10/27 15:34:24 UTC

Nutch indexes less pages, then it fetches

Hi All,

I've got a strange problem, that nutch indexes much less URLs then it
fetches. For example URL:
http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
I assume that if fetched sucessfully because in fetch logs it mentioned only
once:
2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

But it was not sent to the indexer on indexing phase (I'm using custom
NutchIndexWriter and it logs every page for witch it's write method
executed). What could be possible reason? Is there a way to browse crawldb
to ensure that page really fetched? What else could I check?

Thanks
-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by "J. Smith" <li...@gmail.com>.
The funny thing is that in my case I have not any redirects and somehow
status is "Status: 1 (db_unfetched)" regarding that content is fetched and
successfully parsed.

Anyway thanks for your solution.


caezar wrote:
> 
> If you read the thread up you'll see that thing is about pages with
> redirects.
> 
> I hadn't time to investigate this deeply, so decided just to count that it
> is a nutch issue and modify the nutch code.
> 
> Regarding the changes made - see my diff file attached.
>  http://old.nabble.com/file/p26543086/fetcher.diff fetcher.diff 
> 
> J. Smith wrote:
>> 
>> Yes, please. I'll be very grateful.
>> But also I'm curious why this heppaning... Maybe someone can explain?
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26543258.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
If you read the thread up you'll see that thing is about pages with
redirects.

I hadn't time to investigate this deeply, so decided just to count that it
is a nutch issue and modify the nutch code.

Regarding the changes made - see my diff file attached.
http://old.nabble.com/file/p26543086/fetcher.diff fetcher.diff 

J. Smith wrote:
> 
> Yes, please. I'll be very grateful.
> But also I'm curious why this heppaning... Maybe someone can explain?
> 

-- 
View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26543086.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by "J. Smith" <li...@gmail.com>.
Yes, please. I'll be very grateful.
But also I'm curious why this heppaning... Maybe someone can explain?


caezar wrote:
> 
> I've solved this problem by modifying nutch code. If this solution
> acceptable for you I can provide the details
> 
> J. Smith wrote:
>> 
>> Does anybody know how to solve this problem?
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542889.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
I've solved this problem by modifying nutch code. If this solution acceptable
for you I can provide the details

J. Smith wrote:
> 
> Does anybody know how to solve this problem?
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542832.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by "J. Smith" <li...@gmail.com>.
Does anybody know how to solve this problem?


-- 
View this message in context: http://old.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26542690.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
Status: 5 (db_redir_perm) for redirect source
and
Status: 2 (db_fetched) for redirect target

reinhard schwab wrote:
> 
> what is in the crawl db?
> 
> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
> 
> 
> caezar schrieb:
>> No, problem is not solved. Everything happens as you described, but page
>> is
>> not indexed, because of condition:
>>     if (fetchDatum == null || dbDatum == null
>>         || parseText == null || parseData == null) {
>>       return;                                     // only have inlinks
>>     }
>> in IndexerMapReduce code. For this page dbDatum is null, so it is not
>> indexed!
>>
>> reinhard schwab wrote:
>>   
>>> is your problem solved now???
>>>
>>> this can be ok.
>>> new discovered urls will be added to a segment when fetched documents
>>> are parsed and if these urls pass the filters.
>>> they will not have a crawl datum Generate because they are unknown until
>>> they are extracted.
>>>
>>> regards
>>>
>>> caezar schrieb:
>>>     
>>>> I've compared the segments data of the URL which have no redirect and
>>>> was
>>>> indexed correctly, with this "bad" URL, and there is really a
>>>> difference.
>>>> First one have db record in the segment:
>>>> Crawl Generate::
>>>> Version: 7
>>>> Status: 1 (db_unfetched)
>>>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>>>> Modified time: Thu Jan 01 02:00:00 EET 1970
>>>> Retries since fetch: 0
>>>> Retry interval: 2592000 seconds (30 days)
>>>> Score: 1.0
>>>> Signature: null
>>>> Metadata: _ngt_: 1256738472613
>>>>  
>>>> But second one have no such record, which seems pretty fine: it was not
>>>> added to the segment on generate stage, it was added on the fetch
>>>> stage.
>>>> Is
>>>> this a bug in Nutch? Or I'm missing some configuration option?
>>>>
>>>> caezar wrote:
>>>>   
>>>>       
>>>>> I'm pretty sure that I ran both commands before indexing
>>>>>
>>>>> Andrzej Bialecki wrote:
>>>>>     
>>>>>         
>>>>>> caezar wrote:
>>>>>>       
>>>>>>           
>>>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>>>> before
>>>>>>> code
>>>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>>>         || parseText == null || parseData == null) {
>>>>>>>       return;                                     // only have
>>>>>>> inlinks
>>>>>>>     }
>>>>>>> my page has fetchDatum, parseText and parseData not null, but
>>>>>>> dbDatum
>>>>>>> is
>>>>>>> null. Thats why it's skipped :) 
>>>>>>> Any ideas about the reason?
>>>>>>>         
>>>>>>>             
>>>>>> Yes - you should run updatedb with this segment, and also run 
>>>>>> invertlinks with this segment, _before_ trying to index. Otherwise
>>>>>> the 
>>>>>> db status won't be updated properly.
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Best regards,
>>>>>> Andrzej Bialecki     <><
>>>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>>
>>>>>>
>>>>>>
>>>>>>       
>>>>>>           
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26096654.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by reinhard schwab <re...@aon.at>.
what is in the crawl db?

reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>


caezar schrieb:
> No, problem is not solved. Everything happens as you described, but page is
> not indexed, because of condition:
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> in IndexerMapReduce code. For this page dbDatum is null, so it is not
> indexed!
>
> reinhard schwab wrote:
>   
>> is your problem solved now???
>>
>> this can be ok.
>> new discovered urls will be added to a segment when fetched documents
>> are parsed and if these urls pass the filters.
>> they will not have a crawl datum Generate because they are unknown until
>> they are extracted.
>>
>> regards
>>
>> caezar schrieb:
>>     
>>> I've compared the segments data of the URL which have no redirect and was
>>> indexed correctly, with this "bad" URL, and there is really a difference.
>>> First one have db record in the segment:
>>> Crawl Generate::
>>> Version: 7
>>> Status: 1 (db_unfetched)
>>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>>> Modified time: Thu Jan 01 02:00:00 EET 1970
>>> Retries since fetch: 0
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 1.0
>>> Signature: null
>>> Metadata: _ngt_: 1256738472613
>>>  
>>> But second one have no such record, which seems pretty fine: it was not
>>> added to the segment on generate stage, it was added on the fetch stage.
>>> Is
>>> this a bug in Nutch? Or I'm missing some configuration option?
>>>
>>> caezar wrote:
>>>   
>>>       
>>>> I'm pretty sure that I ran both commands before indexing
>>>>
>>>> Andrzej Bialecki wrote:
>>>>     
>>>>         
>>>>> caezar wrote:
>>>>>       
>>>>>           
>>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>>> before
>>>>>> code
>>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>>         || parseText == null || parseData == null) {
>>>>>>       return;                                     // only have inlinks
>>>>>>     }
>>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum
>>>>>> is
>>>>>> null. Thats why it's skipped :) 
>>>>>> Any ideas about the reason?
>>>>>>         
>>>>>>             
>>>>> Yes - you should run updatedb with this segment, and also run 
>>>>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>>>>> db status won't be updated properly.
>>>>>
>>>>>
>>>>> -- 
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>>
>>>>>       
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>   


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
No, problem is not solved. Everything happens as you described, but page is
not indexed, because of condition:
    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }
in IndexerMapReduce code. For this page dbDatum is null, so it is not
indexed!

reinhard schwab wrote:
> 
> is your problem solved now???
> 
> this can be ok.
> new discovered urls will be added to a segment when fetched documents
> are parsed and if these urls pass the filters.
> they will not have a crawl datum Generate because they are unknown until
> they are extracted.
> 
> regards
> 
> caezar schrieb:
>> I've compared the segments data of the URL which have no redirect and was
>> indexed correctly, with this "bad" URL, and there is really a difference.
>> First one have db record in the segment:
>> Crawl Generate::
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Wed Oct 28 16:01:05 EET 2009
>> Modified time: Thu Jan 01 02:00:00 EET 1970
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_: 1256738472613
>>  
>> But second one have no such record, which seems pretty fine: it was not
>> added to the segment on generate stage, it was added on the fetch stage.
>> Is
>> this a bug in Nutch? Or I'm missing some configuration option?
>>
>> caezar wrote:
>>   
>>> I'm pretty sure that I ran both commands before indexing
>>>
>>> Andrzej Bialecki wrote:
>>>     
>>>> caezar wrote:
>>>>       
>>>>> Some more information. Debugging reduce method I've noticed, that
>>>>> before
>>>>> code
>>>>>     if (fetchDatum == null || dbDatum == null
>>>>>         || parseText == null || parseData == null) {
>>>>>       return;                                     // only have inlinks
>>>>>     }
>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum
>>>>> is
>>>>> null. Thats why it's skipped :) 
>>>>> Any ideas about the reason?
>>>>>         
>>>> Yes - you should run updatedb with this segment, and also run 
>>>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>>>> db status won't be updated properly.
>>>>
>>>>
>>>> -- 
>>>> Best regards,
>>>> Andrzej Bialecki     <><
>>>>   ___. ___ ___ ___ _ _   __________________________________
>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>
>>>>
>>>>
>>>>       
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095761.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by reinhard schwab <re...@aon.at>.
is your problem solved now???

this can be ok.
new discovered urls will be added to a segment when fetched documents
are parsed and if these urls pass the filters.
they will not have a crawl datum Generate because they are unknown until
they are extracted.

regards

caezar schrieb:
> I've compared the segments data of the URL which have no redirect and was
> indexed correctly, with this "bad" URL, and there is really a difference.
> First one have db record in the segment:
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Oct 28 16:01:05 EET 2009
> Modified time: Thu Jan 01 02:00:00 EET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1256738472613
>  
> But second one have no such record, which seems pretty fine: it was not
> added to the segment on generate stage, it was added on the fetch stage. Is
> this a bug in Nutch? Or I'm missing some configuration option?
>
> caezar wrote:
>   
>> I'm pretty sure that I ran both commands before indexing
>>
>> Andrzej Bialecki wrote:
>>     
>>> caezar wrote:
>>>       
>>>> Some more information. Debugging reduce method I've noticed, that before
>>>> code
>>>>     if (fetchDatum == null || dbDatum == null
>>>>         || parseText == null || parseData == null) {
>>>>       return;                                     // only have inlinks
>>>>     }
>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>>>> null. Thats why it's skipped :) 
>>>> Any ideas about the reason?
>>>>         
>>> Yes - you should run updatedb with this segment, and also run 
>>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>>> db status won't be updated properly.
>>>
>>>
>>> -- 
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>   ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>>
>>>       
>>     
>
>   


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
I've compared the segments data of the URL which have no redirect and was
indexed correctly, with this "bad" URL, and there is really a difference.
First one have db record in the segment:
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Oct 28 16:01:05 EET 2009
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1256738472613
 
But second one have no such record, which seems pretty fine: it was not
added to the segment on generate stage, it was added on the fetch stage. Is
this a bug in Nutch? Or I'm missing some configuration option?

caezar wrote:
> 
> I'm pretty sure that I ran both commands before indexing
> 
> Andrzej Bialecki wrote:
>> 
>> caezar wrote:
>>> Some more information. Debugging reduce method I've noticed, that before
>>> code
>>>     if (fetchDatum == null || dbDatum == null
>>>         || parseText == null || parseData == null) {
>>>       return;                                     // only have inlinks
>>>     }
>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>>> null. Thats why it's skipped :) 
>>> Any ideas about the reason?
>> 
>> Yes - you should run updatedb with this segment, and also run 
>> invertlinks with this segment, _before_ trying to index. Otherwise the 
>> db status won't be updated properly.
>> 
>> 
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>>   ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26095338.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
I'm pretty sure that I ran both commands before indexing

Andrzej Bialecki wrote:
> 
> caezar wrote:
>> Some more information. Debugging reduce method I've noticed, that before
>> code
>>     if (fetchDatum == null || dbDatum == null
>>         || parseText == null || parseData == null) {
>>       return;                                     // only have inlinks
>>     }
>> my page has fetchDatum, parseText and parseData not null, but dbDatum is
>> null. Thats why it's skipped :) 
>> Any ideas about the reason?
> 
> Yes - you should run updatedb with this segment, and also run 
> invertlinks with this segment, _before_ trying to index. Otherwise the 
> db status won't be updated properly.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26094770.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by Andrzej Bialecki <ab...@getopt.org>.
caezar wrote:
> Some more information. Debugging reduce method I've noticed, that before code
>     if (fetchDatum == null || dbDatum == null
>         || parseText == null || parseData == null) {
>       return;                                     // only have inlinks
>     }
> my page has fetchDatum, parseText and parseData not null, but dbDatum is
> null. Thats why it's skipped :) 
> Any ideas about the reason?

Yes - you should run updatedb with this segment, and also run 
invertlinks with this segment, _before_ trying to index. Otherwise the 
db status won't be updated properly.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
Some more information. Debugging reduce method I've noticed, that before code
    if (fetchDatum == null || dbDatum == null
        || parseText == null || parseData == null) {
      return;                                     // only have inlinks
    }
my page has fetchDatum, parseText and parseData not null, but dbDatum is
null. Thats why it's skipped :) 
Any ideas about the reason?

caezar wrote:
> 
> In the IndexerMapReduce.reduce there is a code:
> if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
>                    CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
>           continue;
>         }
> And the status of the redirect target URL is really linked. Thats why it's
> skipped. But what does this status mean?
> 
> reinhard schwab wrote:
>> 
>> hmm i have no idea now.
>> check the reduce method in IndexerMapReduce and add some debug
>> statements there.
>> recompile nutch and try it again.
>> 
>> caezar schrieb:
>>> Thanks, checked, it was parsed. Still no answer why it was not indexed
>>>
>>> reinhard schwab wrote:
>>>   
>>>> yes, its permanently redirected.
>>>> you can check also the segment status of this url
>>>> here is an example
>>>>
>>>> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
>>>> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
>>>>
>>>> it will show you whether it is parsed and the extracted outlinks.
>>>> it will show any data related to this url stored in the segment.
>>>>
>>>> regards
>>>>
>>>> caezar schrieb:
>>>>     
>>>>> Thanks, that was really helpful. I've moved forward but still not
>>>>> found
>>>>> the
>>>>> solution.
>>>>> So the status of the initial URL
>>>>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>>>>> is:
>>>>> Status: 5 (db_redir_perm)
>>>>> Metadata: _pst_: moved(12), lastModified=0:
>>>>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>>>>
>>>>> So it answers the question, why initial page was not indexed - because
>>>>> it
>>>>> was redirected.
>>>>> Now checking the status of redirect target:
>>>>> Status: 2 (db_fetched)
>>>>>
>>>>> So it was sucessfully fetchet. But, according to indexing log - it
>>>>> still
>>>>> was
>>>>> not sent to indexer!
>>>>>
>>>>>
>>>>>
>>>>> reinhard schwab wrote:
>>>>>   
>>>>>       
>>>>>> what is the db status of this url in your crawl db?
>>>>>> if it is STATUS_DB_NOTMODIFIED,
>>>>>> then it may be the reason.
>>>>>> (you can check it if you dump your crawl db with
>>>>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>>>>
>>>>>> it has this status, if it is recrawled and the signature does not
>>>>>> change.
>>>>>> the signature is MD5 hash of the content.
>>>>>>
>>>>>> another reason may be that you have some indexing filters.
>>>>>> i dont believe its the reason here.
>>>>>>
>>>>>> regards
>>>>>>
>>>>>>
>>>>>> kevin chen schrieb:
>>>>>>     
>>>>>>         
>>>>>>> I have similar experience.
>>>>>>>
>>>>>>> Reinhard schwab responded a possible fix.  See mail in this group
>>>>>>> from
>>>>>>> Reinhard schwab  at 
>>>>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>>>>
>>>>>>> I haven't have chance to try it out yet.
>>>>>>>  
>>>>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I've got a strange problem, that nutch indexes much less URLs then
>>>>>>>> it
>>>>>>>> fetches. For example URL:
>>>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>>>>> mentioned
>>>>>>>> only
>>>>>>>> once:
>>>>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>>>>> fetching
>>>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>>>>
>>>>>>>> But it was not sent to the indexer on indexing phase (I'm using
>>>>>>>> custom
>>>>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>>>>> executed). What could be possible reason? Is there a way to browse
>>>>>>>> crawldb
>>>>>>>> to ensure that page really fetched? What else could I check?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>     
>>>>>>>>         
>>>>>>>>             
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>     
>>>>>>         
>>>>>   
>>>>>       
>>>>
>>>>     
>>>
>>>   
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093867.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
In the IndexerMapReduce.reduce there is a code:
if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                   CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
          continue;
        }
And the status of the redirect target URL is really linked. Thats why it's
skipped. But what does this status mean?

reinhard schwab wrote:
> 
> hmm i have no idea now.
> check the reduce method in IndexerMapReduce and add some debug
> statements there.
> recompile nutch and try it again.
> 
> caezar schrieb:
>> Thanks, checked, it was parsed. Still no answer why it was not indexed
>>
>> reinhard schwab wrote:
>>   
>>> yes, its permanently redirected.
>>> you can check also the segment status of this url
>>> here is an example
>>>
>>> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
>>> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
>>>
>>> it will show you whether it is parsed and the extracted outlinks.
>>> it will show any data related to this url stored in the segment.
>>>
>>> regards
>>>
>>> caezar schrieb:
>>>     
>>>> Thanks, that was really helpful. I've moved forward but still not found
>>>> the
>>>> solution.
>>>> So the status of the initial URL
>>>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>>>> is:
>>>> Status: 5 (db_redir_perm)
>>>> Metadata: _pst_: moved(12), lastModified=0:
>>>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>>>
>>>> So it answers the question, why initial page was not indexed - because
>>>> it
>>>> was redirected.
>>>> Now checking the status of redirect target:
>>>> Status: 2 (db_fetched)
>>>>
>>>> So it was sucessfully fetchet. But, according to indexing log - it
>>>> still
>>>> was
>>>> not sent to indexer!
>>>>
>>>>
>>>>
>>>> reinhard schwab wrote:
>>>>   
>>>>       
>>>>> what is the db status of this url in your crawl db?
>>>>> if it is STATUS_DB_NOTMODIFIED,
>>>>> then it may be the reason.
>>>>> (you can check it if you dump your crawl db with
>>>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>>>
>>>>> it has this status, if it is recrawled and the signature does not
>>>>> change.
>>>>> the signature is MD5 hash of the content.
>>>>>
>>>>> another reason may be that you have some indexing filters.
>>>>> i dont believe its the reason here.
>>>>>
>>>>> regards
>>>>>
>>>>>
>>>>> kevin chen schrieb:
>>>>>     
>>>>>         
>>>>>> I have similar experience.
>>>>>>
>>>>>> Reinhard schwab responded a possible fix.  See mail in this group
>>>>>> from
>>>>>> Reinhard schwab  at 
>>>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>>>
>>>>>> I haven't have chance to try it out yet.
>>>>>>  
>>>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>>>   
>>>>>>       
>>>>>>           
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I've got a strange problem, that nutch indexes much less URLs then
>>>>>>> it
>>>>>>> fetches. For example URL:
>>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>>>> mentioned
>>>>>>> only
>>>>>>> once:
>>>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>>>> fetching
>>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>>>
>>>>>>> But it was not sent to the indexer on indexing phase (I'm using
>>>>>>> custom
>>>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>>>> executed). What could be possible reason? Is there a way to browse
>>>>>>> crawldb
>>>>>>> to ensure that page really fetched? What else could I check?
>>>>>>>
>>>>>>> Thanks
>>>>>>>     
>>>>>>>         
>>>>>>>             
>>>>>>   
>>>>>>       
>>>>>>           
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093649.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by reinhard schwab <re...@aon.at>.
hmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug
statements there.
recompile nutch and try it again.

caezar schrieb:
> Thanks, checked, it was parsed. Still no answer why it was not indexed
>
> reinhard schwab wrote:
>   
>> yes, its permanently redirected.
>> you can check also the segment status of this url
>> here is an example
>>
>> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
>> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
>>
>> it will show you whether it is parsed and the extracted outlinks.
>> it will show any data related to this url stored in the segment.
>>
>> regards
>>
>> caezar schrieb:
>>     
>>> Thanks, that was really helpful. I've moved forward but still not found
>>> the
>>> solution.
>>> So the status of the initial URL
>>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>>> is:
>>> Status: 5 (db_redir_perm)
>>> Metadata: _pst_: moved(12), lastModified=0:
>>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>>
>>> So it answers the question, why initial page was not indexed - because it
>>> was redirected.
>>> Now checking the status of redirect target:
>>> Status: 2 (db_fetched)
>>>
>>> So it was sucessfully fetchet. But, according to indexing log - it still
>>> was
>>> not sent to indexer!
>>>
>>>
>>>
>>> reinhard schwab wrote:
>>>   
>>>       
>>>> what is the db status of this url in your crawl db?
>>>> if it is STATUS_DB_NOTMODIFIED,
>>>> then it may be the reason.
>>>> (you can check it if you dump your crawl db with
>>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>>
>>>> it has this status, if it is recrawled and the signature does not
>>>> change.
>>>> the signature is MD5 hash of the content.
>>>>
>>>> another reason may be that you have some indexing filters.
>>>> i dont believe its the reason here.
>>>>
>>>> regards
>>>>
>>>>
>>>> kevin chen schrieb:
>>>>     
>>>>         
>>>>> I have similar experience.
>>>>>
>>>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>>>> Reinhard schwab  at 
>>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>>
>>>>> I haven't have chance to try it out yet.
>>>>>  
>>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Hi All,
>>>>>>
>>>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>>>> fetches. For example URL:
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>>> mentioned
>>>>>> only
>>>>>> once:
>>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>>> fetching
>>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>>
>>>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>>> executed). What could be possible reason? Is there a way to browse
>>>>>> crawldb
>>>>>> to ensure that page really fetched? What else could I check?
>>>>>>
>>>>>> Thanks
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>   
>>>>>       
>>>>>           
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>   


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
Thanks, checked, it was parsed. Still no answer why it was not indexed

reinhard schwab wrote:
> 
> yes, its permanently redirected.
> you can check also the segment status of this url
> here is an example
> 
> reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"
> 
> it will show you whether it is parsed and the extracted outlinks.
> it will show any data related to this url stored in the segment.
> 
> regards
> 
> caezar schrieb:
>> Thanks, that was really helpful. I've moved forward but still not found
>> the
>> solution.
>> So the status of the initial URL
>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm)
>> is:
>> Status: 5 (db_redir_perm)
>> Metadata: _pst_: moved(12), lastModified=0:
>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>>
>> So it answers the question, why initial page was not indexed - because it
>> was redirected.
>> Now checking the status of redirect target:
>> Status: 2 (db_fetched)
>>
>> So it was sucessfully fetchet. But, according to indexing log - it still
>> was
>> not sent to indexer!
>>
>>
>>
>> reinhard schwab wrote:
>>   
>>> what is the db status of this url in your crawl db?
>>> if it is STATUS_DB_NOTMODIFIED,
>>> then it may be the reason.
>>> (you can check it if you dump your crawl db with
>>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>>
>>> it has this status, if it is recrawled and the signature does not
>>> change.
>>> the signature is MD5 hash of the content.
>>>
>>> another reason may be that you have some indexing filters.
>>> i dont believe its the reason here.
>>>
>>> regards
>>>
>>>
>>> kevin chen schrieb:
>>>     
>>>> I have similar experience.
>>>>
>>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>>> Reinhard schwab  at 
>>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>>
>>>> I haven't have chance to try it out yet.
>>>>  
>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>>   
>>>>       
>>>>> Hi All,
>>>>>
>>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>>> fetches. For example URL:
>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>>> I assume that if fetched sucessfully because in fetch logs it
>>>>> mentioned
>>>>> only
>>>>> once:
>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher:
>>>>> fetching
>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>>
>>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>>> executed). What could be possible reason? Is there a way to browse
>>>>> crawldb
>>>>> to ensure that page really fetched? What else could I check?
>>>>>
>>>>> Thanks
>>>>>     
>>>>>         
>>>>   
>>>>       
>>>
>>>     
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26093230.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by reinhard schwab <re...@aon.at>.
yes, its permanently redirected.
you can check also the segment status of this url
here is an example

reinhard@thord:>bin/nutch  readseg -get crawl/segments/20091028122455
"http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20"

it will show you whether it is parsed and the extracted outlinks.
it will show any data related to this url stored in the segment.

regards

caezar schrieb:
> Thanks, that was really helpful. I've moved forward but still not found the
> solution.
> So the status of the initial URL
> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
> Status: 5 (db_redir_perm)
> Metadata: _pst_: moved(12), lastModified=0:
> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm
>
> So it answers the question, why initial page was not indexed - because it
> was redirected.
> Now checking the status of redirect target:
> Status: 2 (db_fetched)
>
> So it was sucessfully fetchet. But, according to indexing log - it still was
> not sent to indexer!
>
>
>
> reinhard schwab wrote:
>   
>> what is the db status of this url in your crawl db?
>> if it is STATUS_DB_NOTMODIFIED,
>> then it may be the reason.
>> (you can check it if you dump your crawl db with
>> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
>>
>> it has this status, if it is recrawled and the signature does not change.
>> the signature is MD5 hash of the content.
>>
>> another reason may be that you have some indexing filters.
>> i dont believe its the reason here.
>>
>> regards
>>
>>
>> kevin chen schrieb:
>>     
>>> I have similar experience.
>>>
>>> Reinhard schwab responded a possible fix.  See mail in this group from
>>> Reinhard schwab  at 
>>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>>
>>> I haven't have chance to try it out yet.
>>>  
>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>>   
>>>       
>>>> Hi All,
>>>>
>>>> I've got a strange problem, that nutch indexes much less URLs then it
>>>> fetches. For example URL:
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>>> I assume that if fetched sucessfully because in fetch logs it mentioned
>>>> only
>>>> once:
>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>>
>>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>>> NutchIndexWriter and it logs every page for witch it's write method
>>>> executed). What could be possible reason? Is there a way to browse
>>>> crawldb
>>>> to ensure that page really fetched? What else could I check?
>>>>
>>>> Thanks
>>>>     
>>>>         
>>>   
>>>       
>>
>>     
>
>   


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
Thanks, that was really helpful. I've moved forward but still not found the
solution.
So the status of the initial URL
(http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is:
Status: 5 (db_redir_perm)
Metadata: _pst_: moved(12), lastModified=0:
http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm

So it answers the question, why initial page was not indexed - because it
was redirected.
Now checking the status of redirect target:
Status: 2 (db_fetched)

So it was sucessfully fetchet. But, according to indexing log - it still was
not sent to indexer!



reinhard schwab wrote:
> 
> what is the db status of this url in your crawl db?
> if it is STATUS_DB_NOTMODIFIED,
> then it may be the reason.
> (you can check it if you dump your crawl db with
> reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>
> 
> it has this status, if it is recrawled and the signature does not change.
> the signature is MD5 hash of the content.
> 
> another reason may be that you have some indexing filters.
> i dont believe its the reason here.
> 
> regards
> 
> 
> kevin chen schrieb:
>> I have similar experience.
>>
>> Reinhard schwab responded a possible fix.  See mail in this group from
>> Reinhard schwab  at 
>> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>>
>> I haven't have chance to try it out yet.
>>  
>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>>   
>>> Hi All,
>>>
>>> I've got a strange problem, that nutch indexes much less URLs then it
>>> fetches. For example URL:
>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>>> I assume that if fetched sucessfully because in fetch logs it mentioned
>>> only
>>> once:
>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>>
>>> But it was not sent to the indexer on indexing phase (I'm using custom
>>> NutchIndexWriter and it logs every page for witch it's write method
>>> executed). What could be possible reason? Is there a way to browse
>>> crawldb
>>> to ensure that page really fetched? What else could I check?
>>>
>>> Thanks
>>>     
>>
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092907.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by reinhard schwab <re...@aon.at>.
what is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED,
then it may be the reason.
(you can check it if you dump your crawl db with
reinhard@thord:>bin/nutch readdb  <crawldb> -url <url>

it has this status, if it is recrawled and the signature does not change.
the signature is MD5 hash of the content.

another reason may be that you have some indexing filters.
i dont believe its the reason here.

regards


kevin chen schrieb:
> I have similar experience.
>
> Reinhard schwab responded a possible fix.  See mail in this group from
> Reinhard schwab  at 
> Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)
>
> I haven't have chance to try it out yet.
>  
> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
>   
>> Hi All,
>>
>> I've got a strange problem, that nutch indexes much less URLs then it
>> fetches. For example URL:
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>> I assume that if fetched sucessfully because in fetch logs it mentioned only
>> once:
>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>
>> But it was not sent to the indexer on indexing phase (I'm using custom
>> NutchIndexWriter and it logs every page for witch it's write method
>> executed). What could be possible reason? Is there a way to browse crawldb
>> to ensure that page really fetched? What else could I check?
>>
>> Thanks
>>     
>
>
>   


Re: Nutch indexes less pages, then it fetches

Posted by kevin chen <ke...@bdsing.com>.
I have similar experience.

Reinhard schwab responded a possible fix.  See mail in this group from
Reinhard schwab  at 
Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

I haven't have chance to try it out yet.
 
On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
> Hi All,
> 
> I've got a strange problem, that nutch indexes much less URLs then it
> fetches. For example URL:
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
> I assume that if fetched sucessfully because in fetch logs it mentioned only
> once:
> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
> 
> But it was not sent to the indexer on indexing phase (I'm using custom
> NutchIndexWriter and it logs every page for witch it's write method
> executed). What could be possible reason? Is there a way to browse crawldb
> to ensure that page really fetched? What else could I check?
> 
> Thanks


Re: Nutch indexes less pages, then it fetches

Posted by caezar <ca...@gmail.com>.
Sorry, but how could I do this?

皮皮 wrote:
> 
> check the parse data first, maybe it parse unsuccessful.
> 
> 2009/10/27 caezar <ca...@gmail.com>
> 
>>
>> Hi All,
>>
>> I've got a strange problem, that nutch indexes much less URLs then it
>> fetches. For example URL:
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
>> I assume that if fetched sucessfully because in fetch logs it mentioned
>> only
>> once:
>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>>
>> But it was not sent to the indexer on indexing phase (I'm using custom
>> NutchIndexWriter and it logs every page for witch it's write method
>> executed). What could be possible reason? Is there a way to browse
>> crawldb
>> to ensure that page really fetched? What else could I check?
>>
>> Thanks
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26092612.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch indexes less pages, then it fetches

Posted by 皮皮 <pi...@gmail.com>.
check the parse data first, maybe it parse unsuccessful.

2009/10/27 caezar <ca...@gmail.com>

>
> Hi All,
>
> I've got a strange problem, that nutch indexes much less URLs then it
> fetches. For example URL:
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
> I assume that if fetched sucessfully because in fetch logs it mentioned
> only
> once:
> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
>
> But it was not sent to the indexer on indexing phase (I'm using custom
> NutchIndexWriter and it logs every page for witch it's write method
> executed). What could be possible reason? Is there a way to browse crawldb
> to ensure that page really fetched? What else could I check?
>
> Thanks
> --
> View this message in context:
> http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>