You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2012/03/27 10:28:29 UTC

Different number of parsed pages for crawls with same settings

Hi,

I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into 
Solr. With the same settings, I sometimes get more documents indexed, 
sometimes less. There are no errors in the log files. The Solr index and 
Nutch crawl directory are removed before each crawl, so I have a clean 
setup.
What could be the reason for these differences?

Any pointers appreciated,
Elisabeth

Re: Different number of parsed pages for crawls with same settings

Posted by Elisabeth Adler <el...@gmail.com>.
Hi Remi,
Thanks a lot for the quick response. That's what I expect as well - 
pages being temporarily down and therefore not available to crawl and 
index. I'll check the logs and see what I can figure out.
Thanks,
Elisabeth

On 27.03.2012 12:05, remi tassing wrote:
> This happened to me before for a very specific reason and I'm not sure if
> it's the same for you. Some of the websites I was trying to access
> were temporarily down.
>
> I would suggest you check the difference between the logs
>
> Remi
>
> On Tue, Mar 27, 2012 at 4:28 PM, Elisabeth Adler
> <el...@gmail.com>wrote:
>
>> Hi,
>>
>> I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into
>> Solr. With the same settings, I sometimes get more documents indexed,
>> sometimes less. There are no errors in the log files. The Solr index and
>> Nutch crawl directory are removed before each crawl, so I have a clean
>> setup.
>> What could be the reason for these differences?
>>
>> Any pointers appreciated,
>> Elisabeth
>>
>

Re: Different number of parsed pages for crawls with same settings

Posted by remi tassing <ta...@gmail.com>.
This happened to me before for a very specific reason and I'm not sure if
it's the same for you. Some of the websites I was trying to access
were temporarily down.

I would suggest you check the difference between the logs

Remi

On Tue, Mar 27, 2012 at 4:28 PM, Elisabeth Adler
<el...@gmail.com>wrote:

> Hi,
>
> I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into
> Solr. With the same settings, I sometimes get more documents indexed,
> sometimes less. There are no errors in the log files. The Solr index and
> Nutch crawl directory are removed before each crawl, so I have a clean
> setup.
> What could be the reason for these differences?
>
> Any pointers appreciated,
> Elisabeth
>