You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2012/03/27 10:28:29 UTC
Different number of parsed pages for crawls with same settings
Hi,
I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into
Solr. With the same settings, I sometimes get more documents indexed,
sometimes less. There are no errors in the log files. The Solr index and
Nutch crawl directory are removed before each crawl, so I have a clean
setup.
What could be the reason for these differences?
Any pointers appreciated,
Elisabeth
Re: Different number of parsed pages for crawls with same settings
Posted by Elisabeth Adler <el...@gmail.com>.
Hi Remi,
Thanks a lot for the quick response. That's what I expect as well -
pages being temporarily down and therefore not available to crawl and
index. I'll check the logs and see what I can figure out.
Thanks,
Elisabeth
On 27.03.2012 12:05, remi tassing wrote:
> This happened to me before for a very specific reason and I'm not sure if
> it's the same for you. Some of the websites I was trying to access
> were temporarily down.
>
> I would suggest you check the difference between the logs
>
> Remi
>
> On Tue, Mar 27, 2012 at 4:28 PM, Elisabeth Adler
> <el...@gmail.com>wrote:
>
>> Hi,
>>
>> I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into
>> Solr. With the same settings, I sometimes get more documents indexed,
>> sometimes less. There are no errors in the log files. The Solr index and
>> Nutch crawl directory are removed before each crawl, so I have a clean
>> setup.
>> What could be the reason for these differences?
>>
>> Any pointers appreciated,
>> Elisabeth
>>
>
Re: Different number of parsed pages for crawls with same settings
Posted by remi tassing <ta...@gmail.com>.
This happened to me before for a very specific reason and I'm not sure if
it's the same for you. Some of the websites I was trying to access
were temporarily down.
I would suggest you check the difference between the logs
Remi
On Tue, Mar 27, 2012 at 4:28 PM, Elisabeth Adler
<el...@gmail.com>wrote:
> Hi,
>
> I'm using Nutch 1.3 to crawl dynamic pages (JSPs) and indexing them into
> Solr. With the same settings, I sometimes get more documents indexed,
> sometimes less. There are no errors in the log files. The Solr index and
> Nutch crawl directory are removed before each crawl, so I have a clean
> setup.
> What could be the reason for these differences?
>
> Any pointers appreciated,
> Elisabeth
>