You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2012/03/23 13:39:38 UTC

Partially parsed pages

Hi,

I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search 
capabilities to an Intranet. The bit that's indexed is fine, though most 
of them don't seem to be parsed completely. The bottom bit of the page 
is missing in the content field. Nutch's logs don't show any exceptions.

I turned off parallel fetching (setting fetcher.threads.per.host and 
fetcher.threads.fetch both to 1), which seemed to improve things, but 
still I get some pages with are only partially indexed. 
http.content.limit and file.content.limit are both set to -1.

I tested different settings for the fetcher.server.delay property, but 
this seems to only affect how long Nutch waits until the next fetch.

What I think is happening is that the web server can't serve the pages 
fast enough (on accessing the pages via a Browser it takes about 
5seconds 'til the page is rendered complete), so Nutch retrieves only 
the part of the page that has been rendered so far. Is there an option 
to let Nutch wait a certain amount of time for the page to be completely 
loaded before parsing the content?

Did anyone encounter already a similar issue? Any pointers appreciated.
Thanks,
Elisabeth

Re: Partially parsed pages

Posted by Elisabeth Adler <el...@gmail.com>.

Hi,
Found a solution: Only setting the properties fetcher.server.delay was 
not enough, but once I set the fetcher.server.min.delay as well, it 
seems to produce better results.
Elisabeth

On 23.03.2012 13:39, Elisabeth Adler wrote:
> Hi,
>
> I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search
> capabilities to an Intranet. The bit that's indexed is fine, though most
> of them don't seem to be parsed completely. The bottom bit of the page
> is missing in the content field. Nutch's logs don't show any exceptions.
>
> I turned off parallel fetching (setting fetcher.threads.per.host and
> fetcher.threads.fetch both to 1), which seemed to improve things, but
> still I get some pages with are only partially indexed.
> http.content.limit and file.content.limit are both set to -1.
>
> I tested different settings for the fetcher.server.delay property, but
> this seems to only affect how long Nutch waits until the next fetch.
>
> What I think is happening is that the web server can't serve the pages
> fast enough (on accessing the pages via a Browser it takes about
> 5seconds 'til the page is rendered complete), so Nutch retrieves only
> the part of the page that has been rendered so far. Is there an option
> to let Nutch wait a certain amount of time for the page to be completely
> loaded before parsing the content?
>
> Did anyone encounter already a similar issue? Any pointers appreciated.
> Thanks,
> Elisabeth