You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2012/03/23 13:39:38 UTC
Partially parsed pages
Hi,
I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search
capabilities to an Intranet. The bit that's indexed is fine, though most
of them don't seem to be parsed completely. The bottom bit of the page
is missing in the content field. Nutch's logs don't show any exceptions.
I turned off parallel fetching (setting fetcher.threads.per.host and
fetcher.threads.fetch both to 1), which seemed to improve things, but
still I get some pages with are only partially indexed.
http.content.limit and file.content.limit are both set to -1.
I tested different settings for the fetcher.server.delay property, but
this seems to only affect how long Nutch waits until the next fetch.
What I think is happening is that the web server can't serve the pages
fast enough (on accessing the pages via a Browser it takes about
5seconds 'til the page is rendered complete), so Nutch retrieves only
the part of the page that has been rendered so far. Is there an option
to let Nutch wait a certain amount of time for the page to be completely
loaded before parsing the content?
Did anyone encounter already a similar issue? Any pointers appreciated.
Thanks,
Elisabeth
Re: Partially parsed pages
Posted by Elisabeth Adler <el...@gmail.com>.
Hi,
Found a solution: Only setting the properties fetcher.server.delay was
not enough, but once I set the fetcher.server.min.delay as well, it
seems to produce better results.
Elisabeth
On 23.03.2012 13:39, Elisabeth Adler wrote:
> Hi,
>
> I am using Nutch 1.3 in conjunction with Solr 3.3.0 to add search
> capabilities to an Intranet. The bit that's indexed is fine, though most
> of them don't seem to be parsed completely. The bottom bit of the page
> is missing in the content field. Nutch's logs don't show any exceptions.
>
> I turned off parallel fetching (setting fetcher.threads.per.host and
> fetcher.threads.fetch both to 1), which seemed to improve things, but
> still I get some pages with are only partially indexed.
> http.content.limit and file.content.limit are both set to -1.
>
> I tested different settings for the fetcher.server.delay property, but
> this seems to only affect how long Nutch waits until the next fetch.
>
> What I think is happening is that the web server can't serve the pages
> fast enough (on accessing the pages via a Browser it takes about
> 5seconds 'til the page is rendered complete), so Nutch retrieves only
> the part of the page that has been rendered so far. Is there an option
> to let Nutch wait a certain amount of time for the page to be completely
> loaded before parsing the content?
>
> Did anyone encounter already a similar issue? Any pointers appreciated.
> Thanks,
> Elisabeth