You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Filip Stysiak <st...@gmail.com> on 2017/08/10 15:10:55 UTC

dockerized Nutch crawl doesn't end

Hi everyone;

I am developing and app that needs to have dockerized Nutch 1.X instances
and gets crawl requests from Celery and indexes it to Solr 6.6.0. The app
indexes images (using protocol-selenium plugin to fetch dynamic content).
However, I noticed that whereas small crawl tasks are properly indexed I
had no success with a slightly bigger query - when I asked my dockerized
app to crawl a website that (after 3 iterations of the crawl script) needs
to fetch ~5000 links the Nutch in the Docker container just stops to work -
the last thing I see in hadoop.log are from the fetcher; there are no
exceptions, however, save for an exception that does not occur when I run
(successfully) the very same crawling task on the host machine.

the exception (pastebin to full exception):
org.apache.commons.httpclient.NoHttpResponseException: The server
some.site.web failed to respond
https://pastebin.com/nNs7DP93

I doubt that failing to fetch a couple of links would put Nutch in this
crashed-but-not-really state. I say "not really", because Celery still sees
the task as active - but when I look at htop or *docker stats *it's quite
obvious that nutch ceased to do anything productive. Let me restate that
this doesn't occur when I run the task outside of Docker.

Has anyone here stumbled upon anything similair, or has any experience with
running bigger crawls on dockerized Nutch?

Thanks in advance,
Filip