You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Tom Rees <tr...@chiliad.com> on 2014/04/05 00:28:15 UTC

Two simultaneous web crawls hang

Hi. I am running manifold 1.4.1 with patch 813. I am using postgres 9.3.2
for the database. There is a strange problem with the web crawler where if
I run two simultaneous crawls then the crawls fairly quickly hang and the
logfile shows no activity other than "Idle cleanup thread" messages.
However, if I run a single crawl, then that crawl runs for days, either
finishing or indefinitely fetching more documents.

Usually the two sites I crawl are www.fbi.gov and www.cnn.com. The crawls
are vanilla except that I vary the number of connections from 2 to 8 per
crawl, and sometimes I select the option to never delete unreachable
documents. Also, I have varied the number of cralwer threads from 30 to 60,
and I have set the number of database handles to 200. No matter, however,
the crawls always hang.

I looked at the threads after the crawls hanged, and it looks like some
threads are forever waiting for a signal. Most of the crawler threads are
waiting for a connector:

      Name: Worker thread '0'
      State: WAITING on
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory$Pool@1932ab0
      Total blocked: 57,189 Total waited: 59,158
      Stack trace:
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)

org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory$Pool.getConnector(RepositoryConnectorFactory.java:591)

org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.grab(RepositoryConnectorFactory.java:384)

org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:254)

However, some threads are waiting on a response from a URL fetch:

      Name: Worker thread '24'
      State: BLOCKED on
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread@6a082fowned
by: Thread-3289186
      Total blocked: 60,050 Total waited: 62,264
      Stack trace:
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread.getResponseCode(ThrottledFetcher.java:2511)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.executeFetch(ThrottledFetcher.java:1610)

org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:724)

org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322)

and some threads are waiting on a connection to use to download a URL:

      Name: Worker thread '28'
      State: WAITING on java.lang.Integer@984b34
      Total blocked: 93,074 Total waited: 96,142
      Stack trace:
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher.getConnection(ThrottledFetcher.java:413)

org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:714)

org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322)

while one was waiting to "finish up":

      Name: Worker thread '32'
      State: WAITING on
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread@3232b0
      Total blocked: 71,716 Total waited: 73,738
      Stack trace:
      java.lang.Object.wait(Native Method)
      java.lang.Thread.join(Thread.java:1260)
      java.lang.Thread.join(Thread.java:1334)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread.finishUp(ThrottledFetcher.java:2629)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.doneFetch(ThrottledFetcher.java:1926)

org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:804)

It looks like there were also four threads spawned to download the data
from the InputStream:

      Name: Thread-3278771
      State: WAITING on
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottleBin@ec30e9
      Total blocked: 0 Total waited: 1
      Stack trace:
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottleBin.beginRead(ThrottledFetcher.java:831)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.beginRead(ThrottledFetcher.java:1200)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2133)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:2114)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:2077)
      java.util.zip.CheckedInputStream.read(CheckedInputStream.java:59)
      java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:262)
      java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:254)
      java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:163)
      java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
      java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)

org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread.run(ThrottledFetcher.java:2428)
      locked
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread@144d709


According to jconsole, which I used to get these stack traces, there were
no deadlocked threads. However, as the stack traces show, many of these
stack traces are blocked in wait() calls.

Any help you can offer to keep our web crawls from hanging will be greatly
appreciated.

thank you
Tom Rees