You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Rogers <pa...@gmail.com> on 2015/01/09 18:35:05 UTC
Problem with time out on QueueFeeder

Hi Guys

I am using nutch 1.8 to fetch pdf documents from an http server.  The jobs
have been running OK until recently when I started getting the following
error:

-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
fetching
http://server1/doccontrol/DC-10%20Incoming%20Correspondence(IAE-US)/15C_221427_IAE_LTR_IAE_0845%20Letter%20from%20Alvarez.pdf
(queue crawl delay=5000ms)
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
QueueFeeder finished: total 4655 records + hit by time limit :1184
-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500
* queue: http://ws0895 >> dropping!
-finishing thread FetcherThread, activeThreads=49
-finishing thread FetcherThread, activeThreads=48
.
.
.
.
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1340)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1376)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1349)

New pdf's are added to server everynight and the whole of the content then
re-fetched, ie the content is growing so I can understand that a limit
might be reached.

I have searched on the error and it seems that this behaviour should be
governed by fetcher.timelimit.mins property

I've checked the nutch-default and nutch-site files and can only find a
single entry:

<property>
  <name>fetcher.timelimit.mins</name>
  <value>-1</value>
  <description>This is the number of minutes allocated to the fetching.
  Once this value is reached, any remaining entry from the input URL list
is skipped
  and all active queues are emptied. The default value of -1 deactivates
the time limit.
  </description>
</property>

Should there not therefore be  no time limit?

Any suggestions on what else might be causing this problem?

Thanks

P