You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/02/01 03:49:14 UTC

Re: Recovering from Socket closed

Doug, et. al.,

At 2:48 PM -0800 1/31/06, Chris Schneider wrote:
>Is there a work-around? Based on some other postings, I've increased ipc.client.timeout to 300000 (5 minutes). Does this property also control the timeout for the RPC call you describe above? If so, should I increase this timeout further? Is there a better way for us to avoid getting caught by the RPC timeout you describe? This crawl was only a medium-sized test. We hope to execute a much larger crawl over the next few days.

I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.

This is actually the third time we've increased ipc.client.timeout to solve crawl problems. I wonder whether we should just set it to an hour or two until the problem in the code is fixed. I guess I could also look into fixing the problem myself :)

Thanks for the help, Doug!

- Chris
-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: Recovering from Socket closed

Posted by Chris Schneider <Sc...@TransPac.com>.
Gang,

>Chris Schneider wrote:
>>I was unable to even get the indexing phase started; I would get a 
>>timeout right at the beginning. I tried increasing the 
>>ipc.client.timeout from 5 minutes to 10 minutes, but that didn't 
>>help. In desperation, I increased it to 30 minutes and went to walk 
>>the dogs. As it turned out, it apparently took 14 minutes for it to 
>>"compute the splits". The job is still running (34% complete). 
>>Thus, it does seem like Doug was right about this being the problem.
>
>I have no idea why this takes so long.  We should profile this 
>operation to figure out what's going on, because it shouldn't 
>anywhere near that long.  It should be easy to write a simple 
>program that constructs a JobConf and InputFormat like those used in 
>this job, and calls getSplits().  Then profile this as a standalone 
>program to see where the time is going.  Probably you don't really 
>want to profile something that takes 14 minutes, so perhaps profile 
>it on a subset of the input.

Everyone should probably be made aware of the strange behavior we see 
during indexing, at least for a relatively large number of large 
segments (topN=500K, depth=20) with a relatively large crawldb (50M 
URLs). Note that this was all performed with ipc.client.timeout set 
to 30 minutes.

After launching the indexing job, the web UI shows all of the 
TaskTrackers, but the numbers in the "Secs since heartbeat" column 
just keep increasing. This goes on for about 10 minutes until the 
JobTracker finally loses all of them (and the tasks they were working 
on), as is shown in its log:

060210 224115 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060210 225151 Lost tracker 'tracker_37064'
060210 225151 Task 'task_m_4ftk58' has been lost.
060210 225151 Task 'task_m_6ww2ri' has been lost.

...(snip)...

060210 225151 Task 'task_r_y6d190' has been lost.
060210 225151 Lost tracker 'tracker_92921'
060210 225151 Task 'task_m_9p24at' has been lost.

...(etc)...

At this point, the web UI is still up, the job shows 0% complete, and 
the TaskTrackers table is empty. It goes on for an hour or so like 
this, during which any rational person would probably want to kill 
the job and start over.

Don't do this! Keep the faith!!!

About an hour later, the JobTracker magically reestablishes its 
connection to the TaskTrackers (which now have new names), as is 
shown in its log:

060210 225151 Task 'task_r_yj3y3o' has been lost.
060210 235403 Adding task 'task_m_k9u9a8' to set for tracker 'tracker_85874'
060210 235404 Adding task 'task_m_pijt4q' to set for tracker 'tracker_61888'

...(etc)...

The web UI also shows that the TaskTrackers are back (with their new names).

There's nothing in the TaskTracker logs during the initial 10 
minutes, then a bunch of exiting and closing messages, until finally 
the TaskTrackers start "Reinitializing local state":

060210 225403 Stopping server on 50050
060210 230102 Server handler 4 on 50050: exiting

...(snip)...

060210 230105 Server handler 7 on 50050: exiting
060210 232024 Server listener on port 50050: exiting
060210 232403 Stopping server on 50040
060210 234902 Server listener on port 50040: exiting
060210 234925 Server connection on port 50040 from 192.168.1.5: exiting

...(snip)...

060210 235009 Server connection on port 50040 from 192.168.1.10: exiting
060210 235013 Client connection to 192.168.1.4:50040: closing
060210 235014 Client connection to 192.168.1.7:50040: closing
060210 235015 Server connection on port 50040 from 192.168.1.7: exiting
060210 235016 Server handler 0 on 50040: exiting

...(snip)...

060210 235024 Server handler 2 on 50040: exiting
060210 235403 Reinitializing local state
060210 235403 Server listener on port 50050: starting
060210 235403 Server handler 0 on 50050: starting

...(etc)...

During the time that the TaskTrackers are lost, neither the master 
nor the slave machines seem to be using much of the CPU or RAM, and 
the DataNode logs are quiet. I suppose that it's probably I/O bound 
on the master machine, but even that seems mysterious to me. It would 
seem particularly inappropriate for the JobTracker to punt the 
TaskTrackers because the master was too busy to listen for their 
heartbeats.

At any rate, once the TaskTrackers go through the "Reinitializing 
local state" thing, the indexing job seems to proceed normally, and 
it eventually completes with no errors.

FYI,

- Chris
-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: Recovering from Socket closed

Posted by Doug Cutting <cu...@apache.org>.
Andrzej Bialecki wrote:
> I have an idea - you remember the old issue of MapFile's "index" being 
> corrupted, if Fetcher was interrupted. Random accesses to MapFile's 
> would take ages in that case. Does calculating splits involve random 
> access to the segment's MapFiles?

No, calculating splits just lists directories and then gets the size of 
each file.  So this could point to an NDFS name node performance 
problem, or an RPC performance problem.  Each file size request is an 
RPC, and there could be hundreds or even thousands of input files, but 
even thousands of RPCs shouldn't take 14 minutes.

Doug

Re: Recovering from Socket closed

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Chris Schneider wrote:
>> I was unable to even get the indexing phase started; I would get a 
>> timeout right at the beginning. I tried increasing the 
>> ipc.client.timeout from 5 minutes to 10 minutes, but that didn't 
>> help. In desperation, I increased it to 30 minutes and went to walk 
>> the dogs. As it turned out, it apparently took 14 minutes for it to 
>> "compute the splits". The job is still running (34% complete). Thus, 
>> it does seem like Doug was right about this being the problem.
>
> I have no idea why this takes so long.  We should profile this 
> operation to figure out what's going on, because it shouldn't anywhere 
> near that long.  It should be easy to write a simple program that 
> constructs a JobConf and InputFormat like those used in this job, and 
> calls getSplits().  Then profile this as a standalone program to see 
> where the time is going.  Probably you don't really want to profile 
> something that takes 14 minutes, so perhaps profile it on a subset of 
> the input.

I have an idea - you remember the old issue of MapFile's "index" being 
corrupted, if Fetcher was interrupted. Random accesses to MapFile's 
would take ages in that case. Does calculating splits involve random 
access to the segment's MapFiles?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Recovering from Socket closed

Posted by Doug Cutting <cu...@apache.org>.
Chris Schneider wrote:
> I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.

I have no idea why this takes so long.  We should profile this operation 
to figure out what's going on, because it shouldn't anywhere near that 
long.  It should be easy to write a simple program that constructs a 
JobConf and InputFormat like those used in this job, and calls 
getSplits().  Then profile this as a standalone program to see where the 
time is going.  Probably you don't really want to profile something that 
takes 14 minutes, so perhaps profile it on a subset of the input.

Doug