You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/02/01 03:49:14 UTC
Re: Recovering from Socket closed
Doug, et. al.,
At 2:48 PM -0800 1/31/06, Chris Schneider wrote:
>Is there a work-around? Based on some other postings, I've increased ipc.client.timeout to 300000 (5 minutes). Does this property also control the timeout for the RPC call you describe above? If so, should I increase this timeout further? Is there a better way for us to avoid getting caught by the RPC timeout you describe? This crawl was only a medium-sized test. We hope to execute a much larger crawl over the next few days.
I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.
This is actually the third time we've increased ipc.client.timeout to solve crawl problems. I wonder whether we should just set it to an hour or two until the problem in the code is fixed. I guess I could also look into fixing the problem myself :)
Thanks for the help, Doug!
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------
Re: Recovering from Socket closed
Posted by Chris Schneider <Sc...@TransPac.com>.
Gang,
>Chris Schneider wrote:
>>I was unable to even get the indexing phase started; I would get a
>>timeout right at the beginning. I tried increasing the
>>ipc.client.timeout from 5 minutes to 10 minutes, but that didn't
>>help. In desperation, I increased it to 30 minutes and went to walk
>>the dogs. As it turned out, it apparently took 14 minutes for it to
>>"compute the splits". The job is still running (34% complete).
>>Thus, it does seem like Doug was right about this being the problem.
>
>I have no idea why this takes so long. We should profile this
>operation to figure out what's going on, because it shouldn't
>anywhere near that long. It should be easy to write a simple
>program that constructs a JobConf and InputFormat like those used in
>this job, and calls getSplits(). Then profile this as a standalone
>program to see where the time is going. Probably you don't really
>want to profile something that takes 14 minutes, so perhaps profile
>it on a subset of the input.
Everyone should probably be made aware of the strange behavior we see
during indexing, at least for a relatively large number of large
segments (topN=500K, depth=20) with a relatively large crawldb (50M
URLs). Note that this was all performed with ipc.client.timeout set
to 30 minutes.
After launching the indexing job, the web UI shows all of the
TaskTrackers, but the numbers in the "Secs since heartbeat" column
just keep increasing. This goes on for about 10 minutes until the
JobTracker finally loses all of them (and the tasks they were working
on), as is shown in its log:
060210 224115 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060210 225151 Lost tracker 'tracker_37064'
060210 225151 Task 'task_m_4ftk58' has been lost.
060210 225151 Task 'task_m_6ww2ri' has been lost.
...(snip)...
060210 225151 Task 'task_r_y6d190' has been lost.
060210 225151 Lost tracker 'tracker_92921'
060210 225151 Task 'task_m_9p24at' has been lost.
...(etc)...
At this point, the web UI is still up, the job shows 0% complete, and
the TaskTrackers table is empty. It goes on for an hour or so like
this, during which any rational person would probably want to kill
the job and start over.
Don't do this! Keep the faith!!!
About an hour later, the JobTracker magically reestablishes its
connection to the TaskTrackers (which now have new names), as is
shown in its log:
060210 225151 Task 'task_r_yj3y3o' has been lost.
060210 235403 Adding task 'task_m_k9u9a8' to set for tracker 'tracker_85874'
060210 235404 Adding task 'task_m_pijt4q' to set for tracker 'tracker_61888'
...(etc)...
The web UI also shows that the TaskTrackers are back (with their new names).
There's nothing in the TaskTracker logs during the initial 10
minutes, then a bunch of exiting and closing messages, until finally
the TaskTrackers start "Reinitializing local state":
060210 225403 Stopping server on 50050
060210 230102 Server handler 4 on 50050: exiting
...(snip)...
060210 230105 Server handler 7 on 50050: exiting
060210 232024 Server listener on port 50050: exiting
060210 232403 Stopping server on 50040
060210 234902 Server listener on port 50040: exiting
060210 234925 Server connection on port 50040 from 192.168.1.5: exiting
...(snip)...
060210 235009 Server connection on port 50040 from 192.168.1.10: exiting
060210 235013 Client connection to 192.168.1.4:50040: closing
060210 235014 Client connection to 192.168.1.7:50040: closing
060210 235015 Server connection on port 50040 from 192.168.1.7: exiting
060210 235016 Server handler 0 on 50040: exiting
...(snip)...
060210 235024 Server handler 2 on 50040: exiting
060210 235403 Reinitializing local state
060210 235403 Server listener on port 50050: starting
060210 235403 Server handler 0 on 50050: starting
...(etc)...
During the time that the TaskTrackers are lost, neither the master
nor the slave machines seem to be using much of the CPU or RAM, and
the DataNode logs are quiet. I suppose that it's probably I/O bound
on the master machine, but even that seems mysterious to me. It would
seem particularly inappropriate for the JobTracker to punt the
TaskTrackers because the master was too busy to listen for their
heartbeats.
At any rate, once the TaskTrackers go through the "Reinitializing
local state" thing, the indexing job seems to proceed normally, and
it eventually completes with no errors.
FYI,
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------
Re: Recovering from Socket closed
Posted by Doug Cutting <cu...@apache.org>.
Andrzej Bialecki wrote:
> I have an idea - you remember the old issue of MapFile's "index" being
> corrupted, if Fetcher was interrupted. Random accesses to MapFile's
> would take ages in that case. Does calculating splits involve random
> access to the segment's MapFiles?
No, calculating splits just lists directories and then gets the size of
each file. So this could point to an NDFS name node performance
problem, or an RPC performance problem. Each file size request is an
RPC, and there could be hundreds or even thousands of input files, but
even thousands of RPCs shouldn't take 14 minutes.
Doug
Re: Recovering from Socket closed
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Chris Schneider wrote:
>> I was unable to even get the indexing phase started; I would get a
>> timeout right at the beginning. I tried increasing the
>> ipc.client.timeout from 5 minutes to 10 minutes, but that didn't
>> help. In desperation, I increased it to 30 minutes and went to walk
>> the dogs. As it turned out, it apparently took 14 minutes for it to
>> "compute the splits". The job is still running (34% complete). Thus,
>> it does seem like Doug was right about this being the problem.
>
> I have no idea why this takes so long. We should profile this
> operation to figure out what's going on, because it shouldn't anywhere
> near that long. It should be easy to write a simple program that
> constructs a JobConf and InputFormat like those used in this job, and
> calls getSplits(). Then profile this as a standalone program to see
> where the time is going. Probably you don't really want to profile
> something that takes 14 minutes, so perhaps profile it on a subset of
> the input.
I have an idea - you remember the old issue of MapFile's "index" being
corrupted, if Fetcher was interrupted. Random accesses to MapFile's
would take ages in that case. Does calculating splits involve random
access to the segment's MapFiles?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Recovering from Socket closed
Posted by Doug Cutting <cu...@apache.org>.
Chris Schneider wrote:
> I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.
I have no idea why this takes so long. We should profile this operation
to figure out what's going on, because it shouldn't anywhere near that
long. It should be easy to write a simple program that constructs a
JobConf and InputFormat like those used in this job, and calls
getSplits(). Then profile this as a standalone program to see where the
time is going. Probably you don't really want to profile something that
takes 14 minutes, so perhaps profile it on a subset of the input.
Doug