You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by James Moore <ja...@gmail.com> on 2008/05/07 22:29:13 UTC

Read timed out, Abandoning block blk_-5476242061384228962

What is this bit of the log trying to tell me, and what sorts of
things should I be looking at to make sure it doesn't happen?

I don't think the network has any basic configuration issues - I can
telnet from the machine creating this log to the destination - telnet
10.252.222.239 50010 works fine when I ssh in to the box with this
error.

2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Abandoning block blk_-5476242061384228962
2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting
to find target node: 10.252.222.239:50010

I'm seeing a fair number of these.  My reduces finally complete, but
there are usually a couple at the end that take longer than I think
they should, and they frequently have these sorts of errors.

I'm running 20 machines on ec2 right now, with hadoop version 0.16.4.
-- 
James Moore | james@restphone.com
blog.restphone.com

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

Taking the timeout out is very dangerous. It may cause your application to
hang. You could change the timeout parameter to a larger number. HADOOP-2188
fixed the problem. Check https://issues.apache.org/jira/browse/HADOOP-2188.

Hairong

On 5/7/08 2:36 PM, "James Moore" <ja...@gmail.com> wrote:

> I noticed that there was a hard-coded timeout value of 6000 (ms) in
> src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I
> took that way down and now I'm not noticing the problem.  (Doesn't
> mean it's not there, I just don't feel the pain...)
> 
> This feels like a terrible solution^H^H^H^H^H^hack though,
> particularly since I haven't yet taken the time to actually understand
> the code.

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by James Moore <ja...@gmail.com>.

I noticed that there was a hard-coded timeout value of 6000 (ms) in
src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I
took that way down and now I'm not noticing the problem.  (Doesn't
mean it's not there, I just don't feel the pain...)

This feels like a terrible solution^H^H^H^H^H^hack though,
particularly since I haven't yet taken the time to actually understand
the code.

-- 
James Moore | james@restphone.com
blog.restphone.com

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by James Moore <ja...@gmail.com>.

On Sun, May 11, 2008 at 9:45 PM, Dhruba Borthakur <dh...@gmail.com> wrote:
> How much memory does your machine have and how many files does your
> HDFS have? One possibility is that the memory pressure of the
> map-reduce jobs causes more GC runs for the namenode process.

I'm using machines with 16G of memory, and there aren't that many
files yet - about 14,000 total, according to dfs -lsr /. The admin
web page reports 1.32 TB used, but I'm assuming that's including
replication (set to 4).

19 machines total in the cluster (planned to have 20, but one didn't
come up initially and I haven't bothered to add it), and I'm currently
not running maps/reduces on the master.

And earlier, when I said I'm not seeing the timeouts, I spoke too
soon. One of my last runs had 3 maps fail, with these sorts of
errors:

Failed to rename output with the exception:
java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:514)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.delete(DFSClient.java:426)
at org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:149)
at org.apache.hadoop.mapred.Task.saveTaskOutput(Task.java:430)
at org.apache.hadoop.mapred.JobTracker$TaskCommitQueue.run(JobTracker.java:2013)

This is somewhat painful, since these are Nutch fetch maps, meaning
that they're getting data via HTTP from crawled sites. Looks like
it's not a permanent problem, though - just doing some crawls twice.

The default timeouts for the RPCs seem high enough (ipc.client.timeout
== 60000). Is there something else I should look at? Watching 'top'
during a run, it didn't seem like the master ever went below 80%, so
it seems a little unlikely that the machine isn't responding for 60
seconds.

Should I increase the memory for the datanode processes? It's got
-Xmx2000m right now. And is there a way to do that with the existing
scripts/configurations? Obviously I can do it by hand, but I'm not
seeing something like the mapred.child.java.opts setting for the
master.

--
James Moore | james@restphone.com
blog.restphone.com

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by Dhruba Borthakur <dh...@gmail.com>.

You bring up an interesting point. A big chunk of the code in the
Namenode is being done inside a global lock although there are pieces
(e.g. a portion of code that chooses datanodes for a newly allocated
block) that do execute outside this lock. But, it is probably the case
that the namenode does not benefit from more than 4 core or so (with
the current code).

If you have 8 cores, you can expriement with running map-reduce jobs
on the other 4 cores.

How much memory does your machine have and how many files does your
HDFS have? One possibility is that the memory pressure of the
map-reduce jobs causes more GC runs for the namenode process.

thanks,
dhruba


On Fri, May 9, 2008 at 7:54 PM, James Moore <ja...@gmail.com> wrote:
> On Fri, May 9, 2008 at 12:00 PM, Hairong Kuang <ha...@yahoo-inc.com> wrote:
>  >> I'm using the machine running the namenode to run maps as well.
>  > Please do not run maps on the machine that is running the namenode. This
>  > would cause CPU contention and slow down namenode. Thus more easily to see
>  > SocketTimeoutException.
>  >
>  > Hairong
>
>  I've turned off running tasks on the master, and I'm not seeing those errors.
>
>  The behavior was interesting.  On one job, I saw a total of 11 timeout
>  failures (where the map was reported as a failure), but all of them
>  happened in the first few minutes.  After that it worked well and
>  completed correctly.
>
>  I'm wondering if it's worth it, though.  If the number of maps/reduces
>  that the master machine can run is substantially greater than the
>  number of failures due to timeouts, isn't it worth having the master
>  run tasks?  It seems like there's probably a point where the number of
>  machines in the cluster makes having a separate master a requirement,
>  but at 20 8-core machines, it's not clear that dedicating a box to
>  being the master is a win.  (And having a smaller machine dedicated to
>  being the master is cheaper, but annoying.  I'd rather have N
>  identical boxes running the same AMI, etc.)
>
>  To anyone using amazon - definitely upgrade to the new kernels.  I now
>  have have very few instances of the 'Exception in
>  createBlockOutputStream' error that started this thread in my logs.
>  (These are different than the 11 timeouts I mentioned above, FYI).
>
>  The ones that are there all happened in one burst at  03:59:22 this afternoon:
>
>  james@domU-12-31-38-00-D1-B1:~/dev/hadoop$ bin/slaves.sh grep -r
>  'Exception in createBlockOutputStream' ~/dev/hadoop/logs/
>  domU-12-31-38-00-04-51.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000024_0/syslog:2008-05-09
>  03:59:22,713 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>  domU-12-31-38-00-D6-21.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000048_0/syslog:2008-05-09
>  03:59:22,989 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.22.111:50010
>  domU-12-31-38-00-D6-21.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000061_0/syslog:2008-05-09
>  03:59:22,398 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>  domU-12-31-38-00-60-D1.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000017_0/syslog:2008-05-09
>  03:59:22,880 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.217.203:50010
>  domU-12-31-38-00-CD-41.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000051_0/syslog:2008-05-09
>  03:59:23,012 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.34.31:50010
>  domU-12-31-38-00-D5-E1.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000026_0/syslog:2008-05-09
>  03:59:24,551 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.15.47:50010
>  domU-12-31-38-00-1D-D1.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000056_0/syslog:2008-05-09
>  03:59:23,504 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.11.159:50010
>  domU-12-31-38-00-1D-D1.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000050_0/syslog:2008-05-09
>  03:59:22,454 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>  domU-12-31-38-00-1D-D1.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000009_0/syslog:2008-05-09
>  03:59:22,944 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>  domU-12-31-38-00-D8-81.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000002_0/syslog:2008-05-09
>  03:59:22,420 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>  domU-12-31-38-00-D8-81.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000072_0/syslog:2008-05-09
>  03:59:22,318 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>  domU-12-31-38-00-08-C1.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000021_0/syslog:2008-05-09
>  03:59:24,150 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.22.111:50010
>  domU-12-31-38-00-C9-51.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000045_0/syslog:2008-05-09
>  03:59:24,470 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.IOException: Bad connect ack with
>  firstBadLink 10.252.22.111:50010
>  domU-12-31-38-00-C9-51.compute-1.internal:
>  /home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000055_0/syslog:2008-05-09
>  03:59:21,588 INFO org.apache.hadoop.dfs.DFSClient: Exception in
>  createBlockOutputStream java.io.EOFException
>
>
>  --
>  James Moore | james@restphone.com
>  blog.restphone.com
>

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by James Moore <ja...@gmail.com>.

On Fri, May 9, 2008 at 12:00 PM, Hairong Kuang <ha...@yahoo-inc.com> wrote:
>> I'm using the machine running the namenode to run maps as well.
> Please do not run maps on the machine that is running the namenode. This
> would cause CPU contention and slow down namenode. Thus more easily to see
> SocketTimeoutException.
>
> Hairong

I've turned off running tasks on the master, and I'm not seeing those errors.

The behavior was interesting.  On one job, I saw a total of 11 timeout
failures (where the map was reported as a failure), but all of them
happened in the first few minutes.  After that it worked well and
completed correctly.

I'm wondering if it's worth it, though.  If the number of maps/reduces
that the master machine can run is substantially greater than the
number of failures due to timeouts, isn't it worth having the master
run tasks?  It seems like there's probably a point where the number of
machines in the cluster makes having a separate master a requirement,
but at 20 8-core machines, it's not clear that dedicating a box to
being the master is a win.  (And having a smaller machine dedicated to
being the master is cheaper, but annoying.  I'd rather have N
identical boxes running the same AMI, etc.)

To anyone using amazon - definitely upgrade to the new kernels.  I now
have have very few instances of the 'Exception in
createBlockOutputStream' error that started this thread in my logs.
(These are different than the 11 timeouts I mentioned above, FYI).

The ones that are there all happened in one burst at  03:59:22 this afternoon:

james@domU-12-31-38-00-D1-B1:~/dev/hadoop$ bin/slaves.sh grep -r
'Exception in createBlockOutputStream' ~/dev/hadoop/logs/
domU-12-31-38-00-04-51.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000024_0/syslog:2008-05-09
03:59:22,713 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
domU-12-31-38-00-D6-21.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000048_0/syslog:2008-05-09
03:59:22,989 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.22.111:50010
domU-12-31-38-00-D6-21.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000061_0/syslog:2008-05-09
03:59:22,398 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
domU-12-31-38-00-60-D1.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000017_0/syslog:2008-05-09
03:59:22,880 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.217.203:50010
domU-12-31-38-00-CD-41.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000051_0/syslog:2008-05-09
03:59:23,012 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.34.31:50010
domU-12-31-38-00-D5-E1.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000026_0/syslog:2008-05-09
03:59:24,551 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.15.47:50010
domU-12-31-38-00-1D-D1.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000056_0/syslog:2008-05-09
03:59:23,504 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.11.159:50010
domU-12-31-38-00-1D-D1.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000050_0/syslog:2008-05-09
03:59:22,454 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
domU-12-31-38-00-1D-D1.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000009_0/syslog:2008-05-09
03:59:22,944 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
domU-12-31-38-00-D8-81.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000002_0/syslog:2008-05-09
03:59:22,420 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
domU-12-31-38-00-D8-81.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000072_0/syslog:2008-05-09
03:59:22,318 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
domU-12-31-38-00-08-C1.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000021_0/syslog:2008-05-09
03:59:24,150 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.22.111:50010
domU-12-31-38-00-C9-51.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000045_0/syslog:2008-05-09
03:59:24,470 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 10.252.22.111:50010
domU-12-31-38-00-C9-51.compute-1.internal:
/home/james/dev/hadoop/logs/userlogs/task_200805082159_0043_r_000055_0/syslog:2008-05-09
03:59:21,588 INFO org.apache.hadoop.dfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

-- 
James Moore | james@restphone.com
blog.restphone.com

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

> I'm using the machine running the namenode to run maps as well.
Please do not run maps on the machine that is running the namenode. This
would cause CPU contention and slow down namenode. Thus more easily to see
SocketTimeoutException.

Hairong 

On 5/9/08 11:24 AM, "James Moore" <ja...@gmail.com> wrote:

> On Wed, May 7, 2008 at 2:45 PM, Chris K Wensel <ch...@wensel.net> wrote:
>> Hi James
>> 
>> Were you able to start all the nodes in the same 'availability zone'? You
>> using the new AMI kernels?
> 
> After I saw your note, I restarted new instances with the new kernels
> (aki-b51cf9dc and ari-b31cf9da) and made sure everything was in the
> same availability zone.
> 
>> If you are using the contrib/ec2 scripts, you might upgrade (just the
>> scripts) to
>> 
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2>>
/
> 
> I'll take a look at these - I've been doing it by hand.
> 
> Hairong wrote:
>> Taking the timeout out is very dangerous. It may cause your application to
>> hang. You could change the timeout parameter to a larger number.
> 
> Thanks - reducing the timeout did seem like a bad idea.  With the new
> kernels, I'm seeing timeout errors like this:
> 
> java.net.SocketTimeoutException: timed out waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:514)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
> at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
> ava:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocation
> Handler.java:82)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandle
> r.java:59)
> at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source)
> at org.apache.hadoop.dfs.DFSClient.mkdirs(DFSClient.java:550)
> at 
> org.apache.hadoop.dfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:
> 184)
> at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:982)
> at 
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.j
> ava:1429)
> at 
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.jav
> a:1493)
> at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:700)
> at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:693)
> at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
> at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
> at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
> at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
> 
> I'll experiment with increasing the timeout.
> 
> I'm using the machine running the namenode to run maps as well - could
> that be a source of my problem?  The load is fairly high, essentially
> no idle time.  8 cores per machine, so I've got 8 maps running.  I'm
> guessing I'd be better off running 80 smaller machines instead of 20
> larger ones for the same price, but we haven't been approved for more
> than 20 instances yet.  Given that I'm not seeing any idle time, I'm
> assuming that I'm CPU not IO-bound.
> 
> Cpu(s): 89.6%us,  5.7%sy,  0.0%ni,  0.6%id,  0.0%wa,  0.0%hi,  0.1%si,  4.0%st
> Mem:  15736360k total, 14935708k used,   800652k free,   237980k buffers
> Swap:        0k total,        0k used,        0k free,  7545100k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 28955 james     21   0 1308m 750m 9440 S  121  4.9   0:36.61
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
> 28989 james     18   0 1298m 725m 9376 S  120  4.7   0:30.48
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
> 29029 james     18   0 1349m 504m 9376 S  117  3.3   0:24.55
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
> 29059 james     18   0 1301m 313m 9428 S   81  2.0   0:16.51
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
> 25658 james     20   0 1293m 277m 9204 S    8  1.8   0:29.98
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
> 25756 james     19   0 1286m 412m 9204 S    3  2.7   0:30.66
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
> 25688 james     19   0 1286m 332m 9204 S    2  2.2   0:28.69
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
>  1141 james     24   0 2332m 281m 8932 S    1  1.8   3:56.17
> /usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m
> -Dcom.sun.management.jmxremote
> -Dhadoop.log.dir=/home/james/dev/hadoop/logs
> -Dhadoop.log.file=hadoop-james-jobtracker-domU-1
> 25724 james     19   0 1286m 386m 9204 S    1  2.5   0:28.96
> /usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
> -Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/jam
> es/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
>   822 james     24   0 2306m  91m 8912 S    0  0.6   3:15.12
> /usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m
> -Dcom.sun.management.jmxremote
> -Dhadoop.log.dir=/home/james/dev/hadoop/logs
> -Dhadoop.log.file=hadoop-james-namenode-domU-12-
> 
> FYI, I'm using JRuby to do the work in the map tasks.  It's working well so
> far.

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by Chris K Wensel <ch...@wensel.net>.

> I'm using the machine running the namenode to run maps as well - could
> that be a source of my problem?  The load is fairly high, essentially
> no idle time.  8 cores per machine, so I've got 8 maps running.  I'm
> guessing I'd be better off running 80 smaller machines instead of 20
> larger ones for the same price, but we haven't been approved for more
> than 20 instances yet.  Given that I'm not seeing any idle time, I'm
> assuming that I'm CPU not IO-bound.


fwiw, I have not found the large or xlarge EC2 instances  
proportionally faster with Hadoop. Thus we run many small instances  
more cheaply.

btw, the email notifying you that you have been approved may lag the  
actual approval (mine did for days). might be worth trying a larger  
cluster to see.

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by James Moore <ja...@gmail.com>.

On Wed, May 7, 2008 at 2:45 PM, Chris K Wensel <ch...@wensel.net> wrote:
> Hi James
>
> Were you able to start all the nodes in the same 'availability zone'? You
> using the new AMI kernels?

After I saw your note, I restarted new instances with the new kernels
(aki-b51cf9dc and ari-b31cf9da) and made sure everything was in the
same availability zone.

> If you are using the contrib/ec2 scripts, you might upgrade (just the
> scripts) to
> http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/

I'll take a look at these - I've been doing it by hand.

Hairong wrote:
> Taking the timeout out is very dangerous. It may cause your application to
> hang. You could change the timeout parameter to a larger number.

Thanks - reducing the timeout did seem like a bad idea.  With the new
kernels, I'm seeing timeout errors like this:

java.net.SocketTimeoutException: timed out waiting for rpc response
	at org.apache.hadoop.ipc.Client.call(Client.java:514)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198)
	at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at org.apache.hadoop.dfs.$Proxy5.mkdirs(Unknown Source)
	at org.apache.hadoop.dfs.DFSClient.mkdirs(DFSClient.java:550)
	at org.apache.hadoop.dfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:184)
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:982)
	at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1429)
	at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1493)
	at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:700)
	at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:693)
	at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
	at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
	at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
	at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)

I'll experiment with increasing the timeout.

I'm using the machine running the namenode to run maps as well - could
that be a source of my problem?  The load is fairly high, essentially
no idle time.  8 cores per machine, so I've got 8 maps running.  I'm
guessing I'd be better off running 80 smaller machines instead of 20
larger ones for the same price, but we haven't been approved for more
than 20 instances yet.  Given that I'm not seeing any idle time, I'm
assuming that I'm CPU not IO-bound.

Cpu(s): 89.6%us,  5.7%sy,  0.0%ni,  0.6%id,  0.0%wa,  0.0%hi,  0.1%si,  4.0%st
Mem:  15736360k total, 14935708k used,   800652k free,   237980k buffers
Swap:        0k total,        0k used,        0k free,  7545100k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28955 james     21   0 1308m 750m 9440 S  121  4.9   0:36.61
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
28989 james     18   0 1298m 725m 9376 S  120  4.7   0:30.48
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
29029 james     18   0 1349m 504m 9376 S  117  3.3   0:24.55
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
29059 james     18   0 1301m 313m 9428 S   81  2.0   0:16.51
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
25658 james     20   0 1293m 277m 9204 S    8  1.8   0:29.98
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
25756 james     19   0 1286m 412m 9204 S    3  2.7   0:30.66
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
25688 james     19   0 1286m 332m 9204 S    2  2.2   0:28.69
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
 1141 james     24   0 2332m 281m 8932 S    1  1.8   3:56.17
/usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m
-Dcom.sun.management.jmxremote
-Dhadoop.log.dir=/home/james/dev/hadoop/logs
-Dhadoop.log.file=hadoop-james-jobtracker-domU-1
25724 james     19   0 1286m 386m 9204 S    1  2.5   0:28.96
/usr/lib/jvm/java-6-sun-1.6.0.03/jre/bin/java
-Djava.library.path=/home/james/dev/hadoop/lib/native/Linux-amd64-64:/home/james/dfsTmp/mapred/local/taskTracker/jobcache/job_2008
  822 james     24   0 2306m  91m 8912 S    0  0.6   3:15.12
/usr/lib/jvm/java-6-sun-1.6.0.03/bin/java -Xmx2000m
-Dcom.sun.management.jmxremote
-Dhadoop.log.dir=/home/james/dev/hadoop/logs
-Dhadoop.log.file=hadoop-james-namenode-domU-12-

FYI, I'm using JRuby to do the work in the map tasks.  It's working well so far.

-- 
James Moore | james@restphone.com
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: Read timed out, Abandoning block blk_-5476242061384228962

Posted by Chris K Wensel <ch...@wensel.net>.

Hi James

Were you able to start all the nodes in the same 'availability zone'?  
You using the new AMI kernels?

If you are using the contrib/ec2 scripts, you might upgrade (just the  
scripts) to
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/

These support the new kernels and availability zones. My transient  
errors went away when upgrading.

The functional changes are documented here:
http://wiki.apache.org/hadoop/AmazonEC2

fyi, you will need to build your own images (via the create-image  
command) with whatever version of Hadoop you are comfortable with.  
this will also get you a Ganglia install...

ckw

On May 7, 2008, at 1:29 PM, James Moore wrote:

> What is this bit of the log trying to tell me, and what sorts of
> things should I be looking at to make sure it doesn't happen?
>
> I don't think the network has any basic configuration issues - I can
> telnet from the machine creating this log to the destination - telnet
> 10.252.222.239 50010 works fine when I ssh in to the box with this
> error.
>
> 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
> Exception in createBlockOutputStream java.net.SocketTimeoutException:
> Read timed out
> 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
> Abandoning block blk_-5476242061384228962
> 2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting
> to find target node: 10.252.222.239:50010
>
> I'm seeing a fair number of these.  My reduces finally complete, but
> there are usually a couple at the end that take longer than I think
> they should, and they frequently have these sorts of errors.
>
> I'm running 20 machines on ec2 right now, with hadoop version 0.16.4.
> -- 
> James Moore | james@restphone.com
> blog.restphone.com

Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/