You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by stack <st...@duboce.net> on 2008/11/20 22:01:11 UTC

Hung in DFSClient$DFSOutputStream.writeChunk

Over in hbase-space, we trigger a hang in DFSOutputStream.writeChunk.  
Input appreciated.

Here are the two pertinent extracts from the hbase regionserver thread dump:

"IPC Server handler 9 on 60020" daemon prio=10 tid=0x00007fef1c3f0400 
nid=0x7470 waiting for monitor entry 
[0x0000000042d18000..0x0000000042d189f0]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2486)
    - waiting to lock <0x00007fef38ecc138> (a java.util.LinkedList)
    - locked <0x00007fef38ecbdb8> (a 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
    at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
    at 
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
    - locked <0x00007fef38ecbdb8> (a 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
    at 
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
    - locked <0x00007fef38ecbdb8> (a 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
    at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
    at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
    - locked <0x00007fef38ecbdb8> (a 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
    at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    - locked <0x00007fef38e09fc0> (a 
org.apache.hadoop.fs.FSDataOutputStream)
    at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
    - locked <0x00007fef38e09f30> (a 
org.apache.hadoop.io.SequenceFile$Writer)
    at 
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980)
    - locked <0x00007fef38e09f30> (a 
org.apache.hadoop.io.SequenceFile$Writer)
    at org.apache.hadoop.hbase.regionserver.HLog.doWrite(HLog.java:461)
    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:421)
    - locked <0x00007fef29ad9588> (a java.lang.Integer)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.update(HRegion.java:1676)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1439)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1378)
    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1184)
    at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:622)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)


Here is trace from accompanying DataStreamer:


"DataStreamer for file 
/hbase/log_72.34.249.212_1225407466779_60020/hlog.dat.1227075571390 
block blk_-7436808403424765554_553837" daemon prio=10 
tid=0x0000000001c84c00 nid=0x7125 in Object.wait() 
[0x00000000409b3000..0x00000000409b3d70]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.apache.hadoop.ipc.Client.call(Client.java:709)
    - locked <0x00007fef39520bb8> (a org.apache.hadoop.ipc.Client$Call)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
    at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
    at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
    at 
org.apache.hadoop.dfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:139)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2185)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
    at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
    - locked <0x00007fef38ecc138> (a java.util.LinkedList)

Repeated thread dumpings show that we never move on from this state.

Looking at where we're stuck in DataStreamer, we're down in 
processDatanodeError trying to setup an RPC proxy with the 'least' 
datanode as our new 'primary'.

2178         // Tell the primary datanode to do error recovery
2179         // by stamping appropriate generation stamps.
2180         //
2181         Block newBlock = null;
2182         ClientDatanodeProtocol primary =  null;
2183         try {
2184           // Pick the "least" datanode as the primary datanode to 
avoid deadlock.
2185           primary = createClientDatanodeProtocolProxy(
2186               Collections.min(Arrays.asList(newnodes)), conf);
2187           newBlock = primary.recoverBlock(block, newnodes);
2188         } catch (IOException e) {


If I read this right, problem is that since RPC doesn't timeout, we 
never return from createClientDatanodeProtocolProxy (At the time, a 
machine vanished from the network).

Thanks for any insight.
St.Ack


Re: Hung in DFSClient$DFSOutputStream.writeChunk

Posted by stack <st...@duboce.net>.
Trying to get more data on the issue reported below, I tripped over the
following where a datanode died and dfsclient is trying to transition to
getting wanted block from another.  The transition attempt is unsuccessful
but odd is that we do not proceed to the datanode carrying the 3rd replica.
Here's the log:

        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
        at java.io.DataInputStream.readFully(Unknown Source)
        at java.io.DataInputStream.readLong(Unknown Source)
        at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2080)
        2008-11-21 06:15:04,691 WARN org.apache.hadoop.dfs.DFSClient: Error
Recovery for block blk_8300709359495898650_29813788 bad datanode[0]
XX.XX.44.56:50010 2008-11-21 06:15:04,692 WARN
org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 in pipeline XX.XX.44.56:50010,
208.76.45.128:50010, 208.76.45.180:50010: bad datanode 208.76.44.56:50010
2008-11-21 06:26:40,720 INFO org.apache.hadoop.hbase.regionserver.LogRoller:
Rolling hlog. Number of entries: 10112008-11-21 06:31:20,290 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Started memcache flush for
region enwikilive,0sQnTVIXu7rSkOmC2jvzKV==,1224021040059. Current region
memcache size 2.1k2008-11-21 16:36:40,840 WARN
org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 failed  because recovery from primary
datanode XX.XX.45.128:50010 failed 1 times. Will retry...2008-11-21
16:36:40,861 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 bad datanode[0] XX.XX.44.56:500102008-11-21
16:36:40,861 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 in pipeline XX.XX.44.56:50010,
208.76.45.128:50010, 208.76.45.180:50010: bad datanode 208.76.44.56:50010
2008-11-21 16:36:41,870 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 0 time(s).2008-11-21
16:36:42,890 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 1 time(s).
2008-11-21 16:36:43,910 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 2 time(s).2008-11-21
16:36:44,920 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 3 time(s).
2008-11-21 16:36:45,930 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 4 time(s).
2008-11-21 16:36:46,950 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 5 time(s).
2008-11-21 16:36:47,960 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 6 time(s).
2008-11-21 16:36:48,970 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 7 time(s).
2008-11-21 16:36:49,990 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 8 time(s).
2008-11-21 16:36:51,010 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 9 time(s).2008-11-21
16:36:51,011 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 failed  because recovery from primary
datanode XX.XX.45.128:50010 failed 2 times. Will retry...2008-11-21
16:36:52,030 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 bad datanode[0] XX.XX.44.56:500102008-11-21
16:36:52,030 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 in pipeline XX.XX.44.56:50010,
208.76.45.128:50010, 208.76.45.180:50010: bad datanode 208.76.44.56:50010
2008-11-21 16:36:53,050 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 0 time(s).2008-11-21
16:36:54,070 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 1 time(s).
2008-11-21 16:36:55,080 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 2 time(s).2008-11-21
16:36:56,090 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 3 time(s).
2008-11-21 16:36:57,100 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 4 time(s).
2008-11-21 16:36:58,120 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 5 time(s).
2008-11-21 16:36:59,140 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 6 time(s).
2008-11-21 16:37:00,150 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 7 time(s).
2008-11-21 16:37:01,170 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 8 time(s).
2008-11-21 16:37:02,180 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 9 time(s).2008-11-21
16:37:02,181 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 failed  because recovery from primary
datanode XX.XX.45.128:50010 failed 3 times. Will retry...2008-11-21
16:37:03,210 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 bad datanode[0] XX.XX.44.56:500102008-11-21
16:37:03,211 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 in pipeline XX.XX.44.56:50010,
208.76.45.128:50010, 208.76.45.180:50010: bad datanode 208.76.44.56:50010
2008-11-21 16:37:04,220 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 0 time(s).2008-11-21
16:37:05,230 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 1 time(s).
2008-11-21 16:37:06,240 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 2 time(s).2008-11-21
16:37:07,250 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 3 time(s).
2008-11-21 16:37:08,270 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 4 time(s).
2008-11-21 16:37:09,290 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 5 time(s).
2008-11-21 16:37:10,310 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 6 time(s).
2008-11-21 16:37:11,320 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 7 time(s).
2008-11-21 16:37:12,340 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 8 time(s).
2008-11-21 16:37:13,350 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 9 time(s).2008-11-21
16:37:13,351 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 failed  because recovery from primary
datanode XX.XX.45.128:50010 failed 4 times. Will retry...2008-11-21
16:37:14,360 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 bad datanode[0] XX.XX.44.56:500102008-11-21
16:37:14,361 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 in pipeline XX.XX.44.56:50010,
208.76.45.128:50010, 208.76.45.180:50010: bad datanode 208.76.44.56:50010
2008-11-21 16:37:15,370 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 0 time(s).2008-11-21
16:37:16,380 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 1 time(s).
2008-11-21 16:37:17,390 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 2 time(s).2008-11-21
16:37:18,400 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 3 time(s).
2008-11-21 16:37:19,420 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 4 time(s).
2008-11-21 16:37:20,430 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 5 time(s).
2008-11-21 16:37:21,440 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 6 time(s).
2008-11-21 16:37:22,450 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 7 time(s).
2008-11-21 16:37:23,470 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 8 time(s).
2008-11-21 16:37:24,480 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 9 time(s).2008-11-21
16:37:24,481 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 failed  because recovery from primary
datanode XX.XX.45.128:50010 failed 5 times. Will retry...2008-11-21
16:37:25,490 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 bad datanode[0] XX.XX.44.56:500102008-11-21
16:37:25,490 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 in pipeline XX.XX.44.56:50010,
208.76.45.128:50010, 208.76.45.180:50010: bad datanode 208.76.44.56:50010
2008-11-21 16:37:26,500 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 0 time(s).2008-11-21
16:37:27,510 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 1 time(s).
2008-11-21 16:37:28,520 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 2 time(s).2008-11-21
16:37:29,530 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/XX.XX.45.128:50020. Already tried 3 time(s).
2008-11-21 16:37:30,540 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 4 time(s).
2008-11-21 16:37:31,550 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 5 time(s).
2008-11-21 16:37:32,560 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 6 time(s).
2008-11-21 16:37:33,570 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 7 time(s).
2008-11-21 16:37:34,580 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 8 time(s).
2008-11-21 16:37:35,590 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: /XX.XX.45.128:50020. Already tried 9 time(s).2008-11-21
16:37:35,591 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block
blk_8300709359495898650_29813788 failed  because recovery from primary
datanode XX.XX.45.128:50010 failed 6 times. Aborting...

St.Ack

...

On Thu, Nov 20, 2008 at 1:51 PM, stack <st...@duboce.net> wrote:

> stack wrote:
>
>> Over in hbase-space, we trigger a hang in DFSOutputStream.writeChunk.
>>  Input appreciated.
>>
>
> Pardon me.  The above should have read, "...we sometimes trigger....".
>
> The below stack traces are from hadoop-0.18.2.
>
> Other ill-documented instances of the hang can be found over in HBASE-667.
>
> Thanks,
> St.Ack
>
>
>
>
>> Here are the two pertinent extracts from the hbase regionserver thread
>> dump:
>>
>> "IPC Server handler 9 on 60020" daemon prio=10 tid=0x00007fef1c3f0400
>> nid=0x7470 waiting for monitor entry
>> [0x0000000042d18000..0x0000000042d189f0]
>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2486)
>>
>>   - waiting to lock <0x00007fef38ecc138> (a java.util.LinkedList)
>>   - locked <0x00007fef38ecbdb8> (a
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>>   at
>> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
>>
>>   at
>> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
>>   - locked <0x00007fef38ecbdb8> (a
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>>   at
>> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
>>   - locked <0x00007fef38ecbdb8> (a
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>>   at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
>>   at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
>>   - locked <0x00007fef38ecbdb8> (a
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>>   at
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
>>
>>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>   - locked <0x00007fef38e09fc0> (a
>> org.apache.hadoop.fs.FSDataOutputStream)
>>   at
>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
>>   - locked <0x00007fef38e09f30> (a
>> org.apache.hadoop.io.SequenceFile$Writer)
>>   at
>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980)
>>   - locked <0x00007fef38e09f30> (a
>> org.apache.hadoop.io.SequenceFile$Writer)
>>   at org.apache.hadoop.hbase.regionserver.HLog.doWrite(HLog.java:461)
>>   at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:421)
>>   - locked <0x00007fef29ad9588> (a java.lang.Integer)
>>   at
>> org.apache.hadoop.hbase.regionserver.HRegion.update(HRegion.java:1676)
>>   at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1439)
>>   at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1378)
>>   at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1184)
>>
>>   at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>>   at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>>   at java.lang.reflect.Method.invoke(Method.java:616)
>>   at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:622)
>>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>>
>>
>> Here is trace from accompanying DataStreamer:
>>
>>
>> "DataStreamer for file
>> /hbase/log_72.34.249.212_1225407466779_60020/hlog.dat.1227075571390 block
>> blk_-7436808403424765554_553837" daemon prio=10 tid=0x0000000001c84c00
>> nid=0x7125 in Object.wait() [0x00000000409b3000..0x00000000409b3d70]
>>  java.lang.Thread.State: WAITING (on object monitor)
>>   at java.lang.Object.wait(Native Method)
>>   at java.lang.Object.wait(Object.java:502)
>>   at org.apache.hadoop.ipc.Client.call(Client.java:709)
>>   - locked <0x00007fef39520bb8> (a org.apache.hadoop.ipc.Client$Call)
>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>   at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source)
>>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
>>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
>>   at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
>>   at
>> org.apache.hadoop.dfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:139)
>>
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2185)
>>
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
>>
>>   at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
>>
>>   - locked <0x00007fef38ecc138> (a java.util.LinkedList)
>>
>> Repeated thread dumpings show that we never move on from this state.
>>
>> Looking at where we're stuck in DataStreamer, we're down in
>> processDatanodeError trying to setup an RPC proxy with the 'least' datanode
>> as our new 'primary'.
>>
>> 2178         // Tell the primary datanode to do error recovery
>> 2179         // by stamping appropriate generation stamps.
>> 2180         //
>> 2181         Block newBlock = null;
>> 2182         ClientDatanodeProtocol primary =  null;
>> 2183         try {
>> 2184           // Pick the "least" datanode as the primary datanode to
>> avoid deadlock.
>> 2185           primary = createClientDatanodeProtocolProxy(
>> 2186               Collections.min(Arrays.asList(newnodes)), conf);
>> 2187           newBlock = primary.recoverBlock(block, newnodes);
>> 2188         } catch (IOException e) {
>>
>>
>> If I read this right, problem is that since RPC doesn't timeout, we never
>> return from createClientDatanodeProtocolProxy (At the time, a machine
>> vanished from the network).
>>
>> Thanks for any insight.
>> St.Ack
>>
>>
>

Re: Hung in DFSClient$DFSOutputStream.writeChunk

Posted by stack <st...@duboce.net>.
stack wrote:
> Over in hbase-space, we trigger a hang in DFSOutputStream.writeChunk.  
> Input appreciated.

Pardon me.  The above should have read, "...we sometimes trigger....".

The below stack traces are from hadoop-0.18.2.

Other ill-documented instances of the hang can be found over in HBASE-667.

Thanks,
St.Ack


>
> Here are the two pertinent extracts from the hbase regionserver thread 
> dump:
>
> "IPC Server handler 9 on 60020" daemon prio=10 tid=0x00007fef1c3f0400 
> nid=0x7470 waiting for monitor entry 
> [0x0000000042d18000..0x0000000042d189f0]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>    at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2486) 
>
>    - waiting to lock <0x00007fef38ecc138> (a java.util.LinkedList)
>    - locked <0x00007fef38ecbdb8> (a 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>    at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155) 
>
>    at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
>    - locked <0x00007fef38ecbdb8> (a 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>    at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
>    - locked <0x00007fef38ecbdb8> (a 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>    at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
>    at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
>    - locked <0x00007fef38ecbdb8> (a 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
>    at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47) 
>
>    at java.io.DataOutputStream.write(DataOutputStream.java:107)
>    - locked <0x00007fef38e09fc0> (a 
> org.apache.hadoop.fs.FSDataOutputStream)
>    at 
> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
>    - locked <0x00007fef38e09f30> (a 
> org.apache.hadoop.io.SequenceFile$Writer)
>    at 
> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980)
>    - locked <0x00007fef38e09f30> (a 
> org.apache.hadoop.io.SequenceFile$Writer)
>    at org.apache.hadoop.hbase.regionserver.HLog.doWrite(HLog.java:461)
>    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:421)
>    - locked <0x00007fef29ad9588> (a java.lang.Integer)
>    at 
> org.apache.hadoop.hbase.regionserver.HRegion.update(HRegion.java:1676)
>    at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1439) 
>
>    at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1378) 
>
>    at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1184) 
>
>    at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>    at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
>
>    at java.lang.reflect.Method.invoke(Method.java:616)
>    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:622)
>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>
>
> Here is trace from accompanying DataStreamer:
>
>
> "DataStreamer for file 
> /hbase/log_72.34.249.212_1225407466779_60020/hlog.dat.1227075571390 
> block blk_-7436808403424765554_553837" daemon prio=10 
> tid=0x0000000001c84c00 nid=0x7125 in Object.wait() 
> [0x00000000409b3000..0x00000000409b3d70]
>   java.lang.Thread.State: WAITING (on object monitor)
>    at java.lang.Object.wait(Native Method)
>    at java.lang.Object.wait(Object.java:502)
>    at org.apache.hadoop.ipc.Client.call(Client.java:709)
>    - locked <0x00007fef39520bb8> (a org.apache.hadoop.ipc.Client$Call)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>    at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
>    at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
>    at 
> org.apache.hadoop.dfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:139) 
>
>    at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2185) 
>
>    at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735) 
>
>    at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889) 
>
>    - locked <0x00007fef38ecc138> (a java.util.LinkedList)
>
> Repeated thread dumpings show that we never move on from this state.
>
> Looking at where we're stuck in DataStreamer, we're down in 
> processDatanodeError trying to setup an RPC proxy with the 'least' 
> datanode as our new 'primary'.
>
> 2178         // Tell the primary datanode to do error recovery
> 2179         // by stamping appropriate generation stamps.
> 2180         //
> 2181         Block newBlock = null;
> 2182         ClientDatanodeProtocol primary =  null;
> 2183         try {
> 2184           // Pick the "least" datanode as the primary datanode to 
> avoid deadlock.
> 2185           primary = createClientDatanodeProtocolProxy(
> 2186               Collections.min(Arrays.asList(newnodes)), conf);
> 2187           newBlock = primary.recoverBlock(block, newnodes);
> 2188         } catch (IOException e) {
>
>
> If I read this right, problem is that since RPC doesn't timeout, we 
> never return from createClientDatanodeProtocolProxy (At the time, a 
> machine vanished from the network).
>
> Thanks for any insight.
> St.Ack
>