You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by jt...@ina.fr on 2009/03/04 18:18:10 UTC

Data lost during intensive writes

Hello,

I have been testing Hbase for several weeks.
My test cluster is made of 6 low cost machines (dell studio hybrid, core 2 duo 2Ghz, 4Go, HDD 320 Go).

My configurations files :

hadoop-site.xml :

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/hadoop-tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/home/hadoop/hadoop-dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>


<property>
  <name>fs.default.name</name>
  <value>hdfs://hephaistos:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>hephaistos:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>dfs.block.size</name>
  <value>8388608</value>
  <description>The hbase standard size for new files.</description>
<!--<value>67108864</value>-->
<!--<description>The default block size for new files.</description>-->
</property>

<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>8192</value>
   <description>Up xcievers (see HADOOP-3831)</description>
</property>
<property>
  <name>dfs.balance.bandwidthPerSec</name>
  <value>10485760</value>
  <description> Specifies the maximum bandwidth that each datanode can utilize for the
   balancing purpose in term of the number of bytes per second. Default is 1048576</description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/hadoop/hadoop-mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>home/hadoop/hadoop-mapred/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>home/hadoop/hadoop-mapred/temp</value>
  <description>A shared directory for temporary files.
  </description>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>20</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".  
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>5</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>
</configuration>

hbase-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
 * Copyright 2007 The Apache Software Foundation
 *
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-->
<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://hephaistos:54310/hbase</value>
    <description>The directory shared by region servers.
    </description>
  </property>
  <property>
    <name>hbase.master</name>
    <value>hephaistos:60000</value>
    <description>The host and port that the HBase master runs at.
    </description>
  </property>
  <property>
    <name>hbase.hregion.memcache.flush.size</name>
    <value>67108864</value>
    <description>
    A HRegion memcache will be flushed to disk if size of the memcache
    exceeds this number of bytes.  Value is checked by a thread that runs
    every hbase.server.thread.wakefrequency.  
    </description>
  </property>  
  <property>
    <name>hbase.hregion.max.filesize</name>
    <value>268435456</value>
    <description>
    Maximum HStoreFile size. If any one of a column families' HStoreFiles has
    grown to exceed this value, the hosting HRegion is split in two.
    Default: 256M.
    </description>
  </property>
  <property>
    <name>hbase.io.index.interval</name>
    <value>128</value>
    <description>The interval at which we record offsets in hbase
    store files/mapfiles.  Default for stock mapfiles is 128.  Index
    files are read into memory.  If there are many of them, could prove
    a burden.  If so play with the hadoop io.map.index.skip property and
    skip every nth index member when reading back the index into memory.
    Downside to high index interval is lowered access times.
    </description>
  </property>  
  <property>
    <name>hbase.hstore.blockCache.blockSize</name>
    <value>65536</value>
    <description>The size of each block in the block cache.
    Enable blockcaching on a per column family basis; see the BLOCKCACHE setting
    in HColumnDescriptor.  Blocks are kept in a java Soft Reference cache so are
    let go when high pressure on memory.  Block caching is not enabled by default.
    Default is 16384.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.lease.period</name>
    <value>240000</value>
    <description>HRegion server lease period in milliseconds. Default is
    60 seconds. Clients must report in within this period else they are
    considered dead.</description>
  </property>  
</configuration>

My main application of hbase is to build access indexes to a web archive.
My test archive contains 160.10e6 objects that I insert in an hbase instance.
Each rows contains about a thousand of bytes.

During these bacth insertions I can see some exceptions related to DataXceiver :

Case 1:

On HBase Regionserver:

2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

	at org.apache.hadoop.ipc.Client.call(Client.java:696)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
	at $Proxy1.addBlock(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy1.addBlock(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)


On Hadoop Datanode:

2009-02-27 04:22:58,110 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):Got exception while serving blk_5465578316105624003_26301 to /10.1.188.249:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.249:48326]
	at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
	at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
	at java.lang.Thread.run(Thread.java:619)

2009-02-27 04:22:58,110 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.249:48326]
	at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
	at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
	at java.lang.Thread.run(Thread.java:619)

Case 2:

HBase Regionserver:

2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for block blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6496095407839777264_96895 bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6496095407839777264_96895 in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for block blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7585241287138805906_96914 bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-7585241287138805906_96914 in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_8693483996243654850_96912java.io.IOException: Bad response 1 for block blk_8693483996243654850_96912 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8693483996243654850_96912 bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8693483996243654850_96912 in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for block blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8939308025013258259_96931 bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-8939308025013258259_96931 in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_7417692418733608681_96934java.io.IOException: Bad response 1 for block blk_7417692418733608681_96934 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_7417692418733608681_96934 bad datanode[2] 10.1.188.182:50010
2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_7417692418733608681_96934 in pipeline 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_6777180223564108728_96939java.io.IOException: Bad response 1 for block blk_6777180223564108728_96939 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6777180223564108728_96939 bad datanode[1] 10.1.188.182:50010
2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_6777180223564108728_96939 in pipeline 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for block blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6352908575431276531_96948 bad datanode[2] 10.1.188.182:50010
2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6352908575431276531_96948 in pipeline 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:15,988 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_SPLIT: metadata_table,r:http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for block blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-1071965721931053111_96956 bad datanode[2] 10.1.188.182:50010
2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-1071965721931053111_96956 in pipeline 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode 10.1.188.182:50010
2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block blk_1004039574836775403_96959java.io.IOException: Bad response 1 for block blk_1004039574836775403_96959 from datanode 10.1.188.182:50010
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)

2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1004039574836775403_96959 bad datanode[1] 10.1.188.182:50010


Hadoop datanode:

2009-03-02 09:55:10,201 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-5472632607337755080_96875 1 Exception java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at java.io.DataInputStream.readLong(DataInputStream.java:399)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
	at java.lang.Thread.run(Thread.java:619)

2009-03-02 09:55:10,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-5472632607337755080_96875 terminating
2009-03-02 09:55:10,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):Exception writing block blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010
java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
	at sun.nio.ch.IOUtil.write(IOUtil.java:75)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
	at java.lang.Thread.run(Thread.java:619)

2009-03-02 09:55:10,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-5472632607337755080_96875 java.io.IOException: Broken pipe
2009-03-02 09:55:10,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-5472632607337755080_96875 received exception java.io.IOException: Broken pipe
2009-03-02 09:55:10,517 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
	at sun.nio.ch.IOUtil.write(IOUtil.java:75)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
	at java.io.DataOutputStream.write(DataOutputStream.java:90)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
	at java.lang.Thread.run(Thread.java:619)
2009-03-02 09:55:11,174 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op: HDFS_WRITE, cliID: DFSClient_1091437257, srvID: DS-1180278657-127.0.0.1-50010-1235652659245, blockid: blk_5027345212081735473_96878
2009-03-02 09:55:11,177 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_5027345212081735473_96878 terminating
2009-03-02 09:55:11,185 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: /10.1.188.249:50010
2009-03-02 09:55:11,186 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: /10.1.188.249:50010
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_8782629414415941143_96845 java.io.IOException: Connection reset by peer
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_8782629414415941143_96845 Interrupted.
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_8782629414415941143_96845 terminating
2009-03-02 09:55:11,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_8782629414415941143_96845 received exception java.io.IOException: Connection reset by peer
2009-03-02 09:55:11,187 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
	at sun.nio.ch.IOUtil.read(IOUtil.java:206)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at java.io.DataInputStream.read(DataInputStream.java:132)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
	at java.lang.Thread.run(Thread.java:619)
        etc.............................

I have others exceptions related to DataXceivers problems. These errors doesn't make the region server go down, but I can see that I lost some records (about 3.10e6 out of 160.10e6).

As you can see in my conf files, I up the dfs.datanode.max.xcievers to 8192 as suggested from several mails.
And my ulimit -n is at 32768.

Do these problems come from my configuration, or my hardware ?

Jérôme Thièvre

Re : Data lost during intensive writes

Posted by jt...@ina.fr.

I set hadoop log level to DEBUG.
This exception occurs even with few active connections, (5 here). So, it can't be a problems of Xceivers instances number.
Does somebody have an idea of the problem ?

Each exception create a dead socket of this type :

netstat infos:

Proto Recv-Q   Send-Q     Local Address                 Foreign Address            State                    User         Inode          PID/Program name    Timer
tcp     0             121395      aphrodite:50010             aphrodite:42858             FIN_WAIT1         root           0                  -                                       probe (55.17/0/0)
tcp    72729     0                 aphrodite:42858             aphrodite:50010             ESTABLISHED  hadoop     5888205   13471/java                     off (0.00/0/0)

The socket are not closed before I stop HBase.

Jérôme Thièvre



2009-03-05 23:30:41,848 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-482125953-10.1.188.249-50010-1236075545212, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.141:38072]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
        at java.lang.Thread.run(Thread.java:619)
2009-03-05 23:30:41,848 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-482125953-10.1.188.249-50010-1236075545212, infoPort=50075, ipcPort=50020):Number of active connections i
s: 5
--
2009-03-05 23:41:22,264 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-482125953-10.1.188.249-50010-1236075545212, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.141:47006]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
        at java.lang.Thread.run(Thread.java:619)
2009-03-05 23:41:22,264 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-482125953-10.1.188.249-50010-1236075545212, infoPort=50075, ipcPort=50020):Number of active connections i
s: 4
--
2009-03-05 23:52:55,908 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-482125953-10.1.188.249-50010-1236075545212, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010 remote=/10.1.188.141:40436]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
        at java.lang.Thread.run(Thread.java:619)
2009-03-05 23:52:55,908 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.188.249:50010, storageID=DS-482125953-10.1.188.249-50010-1236075545212, infoPort=50075, ipcPort=50020):Number of active connections i
s: 6


----- Message d'origine -----
De: jthievre@ina.fr
Date: Mercredi, Mars 4, 2009 6:18 pm
Objet: Data lost during intensive writes

Re: Data lost during intensive writes

Posted by schubert zhang <zs...@gmail.com>.

Thank you very much Andy. Yes, it is really a difficult issue.
Schubert

On Fri, Mar 27, 2009 at 1:13 AM, Andrew Purtell <ap...@apache.org> wrote:

>
> Hi Schubert,
>
> I set dfs.datanode.max.xcievers=4096 in my config. This was the
> only way I was able to bring > 7000 regions online on 25 nodes
> during cluster restart without DFS errors. Definitely the
> default is too low for HBase. HFile in 0.20 will have material
> impact here, which should help the situation. Also perhaps more
> can/will be done with regards to HBASE-24 to relieve the load on
> the DataNodes:
>
>
> https://issues.apache.org/jira/browse/HBASE-24?focusedCommentId=12613104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12613104
>
> The root cause of this is HADOOP-3846:
> https://issues.apache.org/jira/browse/HADOOP-3856
>
> I looked at helping out on this issue. There is so much
> reimplementation of such a fundamental component (to Hadoop)
> involved that it's difficult for a part-time volunteer to make
> progress on it. Even if the code can be changed, there is
> follow up shepherding through Core review and release processes
> to consider... I hold out hope that a commercial user of Hadoop
> will have pain in this area and commit sponsored resources to
> address the issue of I/O scalability in DFS. I think when DFS
> was written the expectation was that 10,000 nodes would have
> only a few open files each -- very large mapreduce inputs,
> intermediates, and outputs -- not that 100s of nodes might
> have 1,000s of files open each. In any case, the issue is well
> known.
>
> I have found "dfs.datanode.socket.write.timeout=0" is not
> necessary for HBase 0.19.1 on Hadoop 0.19.1 in my testing.
>
> Best regards,
>
>   -Andy
>
>
> > From: schubert zhang <zs...@gmail.com>
> > Subject: Re: Data lost during intensive writes
> > To: hbase-user@hadoop.apache.org, apurtell@apache.org
> > Date: Thursday, March 26, 2009, 4:58 AM
> >
> > I will set "dfs.datanode.max.xcievers=1024" (default is 256)
> >
> > I am using branch-0.19.
> > Do you think "dfs.datanode.socket.write.timeout=0" is
> > necessary in release-0.19?
> >
> > Schubert
>
>
>
>
>

Re: Data lost during intensive writes

Posted by Andrew Purtell <ap...@apache.org>.

Hi Schubert,

I set dfs.datanode.max.xcievers=4096 in my config. This was the
only way I was able to bring > 7000 regions online on 25 nodes
during cluster restart without DFS errors. Definitely the
default is too low for HBase. HFile in 0.20 will have material
impact here, which should help the situation. Also perhaps more
can/will be done with regards to HBASE-24 to relieve the load on
the DataNodes:

    https://issues.apache.org/jira/browse/HBASE-24?focusedCommentId=12613104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12613104

The root cause of this is HADOOP-3846: https://issues.apache.org/jira/browse/HADOOP-3856

I looked at helping out on this issue. There is so much 
reimplementation of such a fundamental component (to Hadoop)
involved that it's difficult for a part-time volunteer to make
progress on it. Even if the code can be changed, there is 
follow up shepherding through Core review and release processes
to consider... I hold out hope that a commercial user of Hadoop
will have pain in this area and commit sponsored resources to
address the issue of I/O scalability in DFS. I think when DFS
was written the expectation was that 10,000 nodes would have 
only a few open files each -- very large mapreduce inputs,
intermediates, and outputs -- not that 100s of nodes might
have 1,000s of files open each. In any case, the issue is well
known. 

I have found "dfs.datanode.socket.write.timeout=0" is not
necessary for HBase 0.19.1 on Hadoop 0.19.1 in my testing. 

Best regards,

   -Andy


> From: schubert zhang <zs...@gmail.com>
> Subject: Re: Data lost during intensive writes
> To: hbase-user@hadoop.apache.org, apurtell@apache.org
> Date: Thursday, March 26, 2009, 4:58 AM
>
> I will set "dfs.datanode.max.xcievers=1024" (default is 256)
> 
> I am using branch-0.19.
> Do you think "dfs.datanode.socket.write.timeout=0" is
> necessary in release-0.19?
> 
> Schubert

Re: Data lost during intensive writes

Posted by schubert zhang <zs...@gmail.com>.

Thanks  Andrew.I will set "dfs.datanode.max.xcievers=1024" (default is 256)

I am using branch-0.19.
Do you think "dfs.datanode.socket.write.timeout=0" is necessary in
release-0.19?

Schubert


On Thu, Mar 26, 2009 at 7:57 AM, Andrew Purtell <ap...@apache.org> wrote:

>
> You may need to increase the maximum number of xceivers allowed
> on each of your datanodes.
>
> Best regards,
>
>   - Andy
>
> > From: schubert zhang <zs...@gmail.com>
> > Subject: Re: Data lost during intensive writes
> > To: hbase-user@hadoop.apache.org
> > Date: Wednesday, March 25, 2009, 2:01 AM
> > Hi all,
> > I also meet such same problems/exceptions.
> > I also have 5+1 machine,e and the system has been running
> > for about 4 days,
> > and there are 512 regions now. But the two
> > exceptions start to happen earlyer.
> >
> > hadoop-0.19
> > hbase-0.19.1 (with patch
> > https://issues.apache.org/jira/browse/HBASE-1008).<
> https://issues.apache.org/jira/browse/HBASE-1008>
> >
> > I want to try to set dfs.datanode.socket.write.timeout=0
> > and watch it later.
> >
> > Schubert
> >
> > On Sat, Mar 7, 2009 at 3:15 AM, stack
> > <st...@duboce.net> wrote:
> >
> > > On Wed, Mar 4, 2009 at 9:18 AM,
> > <jt...@ina.fr> wrote:
> > >
> > > > <property>
> > > >  <name>dfs.replication</name>
> > > >  <value>2</value>
> > > >  <description>Default block replication.
> > > >  The actual number of replications can be
> > specified when the file is
> > > > created.
> > > >  The default is used if replication is not
> > specified in create time.
> > > >  </description>
> > > > </property>
> > > >
> > > > <property>
> > > >  <name>dfs.block.size</name>
> > > >  <value>8388608</value>
> > > >  <description>The hbase standard size for
> > new files.</description>
> > > > <!--<value>67108864</value>-->
> > > > <!--<description>The default block size
> > for new files.</description>-->
> > > > </property>
> > > >
> > >
> > >
> > > The above are non-standard.  A replication of 3 might
> > lessen the incidence
> > > of HDFS errors seen since there will be another
> > replica to go to.   Why
> > > non-standard block size?
> > >
> > > I did not see *dfs.datanode.socket.write.timeout* set
> > to 0.  Is that
> > > because
> > > you are running w/ 0.19.0?  You might try with it
> > especially because in the
> > > below I see complaint about the timeout (but more
> > below on this).
> > >
> > >
> > >
> > > >  <property>
> > > >
> > <name>hbase.hstore.blockCache.blockSize</name>
> > > >    <value>65536</value>
> > > >    <description>The size of each block in
> > the block cache.
> > > >    Enable blockcaching on a per column family
> > basis; see the BLOCKCACHE
> > > > setting
> > > >    in HColumnDescriptor.  Blocks are kept in a
> > java Soft Reference cache
> > > so
> > > > are
> > > >    let go when high pressure on memory.  Block
> > caching is not enabled by
> > > > default.
> > > >    Default is 16384.
> > > >    </description>
> > > >  </property>
> > > >
> > >
> > >
> > > Are you using blockcaching?  If so, 64k was
> > problematic in my testing
> > > (OOMEing).
> > >
> > >
> > >
> > >
> > > > Case 1:
> > > >
> > > > On HBase Regionserver:
> > > >
> > > > 2009-02-27 04:23:52,185 INFO
> > org.apache.hadoop.hdfs.DFSClient:
> > > > org.apache.hadoop.ipc.RemoteException:
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException:
> > Not
> > > > replicated
> > > >
> > >
> >
> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
> > > >        at
> > sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at
> > java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> > > >        at
> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
> > > >
> > > >        at
> > org.apache.hadoop.ipc.Client.call(Client.java:696)
> > > >        at
> > org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> > > >        at $Proxy1.addBlock(Unknown Source)
> > > >        at
> > sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >        at
> > java.lang.reflect.Method.invoke(Method.java:597)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> > > >        at $Proxy1.addBlock(Unknown Source)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > > >
> > > >
> > > > On Hadoop Datanode:
> > > >
> > > > 2009-02-27 04:22:58,110 WARN
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):Got exception
> > while serving
> > > > blk_5465578316105624003_26301 to /10.1.188.249:
> > > > java.net.SocketTimeoutException: 480000 millis
> > timeout while waiting for
> > > > channel to be ready for write. ch :
> > > > java.nio.channels.SocketChannel[connected
> > local=/10.1.188.249:50010
> > > remote=/
> > > > 10.1.188.249:48326]
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > > > 2009-02-27 04:22:58,110 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):DataXceiver
> > > > java.net.SocketTimeoutException: 480000 millis
> > timeout while waiting for
> > > > channel to be ready for write. ch :
> > > > java.nio.channels.SocketChannel[connected
> > local=/10.1.188.249:50010
> > > remote=/
> > > > 10.1.188.249:48326]
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > >
> > >
> > > Are you sure the regionserver error matches the
> > datanode error?
> > >
> > > My understanding is that in 0.19.0, DFSClient in
> > regionserver is supposed
> > > to
> > > reestablish timed-out connections.  If that is not
> > happening in your case
> > > --
> > > and we've speculated some that there might holes
> > in this mechanism -- try
> > > with timeout set to zero (see citation above; be sure
> > the configuration can
> > > be seen by the DFSClient running in hbase by either
> > adding to
> > > hbase-site.xml
> > > or somehow get the hadoop-site.xml into hbase
> > CLASSPATH
> > > (hbase-env.sh#HBASE_CLASSPATH or with a symlink into
> > the HBASE_HOME/conf
> > > dir).
> > >
> > >
> > >
> > > > Case 2:
> > > >
> > > > HBase Regionserver:
> > > >
> > > > 2009-03-02 09:55:11,929 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-6496095407839777264_96895java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-6496095407839777264_96895 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:11,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6496095407839777264_96895
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:11,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6496095407839777264_96895
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,362 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-7585241287138805906_96914java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-7585241287138805906_96914 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:14,362 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-7585241287138805906_96914
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,363 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-7585241287138805906_96914
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.141:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,445 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_8693483996243654850_96912java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_8693483996243654850_96912 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:14,446 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_8693483996243654850_96912
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,446 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_8693483996243654850_96912
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,923 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-8939308025013258259_96931java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-8939308025013258259_96931 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:14,935 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-8939308025013258259_96931
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:14,935 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-8939308025013258259_96931
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,344 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_7417692418733608681_96934java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_7417692418733608681_96934 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:15,344 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_7417692418733608681_96934
> > bad datanode[2]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,344 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_7417692418733608681_96934
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.203:50010,
> > 10.1.188.182:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,579 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_6777180223564108728_96939java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_6777180223564108728_96939 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:15,579 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_6777180223564108728_96939
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,579 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_6777180223564108728_96939
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.182:50010,
> > 10.1.188.203:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,930 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > >
> > blk_-6352908575431276531_96948java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-6352908575431276531_96948 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:15,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6352908575431276531_96948
> > bad datanode[2]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,930 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-6352908575431276531_96948
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.30:50010,
> > 10.1.188.182:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:15,988 INFO
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > > > MSG_REGION_SPLIT: metadata_table,r:
> > > >
> > >
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
> > > > 2009-03-02<
> > >
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02
> >09:55:16,008
> > > WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> > > > ResponseProcessor exception  for block
> > > >
> > blk_-1071965721931053111_96956java.io.IOException: Bad
> > response 1 for
> > > block
> > > > blk_-1071965721931053111_96956 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:16,008 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-1071965721931053111_96956
> > bad datanode[2]
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:16,009 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_-1071965721931053111_96956
> > in pipeline
> > > > 10.1.188.249:50010, 10.1.188.203:50010,
> > 10.1.188.182:50010: bad datanode
> > > > 10.1.188.182:50010
> > > > 2009-03-02 09:55:16,073 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_1004039574836775403_96959java.io.IOException:
> > Bad response 1 for
> > > block
> > > > blk_1004039574836775403_96959 from datanode
> > 10.1.188.182:50010
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > > >
> > > > 2009-03-02 09:55:16,073 WARN
> > org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_1004039574836775403_96959
> > bad datanode[1]
> > > > 10.1.188.182:50010
> > > >
> > > >
> > > > Hadoop datanode:
> > > >
> > > > 2009-03-02 09:55:10,201 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder
> > > > blk_-5472632607337755080_96875 1 Exception
> > java.io.EOFException
> > > >        at
> > java.io.DataInputStream.readFully(DataInputStream.java:180)
> > > >        at
> > java.io.DataInputStream.readLong(DataInputStream.java:399)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > > > 2009-03-02 09:55:10,407 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 1 for
> > > block
> > > > blk_-5472632607337755080_96875 terminating
> > > > 2009-03-02 09:55:10,516 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):Exception writing
> > block
> > > > blk_-5472632607337755080_96875 to mirror
> > 10.1.188.182:50010
> > > > java.io.IOException: Broken pipe
> > > >        at sun.nio.ch.FileDispatcher.write0(Native
> > Method)
> > > >        at
> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> > > >        at
> > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> > > >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> > > >        at
> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > >        at
> > >
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > >        at
> > java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >
> > > > 2009-03-02 09:55:10,517 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Exception in
> > > receiveBlock
> > > > for block blk_-5472632607337755080_96875
> > java.io.IOException: Broken pipe
> > > > 2009-03-02 09:55:10,517 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > writeBlock
> > > > blk_-5472632607337755080_96875 received exception
> > java.io.IOException:
> > > > Broken pipe
> > > > 2009-03-02 09:55:10,517 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):DataXceiver
> > > > java.io.IOException: Broken pipe
> > > >        at sun.nio.ch.FileDispatcher.write0(Native
> > Method)
> > > >        at
> > sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> > > >        at
> > sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> > > >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> > > >        at
> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > >        at
> > >
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > >        at
> > java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > > 2009-03-02 09:55:11,174 INFO
> > > >
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> > src: /
> > > > 10.1.188.249:49063, dest: /10.1.188.249:50010,
> > bytes: 312, op:
> > > HDFS_WRITE,
> > > > cliID: DFSClient_1091437257, srvID:
> > > > DS-1180278657-127.0.0.1-50010-1235652659245,
> > blockid:
> > > > blk_5027345212081735473_96878
> > > > 2009-03-02 09:55:11,177 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 2 for
> > > block
> > > > blk_5027345212081735473_96878 terminating
> > > > 2009-03-02 09:55:11,185 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Receiving block
> > > > blk_-3992843464553216223_96885 src:
> > /10.1.188.249:49069 dest: /
> > > > 10.1.188.249:50010
> > > > 2009-03-02 09:55:11,186 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Receiving block
> > > > blk_-3132070329589136987_96885 src:
> > /10.1.188.30:33316 dest: /
> > > > 10.1.188.249:50010
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > Exception in
> > > receiveBlock
> > > > for block blk_8782629414415941143_96845
> > java.io.IOException: Connection
> > > > reset by peer
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 0 for
> > > block
> > > > blk_8782629414415941143_96845 Interrupted.
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > PacketResponder 0 for
> > > block
> > > > blk_8782629414415941143_96845 terminating
> > > > 2009-03-02 09:55:11,187 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > writeBlock
> > > > blk_8782629414415941143_96845 received exception
> > java.io.IOException:
> > > > Connection reset by peer
> > > > 2009-03-02 09:55:11,187 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > 10.1.188.249:50010,
> > > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > > infoPort=50075, ipcPort=50020):DataXceiver
> > > > java.io.IOException: Connection reset by peer
> > > >        at sun.nio.ch.FileDispatcher.read0(Native
> > Method)
> > > >        at
> > sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> > > >        at
> > sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> > > >        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> > > >        at
> > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > > >        at
> > > >
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> > > >        at
> > > >
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> > > >        at
> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > > >        at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > > >        at
> > java.io.DataInputStream.read(DataInputStream.java:132)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > > >        at java.lang.Thread.run(Thread.java:619)
> > > >        etc.............................
> > >
> > >
> > >
> > > This looks like an HDFS issue where it won't move
> > on past the bad server
> > > 182.  On client side, they are reported as WARN in the
> > dfsclient but don't
> > > make it up to regionserver so not much we can do about
> > it.
> > >
> > >
> > > I have others exceptions related to DataXceivers
> > problems. These errors
> > > > doesn't make the region server go down, but I
> > can see that I lost some
> > > > records (about 3.10e6 out of 160.10e6).
> > > >
> > >
> > >
> > > Any regionserver crashes during your upload?  I'd
> > think this more the
> > > reason
> > > for dataloss; i.e. edits that were in memcache
> > didn't make it out to the
> > > filesystem because there is still no working flush in
> > hdfs -- hopefully
> > > 0.21
> > > hadoop... see HADOOP-4379.... (though your scenario 2
> > above looks like we
> > > could have handed hdfs the data but it dropped it
> > anyways....)
> > >
> > >
> > >
> > > >
> > > > As you can see in my conf files, I up the
> > dfs.datanode.max.xcievers to
> > > 8192
> > > > as suggested from several mails.
> > > > And my ulimit -n is at 32768.
> > >
> > >
> > > Make sure you can see that above is for sure in place
> > by looking at the
> > > head
> > > of your regionserver log on startup.
> > >
> > >
> > >
> > > > Do these problems come from my configuration, or
> > my hardware ?
> > > >
> > >
> > >
> > > Lets do some more back and forth and make sure we have
> > done all we can
> > > regards the software configuration.  Its probably not
> > hardware going by the
> > > above.
> > >
> > > Tell us more about your uploading process and your
> > schema.  Did all load?
> > > If so, on your 6 servers, how many regions?  How did
> > you verify how much
> > > was
> > > loaded?
> > >
> > > St.Ack
> > >
>
>
>
>

Re: Data lost during intensive writes

Posted by schubert zhang <zs...@gmail.com>.

I find if set "dfs.datanode.socket.write.timeout=0", hadoop will always
create new socket, is it ok?

On Wed, Mar 25, 2009 at 5:01 PM, schubert zhang <zs...@gmail.com> wrote:

> Hi all,
> I also meet such same problems/exceptions.
> I also have 5+1 machine,e and the system has been running for about 4 days,
> and there are 512 regions now. But the two
> exceptions start to happen earlyer.
>
> hadoop-0.19
> hbase-0.19.1 (with patch
> https://issues.apache.org/jira/browse/HBASE-1008).<https://issues.apache.org/jira/browse/HBASE-1008>
>
> I want to try to set dfs.datanode.socket.write.timeout=0 and watch it
> later.
>
> Schubert
>
>
> On Sat, Mar 7, 2009 at 3:15 AM, stack <st...@duboce.net> wrote:
>
>> On Wed, Mar 4, 2009 at 9:18 AM, <jt...@ina.fr> wrote:
>>
>> > <property>
>> >  <name>dfs.replication</name>
>> >  <value>2</value>
>> >  <description>Default block replication.
>> >  The actual number of replications can be specified when the file is
>> > created.
>> >  The default is used if replication is not specified in create time.
>> >  </description>
>> > </property>
>> >
>> > <property>
>> >  <name>dfs.block.size</name>
>> >  <value>8388608</value>
>> >  <description>The hbase standard size for new files.</description>
>> > <!--<value>67108864</value>-->
>> > <!--<description>The default block size for new files.</description>-->
>> > </property>
>> >
>>
>>
>> The above are non-standard.  A replication of 3 might lessen the incidence
>> of HDFS errors seen since there will be another replica to go to.   Why
>> non-standard block size?
>>
>> I did not see *dfs.datanode.socket.write.timeout* set to 0.  Is that
>> because
>> you are running w/ 0.19.0?  You might try with it especially because in
>> the
>> below I see complaint about the timeout (but more below on this).
>>
>>
>>
>> >  <property>
>> >    <name>hbase.hstore.blockCache.blockSize</name>
>> >    <value>65536</value>
>> >    <description>The size of each block in the block cache.
>> >    Enable blockcaching on a per column family basis; see the BLOCKCACHE
>> > setting
>> >    in HColumnDescriptor.  Blocks are kept in a java Soft Reference cache
>> so
>> > are
>> >    let go when high pressure on memory.  Block caching is not enabled by
>> > default.
>> >    Default is 16384.
>> >    </description>
>> >  </property>
>> >
>>
>>
>> Are you using blockcaching?  If so, 64k was problematic in my testing
>> (OOMEing).
>>
>>
>>
>>
>> > Case 1:
>> >
>> > On HBase Regionserver:
>> >
>> > 2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient:
>> > org.apache.hadoop.ipc.RemoteException:
>> > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
>> > replicated
>> >
>> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
>> >        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>> >        at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >        at java.lang.reflect.Method.invoke(Method.java:597)
>> >        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>> >        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>> >
>> >        at org.apache.hadoop.ipc.Client.call(Client.java:696)
>> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>> >        at $Proxy1.addBlock(Unknown Source)
>> >        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>> >        at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >        at java.lang.reflect.Method.invoke(Method.java:597)
>> >        at
>> >
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>> >        at
>> >
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>> >        at $Proxy1.addBlock(Unknown Source)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>> >
>> >
>> > On Hadoop Datanode:
>> >
>> > 2009-02-27 04:22:58,110 WARN
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> > 10.1.188.249:50010,
>> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
>> > infoPort=50075, ipcPort=50020):Got exception while serving
>> > blk_5465578316105624003_26301 to /10.1.188.249:
>> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
>> > channel to be ready for write. ch :
>> > java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010
>> remote=/
>> > 10.1.188.249:48326]
>> >        at
>> >
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>> >        at java.lang.Thread.run(Thread.java:619)
>> >
>> > 2009-02-27 04:22:58,110 ERROR
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> > 10.1.188.249:50010,
>> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
>> > infoPort=50075, ipcPort=50020):DataXceiver
>> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
>> > channel to be ready for write. ch :
>> > java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010
>> remote=/
>> > 10.1.188.249:48326]
>> >        at
>> >
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>> >        at java.lang.Thread.run(Thread.java:619)
>>
>>
>> Are you sure the regionserver error matches the datanode error?
>>
>> My understanding is that in 0.19.0, DFSClient in regionserver is supposed
>> to
>> reestablish timed-out connections.  If that is not happening in your case
>> --
>> and we've speculated some that there might holes in this mechanism -- try
>> with timeout set to zero (see citation above; be sure the configuration
>> can
>> be seen by the DFSClient running in hbase by either adding to
>> hbase-site.xml
>> or somehow get the hadoop-site.xml into hbase CLASSPATH
>> (hbase-env.sh#HBASE_CLASSPATH or with a symlink into the HBASE_HOME/conf
>> dir).
>>
>>
>>
>> > Case 2:
>> >
>> > HBase Regionserver:
>> >
>> > 2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for
>> block
>> > blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-6496095407839777264_96895 bad datanode[1]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-6496095407839777264_96895 in pipeline
>> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for
>> block
>> > blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-7585241287138805906_96914 bad datanode[1]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-7585241287138805906_96914 in pipeline
>> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_8693483996243654850_96912java.io.IOException: Bad response 1 for
>> block
>> > blk_8693483996243654850_96912 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_8693483996243654850_96912 bad datanode[1]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_8693483996243654850_96912 in pipeline
>> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for
>> block
>> > blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-8939308025013258259_96931 bad datanode[1]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-8939308025013258259_96931 in pipeline
>> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_7417692418733608681_96934java.io.IOException: Bad response 1 for
>> block
>> > blk_7417692418733608681_96934 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_7417692418733608681_96934 bad datanode[2]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_7417692418733608681_96934 in pipeline
>> > 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_6777180223564108728_96939java.io.IOException: Bad response 1 for
>> block
>> > blk_6777180223564108728_96939 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_6777180223564108728_96939 bad datanode[1]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_6777180223564108728_96939 in pipeline
>> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for
>> block
>> > blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-6352908575431276531_96948 bad datanode[2]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-6352908575431276531_96948 in pipeline
>> > 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:15,988 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
>> > MSG_REGION_SPLIT: metadata_table,r:
>> >
>> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
>> > 2009-03-02<
>> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02>09:55:16,008
>> WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
>> > ResponseProcessor exception  for block
>> > blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for
>> block
>> > blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-1071965721931053111_96956 bad datanode[2]
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-1071965721931053111_96956 in pipeline
>> > 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad
>> datanode
>> > 10.1.188.182:50010
>> > 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception  for block
>> > blk_1004039574836775403_96959java.io.IOException: Bad response 1 for
>> block
>> > blk_1004039574836775403_96959 from datanode 10.1.188.182:50010
>> >        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>> >
>> > 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_1004039574836775403_96959 bad datanode[1]
>> > 10.1.188.182:50010
>> >
>> >
>> > Hadoop datanode:
>> >
>> > 2009-03-02 09:55:10,201 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>> > blk_-5472632607337755080_96875 1 Exception java.io.EOFException
>> >        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>> >        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
>> >        at java.lang.Thread.run(Thread.java:619)
>> >
>> > 2009-03-02 09:55:10,407 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for
>> block
>> > blk_-5472632607337755080_96875 terminating
>> > 2009-03-02 09:55:10,516 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> > 10.1.188.249:50010,
>> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
>> > infoPort=50075, ipcPort=50020):Exception writing block
>> > blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010
>> > java.io.IOException: Broken pipe
>> >        at sun.nio.ch.FileDispatcher.write0(Native Method)
>> >        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>> >        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>> >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>> >        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>> >        at
>> >
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>> >        at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>> >        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>> >        at java.lang.Thread.run(Thread.java:619)
>> >
>> > 2009-03-02 09:55:10,517 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> receiveBlock
>> > for block blk_-5472632607337755080_96875 java.io.IOException: Broken
>> pipe
>> > 2009-03-02 09:55:10,517 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> > blk_-5472632607337755080_96875 received exception java.io.IOException:
>> > Broken pipe
>> > 2009-03-02 09:55:10,517 ERROR
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> > 10.1.188.249:50010,
>> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
>> > infoPort=50075, ipcPort=50020):DataXceiver
>> > java.io.IOException: Broken pipe
>> >        at sun.nio.ch.FileDispatcher.write0(Native Method)
>> >        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>> >        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>> >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>> >        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>> >        at
>> >
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>> >        at
>> >
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>> >        at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>> >        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>> >        at java.lang.Thread.run(Thread.java:619)
>> > 2009-03-02 09:55:11,174 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> > 10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op:
>> HDFS_WRITE,
>> > cliID: DFSClient_1091437257, srvID:
>> > DS-1180278657-127.0.0.1-50010-1235652659245, blockid:
>> > blk_5027345212081735473_96878
>> > 2009-03-02 09:55:11,177 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>> block
>> > blk_5027345212081735473_96878 terminating
>> > 2009-03-02 09:55:11,185 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> > blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: /
>> > 10.1.188.249:50010
>> > 2009-03-02 09:55:11,186 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> > blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: /
>> > 10.1.188.249:50010
>> > 2009-03-02 09:55:11,187 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> receiveBlock
>> > for block blk_8782629414415941143_96845 java.io.IOException: Connection
>> > reset by peer
>> > 2009-03-02 09:55:11,187 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>> block
>> > blk_8782629414415941143_96845 Interrupted.
>> > 2009-03-02 09:55:11,187 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>> block
>> > blk_8782629414415941143_96845 terminating
>> > 2009-03-02 09:55:11,187 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> > blk_8782629414415941143_96845 received exception java.io.IOException:
>> > Connection reset by peer
>> > 2009-03-02 09:55:11,187 ERROR
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> > 10.1.188.249:50010,
>> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
>> > infoPort=50075, ipcPort=50020):DataXceiver
>> > java.io.IOException: Connection reset by peer
>> >        at sun.nio.ch.FileDispatcher.read0(Native Method)
>> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>> >        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>> >        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>> >        at
>> >
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>> >        at
>> >
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>> >        at
>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>> >        at
>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>> >        at
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >        at java.io.DataInputStream.read(DataInputStream.java:132)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>> >        at
>> >
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>> >        at java.lang.Thread.run(Thread.java:619)
>> >        etc.............................
>>
>>
>>
>> This looks like an HDFS issue where it won't move on past the bad server
>> 182.  On client side, they are reported as WARN in the dfsclient but don't
>> make it up to regionserver so not much we can do about it.
>>
>>
>> I have others exceptions related to DataXceivers problems. These errors
>> > doesn't make the region server go down, but I can see that I lost some
>> > records (about 3.10e6 out of 160.10e6).
>> >
>>
>>
>> Any regionserver crashes during your upload?  I'd think this more the
>> reason
>> for dataloss; i.e. edits that were in memcache didn't make it out to the
>> filesystem because there is still no working flush in hdfs -- hopefully
>> 0.21
>> hadoop... see HADOOP-4379.... (though your scenario 2 above looks like we
>> could have handed hdfs the data but it dropped it anyways....)
>>
>>
>>
>> >
>> > As you can see in my conf files, I up the dfs.datanode.max.xcievers to
>> 8192
>> > as suggested from several mails.
>> > And my ulimit -n is at 32768.
>>
>>
>> Make sure you can see that above is for sure in place by looking at the
>> head
>> of your regionserver log on startup.
>>
>>
>>
>> > Do these problems come from my configuration, or my hardware ?
>> >
>>
>>
>> Lets do some more back and forth and make sure we have done all we can
>> regards the software configuration.  Its probably not hardware going by
>> the
>> above.
>>
>> Tell us more about your uploading process and your schema.  Did all load?
>> If so, on your 6 servers, how many regions?  How did you verify how much
>> was
>> loaded?
>>
>> St.Ack
>>
>
>

Re: Data lost during intensive writes

Posted by schubert zhang <zs...@gmail.com>.

Following is what I had send to J-D in another email thread, I will check
more logs of 3.24-25.

2009-03-23 10:07:57,465 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 9000, call
addBlock(/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436,
DFSClient_629567488) from 10.24.1.18:59685: error:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated
yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated
yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(Unknown
Source)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(Unknown Source)
at org.apache.hadoop.ipc.Server$Handler.run(Unknown Source)
2009-03-23 10:07:57,552 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated:10.24.1.12:50010 is added to
blk_8246919716767617786_109126 size 1048576
2009-03-23 10:07:57,552 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated:10.24.1.12:50010 is added to
blk_8246919716767617786_109126 size 1048576
2009-03-23 10:07:57,554 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hbase/log_10.24.1.16_1237686658208_60020/hlog.dat.1237774044443.
blk_45871727940505900_109126
2009-03-23 10:07:57,688 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated:10.24.1.12:50010 is added to
blk_2378060095065607252_109126 size 1048576
2009-03-23 10:07:57,688 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated:10.24.1.14:50010 is added to
blk_2378060095065607252_109126 size 1048576
2009-03-23 10:07:57,689 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hbase/log_10.24.1.14_1237686648061_60020/hlog.dat.1237774036841.
blk_8448212226292209521_109126
2009-03-23 10:07:57,869 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9000, call
addBlock(/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436,
DFSClient_629567488) from 10.24.1.18:59685: error:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated
yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
replicated
yet:/hbase/log_10.24.1.18_1237686636736_60020/hlog.dat.1237774027436
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(Unknown
Source)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(Unknown Source)
at org.apache.hadoop.ipc.Server$Handler.run(Unknown Source)
2009-03-23 10:07:57,944 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated:10.24.1.18:50010 is added to
blk_1270075611008480481_109121 size 1048576

I cannot find useful info in datanode's logs at the time point. But I find
something else, for examples:

2009-03-23 10:08:09,321 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.24.1.20:50010, storageID=DS-2136798339-10.24.1.20-50010-1237686444430,
infoPort=50075, ipcPort=50020):Failed to transfer
blk_-4099352067684877111_109151 to 10.24.1.18:50010 got
java.net.SocketException: Original Exception : java.io.IOException:
Connection reset by peer
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
        at
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:418)
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:519)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(Unknown
Source)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(Unknown
Source)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(Unknown Source)
        at
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(Unknown
Source)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: Connection reset by peer
        ... 8 more

and.

2009-03-23 10:10:17,313 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.24.1.20:50010, storageID=DS-2136798339-10.24.1.20-50010-1237686444430,
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_-6347382571494739349_109326 is valid, and cannot be written to.
        at
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(Unknown
Source)
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(Unknown Source)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(Unknown
Source)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(Unknown
Source)
        at java.lang.Thread.run(Thread.java:619)

On Wed, Mar 25, 2009 at 9:36 PM, stack <st...@duboce.net> wrote:

> On Wed, Mar 25, 2009 at 2:01 AM, schubert zhang <zs...@gmail.com> wrote:
>
>
> > But the two
> > exceptions start to happen earlyer.
> >
>
> Which two exceptions Schubert?
>
>
> hadoop-0.19
> > hbase-0.19.1 (with patch
> > https://issues.apache.org/jira/browse/HBASE-1008)<
> https://issues.apache.org/jira/browse/HBASE-1008%29>
> > .<https://issues.apache.org/jira/browse/HBASE-1008>
> >
> > I want to try to set dfs.datanode.socket.write.timeout=0 and watch it
> > later.
>
>
> Later you ask, ' if set "dfs.datanode.socket.write.timeout=0", hadoop will
> always create new socket, is it ok?'  I traced write.timeout and looks like
> it becomes the socket timeout -- no other special handling seems to be
> done.  Perhaps I am missing something?   To what are you referring?
>
> Thanks,
> St.Ack
>

Re: Data lost during intensive writes

Posted by stack <st...@duboce.net>.

On Wed, Mar 25, 2009 at 2:01 AM, schubert zhang <zs...@gmail.com> wrote:

> But the two
> exceptions start to happen earlyer.
>

Which two exceptions Schubert?

hadoop-0.19
> hbase-0.19.1 (with patch
> https://issues.apache.org/jira/browse/HBASE-1008)<https://issues.apache.org/jira/browse/HBASE-1008%29>
> .<https://issues.apache.org/jira/browse/HBASE-1008>
>
> I want to try to set dfs.datanode.socket.write.timeout=0 and watch it
> later.

Later you ask, ' if set "dfs.datanode.socket.write.timeout=0", hadoop will
always create new socket, is it ok?'  I traced write.timeout and looks like
it becomes the socket timeout -- no other special handling seems to be
done.  Perhaps I am missing something?   To what are you referring?

Thanks,
St.Ack

Re: Data lost during intensive writes

Posted by Andrew Purtell <ap...@apache.org>.

You may need to increase the maximum number of xceivers allowed
on each of your datanodes. 

Best regards,

   - Andy

> From: schubert zhang <zs...@gmail.com>
> Subject: Re: Data lost during intensive writes
> To: hbase-user@hadoop.apache.org
> Date: Wednesday, March 25, 2009, 2:01 AM
> Hi all,
> I also meet such same problems/exceptions.
> I also have 5+1 machine,e and the system has been running
> for about 4 days,
> and there are 512 regions now. But the two
> exceptions start to happen earlyer.
> 
> hadoop-0.19
> hbase-0.19.1 (with patch
> https://issues.apache.org/jira/browse/HBASE-1008).<https://issues.apache.org/jira/browse/HBASE-1008>
> 
> I want to try to set dfs.datanode.socket.write.timeout=0
> and watch it later.
> 
> Schubert
> 
> On Sat, Mar 7, 2009 at 3:15 AM, stack
> <st...@duboce.net> wrote:
> 
> > On Wed, Mar 4, 2009 at 9:18 AM,
> <jt...@ina.fr> wrote:
> >
> > > <property>
> > >  <name>dfs.replication</name>
> > >  <value>2</value>
> > >  <description>Default block replication.
> > >  The actual number of replications can be
> specified when the file is
> > > created.
> > >  The default is used if replication is not
> specified in create time.
> > >  </description>
> > > </property>
> > >
> > > <property>
> > >  <name>dfs.block.size</name>
> > >  <value>8388608</value>
> > >  <description>The hbase standard size for
> new files.</description>
> > > <!--<value>67108864</value>-->
> > > <!--<description>The default block size
> for new files.</description>-->
> > > </property>
> > >
> >
> >
> > The above are non-standard.  A replication of 3 might
> lessen the incidence
> > of HDFS errors seen since there will be another
> replica to go to.   Why
> > non-standard block size?
> >
> > I did not see *dfs.datanode.socket.write.timeout* set
> to 0.  Is that
> > because
> > you are running w/ 0.19.0?  You might try with it
> especially because in the
> > below I see complaint about the timeout (but more
> below on this).
> >
> >
> >
> > >  <property>
> > >   
> <name>hbase.hstore.blockCache.blockSize</name>
> > >    <value>65536</value>
> > >    <description>The size of each block in
> the block cache.
> > >    Enable blockcaching on a per column family
> basis; see the BLOCKCACHE
> > > setting
> > >    in HColumnDescriptor.  Blocks are kept in a
> java Soft Reference cache
> > so
> > > are
> > >    let go when high pressure on memory.  Block
> caching is not enabled by
> > > default.
> > >    Default is 16384.
> > >    </description>
> > >  </property>
> > >
> >
> >
> > Are you using blockcaching?  If so, 64k was
> problematic in my testing
> > (OOMEing).
> >
> >
> >
> >
> > > Case 1:
> > >
> > > On HBase Regionserver:
> > >
> > > 2009-02-27 04:23:52,185 INFO
> org.apache.hadoop.hdfs.DFSClient:
> > > org.apache.hadoop.ipc.RemoteException:
> > >
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException:
> Not
> > > replicated
> > >
> >
> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
> > >        at
> sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> > >        at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at
> java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> > >        at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
> > >
> > >        at
> org.apache.hadoop.ipc.Client.call(Client.java:696)
> > >        at
> org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> > >        at $Proxy1.addBlock(Unknown Source)
> > >        at
> sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> > >        at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at
> java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> > >        at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> > >        at $Proxy1.addBlock(Unknown Source)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > >
> > >
> > > On Hadoop Datanode:
> > >
> > > 2009-02-27 04:22:58,110 WARN
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > 10.1.188.249:50010,
> > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > infoPort=50075, ipcPort=50020):Got exception
> while serving
> > > blk_5465578316105624003_26301 to /10.1.188.249:
> > > java.net.SocketTimeoutException: 480000 millis
> timeout while waiting for
> > > channel to be ready for write. ch :
> > > java.nio.channels.SocketChannel[connected
> local=/10.1.188.249:50010
> > remote=/
> > > 10.1.188.249:48326]
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > >        at java.lang.Thread.run(Thread.java:619)
> > >
> > > 2009-02-27 04:22:58,110 ERROR
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > 10.1.188.249:50010,
> > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > infoPort=50075, ipcPort=50020):DataXceiver
> > > java.net.SocketTimeoutException: 480000 millis
> timeout while waiting for
> > > channel to be ready for write. ch :
> > > java.nio.channels.SocketChannel[connected
> local=/10.1.188.249:50010
> > remote=/
> > > 10.1.188.249:48326]
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> > >        at java.lang.Thread.run(Thread.java:619)
> >
> >
> > Are you sure the regionserver error matches the
> datanode error?
> >
> > My understanding is that in 0.19.0, DFSClient in
> regionserver is supposed
> > to
> > reestablish timed-out connections.  If that is not
> happening in your case
> > --
> > and we've speculated some that there might holes
> in this mechanism -- try
> > with timeout set to zero (see citation above; be sure
> the configuration can
> > be seen by the DFSClient running in hbase by either
> adding to
> > hbase-site.xml
> > or somehow get the hadoop-site.xml into hbase
> CLASSPATH
> > (hbase-env.sh#HBASE_CLASSPATH or with a symlink into
> the HBASE_HOME/conf
> > dir).
> >
> >
> >
> > > Case 2:
> > >
> > > HBase Regionserver:
> > >
> > > 2009-03-02 09:55:11,929 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > >
> blk_-6496095407839777264_96895java.io.IOException: Bad
> response 1 for
> > block
> > > blk_-6496095407839777264_96895 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:11,930 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-6496095407839777264_96895
> bad datanode[1]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:11,930 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-6496095407839777264_96895
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.182:50010,
> 10.1.188.203:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:14,362 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > >
> blk_-7585241287138805906_96914java.io.IOException: Bad
> response 1 for
> > block
> > > blk_-7585241287138805906_96914 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:14,362 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-7585241287138805906_96914
> bad datanode[1]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:14,363 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-7585241287138805906_96914
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.182:50010,
> 10.1.188.141:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:14,445 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > > blk_8693483996243654850_96912java.io.IOException:
> Bad response 1 for
> > block
> > > blk_8693483996243654850_96912 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:14,446 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_8693483996243654850_96912
> bad datanode[1]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:14,446 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_8693483996243654850_96912
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.182:50010,
> 10.1.188.203:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:14,923 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > >
> blk_-8939308025013258259_96931java.io.IOException: Bad
> response 1 for
> > block
> > > blk_-8939308025013258259_96931 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:14,935 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-8939308025013258259_96931
> bad datanode[1]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:14,935 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-8939308025013258259_96931
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.182:50010,
> 10.1.188.203:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,344 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > > blk_7417692418733608681_96934java.io.IOException:
> Bad response 1 for
> > block
> > > blk_7417692418733608681_96934 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:15,344 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_7417692418733608681_96934
> bad datanode[2]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,344 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_7417692418733608681_96934
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.203:50010,
> 10.1.188.182:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,579 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > > blk_6777180223564108728_96939java.io.IOException:
> Bad response 1 for
> > block
> > > blk_6777180223564108728_96939 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:15,579 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_6777180223564108728_96939
> bad datanode[1]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,579 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_6777180223564108728_96939
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.182:50010,
> 10.1.188.203:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,930 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > >
> blk_-6352908575431276531_96948java.io.IOException: Bad
> response 1 for
> > block
> > > blk_-6352908575431276531_96948 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:15,930 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-6352908575431276531_96948
> bad datanode[2]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,930 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-6352908575431276531_96948
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.30:50010,
> 10.1.188.182:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:15,988 INFO
> > >
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > > MSG_REGION_SPLIT: metadata_table,r:
> > >
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
> > > 2009-03-02<
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02>09:55:16,008
> > WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> > > ResponseProcessor exception  for block
> > >
> blk_-1071965721931053111_96956java.io.IOException: Bad
> response 1 for
> > block
> > > blk_-1071965721931053111_96956 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:16,008 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-1071965721931053111_96956
> bad datanode[2]
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:16,009 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-1071965721931053111_96956
> in pipeline
> > > 10.1.188.249:50010, 10.1.188.203:50010,
> 10.1.188.182:50010: bad datanode
> > > 10.1.188.182:50010
> > > 2009-03-02 09:55:16,073 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for
> block
> > > blk_1004039574836775403_96959java.io.IOException:
> Bad response 1 for
> > block
> > > blk_1004039574836775403_96959 from datanode
> 10.1.188.182:50010
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> > >
> > > 2009-03-02 09:55:16,073 WARN
> org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_1004039574836775403_96959
> bad datanode[1]
> > > 10.1.188.182:50010
> > >
> > >
> > > Hadoop datanode:
> > >
> > > 2009-03-02 09:55:10,201 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder
> > > blk_-5472632607337755080_96875 1 Exception
> java.io.EOFException
> > >        at
> java.io.DataInputStream.readFully(DataInputStream.java:180)
> > >        at
> java.io.DataInputStream.readLong(DataInputStream.java:399)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
> > >        at java.lang.Thread.run(Thread.java:619)
> > >
> > > 2009-03-02 09:55:10,407 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder 1 for
> > block
> > > blk_-5472632607337755080_96875 terminating
> > > 2009-03-02 09:55:10,516 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > 10.1.188.249:50010,
> > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > infoPort=50075, ipcPort=50020):Exception writing
> block
> > > blk_-5472632607337755080_96875 to mirror
> 10.1.188.182:50010
> > > java.io.IOException: Broken pipe
> > >        at sun.nio.ch.FileDispatcher.write0(Native
> Method)
> > >        at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> > >        at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> > >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> > >        at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > >        at
> >
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > >        at
> java.io.DataOutputStream.write(DataOutputStream.java:90)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > >        at java.lang.Thread.run(Thread.java:619)
> > >
> > > 2009-03-02 09:55:10,517 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception in
> > receiveBlock
> > > for block blk_-5472632607337755080_96875
> java.io.IOException: Broken pipe
> > > 2009-03-02 09:55:10,517 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> writeBlock
> > > blk_-5472632607337755080_96875 received exception
> java.io.IOException:
> > > Broken pipe
> > > 2009-03-02 09:55:10,517 ERROR
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > 10.1.188.249:50010,
> > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > infoPort=50075, ipcPort=50020):DataXceiver
> > > java.io.IOException: Broken pipe
> > >        at sun.nio.ch.FileDispatcher.write0(Native
> Method)
> > >        at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> > >        at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> > >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> > >        at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > >        at
> >
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > >        at
> java.io.DataOutputStream.write(DataOutputStream.java:90)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > >        at java.lang.Thread.run(Thread.java:619)
> > > 2009-03-02 09:55:11,174 INFO
> > >
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> src: /
> > > 10.1.188.249:49063, dest: /10.1.188.249:50010,
> bytes: 312, op:
> > HDFS_WRITE,
> > > cliID: DFSClient_1091437257, srvID:
> > > DS-1180278657-127.0.0.1-50010-1235652659245,
> blockid:
> > > blk_5027345212081735473_96878
> > > 2009-03-02 09:55:11,177 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder 2 for
> > block
> > > blk_5027345212081735473_96878 terminating
> > > 2009-03-02 09:55:11,185 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> Receiving block
> > > blk_-3992843464553216223_96885 src:
> /10.1.188.249:49069 dest: /
> > > 10.1.188.249:50010
> > > 2009-03-02 09:55:11,186 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> Receiving block
> > > blk_-3132070329589136987_96885 src:
> /10.1.188.30:33316 dest: /
> > > 10.1.188.249:50010
> > > 2009-03-02 09:55:11,187 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> Exception in
> > receiveBlock
> > > for block blk_8782629414415941143_96845
> java.io.IOException: Connection
> > > reset by peer
> > > 2009-03-02 09:55:11,187 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder 0 for
> > block
> > > blk_8782629414415941143_96845 Interrupted.
> > > 2009-03-02 09:55:11,187 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder 0 for
> > block
> > > blk_8782629414415941143_96845 terminating
> > > 2009-03-02 09:55:11,187 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> writeBlock
> > > blk_8782629414415941143_96845 received exception
> java.io.IOException:
> > > Connection reset by peer
> > > 2009-03-02 09:55:11,187 ERROR
> > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > 10.1.188.249:50010,
> > storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > > infoPort=50075, ipcPort=50020):DataXceiver
> > > java.io.IOException: Connection reset by peer
> > >        at sun.nio.ch.FileDispatcher.read0(Native
> Method)
> > >        at
> sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> > >        at
> sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> > >        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> > >        at
> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> > >        at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> > >        at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> > >        at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> > >        at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > >        at
> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > >        at
> java.io.DataInputStream.read(DataInputStream.java:132)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> > >        at
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> > >        at java.lang.Thread.run(Thread.java:619)
> > >        etc.............................
> >
> >
> >
> > This looks like an HDFS issue where it won't move
> on past the bad server
> > 182.  On client side, they are reported as WARN in the
> dfsclient but don't
> > make it up to regionserver so not much we can do about
> it.
> >
> >
> > I have others exceptions related to DataXceivers
> problems. These errors
> > > doesn't make the region server go down, but I
> can see that I lost some
> > > records (about 3.10e6 out of 160.10e6).
> > >
> >
> >
> > Any regionserver crashes during your upload?  I'd
> think this more the
> > reason
> > for dataloss; i.e. edits that were in memcache
> didn't make it out to the
> > filesystem because there is still no working flush in
> hdfs -- hopefully
> > 0.21
> > hadoop... see HADOOP-4379.... (though your scenario 2
> above looks like we
> > could have handed hdfs the data but it dropped it
> anyways....)
> >
> >
> >
> > >
> > > As you can see in my conf files, I up the
> dfs.datanode.max.xcievers to
> > 8192
> > > as suggested from several mails.
> > > And my ulimit -n is at 32768.
> >
> >
> > Make sure you can see that above is for sure in place
> by looking at the
> > head
> > of your regionserver log on startup.
> >
> >
> >
> > > Do these problems come from my configuration, or
> my hardware ?
> > >
> >
> >
> > Lets do some more back and forth and make sure we have
> done all we can
> > regards the software configuration.  Its probably not
> hardware going by the
> > above.
> >
> > Tell us more about your uploading process and your
> schema.  Did all load?
> > If so, on your 6 servers, how many regions?  How did
> you verify how much
> > was
> > loaded?
> >
> > St.Ack
> >

Re: Data lost during intensive writes

Posted by schubert zhang <zs...@gmail.com>.

Hi all,
I also meet such same problems/exceptions.
I also have 5+1 machine,e and the system has been running for about 4 days,
and there are 512 regions now. But the two
exceptions start to happen earlyer.

hadoop-0.19
hbase-0.19.1 (with patch
https://issues.apache.org/jira/browse/HBASE-1008).<https://issues.apache.org/jira/browse/HBASE-1008>

I want to try to set dfs.datanode.socket.write.timeout=0 and watch it later.

Schubert

On Sat, Mar 7, 2009 at 3:15 AM, stack <st...@duboce.net> wrote:

> On Wed, Mar 4, 2009 at 9:18 AM, <jt...@ina.fr> wrote:
>
> > <property>
> >  <name>dfs.replication</name>
> >  <value>2</value>
> >  <description>Default block replication.
> >  The actual number of replications can be specified when the file is
> > created.
> >  The default is used if replication is not specified in create time.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>dfs.block.size</name>
> >  <value>8388608</value>
> >  <description>The hbase standard size for new files.</description>
> > <!--<value>67108864</value>-->
> > <!--<description>The default block size for new files.</description>-->
> > </property>
> >
>
>
> The above are non-standard.  A replication of 3 might lessen the incidence
> of HDFS errors seen since there will be another replica to go to.   Why
> non-standard block size?
>
> I did not see *dfs.datanode.socket.write.timeout* set to 0.  Is that
> because
> you are running w/ 0.19.0?  You might try with it especially because in the
> below I see complaint about the timeout (but more below on this).
>
>
>
> >  <property>
> >    <name>hbase.hstore.blockCache.blockSize</name>
> >    <value>65536</value>
> >    <description>The size of each block in the block cache.
> >    Enable blockcaching on a per column family basis; see the BLOCKCACHE
> > setting
> >    in HColumnDescriptor.  Blocks are kept in a java Soft Reference cache
> so
> > are
> >    let go when high pressure on memory.  Block caching is not enabled by
> > default.
> >    Default is 16384.
> >    </description>
> >  </property>
> >
>
>
> Are you using blockcaching?  If so, 64k was problematic in my testing
> (OOMEing).
>
>
>
>
> > Case 1:
> >
> > On HBase Regionserver:
> >
> > 2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient:
> > org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
> > replicated
> >
> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
> >        at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
> >        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> >        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
> >
> >        at org.apache.hadoop.ipc.Client.call(Client.java:696)
> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> >        at $Proxy1.addBlock(Unknown Source)
> >        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> >        at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> >        at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> >        at $Proxy1.addBlock(Unknown Source)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >
> >
> > On Hadoop Datanode:
> >
> > 2009-02-27 04:22:58,110 WARN
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.188.249:50010,
> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > infoPort=50075, ipcPort=50020):Got exception while serving
> > blk_5465578316105624003_26301 to /10.1.188.249:
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> > channel to be ready for write. ch :
> > java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010
> remote=/
> > 10.1.188.249:48326]
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> > 2009-02-27 04:22:58,110 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.188.249:50010,
> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> > channel to be ready for write. ch :
> > java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010
> remote=/
> > 10.1.188.249:48326]
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
> >        at java.lang.Thread.run(Thread.java:619)
>
>
> Are you sure the regionserver error matches the datanode error?
>
> My understanding is that in 0.19.0, DFSClient in regionserver is supposed
> to
> reestablish timed-out connections.  If that is not happening in your case
> --
> and we've speculated some that there might holes in this mechanism -- try
> with timeout set to zero (see citation above; be sure the configuration can
> be seen by the DFSClient running in hbase by either adding to
> hbase-site.xml
> or somehow get the hadoop-site.xml into hbase CLASSPATH
> (hbase-env.sh#HBASE_CLASSPATH or with a symlink into the HBASE_HOME/conf
> dir).
>
>
>
> > Case 2:
> >
> > HBase Regionserver:
> >
> > 2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for
> block
> > blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-6496095407839777264_96895 bad datanode[1]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-6496095407839777264_96895 in pipeline
> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for
> block
> > blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-7585241287138805906_96914 bad datanode[1]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-7585241287138805906_96914 in pipeline
> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_8693483996243654850_96912java.io.IOException: Bad response 1 for
> block
> > blk_8693483996243654850_96912 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_8693483996243654850_96912 bad datanode[1]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_8693483996243654850_96912 in pipeline
> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for
> block
> > blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-8939308025013258259_96931 bad datanode[1]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-8939308025013258259_96931 in pipeline
> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_7417692418733608681_96934java.io.IOException: Bad response 1 for
> block
> > blk_7417692418733608681_96934 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_7417692418733608681_96934 bad datanode[2]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_7417692418733608681_96934 in pipeline
> > 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_6777180223564108728_96939java.io.IOException: Bad response 1 for
> block
> > blk_6777180223564108728_96939 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_6777180223564108728_96939 bad datanode[1]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_6777180223564108728_96939 in pipeline
> > 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for
> block
> > blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-6352908575431276531_96948 bad datanode[2]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-6352908575431276531_96948 in pipeline
> > 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:15,988 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > MSG_REGION_SPLIT: metadata_table,r:
> >
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
> > 2009-03-02<
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02>09:55:16,008
> WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> > ResponseProcessor exception  for block
> > blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for
> block
> > blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-1071965721931053111_96956 bad datanode[2]
> > 10.1.188.182:50010
> > 2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-1071965721931053111_96956 in pipeline
> > 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode
> > 10.1.188.182:50010
> > 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_1004039574836775403_96959java.io.IOException: Bad response 1 for
> block
> > blk_1004039574836775403_96959 from datanode 10.1.188.182:50010
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
> >
> > 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_1004039574836775403_96959 bad datanode[1]
> > 10.1.188.182:50010
> >
> >
> > Hadoop datanode:
> >
> > 2009-03-02 09:55:10,201 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> > blk_-5472632607337755080_96875 1 Exception java.io.EOFException
> >        at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >        at java.io.DataInputStream.readLong(DataInputStream.java:399)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> > 2009-03-02 09:55:10,407 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for
> block
> > blk_-5472632607337755080_96875 terminating
> > 2009-03-02 09:55:10,516 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.188.249:50010,
> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > infoPort=50075, ipcPort=50020):Exception writing block
> > blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010
> > java.io.IOException: Broken pipe
> >        at sun.nio.ch.FileDispatcher.write0(Native Method)
> >        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> >        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> >        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> >        at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> >        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> > 2009-03-02 09:55:10,517 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock
> > for block blk_-5472632607337755080_96875 java.io.IOException: Broken pipe
> > 2009-03-02 09:55:10,517 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > blk_-5472632607337755080_96875 received exception java.io.IOException:
> > Broken pipe
> > 2009-03-02 09:55:10,517 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.188.249:50010,
> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.io.IOException: Broken pipe
> >        at sun.nio.ch.FileDispatcher.write0(Native Method)
> >        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
> >        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
> >        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
> >        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> >        at
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> >        at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> >        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> >        at java.lang.Thread.run(Thread.java:619)
> > 2009-03-02 09:55:11,174 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op:
> HDFS_WRITE,
> > cliID: DFSClient_1091437257, srvID:
> > DS-1180278657-127.0.0.1-50010-1235652659245, blockid:
> > blk_5027345212081735473_96878
> > 2009-03-02 09:55:11,177 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block
> > blk_5027345212081735473_96878 terminating
> > 2009-03-02 09:55:11,185 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: /
> > 10.1.188.249:50010
> > 2009-03-02 09:55:11,186 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: /
> > 10.1.188.249:50010
> > 2009-03-02 09:55:11,187 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock
> > for block blk_8782629414415941143_96845 java.io.IOException: Connection
> > reset by peer
> > 2009-03-02 09:55:11,187 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block
> > blk_8782629414415941143_96845 Interrupted.
> > 2009-03-02 09:55:11,187 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block
> > blk_8782629414415941143_96845 terminating
> > 2009-03-02 09:55:11,187 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > blk_8782629414415941143_96845 received exception java.io.IOException:
> > Connection reset by peer
> > 2009-03-02 09:55:11,187 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 10.1.188.249:50010,
> storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> > infoPort=50075, ipcPort=50020):DataXceiver
> > java.io.IOException: Connection reset by peer
> >        at sun.nio.ch.FileDispatcher.read0(Native Method)
> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> >        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> >        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> >        at
> >
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> >        at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
> >        at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> >        at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> >        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >        at java.io.DataInputStream.read(DataInputStream.java:132)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
> >        at
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
> >        at java.lang.Thread.run(Thread.java:619)
> >        etc.............................
>
>
>
> This looks like an HDFS issue where it won't move on past the bad server
> 182.  On client side, they are reported as WARN in the dfsclient but don't
> make it up to regionserver so not much we can do about it.
>
>
> I have others exceptions related to DataXceivers problems. These errors
> > doesn't make the region server go down, but I can see that I lost some
> > records (about 3.10e6 out of 160.10e6).
> >
>
>
> Any regionserver crashes during your upload?  I'd think this more the
> reason
> for dataloss; i.e. edits that were in memcache didn't make it out to the
> filesystem because there is still no working flush in hdfs -- hopefully
> 0.21
> hadoop... see HADOOP-4379.... (though your scenario 2 above looks like we
> could have handed hdfs the data but it dropped it anyways....)
>
>
>
> >
> > As you can see in my conf files, I up the dfs.datanode.max.xcievers to
> 8192
> > as suggested from several mails.
> > And my ulimit -n is at 32768.
>
>
> Make sure you can see that above is for sure in place by looking at the
> head
> of your regionserver log on startup.
>
>
>
> > Do these problems come from my configuration, or my hardware ?
> >
>
>
> Lets do some more back and forth and make sure we have done all we can
> regards the software configuration.  Its probably not hardware going by the
> above.
>
> Tell us more about your uploading process and your schema.  Did all load?
> If so, on your 6 servers, how many regions?  How did you verify how much
> was
> loaded?
>
> St.Ack
>

Re: Data lost during intensive writes

Posted by stack <st...@duboce.net>.

On Wed, Mar 4, 2009 at 9:18 AM, <jt...@ina.fr> wrote:

> <property>
>  <name>dfs.replication</name>
>  <value>2</value>
>  <description>Default block replication.
>  The actual number of replications can be specified when the file is
> created.
>  The default is used if replication is not specified in create time.
>  </description>
> </property>
>
> <property>
>  <name>dfs.block.size</name>
>  <value>8388608</value>
>  <description>The hbase standard size for new files.</description>
> <!--<value>67108864</value>-->
> <!--<description>The default block size for new files.</description>-->
> </property>
>


The above are non-standard.  A replication of 3 might lessen the incidence
of HDFS errors seen since there will be another replica to go to.   Why
non-standard block size?

I did not see *dfs.datanode.socket.write.timeout* set to 0.  Is that because
you are running w/ 0.19.0?  You might try with it especially because in the
below I see complaint about the timeout (but more below on this).



>  <property>
>    <name>hbase.hstore.blockCache.blockSize</name>
>    <value>65536</value>
>    <description>The size of each block in the block cache.
>    Enable blockcaching on a per column family basis; see the BLOCKCACHE
> setting
>    in HColumnDescriptor.  Blocks are kept in a java Soft Reference cache so
> are
>    let go when high pressure on memory.  Block caching is not enabled by
> default.
>    Default is 16384.
>    </description>
>  </property>
>


Are you using blockcaching?  If so, 64k was problematic in my testing
(OOMEing).




> Case 1:
>
> On HBase Regionserver:
>
> 2009-02-27 04:23:52,185 INFO org.apache.hadoop.hdfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not
> replicated
> yet:/hbase/metadata_table/compaction.dir/1476318467/content/mapfiles/260278331337921598/data
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1256)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
>        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:696)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy1.addBlock(Unknown Source)
>        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>        at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>        at $Proxy1.addBlock(Unknown Source)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
>
> On Hadoop Datanode:
>
> 2009-02-27 04:22:58,110 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):Got exception while serving
> blk_5465578316105624003_26301 to /10.1.188.249:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010remote=/
> 10.1.188.249:48326]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>        at java.lang.Thread.run(Thread.java:619)
>
> 2009-02-27 04:22:58,110 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
> channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.1.188.249:50010remote=/
> 10.1.188.249:48326]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
>        at
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>        at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
>        at java.lang.Thread.run(Thread.java:619)


Are you sure the regionserver error matches the datanode error?

My understanding is that in 0.19.0, DFSClient in regionserver is supposed to
reestablish timed-out connections.  If that is not happening in your case --
and we've speculated some that there might holes in this mechanism -- try
with timeout set to zero (see citation above; be sure the configuration can
be seen by the DFSClient running in hbase by either adding to hbase-site.xml
or somehow get the hadoop-site.xml into hbase CLASSPATH
(hbase-env.sh#HBASE_CLASSPATH or with a symlink into the HBASE_HOME/conf
dir).



> Case 2:
>
> HBase Regionserver:
>
> 2009-03-02 09:55:11,929 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-6496095407839777264_96895java.io.IOException: Bad response 1 for block
> blk_-6496095407839777264_96895 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6496095407839777264_96895 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:11,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6496095407839777264_96895 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-7585241287138805906_96914java.io.IOException: Bad response 1 for block
> blk_-7585241287138805906_96914 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:14,362 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-7585241287138805906_96914 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:14,363 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-7585241287138805906_96914 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.141:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:14,445 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_8693483996243654850_96912java.io.IOException: Bad response 1 for block
> blk_8693483996243654850_96912 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_8693483996243654850_96912 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:14,446 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_8693483996243654850_96912 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:14,923 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8939308025013258259_96931java.io.IOException: Bad response 1 for block
> blk_-8939308025013258259_96931 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8939308025013258259_96931 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:14,935 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8939308025013258259_96931 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_7417692418733608681_96934java.io.IOException: Bad response 1 for block
> blk_7417692418733608681_96934 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7417692418733608681_96934 bad datanode[2]
> 10.1.188.182:50010
> 2009-03-02 09:55:15,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7417692418733608681_96934 in pipeline
> 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_6777180223564108728_96939java.io.IOException: Bad response 1 for block
> blk_6777180223564108728_96939 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_6777180223564108728_96939 bad datanode[1]
> 10.1.188.182:50010
> 2009-03-02 09:55:15,579 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_6777180223564108728_96939 in pipeline
> 10.1.188.249:50010, 10.1.188.182:50010, 10.1.188.203:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-6352908575431276531_96948java.io.IOException: Bad response 1 for block
> blk_-6352908575431276531_96948 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6352908575431276531_96948 bad datanode[2]
> 10.1.188.182:50010
> 2009-03-02 09:55:15,930 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-6352908575431276531_96948 in pipeline
> 10.1.188.249:50010, 10.1.188.30:50010, 10.1.188.182:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:15,988 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> MSG_REGION_SPLIT: metadata_table,r:
> http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185
> 2009-03-02<http://com.over-blog.www/_cdata/img/footer_mid.gif@20070505132942-20070505132942,1235761772185%0A2009-03-02>09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> ResponseProcessor exception  for block
> blk_-1071965721931053111_96956java.io.IOException: Bad response 1 for block
> blk_-1071965721931053111_96956 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:16,008 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1071965721931053111_96956 bad datanode[2]
> 10.1.188.182:50010
> 2009-03-02 09:55:16,009 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1071965721931053111_96956 in pipeline
> 10.1.188.249:50010, 10.1.188.203:50010, 10.1.188.182:50010: bad datanode
> 10.1.188.182:50010
> 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_1004039574836775403_96959java.io.IOException: Bad response 1 for block
> blk_1004039574836775403_96959 from datanode 10.1.188.182:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2342)
>
> 2009-03-02 09:55:16,073 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1004039574836775403_96959 bad datanode[1]
> 10.1.188.182:50010
>
>
> Hadoop datanode:
>
> 2009-03-02 09:55:10,201 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-5472632607337755080_96875 1 Exception java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:833)
>        at java.lang.Thread.run(Thread.java:619)
>
> 2009-03-02 09:55:10,407 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
> blk_-5472632607337755080_96875 terminating
> 2009-03-02 09:55:10,516 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):Exception writing block
> blk_-5472632607337755080_96875 to mirror 10.1.188.182:50010
> java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
>
> 2009-03-02 09:55:10,517 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_-5472632607337755080_96875 java.io.IOException: Broken pipe
> 2009-03-02 09:55:10,517 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-5472632607337755080_96875 received exception java.io.IOException:
> Broken pipe
> 2009-03-02 09:55:10,517 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:75)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:391)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-03-02 09:55:11,174 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.1.188.249:49063, dest: /10.1.188.249:50010, bytes: 312, op: HDFS_WRITE,
> cliID: DFSClient_1091437257, srvID:
> DS-1180278657-127.0.0.1-50010-1235652659245, blockid:
> blk_5027345212081735473_96878
> 2009-03-02 09:55:11,177 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
> blk_5027345212081735473_96878 terminating
> 2009-03-02 09:55:11,185 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-3992843464553216223_96885 src: /10.1.188.249:49069 dest: /
> 10.1.188.249:50010
> 2009-03-02 09:55:11,186 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-3132070329589136987_96885 src: /10.1.188.30:33316 dest: /
> 10.1.188.249:50010
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_8782629414415941143_96845 java.io.IOException: Connection
> reset by peer
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
> blk_8782629414415941143_96845 Interrupted.
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
> blk_8782629414415941143_96845 terminating
> 2009-03-02 09:55:11,187 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_8782629414415941143_96845 received exception java.io.IOException:
> Connection reset by peer
> 2009-03-02 09:55:11,187 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.1.188.249:50010, storageID=DS-1180278657-127.0.0.1-50010-1235652659245,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Connection reset by peer
>        at sun.nio.ch.FileDispatcher.read0(Native Method)
>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>        at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:251)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:298)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:362)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:514)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:356)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
>        at java.lang.Thread.run(Thread.java:619)
>        etc.............................



This looks like an HDFS issue where it won't move on past the bad server
182.  On client side, they are reported as WARN in the dfsclient but don't
make it up to regionserver so not much we can do about it.


I have others exceptions related to DataXceivers problems. These errors
> doesn't make the region server go down, but I can see that I lost some
> records (about 3.10e6 out of 160.10e6).
>


Any regionserver crashes during your upload?  I'd think this more the reason
for dataloss; i.e. edits that were in memcache didn't make it out to the
filesystem because there is still no working flush in hdfs -- hopefully 0.21
hadoop... see HADOOP-4379.... (though your scenario 2 above looks like we
could have handed hdfs the data but it dropped it anyways....)



>
> As you can see in my conf files, I up the dfs.datanode.max.xcievers to 8192
> as suggested from several mails.
> And my ulimit -n is at 32768.


Make sure you can see that above is for sure in place by looking at the head
of your regionserver log on startup.



> Do these problems come from my configuration, or my hardware ?
>


Lets do some more back and forth and make sure we have done all we can
regards the software configuration.  Its probably not hardware going by the
above.

Tell us more about your uploading process and your schema.  Did all load?
If so, on your 6 servers, how many regions?  How did you verify how much was
loaded?

St.Ack