You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Akmal Abbasov <ak...@icloud.com> on 2015/09/02 18:11:55 UTC

High iowait in idle hbase cluster

Hi, 
I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets. 
But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
iotop shows that datanode process is reading and writing all the time.
Any suggestions?

Thanks.

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

what is the configs used to tune the run frequency of block scanner? or what event is used to trigger it to run?

Thanks.

> On 07 Sep 2015, at 15:17, Ted Yu <yu...@gmail.com> wrote:
> 
> W.r.t. Upgrade, this thread may be of interest to you:
> 
> http://search-hadoop.com/m/uOzYt48qItawnLv1 <http://search-hadoop.com/m/uOzYt48qItawnLv1>
> 
> 
> 
> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> 
>> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
>> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
>> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
>> So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>
>> 
>> Thanks.
>> 
>>> On 04 Sep 2015, at 11:56, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>>> 
>>> Do you have any idea at which files are linked the read blocks?
>>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Adrien,
>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>> I’m using default value of the replication, so it is 3.
>>>> There are some under replicated 
>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>> What could cause this kind of behaviour?
>>>> 
>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>>> 
>>>>> Is your HDFS healthy (fsck /)?
>>>>> 
>>>>> Same for hbase hbck?
>>>>> 
>>>>> What's your replication level?
>>>>> 
>>>>> Can you see constant network use as well?
>>>>> 
>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>> 
>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>> 
>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>> Thanks.
>>>>> 
>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> 
>>>>>> Hi Ted,
>>>>>> No there is no short-circuit read configured.
>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>> Any thoughts.
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I assume you have enabled short-circuit read.
>>>>>>> 
>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi Ted,
>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>> 
>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>> any thoughts?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>> 
>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>> 
>>>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>> Hi Ted,
>>>>>>>> sorry forget to mention
>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>> 
>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>> 
>>>>>>>>> were region servers doing compaction ?
>>>>>>>> 
>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>> 
>>>>>>>>> have you checked region server logs ?
>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>> 
>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Please provide some more information:
>>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>> were region servers doing compaction ?
>>>>>>>>> have you checked region server logs ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>> Any suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> 
>>>>> Adrien Mogenet
>>>>> Head of Backend/Infrastructure
>>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>> 50, avenue Montaigne - 75008 Paris
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

what is the configs used to tune the run frequency of block scanner? or what event is used to trigger it to run?

Thanks.

> On 07 Sep 2015, at 15:17, Ted Yu <yu...@gmail.com> wrote:
> 
> W.r.t. Upgrade, this thread may be of interest to you:
> 
> http://search-hadoop.com/m/uOzYt48qItawnLv1 <http://search-hadoop.com/m/uOzYt48qItawnLv1>
> 
> 
> 
> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> 
>> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
>> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
>> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
>> So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>
>> 
>> Thanks.
>> 
>>> On 04 Sep 2015, at 11:56, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>>> 
>>> Do you have any idea at which files are linked the read blocks?
>>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Adrien,
>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>> I’m using default value of the replication, so it is 3.
>>>> There are some under replicated 
>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>> What could cause this kind of behaviour?
>>>> 
>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>>> 
>>>>> Is your HDFS healthy (fsck /)?
>>>>> 
>>>>> Same for hbase hbck?
>>>>> 
>>>>> What's your replication level?
>>>>> 
>>>>> Can you see constant network use as well?
>>>>> 
>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>> 
>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>> 
>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>> Thanks.
>>>>> 
>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> 
>>>>>> Hi Ted,
>>>>>> No there is no short-circuit read configured.
>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>> Any thoughts.
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I assume you have enabled short-circuit read.
>>>>>>> 
>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi Ted,
>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>> 
>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>> any thoughts?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>> 
>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>> 
>>>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>> Hi Ted,
>>>>>>>> sorry forget to mention
>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>> 
>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>> 
>>>>>>>>> were region servers doing compaction ?
>>>>>>>> 
>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>> 
>>>>>>>>> have you checked region server logs ?
>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>> 
>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Please provide some more information:
>>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>> were region servers doing compaction ?
>>>>>>>>> have you checked region server logs ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>> Any suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> 
>>>>> Adrien Mogenet
>>>>> Head of Backend/Infrastructure
>>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>> 50, avenue Montaigne - 75008 Paris
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

what is the configs used to tune the run frequency of block scanner? or what event is used to trigger it to run?

Thanks.

> On 07 Sep 2015, at 15:17, Ted Yu <yu...@gmail.com> wrote:
> 
> W.r.t. Upgrade, this thread may be of interest to you:
> 
> http://search-hadoop.com/m/uOzYt48qItawnLv1 <http://search-hadoop.com/m/uOzYt48qItawnLv1>
> 
> 
> 
> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> 
>> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
>> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
>> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
>> So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>
>> 
>> Thanks.
>> 
>>> On 04 Sep 2015, at 11:56, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>>> 
>>> Do you have any idea at which files are linked the read blocks?
>>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Adrien,
>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>> I’m using default value of the replication, so it is 3.
>>>> There are some under replicated 
>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>> What could cause this kind of behaviour?
>>>> 
>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>>> 
>>>>> Is your HDFS healthy (fsck /)?
>>>>> 
>>>>> Same for hbase hbck?
>>>>> 
>>>>> What's your replication level?
>>>>> 
>>>>> Can you see constant network use as well?
>>>>> 
>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>> 
>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>> 
>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>> Thanks.
>>>>> 
>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> 
>>>>>> Hi Ted,
>>>>>> No there is no short-circuit read configured.
>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>> Any thoughts.
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I assume you have enabled short-circuit read.
>>>>>>> 
>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi Ted,
>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>> 
>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>> any thoughts?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>> 
>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>> 
>>>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>> Hi Ted,
>>>>>>>> sorry forget to mention
>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>> 
>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>> 
>>>>>>>>> were region servers doing compaction ?
>>>>>>>> 
>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>> 
>>>>>>>>> have you checked region server logs ?
>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>> 
>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Please provide some more information:
>>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>> were region servers doing compaction ?
>>>>>>>>> have you checked region server logs ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>> Any suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> 
>>>>> Adrien Mogenet
>>>>> Head of Backend/Infrastructure
>>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>> 50, avenue Montaigne - 75008 Paris
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

what is the configs used to tune the run frequency of block scanner? or what event is used to trigger it to run?

Thanks.

> On 07 Sep 2015, at 15:17, Ted Yu <yu...@gmail.com> wrote:
> 
> W.r.t. Upgrade, this thread may be of interest to you:
> 
> http://search-hadoop.com/m/uOzYt48qItawnLv1 <http://search-hadoop.com/m/uOzYt48qItawnLv1>
> 
> 
> 
> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> 
>> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
>> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
>> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
>> So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>
>> 
>> Thanks.
>> 
>>> On 04 Sep 2015, at 11:56, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>>> 
>>> Do you have any idea at which files are linked the read blocks?
>>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Adrien,
>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>> I’m using default value of the replication, so it is 3.
>>>> There are some under replicated 
>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>> What could cause this kind of behaviour?
>>>> 
>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>>>> 
>>>>> Is your HDFS healthy (fsck /)?
>>>>> 
>>>>> Same for hbase hbck?
>>>>> 
>>>>> What's your replication level?
>>>>> 
>>>>> Can you see constant network use as well?
>>>>> 
>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>> 
>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>> 
>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>> Thanks.
>>>>> 
>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> 
>>>>>> Hi Ted,
>>>>>> No there is no short-circuit read configured.
>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>> Any thoughts.
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I assume you have enabled short-circuit read.
>>>>>>> 
>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi Ted,
>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>> 
>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>> any thoughts?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>> 
>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>> 
>>>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>> Hi Ted,
>>>>>>>> sorry forget to mention
>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>> 
>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>> 
>>>>>>>>> were region servers doing compaction ?
>>>>>>>> 
>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>> 
>>>>>>>>> have you checked region server logs ?
>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>> 
>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Please provide some more information:
>>>>>>>>> 
>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>> were region servers doing compaction ?
>>>>>>>>> have you checked region server logs ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>> Any suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> 
>>>>> Adrien Mogenet
>>>>> Head of Backend/Infrastructure
>>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>>> 50, avenue Montaigne - 75008 Paris
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

W.r.t. Upgrade, this thread may be of interest to you:

http://search-hadoop.com/m/uOzYt48qItawnLv1



> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
> So it looks like https://issues.apache.org/jira/browse/HDFS-6114
> 
> Thanks.
> 
>> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
>> 
>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>> 
>> Do you have any idea at which files are linked the read blocks?
>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>> Hi Adrien,
>>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>>> I’m using default value of the replication, so it is 3.
>>>>> There are some under replicated 
>>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>>> What could cause this kind of behaviour?
>>>>> 
>>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>>>> 
>>>>>> Is your HDFS healthy (fsck /)?
>>>>>> 
>>>>>> Same for hbase hbck?
>>>>>> 
>>>>>> What's your replication level?
>>>>>> 
>>>>>> Can you see constant network use as well?
>>>>>> 
>>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>>> 
>>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>>> 
>>>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Ted,
>>>>>>>> No there is no short-circuit read configured.
>>>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>>>> Any thoughts.
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> I assume you have enabled short-circuit read.
>>>>>>>>> 
>>>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>> Hi Ted,
>>>>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>>>>> 
>>>>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>>>>> any thoughts?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>>>>> 
>>>>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>>>>> 
>>>>>>>>>>> BTW I assume 10.10.8.55 is where your region server resides.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>> sorry forget to mention
>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>> 
>>>>>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>>>>>> 
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>>>>>> 
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>>>>>> 
>>>>>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Please provide some more information:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>>>>>>> Any suggestions?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> 
>>>>>> Adrien Mogenet
>>>>>> Head of Backend/Infrastructure
>>>>>> adrien.mogenet@contentsquare.com
>>>>>> (+33)6.59.16.64.22
>>>>>> http://www.contentsquare.com
>>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com
>>>> (+33)6.59.16.64.22
>>>> http://www.contentsquare.com
>>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

W.r.t. Upgrade, this thread may be of interest to you:

http://search-hadoop.com/m/uOzYt48qItawnLv1



> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
> So it looks like https://issues.apache.org/jira/browse/HDFS-6114
> 
> Thanks.
> 
>> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
>> 
>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>> 
>> Do you have any idea at which files are linked the read blocks?
>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>> Hi Adrien,
>>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>>> I’m using default value of the replication, so it is 3.
>>>>> There are some under replicated 
>>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>>> What could cause this kind of behaviour?
>>>>> 
>>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>>>> 
>>>>>> Is your HDFS healthy (fsck /)?
>>>>>> 
>>>>>> Same for hbase hbck?
>>>>>> 
>>>>>> What's your replication level?
>>>>>> 
>>>>>> Can you see constant network use as well?
>>>>>> 
>>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>>> 
>>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>>> 
>>>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Ted,
>>>>>>>> No there is no short-circuit read configured.
>>>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>>>> Any thoughts.
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> I assume you have enabled short-circuit read.
>>>>>>>>> 
>>>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>> Hi Ted,
>>>>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>>>>> 
>>>>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>>>>> any thoughts?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>>>>> 
>>>>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>>>>> 
>>>>>>>>>>> BTW I assume 10.10.8.55 is where your region server resides.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>> sorry forget to mention
>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>> 
>>>>>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>>>>>> 
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>>>>>> 
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>>>>>> 
>>>>>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Please provide some more information:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>>>>>>> Any suggestions?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> 
>>>>>> Adrien Mogenet
>>>>>> Head of Backend/Infrastructure
>>>>>> adrien.mogenet@contentsquare.com
>>>>>> (+33)6.59.16.64.22
>>>>>> http://www.contentsquare.com
>>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com
>>>> (+33)6.59.16.64.22
>>>> http://www.contentsquare.com
>>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

W.r.t. Upgrade, this thread may be of interest to you:

http://search-hadoop.com/m/uOzYt48qItawnLv1



> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
> So it looks like https://issues.apache.org/jira/browse/HDFS-6114
> 
> Thanks.
> 
>> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
>> 
>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>> 
>> Do you have any idea at which files are linked the read blocks?
>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>> Hi Adrien,
>>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>>> I’m using default value of the replication, so it is 3.
>>>>> There are some under replicated 
>>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>>> What could cause this kind of behaviour?
>>>>> 
>>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>>>> 
>>>>>> Is your HDFS healthy (fsck /)?
>>>>>> 
>>>>>> Same for hbase hbck?
>>>>>> 
>>>>>> What's your replication level?
>>>>>> 
>>>>>> Can you see constant network use as well?
>>>>>> 
>>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>>> 
>>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>>> 
>>>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Ted,
>>>>>>>> No there is no short-circuit read configured.
>>>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>>>> Any thoughts.
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> I assume you have enabled short-circuit read.
>>>>>>>>> 
>>>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>> Hi Ted,
>>>>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>>>>> 
>>>>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>>>>> any thoughts?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>>>>> 
>>>>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>>>>> 
>>>>>>>>>>> BTW I assume 10.10.8.55 is where your region server resides.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>> sorry forget to mention
>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>> 
>>>>>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>>>>>> 
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>>>>>> 
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>>>>>> 
>>>>>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Please provide some more information:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>>>>>>> Any suggestions?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> 
>>>>>> Adrien Mogenet
>>>>>> Head of Backend/Infrastructure
>>>>>> adrien.mogenet@contentsquare.com
>>>>>> (+33)6.59.16.64.22
>>>>>> http://www.contentsquare.com
>>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com
>>>> (+33)6.59.16.64.22
>>>> http://www.contentsquare.com
>>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

W.r.t. Upgrade, this thread may be of interest to you:

http://search-hadoop.com/m/uOzYt48qItawnLv1



> On Sep 7, 2015, at 5:15 AM, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
> They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
> While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
> So it looks like https://issues.apache.org/jira/browse/HDFS-6114
> 
> Thanks.
> 
>> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
>> 
>> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
>> 
>> Do you have any idea at which files are linked the read blocks?
>> 
>>> On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com> wrote:
>>> Hi Adrien,
>>> for the last 24 hours all RS are up and running. There was no region transitions.
>>> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
>>> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
>>> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
>>> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
>>> Why datanode could require reading from disk constantly?
>>> Any ideas?
>>> 
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>> 
>>>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>>>> 
>>>> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>> Hi Adrien,
>>>>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>>>>> I’m using default value of the replication, so it is 3.
>>>>> There are some under replicated 
>>>>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>>>>> What could cause this kind of behaviour?
>>>>> 
>>>>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
>>>>>> 
>>>>>> Is your HDFS healthy (fsck /)?
>>>>>> 
>>>>>> Same for hbase hbck?
>>>>>> 
>>>>>> What's your replication level?
>>>>>> 
>>>>>> Can you see constant network use as well?
>>>>>> 
>>>>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>>>>> 
>>>>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>>>>> 
>>>>>>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>>>>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Ted,
>>>>>>>> No there is no short-circuit read configured.
>>>>>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>>>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>>>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>>>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>>>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>>>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>>>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>>>>>> Any thoughts.
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> I assume you have enabled short-circuit read.
>>>>>>>>> 
>>>>>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>> Hi Ted,
>>>>>>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>>>>>>> 
>>>>>>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>>>>>>> any thoughts?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>>>>>> 
>>>>>>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>>>>>>> 
>>>>>>>>>>> BTW I assume 10.10.8.55 is where your region server resides.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>> sorry forget to mention
>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>> 
>>>>>>>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>>>>>>> 
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>>>>>>>> 
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>> The logs of datanode is full of this kind of messages
>>>>>>>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>>>>>>>> 
>>>>>>>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Please provide some more information:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> release of hbase / hadoop you're using
>>>>>>>>>>>>> were region servers doing compaction ?
>>>>>>>>>>>>> have you checked region server logs ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>>>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>>>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>>>>>>>>> Any suggestions?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> 
>>>>>> Adrien Mogenet
>>>>>> Head of Backend/Infrastructure
>>>>>> adrien.mogenet@contentsquare.com
>>>>>> (+33)6.59.16.64.22
>>>>>> http://www.contentsquare.com
>>>>>> 50, avenue Montaigne - 75008 Paris
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> Adrien Mogenet
>>>> Head of Backend/Infrastructure
>>>> adrien.mogenet@contentsquare.com
>>>> (+33)6.59.16.64.22
>>>> http://www.contentsquare.com
>>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>

Thanks.

> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
> 
> Do you have any idea at which files are linked the read blocks?
> 
> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
> 
> Thanks.
> 
>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>> 
>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated 
>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>> 
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>> 
>> Thanks.
>> 
>> 
>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> Is your HDFS healthy (fsck /)?
>>> 
>>> Same for hbase hbck?
>>> 
>>> What's your replication level?
>>> 
>>> Can you see constant network use as well?
>>> 
>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>> 
>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>> 
>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> 
>>>> Hi Ted,
>>>> No there is no short-circuit read configured.
>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>> Any thoughts.
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> I assume you have enabled short-circuit read.
>>>>> 
>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>> 
>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>> any thoughts?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>> 
>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>> 
>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi Ted,
>>>>>> sorry forget to mention
>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>> 
>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>> 
>>>>>>> were region servers doing compaction ?
>>>>>> 
>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>> 
>>>>>>> have you checked region server logs ?
>>>>>> The logs of datanode is full of this kind of messages
>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>> 
>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Please provide some more information:
>>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>>> were region servers doing compaction ?
>>>>>>> have you checked region server logs ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi,
>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>> Any suggestions?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>

Thanks.

> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
> 
> Do you have any idea at which files are linked the read blocks?
> 
> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
> 
> Thanks.
> 
>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>> 
>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated 
>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>> 
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>> 
>> Thanks.
>> 
>> 
>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> Is your HDFS healthy (fsck /)?
>>> 
>>> Same for hbase hbck?
>>> 
>>> What's your replication level?
>>> 
>>> Can you see constant network use as well?
>>> 
>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>> 
>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>> 
>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> 
>>>> Hi Ted,
>>>> No there is no short-circuit read configured.
>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>> Any thoughts.
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> I assume you have enabled short-circuit read.
>>>>> 
>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>> 
>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>> any thoughts?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>> 
>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>> 
>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi Ted,
>>>>>> sorry forget to mention
>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>> 
>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>> 
>>>>>>> were region servers doing compaction ?
>>>>>> 
>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>> 
>>>>>>> have you checked region server logs ?
>>>>>> The logs of datanode is full of this kind of messages
>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>> 
>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Please provide some more information:
>>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>>> were region servers doing compaction ?
>>>>>>> have you checked region server logs ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi,
>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>> Any suggestions?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>

Thanks.

> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
> 
> Do you have any idea at which files are linked the read blocks?
> 
> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
> 
> Thanks.
> 
>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>> 
>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated 
>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>> 
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>> 
>> Thanks.
>> 
>> 
>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> Is your HDFS healthy (fsck /)?
>>> 
>>> Same for hbase hbck?
>>> 
>>> What's your replication level?
>>> 
>>> Can you see constant network use as well?
>>> 
>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>> 
>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>> 
>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> 
>>>> Hi Ted,
>>>> No there is no short-circuit read configured.
>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>> Any thoughts.
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> I assume you have enabled short-circuit read.
>>>>> 
>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>> 
>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>> any thoughts?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>> 
>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>> 
>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi Ted,
>>>>>> sorry forget to mention
>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>> 
>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>> 
>>>>>>> were region servers doing compaction ?
>>>>>> 
>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>> 
>>>>>>> have you checked region server logs ?
>>>>>> The logs of datanode is full of this kind of messages
>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>> 
>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Please provide some more information:
>>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>>> were region servers doing compaction ?
>>>>>>> have you checked region server logs ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi,
>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>> Any suggestions?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

While looking into this problem, I found that I have large dncp_block_verification.log.curr and dncp_block_verification.log.prev files.
They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>

Thanks.

> On 04 Sep 2015, at 11:56, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller, or a constantly-rebuilding array.
> 
> Do you have any idea at which files are linked the read blocks?
> 
> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
> 
> Thanks.
> 
>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>> 
>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated 
>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>> 
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>> 
>> Thanks.
>> 
>> 
>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>>> 
>>> Is your HDFS healthy (fsck /)?
>>> 
>>> Same for hbase hbck?
>>> 
>>> What's your replication level?
>>> 
>>> Can you see constant network use as well?
>>> 
>>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>>> 
>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>>> 
>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> 
>>>> Hi Ted,
>>>> No there is no short-circuit read configured.
>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>> Any thoughts.
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> I assume you have enabled short-circuit read.
>>>>> 
>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>>> 
>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>> any thoughts?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>> 
>>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>> 
>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi Ted,
>>>>>> sorry forget to mention
>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>> 
>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>> 
>>>>>>> were region servers doing compaction ?
>>>>>> 
>>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>>> 
>>>>>>> have you checked region server logs ?
>>>>>> The logs of datanode is full of this kind of messages
>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>>> 
>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Please provide some more information:
>>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>>> were region servers doing compaction ?
>>>>>>> have you checked region server logs ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>>> Hi,
>>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>>> Any suggestions?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

What is your disk configuration? JBOD? If RAID, possibly a dysfunctional
RAID controller, or a constantly-rebuilding array.

Do you have any idea at which files are linked the read blocks?

On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region
> transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high
> iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs
> have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it
> is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
>
> Thanks.
>
> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is the uptime of RS "normal"? No quick and global reboot that could lead
> into a regiongi-reallocation-storm?
>
> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
>> hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated
>> HBase master(node 10.10.8.55) is reading constantly from regionservers.
>> Only today, it send >150.000 HDFS_READ requests to each regionserver so
>> far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>>
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>
>> Thanks.
>>
>>
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <
>> adrien.mogenet@contentsquare.com> wrote:
>>
>> Is your HDFS healthy (fsck /)?
>>
>> Same for hbase hbck?
>>
>> What's your replication level?
>>
>> Can you see constant network use as well?
>>
>> Anything than might be triggered by the hbasemaster? (something like a
>> virtually dead RS, due to ZK race-condition, etc.)
>>
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
>> compaction, successfully, yesterday.
>>
>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> I’ve started HDFS balancer, but then stopped it immediately after
>>> knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence
>>> on the cluster behaviour I’m having now?
>>> Thanks.
>>>
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>>> 276448307
>>> 2015-09-03 12:03:56,494 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>>> 60550244
>>> 2015-09-03 12:03:59,561 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>>> 755613819
>>> There are >100.000 of them just for today. The situation with other
>>> regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is
>>> also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO
>>> happening?
>>> Any thoughts.
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> I assume you have enabled short-circuit read.
>>>
>>> Can you capture region server stack trace(s) and pastebin them ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com
>>> > wrote:
>>>
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange
>>>> behaviour started weeks before it.
>>>>
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>>
>>>> Thanks
>>>>
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> bq. change the ip addresses of the cluster nodes
>>>>
>>>> Did this happen recently ? If high iowait was observed after the change
>>>> (you can look at ganglia graph), there is a chance that the change was
>>>> related.
>>>>
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your
>>>> region server resides.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>>
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>
>>>>> were region servers doing compaction ?
>>>>>
>>>>> I’ve run major compactions manually earlier today, but it seems that
>>>>> they already completed, looking at the compactionQueueSize.
>>>>>
>>>>> have you checked region server logs ?
>>>>>
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>>> 7881815
>>>>>
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>>> relevant?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>> Please provide some more information:
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <
>>>>> akmal.abbasov@icloud.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle,
>>>>>> only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>>> iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

What is your disk configuration? JBOD? If RAID, possibly a dysfunctional
RAID controller, or a constantly-rebuilding array.

Do you have any idea at which files are linked the read blocks?

On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region
> transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high
> iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs
> have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it
> is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
>
> Thanks.
>
> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is the uptime of RS "normal"? No quick and global reboot that could lead
> into a regiongi-reallocation-storm?
>
> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
>> hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated
>> HBase master(node 10.10.8.55) is reading constantly from regionservers.
>> Only today, it send >150.000 HDFS_READ requests to each regionserver so
>> far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>>
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>
>> Thanks.
>>
>>
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <
>> adrien.mogenet@contentsquare.com> wrote:
>>
>> Is your HDFS healthy (fsck /)?
>>
>> Same for hbase hbck?
>>
>> What's your replication level?
>>
>> Can you see constant network use as well?
>>
>> Anything than might be triggered by the hbasemaster? (something like a
>> virtually dead RS, due to ZK race-condition, etc.)
>>
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
>> compaction, successfully, yesterday.
>>
>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> I’ve started HDFS balancer, but then stopped it immediately after
>>> knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence
>>> on the cluster behaviour I’m having now?
>>> Thanks.
>>>
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>>> 276448307
>>> 2015-09-03 12:03:56,494 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>>> 60550244
>>> 2015-09-03 12:03:59,561 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>>> 755613819
>>> There are >100.000 of them just for today. The situation with other
>>> regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is
>>> also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO
>>> happening?
>>> Any thoughts.
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> I assume you have enabled short-circuit read.
>>>
>>> Can you capture region server stack trace(s) and pastebin them ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com
>>> > wrote:
>>>
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange
>>>> behaviour started weeks before it.
>>>>
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>>
>>>> Thanks
>>>>
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> bq. change the ip addresses of the cluster nodes
>>>>
>>>> Did this happen recently ? If high iowait was observed after the change
>>>> (you can look at ganglia graph), there is a chance that the change was
>>>> related.
>>>>
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your
>>>> region server resides.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>>
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>
>>>>> were region servers doing compaction ?
>>>>>
>>>>> I’ve run major compactions manually earlier today, but it seems that
>>>>> they already completed, looking at the compactionQueueSize.
>>>>>
>>>>> have you checked region server logs ?
>>>>>
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>>> 7881815
>>>>>
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>>> relevant?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>> Please provide some more information:
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <
>>>>> akmal.abbasov@icloud.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle,
>>>>>> only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>>> iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

What is your disk configuration? JBOD? If RAID, possibly a dysfunctional
RAID controller, or a constantly-rebuilding array.

Do you have any idea at which files are linked the read blocks?

On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region
> transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high
> iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs
> have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it
> is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
>
> Thanks.
>
> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is the uptime of RS "normal"? No quick and global reboot that could lead
> into a regiongi-reallocation-storm?
>
> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
>> hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated
>> HBase master(node 10.10.8.55) is reading constantly from regionservers.
>> Only today, it send >150.000 HDFS_READ requests to each regionserver so
>> far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>>
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>
>> Thanks.
>>
>>
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <
>> adrien.mogenet@contentsquare.com> wrote:
>>
>> Is your HDFS healthy (fsck /)?
>>
>> Same for hbase hbck?
>>
>> What's your replication level?
>>
>> Can you see constant network use as well?
>>
>> Anything than might be triggered by the hbasemaster? (something like a
>> virtually dead RS, due to ZK race-condition, etc.)
>>
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
>> compaction, successfully, yesterday.
>>
>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> I’ve started HDFS balancer, but then stopped it immediately after
>>> knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence
>>> on the cluster behaviour I’m having now?
>>> Thanks.
>>>
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>>> 276448307
>>> 2015-09-03 12:03:56,494 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>>> 60550244
>>> 2015-09-03 12:03:59,561 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>>> 755613819
>>> There are >100.000 of them just for today. The situation with other
>>> regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is
>>> also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO
>>> happening?
>>> Any thoughts.
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> I assume you have enabled short-circuit read.
>>>
>>> Can you capture region server stack trace(s) and pastebin them ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com
>>> > wrote:
>>>
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange
>>>> behaviour started weeks before it.
>>>>
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>>
>>>> Thanks
>>>>
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> bq. change the ip addresses of the cluster nodes
>>>>
>>>> Did this happen recently ? If high iowait was observed after the change
>>>> (you can look at ganglia graph), there is a chance that the change was
>>>> related.
>>>>
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your
>>>> region server resides.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>>
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>
>>>>> were region servers doing compaction ?
>>>>>
>>>>> I’ve run major compactions manually earlier today, but it seems that
>>>>> they already completed, looking at the compactionQueueSize.
>>>>>
>>>>> have you checked region server logs ?
>>>>>
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>>> 7881815
>>>>>
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>>> relevant?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>> Please provide some more information:
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <
>>>>> akmal.abbasov@icloud.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle,
>>>>>> only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>>> iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

What is your disk configuration? JBOD? If RAID, possibly a dysfunctional
RAID controller, or a constantly-rebuilding array.

Do you have any idea at which files are linked the read blocks?

On 4 September 2015 at 11:02, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region
> transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high
> iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs
> have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it
> is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
>
> Thanks.
>
> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is the uptime of RS "normal"? No quick and global reboot that could lead
> into a regiongi-reallocation-storm?
>
> On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
>> hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated
>> HBase master(node 10.10.8.55) is reading constantly from regionservers.
>> Only today, it send >150.000 HDFS_READ requests to each regionserver so
>> far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>>
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>
>> Thanks.
>>
>>
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <
>> adrien.mogenet@contentsquare.com> wrote:
>>
>> Is your HDFS healthy (fsck /)?
>>
>> Same for hbase hbck?
>>
>> What's your replication level?
>>
>> Can you see constant network use as well?
>>
>> Anything than might be triggered by the hbasemaster? (something like a
>> virtually dead RS, due to ZK race-condition, etc.)
>>
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
>> compaction, successfully, yesterday.
>>
>> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> I’ve started HDFS balancer, but then stopped it immediately after
>>> knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence
>>> on the cluster behaviour I’m having now?
>>> Thanks.
>>>
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>>> 276448307
>>> 2015-09-03 12:03:56,494 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>>> 60550244
>>> 2015-09-03 12:03:59,561 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>>> 755613819
>>> There are >100.000 of them just for today. The situation with other
>>> regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is
>>> also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO
>>> happening?
>>> Any thoughts.
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> I assume you have enabled short-circuit read.
>>>
>>> Can you capture region server stack trace(s) and pastebin them ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com
>>> > wrote:
>>>
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange
>>>> behaviour started weeks before it.
>>>>
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>>
>>>> Thanks
>>>>
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> bq. change the ip addresses of the cluster nodes
>>>>
>>>> Did this happen recently ? If high iowait was observed after the change
>>>> (you can look at ganglia graph), there is a chance that the change was
>>>> related.
>>>>
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your
>>>> region server resides.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>>
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>
>>>>> were region servers doing compaction ?
>>>>>
>>>>> I’ve run major compactions manually earlier today, but it seems that
>>>>> they already completed, looking at the compactionQueueSize.
>>>>>
>>>>> have you checked region server logs ?
>>>>>
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>>> 7881815
>>>>>
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>>> relevant?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>> Please provide some more information:
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <
>>>>> akmal.abbasov@icloud.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle,
>>>>>> only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>>> iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
for the last 24 hours all RS are up and running. There was no region transitions.
The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
According to iotop the process which is doing most IO is datanode, and it is reading constantly.
Why datanode could require reading from disk constantly?
Any ideas?

Thanks.

> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
> 
> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated 
> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
> 
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
> 
> Thanks.
> 
> 
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is your HDFS healthy (fsck /)?
>> 
>> Same for hbase hbck?
>> 
>> What's your replication level?
>> 
>> Can you see constant network use as well?
>> 
>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>> 
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>> 
>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>> Thanks.
>> 
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> 
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO happening?
>>> Any thoughts.
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I assume you have enabled short-circuit read.
>>>> 
>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>> 
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>> 
>>>> Thanks
>>>> 
>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> bq. change the ip addresses of the cluster nodes
>>>>> 
>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>> 
>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>> 
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>> 
>>>>>> were region servers doing compaction ?
>>>>> 
>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>> 
>>>>>> have you checked region server logs ?
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>> 
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Please provide some more information:
>>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>>> were region servers doing compaction ?
>>>>>> have you checked region server logs ?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
for the last 24 hours all RS are up and running. There was no region transitions.
The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
According to iotop the process which is doing most IO is datanode, and it is reading constantly.
Why datanode could require reading from disk constantly?
Any ideas?

Thanks.

> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
> 
> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated 
> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
> 
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
> 
> Thanks.
> 
> 
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is your HDFS healthy (fsck /)?
>> 
>> Same for hbase hbck?
>> 
>> What's your replication level?
>> 
>> Can you see constant network use as well?
>> 
>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>> 
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>> 
>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>> Thanks.
>> 
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> 
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO happening?
>>> Any thoughts.
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I assume you have enabled short-circuit read.
>>>> 
>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>> 
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>> 
>>>> Thanks
>>>> 
>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> bq. change the ip addresses of the cluster nodes
>>>>> 
>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>> 
>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>> 
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>> 
>>>>>> were region servers doing compaction ?
>>>>> 
>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>> 
>>>>>> have you checked region server logs ?
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>> 
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Please provide some more information:
>>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>>> were region servers doing compaction ?
>>>>>> have you checked region server logs ?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
for the last 24 hours all RS are up and running. There was no region transitions.
The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
According to iotop the process which is doing most IO is datanode, and it is reading constantly.
Why datanode could require reading from disk constantly?
Any ideas?

Thanks.

> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
> 
> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated 
> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
> 
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
> 
> Thanks.
> 
> 
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is your HDFS healthy (fsck /)?
>> 
>> Same for hbase hbck?
>> 
>> What's your replication level?
>> 
>> Can you see constant network use as well?
>> 
>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>> 
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>> 
>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>> Thanks.
>> 
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> 
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO happening?
>>> Any thoughts.
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I assume you have enabled short-circuit read.
>>>> 
>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>> 
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>> 
>>>> Thanks
>>>> 
>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> bq. change the ip addresses of the cluster nodes
>>>>> 
>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>> 
>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>> 
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>> 
>>>>>> were region servers doing compaction ?
>>>>> 
>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>> 
>>>>>> have you checked region server logs ?
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>> 
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Please provide some more information:
>>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>>> were region servers doing compaction ?
>>>>>> have you checked region server logs ?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
for the last 24 hours all RS are up and running. There was no region transitions.
The overall cluster iowait has decreased, but still 2 RS have very high iowait, while there is no load on the cluster.
My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost identical number
of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
According to iotop the process which is doing most IO is datanode, and it is reading constantly.
Why datanode could require reading from disk constantly?
Any ideas?

Thanks.

> On 03 Sep 2015, at 18:57, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
> 
> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated 
> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
> 
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
> 
> Thanks.
> 
> 
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com <ma...@contentsquare.com>> wrote:
>> 
>> Is your HDFS healthy (fsck /)?
>> 
>> Same for hbase hbck?
>> 
>> What's your replication level?
>> 
>> Can you see constant network use as well?
>> 
>> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
>> 
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
>> 
>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
>> Thanks.
>> 
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> 
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO happening?
>>> Any thoughts.
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I assume you have enabled short-circuit read.
>>>> 
>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>>> 
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>> 
>>>> Thanks
>>>> 
>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> bq. change the ip addresses of the cluster nodes
>>>>> 
>>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>>> 
>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>> 
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>> 
>>>>>> were region servers doing compaction ?
>>>>> 
>>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>>> 
>>>>>> have you checked region server logs ?
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>>> 
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Please provide some more information:
>>>>>> 
>>>>>> release of hbase / hadoop you're using
>>>>>> were region servers doing compaction ?
>>>>>> have you checked region server logs ?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>>> Any suggestions?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is the uptime of RS "normal"? No quick and global reboot that could lead
into a regiongi-reallocation-storm?

On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
> hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated
> HBase master(node 10.10.8.55) is reading constantly from regionservers.
> Only today, it send >150.000 HDFS_READ requests to each regionserver so
> far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
>
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>
> Thanks.
>
>
> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is your HDFS healthy (fsck /)?
>
> Same for hbase hbck?
>
> What's your replication level?
>
> Can you see constant network use as well?
>
> Anything than might be triggered by the hbasemaster? (something like a
> virtually dead RS, due to ZK race-condition, etc.)
>
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
> compaction, successfully, yesterday.
>
> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> I’ve started HDFS balancer, but then stopped it immediately after knowing
>> that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on
>> the cluster behaviour I’m having now?
>> Thanks.
>>
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>> 276448307
>> 2015-09-03 12:03:56,494 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>> 60550244
>> 2015-09-03 12:03:59,561 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>> 755613819
>> There are >100.000 of them just for today. The situation with other
>> regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also
>> hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>>
>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>
>> I assume you have enabled short-circuit read.
>>
>> Can you capture region server stack trace(s) and pastebin them ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange
>>> behaviour started weeks before it.
>>>
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>>
>>> Thanks
>>>
>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> bq. change the ip addresses of the cluster nodes
>>>
>>> Did this happen recently ? If high iowait was observed after the change
>>> (you can look at ganglia graph), there is a chance that the change was
>>> related.
>>>
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>>> server resides.
>>>
>>> Cheers
>>>
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi Ted,
>>>> sorry forget to mention
>>>>
>>>> release of hbase / hadoop you're using
>>>>
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>
>>>> were region servers doing compaction ?
>>>>
>>>> I’ve run major compactions manually earlier today, but it seems that
>>>> they already completed, looking at the compactionQueueSize.
>>>>
>>>> have you checked region server logs ?
>>>>
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>> 7881815
>>>>
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>> relevant?
>>>>
>>>> Thanks.
>>>>
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> Please provide some more information:
>>>>
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>>> <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>> iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is the uptime of RS "normal"? No quick and global reboot that could lead
into a regiongi-reallocation-storm?

On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
> hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated
> HBase master(node 10.10.8.55) is reading constantly from regionservers.
> Only today, it send >150.000 HDFS_READ requests to each regionserver so
> far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
>
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>
> Thanks.
>
>
> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is your HDFS healthy (fsck /)?
>
> Same for hbase hbck?
>
> What's your replication level?
>
> Can you see constant network use as well?
>
> Anything than might be triggered by the hbasemaster? (something like a
> virtually dead RS, due to ZK race-condition, etc.)
>
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
> compaction, successfully, yesterday.
>
> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> I’ve started HDFS balancer, but then stopped it immediately after knowing
>> that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on
>> the cluster behaviour I’m having now?
>> Thanks.
>>
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>> 276448307
>> 2015-09-03 12:03:56,494 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>> 60550244
>> 2015-09-03 12:03:59,561 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>> 755613819
>> There are >100.000 of them just for today. The situation with other
>> regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also
>> hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>>
>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>
>> I assume you have enabled short-circuit read.
>>
>> Can you capture region server stack trace(s) and pastebin them ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange
>>> behaviour started weeks before it.
>>>
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>>
>>> Thanks
>>>
>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> bq. change the ip addresses of the cluster nodes
>>>
>>> Did this happen recently ? If high iowait was observed after the change
>>> (you can look at ganglia graph), there is a chance that the change was
>>> related.
>>>
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>>> server resides.
>>>
>>> Cheers
>>>
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi Ted,
>>>> sorry forget to mention
>>>>
>>>> release of hbase / hadoop you're using
>>>>
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>
>>>> were region servers doing compaction ?
>>>>
>>>> I’ve run major compactions manually earlier today, but it seems that
>>>> they already completed, looking at the compactionQueueSize.
>>>>
>>>> have you checked region server logs ?
>>>>
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>> 7881815
>>>>
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>> relevant?
>>>>
>>>> Thanks.
>>>>
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> Please provide some more information:
>>>>
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>>> <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>> iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is the uptime of RS "normal"? No quick and global reboot that could lead
into a regiongi-reallocation-storm?

On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
> hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated
> HBase master(node 10.10.8.55) is reading constantly from regionservers.
> Only today, it send >150.000 HDFS_READ requests to each regionserver so
> far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
>
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>
> Thanks.
>
>
> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is your HDFS healthy (fsck /)?
>
> Same for hbase hbck?
>
> What's your replication level?
>
> Can you see constant network use as well?
>
> Anything than might be triggered by the hbasemaster? (something like a
> virtually dead RS, due to ZK race-condition, etc.)
>
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
> compaction, successfully, yesterday.
>
> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> I’ve started HDFS balancer, but then stopped it immediately after knowing
>> that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on
>> the cluster behaviour I’m having now?
>> Thanks.
>>
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>> 276448307
>> 2015-09-03 12:03:56,494 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>> 60550244
>> 2015-09-03 12:03:59,561 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>> 755613819
>> There are >100.000 of them just for today. The situation with other
>> regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also
>> hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>>
>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>
>> I assume you have enabled short-circuit read.
>>
>> Can you capture region server stack trace(s) and pastebin them ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange
>>> behaviour started weeks before it.
>>>
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>>
>>> Thanks
>>>
>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> bq. change the ip addresses of the cluster nodes
>>>
>>> Did this happen recently ? If high iowait was observed after the change
>>> (you can look at ganglia graph), there is a chance that the change was
>>> related.
>>>
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>>> server resides.
>>>
>>> Cheers
>>>
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi Ted,
>>>> sorry forget to mention
>>>>
>>>> release of hbase / hadoop you're using
>>>>
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>
>>>> were region servers doing compaction ?
>>>>
>>>> I’ve run major compactions manually earlier today, but it seems that
>>>> they already completed, looking at the compactionQueueSize.
>>>>
>>>> have you checked region server logs ?
>>>>
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>> 7881815
>>>>
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>> relevant?
>>>>
>>>> Thanks.
>>>>
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> Please provide some more information:
>>>>
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>>> <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>> iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is the uptime of RS "normal"? No quick and global reboot that could lead
into a regiongi-reallocation-storm?

On 3 September 2015 at 18:42, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Adrien,
> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
> hbase is consistent.
> I’m using default value of the replication, so it is 3.
> There are some under replicated
> HBase master(node 10.10.8.55) is reading constantly from regionservers.
> Only today, it send >150.000 HDFS_READ requests to each regionserver so
> far, while the hbase cluster is almost idle.
> What could cause this kind of behaviour?
>
> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>
> Thanks.
>
>
> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com>
> wrote:
>
> Is your HDFS healthy (fsck /)?
>
> Same for hbase hbck?
>
> What's your replication level?
>
> Can you see constant network use as well?
>
> Anything than might be triggered by the hbasemaster? (something like a
> virtually dead RS, due to ZK race-condition, etc.)
>
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
> compaction, successfully, yesterday.
>
> On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> I’ve started HDFS balancer, but then stopped it immediately after knowing
>> that it is not a good idea.
>> but it was around 3 weeks ago, is it possible that it had an influence on
>> the cluster behaviour I’m having now?
>> Thanks.
>>
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>>
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>> 276448307
>> 2015-09-03 12:03:56,494 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>> 60550244
>> 2015-09-03 12:03:59,561 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>> 755613819
>> There are >100.000 of them just for today. The situation with other
>> regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also
>> hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>>
>> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>>
>> I assume you have enabled short-circuit read.
>>
>> Can you capture region server stack trace(s) and pastebin them ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange
>>> behaviour started weeks before it.
>>>
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>>
>>> Thanks
>>>
>>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> bq. change the ip addresses of the cluster nodes
>>>
>>> Did this happen recently ? If high iowait was observed after the change
>>> (you can look at ganglia graph), there is a chance that the change was
>>> related.
>>>
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>>> server resides.
>>>
>>> Cheers
>>>
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi Ted,
>>>> sorry forget to mention
>>>>
>>>> release of hbase / hadoop you're using
>>>>
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>
>>>> were region servers doing compaction ?
>>>>
>>>> I’ve run major compactions manually earlier today, but it seems that
>>>> they already completed, looking at the compactionQueueSize.
>>>>
>>>> have you checked region server logs ?
>>>>
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>> 7881815
>>>>
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>> relevant?
>>>>
>>>> Thanks.
>>>>
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>> Please provide some more information:
>>>>
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>>> <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high
>>>>> iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
I’m using default value of the replication, so it is 3.
There are some under replicated 
HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
What could cause this kind of behaviour?

p.s. each node in the cluster have 2 core, 4 gb ram, just in case.

Thanks.


> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is your HDFS healthy (fsck /)?
> 
> Same for hbase hbck?
> 
> What's your replication level?
> 
> Can you see constant network use as well?
> 
> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
> 
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
> 
> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
> Thanks.
> 
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> 
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>> 
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I assume you have enabled short-circuit read.
>>> 
>>> Can you capture region server stack trace(s) and pastebin them ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>> 
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>> 
>>> Thanks
>>> 
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> bq. change the ip addresses of the cluster nodes
>>>> 
>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>> 
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> sorry forget to mention
>>>> 
>>>>> release of hbase / hadoop you're using
>>>> 
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>> 
>>>>> were region servers doing compaction ?
>>>> 
>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>> 
>>>>> have you checked region server logs ?
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>> 
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>> 
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Please provide some more information:
>>>>> 
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
I’m using default value of the replication, so it is 3.
There are some under replicated 
HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
What could cause this kind of behaviour?

p.s. each node in the cluster have 2 core, 4 gb ram, just in case.

Thanks.


> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is your HDFS healthy (fsck /)?
> 
> Same for hbase hbck?
> 
> What's your replication level?
> 
> Can you see constant network use as well?
> 
> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
> 
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
> 
> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
> Thanks.
> 
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> 
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>> 
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I assume you have enabled short-circuit read.
>>> 
>>> Can you capture region server stack trace(s) and pastebin them ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>> 
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>> 
>>> Thanks
>>> 
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> bq. change the ip addresses of the cluster nodes
>>>> 
>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>> 
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> sorry forget to mention
>>>> 
>>>>> release of hbase / hadoop you're using
>>>> 
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>> 
>>>>> were region servers doing compaction ?
>>>> 
>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>> 
>>>>> have you checked region server logs ?
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>> 
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>> 
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Please provide some more information:
>>>>> 
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
I’m using default value of the replication, so it is 3.
There are some under replicated 
HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
What could cause this kind of behaviour?

p.s. each node in the cluster have 2 core, 4 gb ram, just in case.

Thanks.


> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is your HDFS healthy (fsck /)?
> 
> Same for hbase hbck?
> 
> What's your replication level?
> 
> Can you see constant network use as well?
> 
> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
> 
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
> 
> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
> Thanks.
> 
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> 
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>> 
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I assume you have enabled short-circuit read.
>>> 
>>> Can you capture region server stack trace(s) and pastebin them ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>> 
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>> 
>>> Thanks
>>> 
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> bq. change the ip addresses of the cluster nodes
>>>> 
>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>> 
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> sorry forget to mention
>>>> 
>>>>> release of hbase / hadoop you're using
>>>> 
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>> 
>>>>> were region servers doing compaction ?
>>>> 
>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>> 
>>>>> have you checked region server logs ?
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>> 
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>> 
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Please provide some more information:
>>>>> 
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Adrien,
I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
I’m using default value of the replication, so it is 3.
There are some under replicated 
HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost idle.
What could cause this kind of behaviour?

p.s. each node in the cluster have 2 core, 4 gb ram, just in case.

Thanks.


> On 03 Sep 2015, at 17:46, Adrien Mogenet <ad...@contentsquare.com> wrote:
> 
> Is your HDFS healthy (fsck /)?
> 
> Same for hbase hbck?
> 
> What's your replication level?
> 
> Can you see constant network use as well?
> 
> Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.)
> 
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday.
> 
> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
> Thanks.
> 
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> 
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>> There are >100.000 of them just for today. The situation with other regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>> 
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I assume you have enabled short-circuit read.
>>> 
>>> Can you capture region server stack trace(s) and pastebin them ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>>> 
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>> 
>>> Thanks
>>> 
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> bq. change the ip addresses of the cluster nodes
>>>> 
>>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>>> 
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> sorry forget to mention
>>>> 
>>>>> release of hbase / hadoop you're using
>>>> 
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>> 
>>>>> were region servers doing compaction ?
>>>> 
>>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>>> 
>>>>> have you checked region server logs ?
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>>> 
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>> 
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Please provide some more information:
>>>>> 
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <ma...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is your HDFS healthy (fsck /)?

Same for hbase hbck?

What's your replication level?

Can you see constant network use as well?

Anything than might be triggered by the hbasemaster? (something like a
virtually dead RS, due to ZK race-condition, etc.)

Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
compaction, successfully, yesterday.

On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
wrote:

> I’ve started HDFS balancer, but then stopped it immediately after knowing
> that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on
> the cluster behaviour I’m having now?
> Thanks.
>
> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
> 276448307
> 2015-09-03 12:03:56,494 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
> 60550244
> 2015-09-03 12:03:59,561 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
> 755613819
> There are >100.000 of them just for today. The situation with other
> regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also
> hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
>
> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>
> I assume you have enabled short-circuit read.
>
> Can you capture region server stack trace(s) and pastebin them ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange
>> behaviour started weeks before it.
>>
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>>
>> Thanks
>>
>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>
>> bq. change the ip addresses of the cluster nodes
>>
>> Did this happen recently ? If high iowait was observed after the change
>> (you can look at ganglia graph), there is a chance that the change was
>> related.
>>
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>> server resides.
>>
>> Cheers
>>
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> sorry forget to mention
>>>
>>> release of hbase / hadoop you're using
>>>
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>
>>> were region servers doing compaction ?
>>>
>>> I’ve run major compactions manually earlier today, but it seems that
>>> they already completed, looking at the compactionQueueSize.
>>>
>>> have you checked region server logs ?
>>>
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>> 7881815
>>>
>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>> relevant?
>>>
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> Please provide some more information:
>>>
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>> <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high
>>>> iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>
>>
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is your HDFS healthy (fsck /)?

Same for hbase hbck?

What's your replication level?

Can you see constant network use as well?

Anything than might be triggered by the hbasemaster? (something like a
virtually dead RS, due to ZK race-condition, etc.)

Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
compaction, successfully, yesterday.

On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
wrote:

> I’ve started HDFS balancer, but then stopped it immediately after knowing
> that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on
> the cluster behaviour I’m having now?
> Thanks.
>
> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
> 276448307
> 2015-09-03 12:03:56,494 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
> 60550244
> 2015-09-03 12:03:59,561 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
> 755613819
> There are >100.000 of them just for today. The situation with other
> regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also
> hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
>
> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>
> I assume you have enabled short-circuit read.
>
> Can you capture region server stack trace(s) and pastebin them ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange
>> behaviour started weeks before it.
>>
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>>
>> Thanks
>>
>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>
>> bq. change the ip addresses of the cluster nodes
>>
>> Did this happen recently ? If high iowait was observed after the change
>> (you can look at ganglia graph), there is a chance that the change was
>> related.
>>
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>> server resides.
>>
>> Cheers
>>
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> sorry forget to mention
>>>
>>> release of hbase / hadoop you're using
>>>
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>
>>> were region servers doing compaction ?
>>>
>>> I’ve run major compactions manually earlier today, but it seems that
>>> they already completed, looking at the compactionQueueSize.
>>>
>>> have you checked region server logs ?
>>>
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>> 7881815
>>>
>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>> relevant?
>>>
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> Please provide some more information:
>>>
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>> <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high
>>>> iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>
>>
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is your HDFS healthy (fsck /)?

Same for hbase hbck?

What's your replication level?

Can you see constant network use as well?

Anything than might be triggered by the hbasemaster? (something like a
virtually dead RS, due to ZK race-condition, etc.)

Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
compaction, successfully, yesterday.

On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
wrote:

> I’ve started HDFS balancer, but then stopped it immediately after knowing
> that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on
> the cluster behaviour I’m having now?
> Thanks.
>
> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
> 276448307
> 2015-09-03 12:03:56,494 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
> 60550244
> 2015-09-03 12:03:59,561 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
> 755613819
> There are >100.000 of them just for today. The situation with other
> regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also
> hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
>
> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>
> I assume you have enabled short-circuit read.
>
> Can you capture region server stack trace(s) and pastebin them ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange
>> behaviour started weeks before it.
>>
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>>
>> Thanks
>>
>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>
>> bq. change the ip addresses of the cluster nodes
>>
>> Did this happen recently ? If high iowait was observed after the change
>> (you can look at ganglia graph), there is a chance that the change was
>> related.
>>
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>> server resides.
>>
>> Cheers
>>
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> sorry forget to mention
>>>
>>> release of hbase / hadoop you're using
>>>
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>
>>> were region servers doing compaction ?
>>>
>>> I’ve run major compactions manually earlier today, but it seems that
>>> they already completed, looking at the compactionQueueSize.
>>>
>>> have you checked region server logs ?
>>>
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>> 7881815
>>>
>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>> relevant?
>>>
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> Please provide some more information:
>>>
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>> <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high
>>>> iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>
>>
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Adrien Mogenet <ad...@contentsquare.com>.

Is your HDFS healthy (fsck /)?

Same for hbase hbck?

What's your replication level?

Can you see constant network use as well?

Anything than might be triggered by the hbasemaster? (something like a
virtually dead RS, due to ZK race-condition, etc.)

Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
compaction, successfully, yesterday.

On 3 September 2015 at 16:32, Akmal Abbasov <ak...@icloud.com>
wrote:

> I’ve started HDFS balancer, but then stopped it immediately after knowing
> that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on
> the cluster behaviour I’m having now?
> Thanks.
>
> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
>
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
> 276448307
> 2015-09-03 12:03:56,494 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
> 60550244
> 2015-09-03 12:03:59,561 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
> 755613819
> There are >100.000 of them just for today. The situation with other
> regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also
> hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
>
> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
>
> I assume you have enabled short-circuit read.
>
> Can you capture region server stack trace(s) and pastebin them ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange
>> behaviour started weeks before it.
>>
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>>
>> Thanks
>>
>> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>>
>> bq. change the ip addresses of the cluster nodes
>>
>> Did this happen recently ? If high iowait was observed after the change
>> (you can look at ganglia graph), there is a chance that the change was
>> related.
>>
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>> server resides.
>>
>> Cheers
>>
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> sorry forget to mention
>>>
>>> release of hbase / hadoop you're using
>>>
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>
>>> were region servers doing compaction ?
>>>
>>> I’ve run major compactions manually earlier today, but it seems that
>>> they already completed, looking at the compactionQueueSize.
>>>
>>> have you checked region server logs ?
>>>
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>> 7881815
>>>
>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>> relevant?
>>>
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> Please provide some more information:
>>>
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>>> <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high
>>>> iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>
>>
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
Thanks.

> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
> There are >100.000 of them just for today. The situation with other regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
> 
>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I assume you have enabled short-circuit read.
>> 
>> Can you capture region server stack trace(s) and pastebin them ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>> 
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>> 
>> Thanks
>> 
>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> bq. change the ip addresses of the cluster nodes
>>> 
>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>> 
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>> 
>>> Cheers
>>> 
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> sorry forget to mention
>>> 
>>>> release of hbase / hadoop you're using
>>> 
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>> 
>>>> were region servers doing compaction ?
>>> 
>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>> 
>>>> have you checked region server logs ?
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>> 
>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>> 
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Please provide some more information:
>>>> 
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>> 
>>>> Thanks.
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
Thanks.

> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
> There are >100.000 of them just for today. The situation with other regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
> 
>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I assume you have enabled short-circuit read.
>> 
>> Can you capture region server stack trace(s) and pastebin them ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>> 
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>> 
>> Thanks
>> 
>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> bq. change the ip addresses of the cluster nodes
>>> 
>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>> 
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>> 
>>> Cheers
>>> 
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> sorry forget to mention
>>> 
>>>> release of hbase / hadoop you're using
>>> 
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>> 
>>>> were region servers doing compaction ?
>>> 
>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>> 
>>>> have you checked region server logs ?
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>> 
>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>> 
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Please provide some more information:
>>>> 
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>> 
>>>> Thanks.
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
Thanks.

> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
> There are >100.000 of them just for today. The situation with other regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
> 
>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I assume you have enabled short-circuit read.
>> 
>> Can you capture region server stack trace(s) and pastebin them ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>> 
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>> 
>> Thanks
>> 
>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> bq. change the ip addresses of the cluster nodes
>>> 
>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>> 
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>> 
>>> Cheers
>>> 
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> sorry forget to mention
>>> 
>>>> release of hbase / hadoop you're using
>>> 
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>> 
>>>> were region servers doing compaction ?
>>> 
>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>> 
>>>> have you checked region server logs ?
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>> 
>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>> 
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Please provide some more information:
>>>> 
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>> 
>>>> Thanks.
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

I’ve started HDFS balancer, but then stopped it immediately after knowing that it is not a good idea.
but it was around 3 weeks ago, is it possible that it had an influence on the cluster behaviour I’m having now?
Thanks.

> On 03 Sep 2015, at 14:23, Akmal Abbasov <ak...@icloud.com> wrote:
> 
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
> There are >100.000 of them just for today. The situation with other regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
> So if there is no load on the cluster, why there are so much IO happening?
> Any thoughts.
> Thanks.
> 
>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I assume you have enabled short-circuit read.
>> 
>> Can you capture region server stack trace(s) and pastebin them ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
>> 
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>> 
>> Thanks
>> 
>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> bq. change the ip addresses of the cluster nodes
>>> 
>>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>>> 
>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>>> 
>>> Cheers
>>> 
>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi Ted,
>>> sorry forget to mention
>>> 
>>>> release of hbase / hadoop you're using
>>> 
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>> 
>>>> were region servers doing compaction ?
>>> 
>>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>>> 
>>>> have you checked region server logs ?
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>>> 
>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>> 
>>> Thanks.
>>> 
>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Please provide some more information:
>>>> 
>>>> release of hbase / hadoop you're using
>>>> were region servers doing compaction ?
>>>> have you checked region server logs ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>>> Hi,
>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>> 
>>>> Thanks.
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
No there is no short-circuit read configured.
The logs of datanode of the 10.10.8.55 are full of following messages
2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
There are >100.000 of them just for today. The situation with other regionservers are similar.
Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
So if there is no load on the cluster, why there are so much IO happening?
Any thoughts.
Thanks.

> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
> 
> I assume you have enabled short-circuit read.
> 
> Can you capture region server stack trace(s) and pastebin them ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
> 
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
> 
> Thanks
> 
>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> bq. change the ip addresses of the cluster nodes
>> 
>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>> 
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>> 
>> Cheers
>> 
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> sorry forget to mention
>> 
>>> release of hbase / hadoop you're using
>> 
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>> 
>>> were region servers doing compaction ?
>> 
>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>> 
>>> have you checked region server logs ?
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>> 
>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>> 
>> Thanks.
>> 
>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Please provide some more information:
>>> 
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>> 
>>> Thanks.
>>> 
>> 
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
No there is no short-circuit read configured.
The logs of datanode of the 10.10.8.55 are full of following messages
2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
There are >100.000 of them just for today. The situation with other regionservers are similar.
Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
So if there is no load on the cluster, why there are so much IO happening?
Any thoughts.
Thanks.

> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
> 
> I assume you have enabled short-circuit read.
> 
> Can you capture region server stack trace(s) and pastebin them ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
> 
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
> 
> Thanks
> 
>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> bq. change the ip addresses of the cluster nodes
>> 
>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>> 
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>> 
>> Cheers
>> 
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> sorry forget to mention
>> 
>>> release of hbase / hadoop you're using
>> 
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>> 
>>> were region servers doing compaction ?
>> 
>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>> 
>>> have you checked region server logs ?
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>> 
>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>> 
>> Thanks.
>> 
>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Please provide some more information:
>>> 
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>> 
>>> Thanks.
>>> 
>> 
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
No there is no short-circuit read configured.
The logs of datanode of the 10.10.8.55 are full of following messages
2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
There are >100.000 of them just for today. The situation with other regionservers are similar.
Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
So if there is no load on the cluster, why there are so much IO happening?
Any thoughts.
Thanks.

> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
> 
> I assume you have enabled short-circuit read.
> 
> Can you capture region server stack trace(s) and pastebin them ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
> 
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
> 
> Thanks
> 
>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> bq. change the ip addresses of the cluster nodes
>> 
>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>> 
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>> 
>> Cheers
>> 
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> sorry forget to mention
>> 
>>> release of hbase / hadoop you're using
>> 
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>> 
>>> were region servers doing compaction ?
>> 
>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>> 
>>> have you checked region server logs ?
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>> 
>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>> 
>> Thanks.
>> 
>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Please provide some more information:
>>> 
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>> 
>>> Thanks.
>>> 
>> 
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
No there is no short-circuit read configured.
The logs of datanode of the 10.10.8.55 are full of following messages
2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
There are >100.000 of them just for today. The situation with other regionservers are similar.
Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
So if there is no load on the cluster, why there are so much IO happening?
Any thoughts.
Thanks.

> On 02 Sep 2015, at 21:57, Ted Yu <yu...@gmail.com> wrote:
> 
> I assume you have enabled short-circuit read.
> 
> Can you capture region server stack trace(s) and pastebin them ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.
> 
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
> 
> Thanks
> 
>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> bq. change the ip addresses of the cluster nodes
>> 
>> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
>> 
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
>> 
>> Cheers
>> 
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi Ted,
>> sorry forget to mention
>> 
>>> release of hbase / hadoop you're using
>> 
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>> 
>>> were region servers doing compaction ?
>> 
>> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
>> 
>>> have you checked region server logs ?
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
>> 
>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>> 
>> Thanks.
>> 
>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Please provide some more information:
>>> 
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>> 
>>> Thanks.
>>> 
>> 
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

I assume you have enabled short-circuit read.

Can you capture region server stack trace(s) and pastebin them ?

Thanks

On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange
> behaviour started weeks before it.
>
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
>
> Thanks
>
> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>
> bq. change the ip addresses of the cluster nodes
>
> Did this happen recently ? If high iowait was observed after the change
> (you can look at ganglia graph), there is a chance that the change was
> related.
>
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
> server resides.
>
> Cheers
>
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> sorry forget to mention
>>
>> release of hbase / hadoop you're using
>>
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>
>> were region servers doing compaction ?
>>
>> I’ve run major compactions manually earlier today, but it seems that they
>> already completed, looking at the compactionQueueSize.
>>
>> have you checked region server logs ?
>>
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>> 7881815
>>
>> p.s. we had to change the ip addresses of the cluster nodes, is it
>> relevant?
>>
>> Thanks.
>>
>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>
>> Please provide some more information:
>>
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>> <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high
>>> iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>>
>>> Thanks.
>>
>>
>>
>>
>
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

I assume you have enabled short-circuit read.

Can you capture region server stack trace(s) and pastebin them ?

Thanks

On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange
> behaviour started weeks before it.
>
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
>
> Thanks
>
> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>
> bq. change the ip addresses of the cluster nodes
>
> Did this happen recently ? If high iowait was observed after the change
> (you can look at ganglia graph), there is a chance that the change was
> related.
>
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
> server resides.
>
> Cheers
>
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> sorry forget to mention
>>
>> release of hbase / hadoop you're using
>>
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>
>> were region servers doing compaction ?
>>
>> I’ve run major compactions manually earlier today, but it seems that they
>> already completed, looking at the compactionQueueSize.
>>
>> have you checked region server logs ?
>>
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>> 7881815
>>
>> p.s. we had to change the ip addresses of the cluster nodes, is it
>> relevant?
>>
>> Thanks.
>>
>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>
>> Please provide some more information:
>>
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>> <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high
>>> iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>>
>>> Thanks.
>>
>>
>>
>>
>
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

I assume you have enabled short-circuit read.

Can you capture region server stack trace(s) and pastebin them ?

Thanks

On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange
> behaviour started weeks before it.
>
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
>
> Thanks
>
> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>
> bq. change the ip addresses of the cluster nodes
>
> Did this happen recently ? If high iowait was observed after the change
> (you can look at ganglia graph), there is a chance that the change was
> related.
>
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
> server resides.
>
> Cheers
>
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> sorry forget to mention
>>
>> release of hbase / hadoop you're using
>>
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>
>> were region servers doing compaction ?
>>
>> I’ve run major compactions manually earlier today, but it seems that they
>> already completed, looking at the compactionQueueSize.
>>
>> have you checked region server logs ?
>>
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>> 7881815
>>
>> p.s. we had to change the ip addresses of the cluster nodes, is it
>> relevant?
>>
>> Thanks.
>>
>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>
>> Please provide some more information:
>>
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>> <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high
>>> iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>>
>>> Thanks.
>>
>>
>>
>>
>
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

I assume you have enabled short-circuit read.

Can you capture region server stack trace(s) and pastebin them ?

Thanks

On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> I’ve checked the time when addresses were changed, and this strange
> behaviour started weeks before it.
>
> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
> any thoughts?
>
> Thanks
>
> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
>
> bq. change the ip addresses of the cluster nodes
>
> Did this happen recently ? If high iowait was observed after the change
> (you can look at ganglia graph), there is a chance that the change was
> related.
>
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
> server resides.
>
> Cheers
>
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi Ted,
>> sorry forget to mention
>>
>> release of hbase / hadoop you're using
>>
>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>
>> were region servers doing compaction ?
>>
>> I’ve run major compactions manually earlier today, but it seems that they
>> already completed, looking at the compactionQueueSize.
>>
>> have you checked region server logs ?
>>
>> The logs of datanode is full of this kind of messages
>> 2015-09-02 16:37:06,950 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>> 7881815
>>
>> p.s. we had to change the ip addresses of the cluster nodes, is it
>> relevant?
>>
>> Thanks.
>>
>> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>>
>> Please provide some more information:
>>
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>>
>> Thanks
>>
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
>> wrote:
>>
>>> Hi,
>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
>>> <5 puts and gets.
>>> But the data in hdfs is increasing, and region servers have very high
>>> iowait(>100, in 2 core CPU).
>>> iotop shows that datanode process is reading and writing all the time.
>>> Any suggestions?
>>>
>>> Thanks.
>>
>>
>>
>>
>
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.

yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
any thoughts?

Thanks

> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
> 
> bq. change the ip addresses of the cluster nodes
> 
> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
> 
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
> 
> Cheers
> 
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> sorry forget to mention
> 
>> release of hbase / hadoop you're using
> 
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
> 
>> were region servers doing compaction ?
> 
> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
> 
>> have you checked region server logs ?
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
> 
> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
> 
> Thanks.
> 
>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Please provide some more information:
>> 
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>> 
>> Thanks.
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.

yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
any thoughts?

Thanks

> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
> 
> bq. change the ip addresses of the cluster nodes
> 
> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
> 
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
> 
> Cheers
> 
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> sorry forget to mention
> 
>> release of hbase / hadoop you're using
> 
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
> 
>> were region servers doing compaction ?
> 
> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
> 
>> have you checked region server logs ?
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
> 
> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
> 
> Thanks.
> 
>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Please provide some more information:
>> 
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>> 
>> Thanks.
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.

yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
any thoughts?

Thanks

> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
> 
> bq. change the ip addresses of the cluster nodes
> 
> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
> 
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
> 
> Cheers
> 
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> sorry forget to mention
> 
>> release of hbase / hadoop you're using
> 
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
> 
>> were region servers doing compaction ?
> 
> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
> 
>> have you checked region server logs ?
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
> 
> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
> 
> Thanks.
> 
>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Please provide some more information:
>> 
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>> 
>> Thanks.
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
I’ve checked the time when addresses were changed, and this strange behaviour started weeks before it.

yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
any thoughts?

Thanks

> On 02 Sep 2015, at 18:45, Ted Yu <yu...@gmail.com> wrote:
> 
> bq. change the ip addresses of the cluster nodes
> 
> Did this happen recently ? If high iowait was observed after the change (you can look at ganglia graph), there is a chance that the change was related.
> 
> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region server resides.
> 
> Cheers
> 
> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi Ted,
> sorry forget to mention
> 
>> release of hbase / hadoop you're using
> 
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
> 
>> were region servers doing compaction ?
> 
> I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.
> 
>> have you checked region server logs ?
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815
> 
> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
> 
> Thanks.
> 
>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Please provide some more information:
>> 
>> release of hbase / hadoop you're using
>> were region servers doing compaction ?
>> have you checked region server logs ?
>> 
>> Thanks
>> 
>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
>> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>> 
>> Thanks.
>> 
> 
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

bq. change the ip addresses of the cluster nodes

Did this happen recently ? If high iowait was observed after the change
(you can look at ganglia graph), there is a chance that the change was
related.

BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
server resides.

Cheers

On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> sorry forget to mention
>
> release of hbase / hadoop you're using
>
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>
> were region servers doing compaction ?
>
> I’ve run major compactions manually earlier today, but it seems that they
> already completed, looking at the compactionQueueSize.
>
> have you checked region server logs ?
>
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
> 7881815
>
> p.s. we had to change the ip addresses of the cluster nodes, is it
> relevant?
>
> Thanks.
>
> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>
> Please provide some more information:
>
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
>> puts and gets.
>> But the data in hdfs is increasing, and region servers have very high
>> iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>>
>> Thanks.
>
>
>
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

bq. change the ip addresses of the cluster nodes

Did this happen recently ? If high iowait was observed after the change
(you can look at ganglia graph), there is a chance that the change was
related.

BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
server resides.

Cheers

On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> sorry forget to mention
>
> release of hbase / hadoop you're using
>
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>
> were region servers doing compaction ?
>
> I’ve run major compactions manually earlier today, but it seems that they
> already completed, looking at the compactionQueueSize.
>
> have you checked region server logs ?
>
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
> 7881815
>
> p.s. we had to change the ip addresses of the cluster nodes, is it
> relevant?
>
> Thanks.
>
> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>
> Please provide some more information:
>
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
>> puts and gets.
>> But the data in hdfs is increasing, and region servers have very high
>> iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>>
>> Thanks.
>
>
>
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

bq. change the ip addresses of the cluster nodes

Did this happen recently ? If high iowait was observed after the change
(you can look at ganglia graph), there is a chance that the change was
related.

BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
server resides.

Cheers

On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> sorry forget to mention
>
> release of hbase / hadoop you're using
>
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>
> were region servers doing compaction ?
>
> I’ve run major compactions manually earlier today, but it seems that they
> already completed, looking at the compactionQueueSize.
>
> have you checked region server logs ?
>
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
> 7881815
>
> p.s. we had to change the ip addresses of the cluster nodes, is it
> relevant?
>
> Thanks.
>
> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>
> Please provide some more information:
>
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
>> puts and gets.
>> But the data in hdfs is increasing, and region servers have very high
>> iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>>
>> Thanks.
>
>
>
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

bq. change the ip addresses of the cluster nodes

Did this happen recently ? If high iowait was observed after the change
(you can look at ganglia graph), there is a chance that the change was
related.

BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
server resides.

Cheers

On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Ted,
> sorry forget to mention
>
> release of hbase / hadoop you're using
>
> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>
> were region servers doing compaction ?
>
> I’ve run major compactions manually earlier today, but it seems that they
> already completed, looking at the compactionQueueSize.
>
> have you checked region server logs ?
>
> The logs of datanode is full of this kind of messages
> 2015-09-02 16:37:06,950 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
> 7881815
>
> p.s. we had to change the ip addresses of the cluster nodes, is it
> relevant?
>
> Thanks.
>
> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
>
> Please provide some more information:
>
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
> wrote:
>
>> Hi,
>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
>> puts and gets.
>> But the data in hdfs is increasing, and region servers have very high
>> iowait(>100, in 2 core CPU).
>> iotop shows that datanode process is reading and writing all the time.
>> Any suggestions?
>>
>> Thanks.
>
>
>
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
sorry forget to mention

> release of hbase / hadoop you're using
hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1

> were region servers doing compaction ?
I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.

> have you checked region server logs ?
The logs of datanode is full of this kind of messages
2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815

p.s. we had to change the ip addresses of the cluster nodes, is it relevant?

Thanks.

> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
> 
> Please provide some more information:
> 
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
> 
> Thanks.
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
sorry forget to mention

> release of hbase / hadoop you're using
hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1

> were region servers doing compaction ?
I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.

> have you checked region server logs ?
The logs of datanode is full of this kind of messages
2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815

p.s. we had to change the ip addresses of the cluster nodes, is it relevant?

Thanks.

> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
> 
> Please provide some more information:
> 
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
> 
> Thanks.
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
sorry forget to mention

> release of hbase / hadoop you're using
hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1

> were region servers doing compaction ?
I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.

> have you checked region server logs ?
The logs of datanode is full of this kind of messages
2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815

p.s. we had to change the ip addresses of the cluster nodes, is it relevant?

Thanks.

> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
> 
> Please provide some more information:
> 
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
> 
> Thanks.
>

Re: High iowait in idle hbase cluster

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Ted,
sorry forget to mention

> release of hbase / hadoop you're using
hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1

> were region servers doing compaction ?
I’ve run major compactions manually earlier today, but it seems that they already completed, looking at the compactionQueueSize.

> have you checked region server logs ?
The logs of datanode is full of this kind of messages
2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 7881815

p.s. we had to change the ip addresses of the cluster nodes, is it relevant?

Thanks.

> On 02 Sep 2015, at 18:20, Ted Yu <yu...@gmail.com> wrote:
> 
> Please provide some more information:
> 
> release of hbase / hadoop you're using
> were region servers doing compaction ?
> have you checked region server logs ?
> 
> Thanks
> 
> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com <ma...@icloud.com>> wrote:
> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 puts and gets.
> But the data in hdfs is increasing, and region servers have very high iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
> 
> Thanks.
>

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

Please provide some more information:

release of hbase / hadoop you're using
were region servers doing compaction ?
have you checked region server logs ?

Thanks

On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
> puts and gets.
> But the data in hdfs is increasing, and region servers have very high
> iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
>
> Thanks.

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

Please provide some more information:

release of hbase / hadoop you're using
were region servers doing compaction ?
have you checked region server logs ?

Thanks

On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
> puts and gets.
> But the data in hdfs is increasing, and region servers have very high
> iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
>
> Thanks.

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

Please provide some more information:

release of hbase / hadoop you're using
were region servers doing compaction ?
have you checked region server logs ?

Thanks

On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
> puts and gets.
> But the data in hdfs is increasing, and region servers have very high
> iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
>
> Thanks.

Re: High iowait in idle hbase cluster

Posted by Ted Yu <yu...@gmail.com>.

Please provide some more information:

release of hbase / hadoop you're using
were region servers doing compaction ?
have you checked region server logs ?

Thanks

On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi,
> I’m having strange behaviour in hbase cluster. It is almost idle, only <5
> puts and gets.
> But the data in hdfs is increasing, and region servers have very high
> iowait(>100, in 2 core CPU).
> iotop shows that datanode process is reading and writing all the time.
> Any suggestions?
>
> Thanks.