You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Senthil Kumar <se...@gmail.com> on 2016/09/08 12:45:41 UTC

HDFS Balancer Stuck after 10 Minz

Hi All ,  We are in the situation to balance the cluster data since median
reached 98% .. I started balancer as below

Hadoop Version: Hadoop 2.4.1


/apache/hadoop/sbin/start-balancer.sh   -threshold  30


Once i start balancer it goes will for first 8-10 minutes of time..
Balancer was moving so quickly first 10 minutes.. Not sure whats happening
in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .

Log excerpts :

2016-09-08 04:58:15,653 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_-5830766563502877304_1279767737 with size=134217728 from
10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004

2016-09-08 04:59:14,426 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004 to
10.142.18.27:1004 through 10.103.84.16:1004

2016-09-08 05:01:15,037 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004 to
10.142.21.56:1004 through 10.103.21.42:1004



[05:16]:[hadoop@lvsaishdc3sn0002:~]$ date

Thu Sep  8 05:16:53 GMT+7 2016

[05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps

1003 Balancer

20388 Jps



Last Block Mover Timestamp     : 05:01

Current Timestamp                    : 05:16


Almost 15 minz no blocks moved by balancer ..  What could be the issue here
??  Restart would help us start moving again..



It’s not event passing iteration 1 ..


I found one thread discussing about the same issue:

http://lucene.472066.n3.nabble.com/A-question-about-Balancer-in-HDFS-td4118505.html


Pls suggest here to balance cluster ..


--Senthil

Re: HDFS Balancer Stuck after 10 Minz

Posted by karthik p <ka...@gmail.com>.
Senthil,

           I don't find any delay from the log trace, you could find the
destination node which it takes longer time to receive blocks.

Check out whether blocks are moving within the rack or outside of it, also
identify if any delay from particular node.

Make sure there is no networks issue on the rack/node. Pls, confirm all the
racks has same network bandwidth.


Thanks,
Karthik

On Thu, Sep 8, 2016 at 10:03 PM, Senthil Kumar <se...@gmail.com>
wrote:

> Here is the cluster space details :
>
> DFS Remaining : 32.50 PB
> DFS Used% : 54.24%
> DFS Remaining% : 45.76%
> Block Pool Used : 38.52 PB
> Block Pool Used% : 54.24%
> DataNodes usages : Min % Median % Max % stdev %
> 0.00% 95.59% 99.92% 36.28%
>
> Also i tried with threshold  value 5 & 10 , ended with same result ..
>
>
> "pool-5-thread-84" prio=10 tid=0x00002ae8780a8800 nid=0x3b22 waiting on
> condition [0x00002ae89910f000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007062bda50> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.
> java:186)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>         at java.util.concurrent.LinkedBlockingQueue.take(
> LinkedBlockingQueue.java:442)
>         at java.util.concurrent.ThreadPoolExecutor.getTask(
> ThreadPoolExecutor.java:1068)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1130)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> "pool-5-thread-83" prio=10 tid=0x00002ae878ba3000 nid=0x3b21 waiting on
> condition [0x00002ae89900e000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>
> most of the threads are waiting for blocks ..
>
>
> --Senthil
>
> On Thu, Sep 8, 2016 at 9:47 PM, Senthil Kumar <se...@gmail.com>
> wrote:
>
>> Thanks  Rakesh .
>>
>> "*Perhaps there could be high chance of searching for data blocks which
>> it can move around to balance the cluster*. "
>>
>> I could see below log statement after enabling DEBUG mode..
>>
>> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49230
>> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49231
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49229
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 1ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49230
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49232
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 1ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49233
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49231
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49234
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49232
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49235
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49233
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49234
>>
>>
>> Same getBlocks() call repeating !!!
>>
>> --Senthil
>>
>> On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <ra...@apache.org>
>> wrote:
>>
>>> Have you taken multiple thread dumps (jstack) and observed the
>>> operations which are performing during this period of time. Perhaps there
>>> could be high chance of searching for data blocks which it can move around
>>> to balance the cluster.
>>>
>>> Could you tell me the used space and available space values. Have you
>>> tried changing the threshold to a lower value, may be 10 or 5 and what
>>> happens with this value. Also, I think there is no log messages during 15
>>> mins time period, any possibility of enabling debug log priority and try to
>>> dig more about the problem.
>>>
>>>
>>> Rakesh
>>>
>>> On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <rakeshr@apache.org
>>> > wrote:
>>>
>>>> Have you taken multiple thread dumps (jstack) and observed the
>>>> operations which are performing during this period of time. Perhaps there
>>>> could be high chance of searching for data blocks which it can move around
>>>> to balance the cluster.
>>>>
>>>> Could you tell me the used space and available space values. Have you
>>>> tried changing the threshold to a lower value, may be 10 or 5 and what
>>>> happens with this value. Also, I think there is no log messages during 15
>>>> mins time period, any possibility of enabling debug log priority and try to
>>>> dig more about the problem.
>>>>
>>>> Rakesh
>>>>
>>>> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <se...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All ,  We are in the situation to balance the cluster data since
>>>>> median
>>>>> reached 98% .. I started balancer as below
>>>>>
>>>>> Hadoop Version: Hadoop 2.4.1
>>>>>
>>>>>
>>>>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>>>>
>>>>>
>>>>> Once i start balancer it goes will for first 8-10 minutes of time..
>>>>> Balancer was moving so quickly first 10 minutes.. Not sure whats
>>>>> happening
>>>>> in the cluster after sometime ( say 10 minz ) , balancer is almost
>>>>> stuck .
>>>>>
>>>>> Log excerpts :
>>>>>
>>>>> 2016-09-08 04:58:15,653 INFO
>>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>>> blk_-5830766563502877304_1279767737 with size=134217728 from
>>>>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>>>>
>>>>> 2016-09-08 04:59:14,426 INFO
>>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>>> blk_2601479900_1104500421142 with size=268435456 from
>>>>> 10.103.84.51:1004 to
>>>>> 10.142.18.27:1004 through 10.103.84.16:1004
>>>>>
>>>>> 2016-09-08 05:01:15,037 INFO
>>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>>> blk_3073791211_1104972921837 with size=268435456 from
>>>>> 10.103.21.27:1004 to
>>>>> 10.142.21.56:1004 through 10.103.21.42:1004
>>>>>
>>>>>
>>>>>
>>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>>>>
>>>>> Thu Sep  8 05:16:53 GMT+7 2016
>>>>>
>>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>>>>
>>>>> 1003 Balancer
>>>>>
>>>>> 20388 Jps
>>>>>
>>>>>
>>>>>
>>>>> Last Block Mover Timestamp     : 05:01
>>>>>
>>>>> Current Timestamp                    : 05:16
>>>>>
>>>>>
>>>>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>>>>> here
>>>>> ??  Restart would help us start moving again..
>>>>>
>>>>>
>>>>>
>>>>> It’s not event passing iteration 1 ..
>>>>>
>>>>>
>>>>> I found one thread discussing about the same issue:
>>>>>
>>>>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>>>>> -in-HDFS-td4118505.html
>>>>>
>>>>>
>>>>> Pls suggest here to balance cluster ..
>>>>>
>>>>>
>>>>> --Senthil
>>>>>
>>>>
>>>>
>>>
>>
>


-- 

Thanks and Regards

Karthik P

+91 9940089181

SkypeId - karthik.p01

Re: HDFS Balancer Stuck after 10 Minz

Posted by Senthil Kumar <se...@gmail.com>.
Here is the cluster space details :

DFS Remaining : 32.50 PB
DFS Used% : 54.24%
DFS Remaining% : 45.76%
Block Pool Used : 38.52 PB
Block Pool Used% : 54.24%
DataNodes usages : Min % Median % Max % stdev %
0.00% 95.59% 99.92% 36.28%

Also i tried with threshold  value 5 & 10 , ended with same result ..


"pool-5-thread-84" prio=10 tid=0x00002ae8780a8800 nid=0x3b22 waiting on
condition [0x00002ae89910f000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007062bda50> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

"pool-5-thread-83" prio=10 tid=0x00002ae878ba3000 nid=0x3b21 waiting on
condition [0x00002ae89900e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)

most of the threads are waiting for blocks ..


--Senthil

On Thu, Sep 8, 2016 at 9:47 PM, Senthil Kumar <se...@gmail.com>
wrote:

> Thanks  Rakesh .
>
> "*Perhaps there could be high chance of searching for data blocks which
> it can move around to balance the cluster*. "
>
> I could see below log statement after enabling DEBUG mode..
>
> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49230
> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49231
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49229
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 1ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49230
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49232
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 1ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49233
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49231
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49234
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49232
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49235
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49233
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49234
>
>
> Same getBlocks() call repeating !!!
>
> --Senthil
>
> On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <ra...@apache.org>
> wrote:
>
>> Have you taken multiple thread dumps (jstack) and observed the operations
>> which are performing during this period of time. Perhaps there could be
>> high chance of searching for data blocks which it can move around to
>> balance the cluster.
>>
>> Could you tell me the used space and available space values. Have you
>> tried changing the threshold to a lower value, may be 10 or 5 and what
>> happens with this value. Also, I think there is no log messages during 15
>> mins time period, any possibility of enabling debug log priority and try to
>> dig more about the problem.
>>
>>
>> Rakesh
>>
>> On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <ra...@apache.org>
>> wrote:
>>
>>> Have you taken multiple thread dumps (jstack) and observed the
>>> operations which are performing during this period of time. Perhaps there
>>> could be high chance of searching for data blocks which it can move around
>>> to balance the cluster.
>>>
>>> Could you tell me the used space and available space values. Have you
>>> tried changing the threshold to a lower value, may be 10 or 5 and what
>>> happens with this value. Also, I think there is no log messages during 15
>>> mins time period, any possibility of enabling debug log priority and try to
>>> dig more about the problem.
>>>
>>> Rakesh
>>>
>>> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <se...@gmail.com>
>>> wrote:
>>>
>>>> Hi All ,  We are in the situation to balance the cluster data since
>>>> median
>>>> reached 98% .. I started balancer as below
>>>>
>>>> Hadoop Version: Hadoop 2.4.1
>>>>
>>>>
>>>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>>>
>>>>
>>>> Once i start balancer it goes will for first 8-10 minutes of time..
>>>> Balancer was moving so quickly first 10 minutes.. Not sure whats
>>>> happening
>>>> in the cluster after sometime ( say 10 minz ) , balancer is almost
>>>> stuck .
>>>>
>>>> Log excerpts :
>>>>
>>>> 2016-09-08 04:58:15,653 INFO
>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>> blk_-5830766563502877304_1279767737 with size=134217728 from
>>>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>>>
>>>> 2016-09-08 04:59:14,426 INFO
>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004
>>>> to
>>>> 10.142.18.27:1004 through 10.103.84.16:1004
>>>>
>>>> 2016-09-08 05:01:15,037 INFO
>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004
>>>> to
>>>> 10.142.21.56:1004 through 10.103.21.42:1004
>>>>
>>>>
>>>>
>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>>>
>>>> Thu Sep  8 05:16:53 GMT+7 2016
>>>>
>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>>>
>>>> 1003 Balancer
>>>>
>>>> 20388 Jps
>>>>
>>>>
>>>>
>>>> Last Block Mover Timestamp     : 05:01
>>>>
>>>> Current Timestamp                    : 05:16
>>>>
>>>>
>>>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>>>> here
>>>> ??  Restart would help us start moving again..
>>>>
>>>>
>>>>
>>>> It’s not event passing iteration 1 ..
>>>>
>>>>
>>>> I found one thread discussing about the same issue:
>>>>
>>>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>>>> -in-HDFS-td4118505.html
>>>>
>>>>
>>>> Pls suggest here to balance cluster ..
>>>>
>>>>
>>>> --Senthil
>>>>
>>>
>>>
>>
>

Re: HDFS Balancer Stuck after 10 Minz

Posted by Senthil Kumar <se...@gmail.com>.
Thanks  Rakesh .

"*Perhaps there could be high chance of searching for data blocks which it
can move around to balance the cluster*. "

I could see below log statement after enabling DEBUG mode..

2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49230
2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49229
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49230
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49234
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49235
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49234


Same getBlocks() call repeating !!!

--Senthil

On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <ra...@apache.org>
wrote:

> Have you taken multiple thread dumps (jstack) and observed the operations
> which are performing during this period of time. Perhaps there could be
> high chance of searching for data blocks which it can move around to
> balance the cluster.
>
> Could you tell me the used space and available space values. Have you
> tried changing the threshold to a lower value, may be 10 or 5 and what
> happens with this value. Also, I think there is no log messages during 15
> mins time period, any possibility of enabling debug log priority and try to
> dig more about the problem.
>
>
> Rakesh
>
> On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <ra...@apache.org>
> wrote:
>
>> Have you taken multiple thread dumps (jstack) and observed the operations
>> which are performing during this period of time. Perhaps there could be
>> high chance of searching for data blocks which it can move around to
>> balance the cluster.
>>
>> Could you tell me the used space and available space values. Have you
>> tried changing the threshold to a lower value, may be 10 or 5 and what
>> happens with this value. Also, I think there is no log messages during 15
>> mins time period, any possibility of enabling debug log priority and try to
>> dig more about the problem.
>>
>> Rakesh
>>
>> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <se...@gmail.com>
>> wrote:
>>
>>> Hi All ,  We are in the situation to balance the cluster data since
>>> median
>>> reached 98% .. I started balancer as below
>>>
>>> Hadoop Version: Hadoop 2.4.1
>>>
>>>
>>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>>
>>>
>>> Once i start balancer it goes will for first 8-10 minutes of time..
>>> Balancer was moving so quickly first 10 minutes.. Not sure whats
>>> happening
>>> in the cluster after sometime ( say 10 minz ) , balancer is almost stuck
>>> .
>>>
>>> Log excerpts :
>>>
>>> 2016-09-08 04:58:15,653 INFO
>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>> blk_-5830766563502877304_1279767737 with size=134217728 from
>>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>>
>>> 2016-09-08 04:59:14,426 INFO
>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004
>>> to
>>> 10.142.18.27:1004 through 10.103.84.16:1004
>>>
>>> 2016-09-08 05:01:15,037 INFO
>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004
>>> to
>>> 10.142.21.56:1004 through 10.103.21.42:1004
>>>
>>>
>>>
>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>>
>>> Thu Sep  8 05:16:53 GMT+7 2016
>>>
>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>>
>>> 1003 Balancer
>>>
>>> 20388 Jps
>>>
>>>
>>>
>>> Last Block Mover Timestamp     : 05:01
>>>
>>> Current Timestamp                    : 05:16
>>>
>>>
>>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>>> here
>>> ??  Restart would help us start moving again..
>>>
>>>
>>>
>>> It’s not event passing iteration 1 ..
>>>
>>>
>>> I found one thread discussing about the same issue:
>>>
>>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>>> -in-HDFS-td4118505.html
>>>
>>>
>>> Pls suggest here to balance cluster ..
>>>
>>>
>>> --Senthil
>>>
>>
>>
>

Re: HDFS Balancer Stuck after 10 Minz

Posted by Rakesh Radhakrishnan <ra...@apache.org>.
Have you taken multiple thread dumps (jstack) and observed the operations
which are performing during this period of time. Perhaps there could be
high chance of searching for data blocks which it can move around to
balance the cluster.

Could you tell me the used space and available space values. Have you tried
changing the threshold to a lower value, may be 10 or 5 and what happens
with this value. Also, I think there is no log messages during 15 mins time
period, any possibility of enabling debug log priority and try to dig more
about the problem.


Rakesh

On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <ra...@apache.org>
wrote:

> Have you taken multiple thread dumps (jstack) and observed the operations
> which are performing during this period of time. Perhaps there could be
> high chance of searching for data blocks which it can move around to
> balance the cluster.
>
> Could you tell me the used space and available space values. Have you
> tried changing the threshold to a lower value, may be 10 or 5 and what
> happens with this value. Also, I think there is no log messages during 15
> mins time period, any possibility of enabling debug log priority and try to
> dig more about the problem.
>
> Rakesh
>
> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <se...@gmail.com>
> wrote:
>
>> Hi All ,  We are in the situation to balance the cluster data since median
>> reached 98% .. I started balancer as below
>>
>> Hadoop Version: Hadoop 2.4.1
>>
>>
>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>
>>
>> Once i start balancer it goes will for first 8-10 minutes of time..
>> Balancer was moving so quickly first 10 minutes.. Not sure whats happening
>> in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .
>>
>> Log excerpts :
>>
>> 2016-09-08 04:58:15,653 INFO
>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>> blk_-5830766563502877304_1279767737 with size=134217728 from
>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>
>> 2016-09-08 04:59:14,426 INFO
>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004
>> to
>> 10.142.18.27:1004 through 10.103.84.16:1004
>>
>> 2016-09-08 05:01:15,037 INFO
>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004
>> to
>> 10.142.21.56:1004 through 10.103.21.42:1004
>>
>>
>>
>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>
>> Thu Sep  8 05:16:53 GMT+7 2016
>>
>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>
>> 1003 Balancer
>>
>> 20388 Jps
>>
>>
>>
>> Last Block Mover Timestamp     : 05:01
>>
>> Current Timestamp                    : 05:16
>>
>>
>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>> here
>> ??  Restart would help us start moving again..
>>
>>
>>
>> It’s not event passing iteration 1 ..
>>
>>
>> I found one thread discussing about the same issue:
>>
>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>> -in-HDFS-td4118505.html
>>
>>
>> Pls suggest here to balance cluster ..
>>
>>
>> --Senthil
>>
>
>

Re: HDFS Balancer Stuck after 10 Minz

Posted by Rakesh Radhakrishnan <ra...@apache.org>.
Have you taken multiple thread dumps (jstack) and observed the operations
which are performing during this period of time. Perhaps there could be
high chance of searching for data blocks which it can move around to
balance the cluster.

Could you tell me the used space and available space values. Have you tried
changing the threshold to a lower value, may be 10 or 5 and what happens
with this value. Also, I think there is no log messages during 15 mins time
period, any possibility of enabling debug log priority and try to dig more
about the problem.

Rakesh

On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <se...@gmail.com>
wrote:

> Hi All ,  We are in the situation to balance the cluster data since median
> reached 98% .. I started balancer as below
>
> Hadoop Version: Hadoop 2.4.1
>
>
> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>
>
> Once i start balancer it goes will for first 8-10 minutes of time..
> Balancer was moving so quickly first 10 minutes.. Not sure whats happening
> in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .
>
> Log excerpts :
>
> 2016-09-08 04:58:15,653 INFO
> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
> blk_-5830766563502877304_1279767737 with size=134217728 from
> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>
> 2016-09-08 04:59:14,426 INFO
> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004 to
> 10.142.18.27:1004 through 10.103.84.16:1004
>
> 2016-09-08 05:01:15,037 INFO
> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004 to
> 10.142.21.56:1004 through 10.103.21.42:1004
>
>
>
> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>
> Thu Sep  8 05:16:53 GMT+7 2016
>
> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>
> 1003 Balancer
>
> 20388 Jps
>
>
>
> Last Block Mover Timestamp     : 05:01
>
> Current Timestamp                    : 05:16
>
>
> Almost 15 minz no blocks moved by balancer ..  What could be the issue here
> ??  Restart would help us start moving again..
>
>
>
> It’s not event passing iteration 1 ..
>
>
> I found one thread discussing about the same issue:
>
> http://lucene.472066.n3.nabble.com/A-question-about-
> Balancer-in-HDFS-td4118505.html
>
>
> Pls suggest here to balance cluster ..
>
>
> --Senthil
>