You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rod Cope <ro...@openlogic.com> on 2010/03/03 23:22:43 UTC

Neverending DataNode.clienttrace logging after starting Hbase

We have a problem on our Hadoop cluster where a random subset of our
datanode logs are filling up with GB¹s of clienttrace messages after
starting HBase.  The logs are fine, fsck shows a perfect report, and all is
well until HBase is started.  Without running any MR jobs or using
HBase/HDFS, we¹ve let it go for 15+ hours without seeing any change in
logging behavior.  The datanode logs look like this:

...
2010-03-03 09:50:16,738 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_3353532406500894348_402571, duration: 93000
2010-03-03 09:50:16,739 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_1719430407401031560_402554, duration: 256000
2010-03-03 09:50:16,743 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_-4376123158716665125_402579, duration: 106000
2010-03-03 09:50:16,743 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_-4354660273604489760_402576, duration: 118000
2010-03-03 09:50:16,744 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_3353532406500894348_402571, duration: 88000
2010-03-03 09:50:16,745 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_1719430407401031560_402554, duration: 253000
2010-03-03 09:50:16,749 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_-4376123158716665125_402579, duration: 81000
2010-03-03 09:50:16,749 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_-4354660273604489760_402576, duration: 76000
2010-03-03 09:50:16,750 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_3353532406500894348_402571, duration: 91000
2010-03-03 09:50:16,750 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op:
HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
DS-35570150-192.168.60.106-50010-1267000204488, blockid:
blk_1719430407401031560_402554, duration: 237000
...

The logs (and top and tcptrack) show bursts of activity for perhaps 15
seconds at a time, then it stops for 30 seconds or so and starts all over
again.  Frequently, there will be a line or two like this after each burst:
2010-03-03 09:52:35,255 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_8884930029593424313_289042

When we restart the cluster, the problem usually moves to a different random
subset of nodes.  This has happened at least 5 times and there¹s no way
we¹ve been able to predict which machine(s) will have this weird logging
behavior.

I¹ve noticed that the cliID in the logs is identical across the machines
exhibiting this behavior.  Is there a way to tell which client that is?  As
I said, we don¹t have any MR jobs running.

We¹re on the latest CDH test release (February 9, 0.20.1+169.56), which is
pretty close to Hadoop 0.20.2.  The data in HDFS is almost entirely HBase
(0.20.3), but there are a few other blocks of data.  The problems just
started in the last few days.  We¹ve recently gone through some very heavy
data loading into HBase, but hadn¹t seen this problem in the past when we¹ve
done the same thing.

The data nodes and regionservers stay up without any errors and I can run
counts from the shell and Hadoop jobs that read every row of my largest
table without issues, but the burst of logging activity is accompanied by a
spike of CPU, network, and disk that impacts performance and I¹m afraid it
will cause timeouts or other problems.  We¹re also getting datanode log
files that run into several GB¹s a day.

Any ideas?

Thanks,
Rod

Re: Neverending DataNode.clienttrace logging after starting Hbase

Posted by Stack <st...@duboce.net>.
On Thu, Mar 4, 2010 at 6:39 AM, Rod Cope <ro...@openlogic.com> wrote:
> St.Ack,
>
> I could turn off the logging through log4j.properties,


As part of our soon move to hadoop 0.20.2, we should include a note on
how to do this and recommend that folks indeed do this since its so
chatty.

but I'm more
> concerned about the fact that the logging never stops or slows down even
> when the system has been idle for 20+ hours.
>

Even on a quiescent system, there are a few threads running in hbase
running checks.  For example, in the master, a thread runs each minute
checking the -ROOT- and .META. tables.  At a minimum its a scan of
each of the tables.  That'll show as clienttrace reads.

You could try correlating requests in the UI with logging in the datanode?

> Is there a reason why HDFS would be so active when no nodes have been added,
> removed, or changed and "hadoop fsck /" has been showing a perfect report
> since the cluster was started?  Does that fact that restarting the cluster
> causes the logging behavior to move to a different random subset of nodes
> provide any clues?
>

Probably because a different regionserver is now hosting .META. and
its closest replica is now at a different datanode?  Could that be
whats going on?

> Also, this behavior only started recently after doing some heavy HBase
> loads.  And if I stop HBase, this HDFS activity also stops - yet there's no
> strange activity or any exceptions in the regionserver logs.
>

I'd think you'd have seen these logs as soon as you installed an hdfs
that had HADOOP-3062.


> Are there any other logs I should look in or config I should check?  I'd
> also be happy to shut down the cluster, copy the /hbase directory and
> restart, or similar if you think that might help.
>

Try doing correlation between UI requests and datanode logging as
suggested above.  Start here I'd say.

Yours,
St.Ack

> Thanks,
> Rod
>
>
> On 3/3/10 Wednesday, March 3, 201011:30 PM, "Stack" <st...@duboce.net>
> wrote:
>
>> This looks like the product of the below addtion:
>>
>> ------------------------------------------------------------------------
>> r687868 | cdouglas | 2008-08-21 14:27:31 -0700 (Thu, 21 Aug 2008) | 3 lines
>>
>> HADOOP-3062. Add metrics to DataNode and TaskTracker to record network
>> traffic for HDFS reads/writes and MR shuffling.
>>
>>
>> Your problem is the quantity of the traces is overwhelming?  If so, we
>> should make a patch for 0.20 hadoop to turn off the above tracing when
>> not wanted.
>>
>> St.Ack
>>
>>
>>
>> On Wed, Mar 3, 2010 at 2:22 PM, Rod Cope <ro...@openlogic.com> wrote:
>>> We have a problem on our Hadoop cluster where a random subset of our
>>> datanode logs are filling up with GB¹s of clienttrace messages after
>>> starting HBase.  The logs are fine, fsck shows a perfect report, and all is
>>> well until HBase is started.  Without running any MR jobs or using
>>> HBase/HDFS, we¹ve let it go for 15+ hours without seeing any change in
>>> logging behavior.  The datanode logs look like this:
>>>
>>> ...
>>> 2010-03-03 09:50:16,738 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_3353532406500894348_402571, duration: 93000
>>> 2010-03-03 09:50:16,739 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_1719430407401031560_402554, duration: 256000
>>> 2010-03-03 09:50:16,743 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4376123158716665125_402579, duration: 106000
>>> 2010-03-03 09:50:16,743 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4354660273604489760_402576, duration: 118000
>>> 2010-03-03 09:50:16,744 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_3353532406500894348_402571, duration: 88000
>>> 2010-03-03 09:50:16,745 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_1719430407401031560_402554, duration: 253000
>>> 2010-03-03 09:50:16,749 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4376123158716665125_402579, duration: 81000
>>> 2010-03-03 09:50:16,749 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_-4354660273604489760_402576, duration: 76000
>>> 2010-03-03 09:50:16,750 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_3353532406500894348_402571, duration: 91000
>>> 2010-03-03 09:50:16,750 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op:
>>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>>> blk_1719430407401031560_402554, duration: 237000
>>> ...
>>>
>>> The logs (and top and tcptrack) show bursts of activity for perhaps 15
>>> seconds at a time, then it stops for 30 seconds or so and starts all over
>>> again.  Frequently, there will be a line or two like this after each burst:
>>> 2010-03-03 09:52:35,255 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
>>> succeeded for blk_8884930029593424313_289042
>>>
>>> When we restart the cluster, the problem usually moves to a different random
>>> subset of nodes.  This has happened at least 5 times and there¹s no way
>>> we¹ve been able to predict which machine(s) will have this weird logging
>>> behavior.
>>>
>>> I¹ve noticed that the cliID in the logs is identical across the machines
>>> exhibiting this behavior.  Is there a way to tell which client that is?  As
>>> I said, we don¹t have any MR jobs running.
>>>
>>> We¹re on the latest CDH test release (February 9, 0.20.1+169.56), which is
>>> pretty close to Hadoop 0.20.2.  The data in HDFS is almost entirely HBase
>>> (0.20.3), but there are a few other blocks of data.  The problems just
>>> started in the last few days.  We¹ve recently gone through some very heavy
>>> data loading into HBase, but hadn¹t seen this problem in the past when we¹ve
>>> done the same thing.
>>>
>>> The data nodes and regionservers stay up without any errors and I can run
>>> counts from the shell and Hadoop jobs that read every row of my largest
>>> table without issues, but the burst of logging activity is accompanied by a
>>> spike of CPU, network, and disk that impacts performance and I¹m afraid it
>>> will cause timeouts or other problems.  We¹re also getting datanode log
>>> files that run into several GB¹s a day.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>> Rod
>>>
>
>
>
>
>
>
>

Re: Neverending DataNode.clienttrace logging after starting Hbase

Posted by Rod Cope <ro...@openlogic.com>.
St.Ack,

I could turn off the logging through log4j.properties, but I'm more
concerned about the fact that the logging never stops or slows down even
when the system has been idle for 20+ hours.

Is there a reason why HDFS would be so active when no nodes have been added,
removed, or changed and "hadoop fsck /" has been showing a perfect report
since the cluster was started?  Does that fact that restarting the cluster
causes the logging behavior to move to a different random subset of nodes
provide any clues? 

Also, this behavior only started recently after doing some heavy HBase
loads.  And if I stop HBase, this HDFS activity also stops - yet there's no
strange activity or any exceptions in the regionserver logs.

Are there any other logs I should look in or config I should check?  I'd
also be happy to shut down the cluster, copy the /hbase directory and
restart, or similar if you think that might help.

Thanks,
Rod


On 3/3/10 Wednesday, March 3, 201011:30 PM, "Stack" <st...@duboce.net>
wrote:

> This looks like the product of the below addtion:
> 
> ------------------------------------------------------------------------
> r687868 | cdouglas | 2008-08-21 14:27:31 -0700 (Thu, 21 Aug 2008) | 3 lines
> 
> HADOOP-3062. Add metrics to DataNode and TaskTracker to record network
> traffic for HDFS reads/writes and MR shuffling.
> 
> 
> Your problem is the quantity of the traces is overwhelming?  If so, we
> should make a patch for 0.20 hadoop to turn off the above tracing when
> not wanted.
> 
> St.Ack
> 
> 
> 
> On Wed, Mar 3, 2010 at 2:22 PM, Rod Cope <ro...@openlogic.com> wrote:
>> We have a problem on our Hadoop cluster where a random subset of our
>> datanode logs are filling up with GB¹s of clienttrace messages after
>> starting HBase.  The logs are fine, fsck shows a perfect report, and all is
>> well until HBase is started.  Without running any MR jobs or using
>> HBase/HDFS, we¹ve let it go for 15+ hours without seeing any change in
>> logging behavior.  The datanode logs look like this:
>> 
>> ...
>> 2010-03-03 09:50:16,738 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_3353532406500894348_402571, duration: 93000
>> 2010-03-03 09:50:16,739 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_1719430407401031560_402554, duration: 256000
>> 2010-03-03 09:50:16,743 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_-4376123158716665125_402579, duration: 106000
>> 2010-03-03 09:50:16,743 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_-4354660273604489760_402576, duration: 118000
>> 2010-03-03 09:50:16,744 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_3353532406500894348_402571, duration: 88000
>> 2010-03-03 09:50:16,745 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_1719430407401031560_402554, duration: 253000
>> 2010-03-03 09:50:16,749 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_-4376123158716665125_402579, duration: 81000
>> 2010-03-03 09:50:16,749 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_-4354660273604489760_402576, duration: 76000
>> 2010-03-03 09:50:16,750 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_3353532406500894348_402571, duration: 91000
>> 2010-03-03 09:50:16,750 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op:
>> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
>> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
>> blk_1719430407401031560_402554, duration: 237000
>> ...
>> 
>> The logs (and top and tcptrack) show bursts of activity for perhaps 15
>> seconds at a time, then it stops for 30 seconds or so and starts all over
>> again.  Frequently, there will be a line or two like this after each burst:
>> 2010-03-03 09:52:35,255 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
>> succeeded for blk_8884930029593424313_289042
>> 
>> When we restart the cluster, the problem usually moves to a different random
>> subset of nodes.  This has happened at least 5 times and there¹s no way
>> we¹ve been able to predict which machine(s) will have this weird logging
>> behavior.
>> 
>> I¹ve noticed that the cliID in the logs is identical across the machines
>> exhibiting this behavior.  Is there a way to tell which client that is?  As
>> I said, we don¹t have any MR jobs running.
>> 
>> We¹re on the latest CDH test release (February 9, 0.20.1+169.56), which is
>> pretty close to Hadoop 0.20.2.  The data in HDFS is almost entirely HBase
>> (0.20.3), but there are a few other blocks of data.  The problems just
>> started in the last few days.  We¹ve recently gone through some very heavy
>> data loading into HBase, but hadn¹t seen this problem in the past when we¹ve
>> done the same thing.
>> 
>> The data nodes and regionservers stay up without any errors and I can run
>> counts from the shell and Hadoop jobs that read every row of my largest
>> table without issues, but the burst of logging activity is accompanied by a
>> spike of CPU, network, and disk that impacts performance and I¹m afraid it
>> will cause timeouts or other problems.  We¹re also getting datanode log
>> files that run into several GB¹s a day.
>> 
>> Any ideas?
>> 
>> Thanks,
>> Rod
>> 







Re: Neverending DataNode.clienttrace logging after starting Hbase

Posted by Stack <st...@duboce.net>.
This looks like the product of the below addtion:

------------------------------------------------------------------------
r687868 | cdouglas | 2008-08-21 14:27:31 -0700 (Thu, 21 Aug 2008) | 3 lines

HADOOP-3062. Add metrics to DataNode and TaskTracker to record network
traffic for HDFS reads/writes and MR shuffling.


Your problem is the quantity of the traces is overwhelming?  If so, we
should make a patch for 0.20 hadoop to turn off the above tracing when
not wanted.

St.Ack



On Wed, Mar 3, 2010 at 2:22 PM, Rod Cope <ro...@openlogic.com> wrote:
> We have a problem on our Hadoop cluster where a random subset of our
> datanode logs are filling up with GB¹s of clienttrace messages after
> starting HBase.  The logs are fine, fsck shows a perfect report, and all is
> well until HBase is started.  Without running any MR jobs or using
> HBase/HDFS, we¹ve let it go for 15+ hours without seeing any change in
> logging behavior.  The datanode logs look like this:
>
> ...
> 2010-03-03 09:50:16,738 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47445, bytes: 23710, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_3353532406500894348_402571, duration: 93000
> 2010-03-03 09:50:16,739 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47446, bytes: 226458, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_1719430407401031560_402554, duration: 256000
> 2010-03-03 09:50:16,743 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47449, bytes: 5073, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4376123158716665125_402579, duration: 106000
> 2010-03-03 09:50:16,743 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47450, bytes: 7974, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4354660273604489760_402576, duration: 118000
> 2010-03-03 09:50:16,744 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47451, bytes: 23710, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_3353532406500894348_402571, duration: 88000
> 2010-03-03 09:50:16,745 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47452, bytes: 226458, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_1719430407401031560_402554, duration: 253000
> 2010-03-03 09:50:16,749 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47455, bytes: 5073, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 16384, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4376123158716665125_402579, duration: 81000
> 2010-03-03 09:50:16,749 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47456, bytes: 7974, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 24576, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_-4354660273604489760_402576, duration: 76000
> 2010-03-03 09:50:16,750 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47457, bytes: 23710, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 82432, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_3353532406500894348_402571, duration: 91000
> 2010-03-03 09:50:16,750 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.60.106:50010, dest: /192.168.60.106:47458, bytes: 226458, op:
> HDFS_READ, cliID: DFSClient_1676127095, offset: 1326592, srvID:
> DS-35570150-192.168.60.106-50010-1267000204488, blockid:
> blk_1719430407401031560_402554, duration: 237000
> ...
>
> The logs (and top and tcptrack) show bursts of activity for perhaps 15
> seconds at a time, then it stops for 30 seconds or so and starts all over
> again.  Frequently, there will be a line or two like this after each burst:
> 2010-03-03 09:52:35,255 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_8884930029593424313_289042
>
> When we restart the cluster, the problem usually moves to a different random
> subset of nodes.  This has happened at least 5 times and there¹s no way
> we¹ve been able to predict which machine(s) will have this weird logging
> behavior.
>
> I¹ve noticed that the cliID in the logs is identical across the machines
> exhibiting this behavior.  Is there a way to tell which client that is?  As
> I said, we don¹t have any MR jobs running.
>
> We¹re on the latest CDH test release (February 9, 0.20.1+169.56), which is
> pretty close to Hadoop 0.20.2.  The data in HDFS is almost entirely HBase
> (0.20.3), but there are a few other blocks of data.  The problems just
> started in the last few days.  We¹ve recently gone through some very heavy
> data loading into HBase, but hadn¹t seen this problem in the past when we¹ve
> done the same thing.
>
> The data nodes and regionservers stay up without any errors and I can run
> counts from the shell and Hadoop jobs that read every row of my largest
> table without issues, but the burst of logging activity is accompanied by a
> spike of CPU, network, and disk that impacts performance and I¹m afraid it
> will cause timeouts or other problems.  We¹re also getting datanode log
> files that run into several GB¹s a day.
>
> Any ideas?
>
> Thanks,
> Rod
>