You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Павел Мезенцев <pa...@mezentsev.org> on 2014/07/22 12:59:40 UTC

hbase cluster working bad

Hello all!

We have a trouble with hbase
Our hadoop cluster has 4 nodes (plus 1 client node).
There are CHD 4.6 + CM 4.7 hadoop installed
Hadoop versions are:
 - hadoop-hdfs : 2.0.0+1475
 - hadoop-0.20-mapreduce : 2.0.0+1475
 - hbase" : 0.94.6+132
Hadoop and hBase configs are in attachment

We have several tables in hbase with total volume of 2 Tb.
We run mapReduce ETL jobs and analytics queries over them.

There are a lot of warnings like
- *The health test result for REGION_SERVER_READ_LATENCY has become bad:
The moving average of HDFS read latency is 162 millisecond(s) over the
previous 5 minute(s). Critical threshold: 100*.
- *The health test result for REGION_SERVER_SYNC_LATENCY has become bad:
The moving average of HDFS sync latency is 8.2 second(s) over the previous
5 minute(s). Critical threshold: 5,000*.
*- HBase region health: 442 unhealthy regions *
*- HDFS_DATA_NODES_HEALTHY has become bad*
*- HBase Region Health Canary is running slowly **on the cluster*

mapReduce jobs over hBase with random queries to hBase working very slowly
(job is completed on 20% after 18 hours versus 100% after 12 hours on
analogue cluster)

Please help use to solve reasons of this alerts and speed up the cluster.
Could you give us a good advise, what shall we do?

Cheers,
Mezentsev Pavel

Re: hbase cluster working bad

Posted by Dhaval Shah <pr...@yahoo.co.in>.

We just solved a very similar issue with our cluster (yesterday!). I would suggest you look at 2 things in particular:
- Is the network on your region server saturated? That would prevent connections from being made
- See if the region server has any RPC handlers available when you get this error. Its possible that all RPC handlers are busy servicing other requests (or stuck due to a combination of load and bad configs).
 

Regards,
Dhaval


________________________________
 From: Павел Мезенцев <pa...@mezentsev.org>
To: user@hbase.apache.org 
Sent: Tuesday, 22 July 2014 7:46 AM
Subject: Re: hbase cluster working bad
 

Jobs, running on this cluster, print exceptions:

java.util.concurrent.ExecutionException: java.net.SocketTimeoutException:
Call to ds-hadoop-wk01p.tcsbank.ru/10.218.64.11:60020 failed on socket
timeout exception: java.net.SocketTimeoutException: 60000 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.218.64.14:38621 remote=
ds-hadoop-wk01p.tcsbank.ru/10.218.64.11:60020]

    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:188)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1569)
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1421)
    at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:739)
    at org.apache.hadoop.hbase.client.HTable.get(HTable.java:708)
    at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:367)
    at ru.tcsbank.hbase.HBasePersonDao.getUsersBatch(HBasePersonDao.java:306)
    at ru.tcsbank.matching.PersonMatcher.performSolrRequest(PersonMatcher.java:153)
    at ru.tcsbank.matching.PersonMatcher.search(PersonMatcher.java:135)
    at ru.tcsbank.personmatcher.mr.PersonMatcherJob$MapClass.map(PersonMatcherJob.java:80)
    at ru.tcsbank.personmatcher.mr.PersonMatcherJob$MapClass.map(PersonMatcherJob.java:65)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)





С уважением,
Мезенцев Павел





2014-07-22 14:59 GMT+04:00 Павел Мезенцев <pa...@mezentsev.org>:

> Hello all!
>
> We have a trouble with hbase
> Our hadoop cluster has 4 nodes (plus 1 client node).
> There are CHD 4.6 + CM 4.7 hadoop installed
> Hadoop versions are:
>  - hadoop-hdfs : 2.0.0+1475
>  - hadoop-0.20-mapreduce : 2.0.0+1475
>  - hbase" : 0.94.6+132
> Hadoop and hBase configs are in attachment
>
> We have several tables in hbase with total volume of 2 Tb.
> We run mapReduce ETL jobs and analytics queries over them.
>
> There are a lot of warnings like
> - *The health test result for REGION_SERVER_READ_LATENCY has become bad:
> The moving average of HDFS read latency is 162 millisecond(s) over the
> previous 5 minute(s). Critical threshold: 100*.
> - *The health test result for REGION_SERVER_SYNC_LATENCY has become bad:
> The moving average of HDFS sync latency is 8.2 second(s) over the previous
> 5 minute(s). Critical threshold: 5,000*.
> *- HBase region health: 442 unhealthy regions *
> *- HDFS_DATA_NODES_HEALTHY has become bad*
> *- HBase Region Health Canary is running slowly **on the cluster*
>
> mapReduce jobs over hBase with random queries to hBase working very slowly
> (job is completed on 20% after 18 hours versus 100% after 12 hours on
> analogue cluster)
>
> Please help use to solve reasons of this alerts and speed up the cluster.
> Could you give us a good advise, what shall we do?
>
> Cheers,
> Mezentsev Pavel
>
>

Re: hbase cluster working bad

Posted by Esteban Gutierrez <es...@cloudera.com>.

-user@hbase (bcc), +cdh-user

Hello Pavel,

I'm moving your question to the cdh-user@cloudera.org mailing list since
its more related to specific Hadoop distribution. However from the symptoms
it looks like there is some contention (probably in HDFS or something else)
that is causing the Region Servers to become unresponsive and rippling to
the map tasks when the task tries to fetch data from HBase.

Regards,
Esteban.


--
Cloudera, Inc.



On Tue, Jul 22, 2014 at 4:46 AM, Павел Мезенцев <pa...@mezentsev.org> wrote:

> Jobs, running on this cluster, print exceptions:
>
> java.util.concurrent.ExecutionException: java.net.SocketTimeoutException:
> Call to ds-hadoop-wk01p.tcsbank.ru/10.218.64.11:60020 failed on socket
> timeout exception: java.net.SocketTimeoutException: 60000 millis timeout
> while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.218.64.14:38621
> remote=
> ds-hadoop-wk01p.tcsbank.ru/10.218.64.11:60020]
>
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:188)
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1569)
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1421)
>         at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:739)
>         at org.apache.hadoop.hbase.client.HTable.get(HTable.java:708)
>         at
> org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:367)
>         at
> ru.tcsbank.hbase.HBasePersonDao.getUsersBatch(HBasePersonDao.java:306)
>         at
> ru.tcsbank.matching.PersonMatcher.performSolrRequest(PersonMatcher.java:153)
>         at ru.tcsbank.matching.PersonMatcher.search(PersonMatcher.java:135)
>         at
> ru.tcsbank.personmatcher.mr.PersonMatcherJob$MapClass.map(PersonMatcherJob.java:80)
>         at
> ru.tcsbank.personmatcher.mr.PersonMatcherJob$MapClass.map(PersonMatcherJob.java:65)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>
>
>
>
>
> С уважением,
> Мезенцев Павел
>
>
> 2014-07-22 14:59 GMT+04:00 Павел Мезенцев <pa...@mezentsev.org>:
>
> > Hello all!
> >
> > We have a trouble with hbase
> > Our hadoop cluster has 4 nodes (plus 1 client node).
> > There are CHD 4.6 + CM 4.7 hadoop installed
> > Hadoop versions are:
> >  - hadoop-hdfs : 2.0.0+1475
> >  - hadoop-0.20-mapreduce : 2.0.0+1475
> >  - hbase" : 0.94.6+132
> > Hadoop and hBase configs are in attachment
> >
> > We have several tables in hbase with total volume of 2 Tb.
> > We run mapReduce ETL jobs and analytics queries over them.
> >
> > There are a lot of warnings like
> > - *The health test result for REGION_SERVER_READ_LATENCY has become bad:
> > The moving average of HDFS read latency is 162 millisecond(s) over the
> > previous 5 minute(s). Critical threshold: 100*.
> > - *The health test result for REGION_SERVER_SYNC_LATENCY has become bad:
> > The moving average of HDFS sync latency is 8.2 second(s) over the
> previous
> > 5 minute(s). Critical threshold: 5,000*.
> > *- HBase region health: 442 unhealthy regions *
> > *- HDFS_DATA_NODES_HEALTHY has become bad*
> > *- HBase Region Health Canary is running slowly **on the cluster*
> >
> > mapReduce jobs over hBase with random queries to hBase working very
> slowly
> > (job is completed on 20% after 18 hours versus 100% after 12 hours on
> > analogue cluster)
> >
> > Please help use to solve reasons of this alerts and speed up the cluster.
> > Could you give us a good advise, what shall we do?
> >
> > Cheers,
> > Mezentsev Pavel
> >
> >
>

Re: hbase cluster working bad

Posted by Павел Мезенцев <pa...@mezentsev.org>.

Jobs, running on this cluster, print exceptions:

java.util.concurrent.ExecutionException: java.net.SocketTimeoutException:
Call to ds-hadoop-wk01p.tcsbank.ru/10.218.64.11:60020 failed on socket
timeout exception: java.net.SocketTimeoutException: 60000 millis timeout
while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.218.64.14:38621 remote=
ds-hadoop-wk01p.tcsbank.ru/10.218.64.11:60020]

	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:188)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1569)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1421)
	at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:739)
	at org.apache.hadoop.hbase.client.HTable.get(HTable.java:708)
	at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:367)
	at ru.tcsbank.hbase.HBasePersonDao.getUsersBatch(HBasePersonDao.java:306)
	at ru.tcsbank.matching.PersonMatcher.performSolrRequest(PersonMatcher.java:153)
	at ru.tcsbank.matching.PersonMatcher.search(PersonMatcher.java:135)
	at ru.tcsbank.personmatcher.mr.PersonMatcherJob$MapClass.map(PersonMatcherJob.java:80)
	at ru.tcsbank.personmatcher.mr.PersonMatcherJob$MapClass.map(PersonMatcherJob.java:65)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)





С уважением,
Мезенцев Павел


2014-07-22 14:59 GMT+04:00 Павел Мезенцев <pa...@mezentsev.org>:

> Hello all!
>
> We have a trouble with hbase
> Our hadoop cluster has 4 nodes (plus 1 client node).
> There are CHD 4.6 + CM 4.7 hadoop installed
> Hadoop versions are:
>  - hadoop-hdfs : 2.0.0+1475
>  - hadoop-0.20-mapreduce : 2.0.0+1475
>  - hbase" : 0.94.6+132
> Hadoop and hBase configs are in attachment
>
> We have several tables in hbase with total volume of 2 Tb.
> We run mapReduce ETL jobs and analytics queries over them.
>
> There are a lot of warnings like
> - *The health test result for REGION_SERVER_READ_LATENCY has become bad:
> The moving average of HDFS read latency is 162 millisecond(s) over the
> previous 5 minute(s). Critical threshold: 100*.
> - *The health test result for REGION_SERVER_SYNC_LATENCY has become bad:
> The moving average of HDFS sync latency is 8.2 second(s) over the previous
> 5 minute(s). Critical threshold: 5,000*.
> *- HBase region health: 442 unhealthy regions *
> *- HDFS_DATA_NODES_HEALTHY has become bad*
> *- HBase Region Health Canary is running slowly **on the cluster*
>
> mapReduce jobs over hBase with random queries to hBase working very slowly
> (job is completed on 20% after 18 hours versus 100% after 12 hours on
> analogue cluster)
>
> Please help use to solve reasons of this alerts and speed up the cluster.
> Could you give us a good advise, what shall we do?
>
> Cheers,
> Mezentsev Pavel
>
>