You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/05/04 21:58:39 UTC

[Hadoop Wiki] Update of "Hbase/Troubleshooting" by DougMeil

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/Troubleshooting" page has been changed by DougMeil.
The comment on this change is: After meeting with Stack, updates to page (removing some things that are really old).  More to come..
http://wiki.apache.org/hadoop/Hbase/Troubleshooting?action=diff&rev1=47&rev2=48

--------------------------------------------------

  == Contents ==
   1. [[#A1|Problem: Master initializes, but Region Servers do not]]
   1. [[#A2|Problem: Created Root Directory for HBase through Hadoop DFS]]
-  1. [[#A3|Problem: Replay of hlog required, forcing regionserver restart]]
-  1. [[#A4|Problem: On migration, no files in root directory]]
+  1. [[#A3|Problem: On migration, no files in root directory]]
-  1. [[#A5|Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 256"]]
+  1. [[#A4|Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 256"]]
-  1. [[#A6|Problem: "No live nodes contain current block"]]
+  1. [[#A5|Problem: "No live nodes contain current block"]]
-  1. [[#A7|Problem: DFS instability and/or regionserver lease timeouts]]
+  1. [[#A6|Problem: DFS instability and/or regionserver lease timeouts]]
-  1. [[#A8|Problem: Instability on Amazon EC2]]
+  1. [[#A7|Problem: Instability on Amazon EC2]]
-  1. [[#A9|Problem: Zookeeper SessionExpired events]]
+  1. [[#A8|Problem: Zookeeper SessionExpired events]]
-  1. [[#A10|Problem: Scanners keep getting timeouts]]
-  1. [[#A11|Problem: Client says no such table but it exists]]
-  1. [[#A12|Problem: Could not find my address: xyz in list of ZooKeeper quorum servers]]
+  1. [[#A9|Problem: Could not find my address: xyz in list of ZooKeeper quorum servers]]
-  1. [[#A13|Problem: Long client pauses under high load; or deadlock if using THBase]]
-  1. [[#A14|Problem: Zookeeper does not seem to work on Amazon EC2]]
+  1. [[#A10|Problem: Zookeeper does not seem to work on Amazon EC2]]
-  1. [[#A15|Problem: General operating environment issues -- zookeeper session timeouts, regionservers shutting down, etc.]]
+  1. [[#A11|Problem: General operating environment issues -- zookeeper session timeouts, regionservers shutting down, etc.]]
-  1. [[#A16|Problem: Scanner performance is low]]
+  1. [[#A12|Problem: Scanner performance is low]]
-  1. [[#A17|Problem: My shell or client application throws lots of scary exceptions during normal operation]]
+  1. [[#A13|Problem: My shell or client application throws lots of scary exceptions during normal operation]]
-  1. [[#A18|Problem: The HBase or Hadoop daemons crash after some days of uptime with no errors logged]]
-  1. [[#A19|Problem: Running a Scan or a MapReduce job over a full table fails with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes]]
+  1. [[#A14|Problem: Running a Scan or a MapReduce job over a full table fails with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes]]
-  1. [[#A20|Problem: System instability, and the presence of "java.lang.OutOfMemoryError: unable to create new native thread" exceptions in HDFS datanode logs or that of any system daemon]]
+  1. [[#A15|Problem: System instability, and the presence of "java.lang.OutOfMemoryError: unable to create new native thread" exceptions in HDFS datanode logs or that of any system daemon]]
  
  <<Anchor(1)>>
  
  == 1. Problem: Master initializes, but Region Servers do not ==
+ 
-  * Master's log contains repeated instances of the following block:
-   . ~-INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:60020. Already tried 1 time(s).<<BR>> INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:60020. Already tried 2 time(s).<<BR>> -~
-   . ~-..<<BR>> INFO org.apache.hadoop.ipc.RPC: Server at /127.0.0.1:60020 not available yet, Zzzzz...-~
-  * Region Servers' logs contains repeated instances of the following block:
-   . ~-INFO org.apache.hadoop.ipc.Client: Retrying connect to server: masternode/192.168.100.50:60000. Already tried 1 time(s).<<BR>> INFO org.apache.hadoop.ipc.Client: Retrying connect to server: masternode/192.168.100.50:60000. Already tried 2 time(s).<<BR>> -~
-   . ~-..<<BR>> INFO org.apache.hadoop.ipc.RPC: Server at masternode/192.168.100.50:60000 not available yet, Zzzzz...-~
-  * Note that the Master believes the Region Servers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost.
+  * The Master believes the Region Servers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost.
  
  === Causes ===
   * The Region Servers are erroneously informing the Master that their IP addresses are 127.0.0.1.
@@ -66, +56 @@

  === Resolution ===
   * Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.
  
+ 
  <<Anchor(3)>>
  
- == 3. Problem: Replay of hlog required, forcing regionserver restart ==
-  * Under a heavy write load, some regions servers will go down with the following exception:
+ == 3. Problem: On startup, Master says that you need to run the hbase migrations script ==
+  * On Startup, Master says that you need to run the hbase migrations script. Upon running that, the hbase migrations script says no files in root directory.
  
- {{{
- WARN org.apache.hadoop.dfs.DFSClient: Exception while reading from blk_xxxxxxxxxxxxxxx of /hbase/some_repository from IP_address:50010: java.io.IOException: Premeture EOF from inputStream
- then later
- ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening scanner (fsOk: true)
- java.io.IOException: HStoreScanner failed constructionat org.apache.hadoop.hbase.regionserver.StoreFileScanner.<init>(StoreFileScanner.java:69)
-        at org.apache.hadoop.hbase.regionserver.HStoreScanner.<init>(HStoreScanner.java:68)
-        at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1896)
-        ...
- Caused by: java.net.SocketTimeoutException: timed out waiting for rpc response
-        at org.apache.hadoop.ipc.Client.call(Client.java:559)
-        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
-        at org.apache.hadoop.dfs.$Proxy1.getFileInfo(Unknown Source)
- }}}
  === Causes ===
-  * RPC timeouts may happen because of a IO contention which blocks processes during file swapping.
+  * HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur.
  
  === Resolution ===
-  * Configure your system to avoid swapping. Set vm.swappiness to 0. (http://kerneltrap.org/node/1044)
+  * Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.
  
  <<Anchor(4)>>
  
+ == 4. Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 256" ==
+  * See the Troubleshooting section in the HBase book http://hbase.apache.org/book.html#trouble
- == 4. Problem: On migration, no files in root directory ==
-  * On Startup, Master says that you need to run the hbase migrations script. Upon running that, the hbase migrations script says no files in root directory.
- 
- === Causes ===
-  * HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur.
- 
- === Resolution ===
-  * Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.
  
  <<Anchor(5)>>
  
- == 5. Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 256" ==
-  * See an exception with above message in logs, usually the datanode logs.
+ == 5. Problem: "No live nodes contain current block" ==
+  * See the Troubleshooting section in the HBase book http://hbase.apache.org/book.html#trouble
  
- === Causes ===
-  * An upper bound on connections was added in Hadoop (HADOOP-3633/HADOOP-3859).
- 
- === Resolution ===
-  * Up the maximum by setting '''dfs.datanode.max.xcievers''' (sic).  See [[http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/<20...@talk.nabble.com>|message from jean-adrien]] for some background. Values of 2048 or 4096 are common.
-  * This may be a symptom of having an unrealistically high number of regions in a table and/or an underpowered cluster; see [[#A19|the discussion of this below]]
  
  <<Anchor(6)>>
  
- == 6. Problem: "No live nodes contain current block" ==
-  * See an exception with above message in logs.
- 
- === Causes ===
-  * Insufficient file descriptors available at the OS level for DFS DataNodes
-  * Patch for HDFS-127 is not present (Should not be an issue for HBase >= 0.20.0 as a private Hadoop jar is shipped with the client side fix applied)
-  * Slow datanodes are marked as down by DFSClient; eventually all replicas are marked as 'bad' (HADOOP-3831).
- 
- === Resolution ===
-  * Increase the file descriptor limit of the user account under which the DFS DataNode processes are operating. On most Linux systems, adding the following lines to /etc/security/limits.conf will increase the file descriptor limit from the default of 1024 to 32768. Substitute the actual user name for {{{<user>}}}.
-   . {{{
- <user>          soft    nofile          32768
- <user>          hard    nofile          32768
- }}}
-  * RedHat based distributions also may have a maximum total open files across the whole system, so you will also need to edit /etc/sysctl.conf to include the line:
-   . {{{
- fs.file-max = 32768
- }}}
-   . Run the commands {{{sysctl -p /etc/sysctl.conf}}} and {{{service network restart}}} to make the change immediately effective.
- 
- <<Anchor(7)>>
- 
- == 7. Problem: DFS instability and/or regionserver lease timeouts ==
+ == 6. Problem: DFS instability and/or regionserver lease timeouts ==
   * HBase regionserver leases expire during start up
   * HBase daemons cannot find block locations in HDFS during start up or other periods of load
   * HBase regionserver restarts after beeing unable to report to master:
@@ -161, +105 @@

   * For Java SE 6, some users have had success with {{{ -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelGCThreads=8 }}}.
   * See HBase [[PerformanceTuning|Performance Tuning]] for more on JVM GC tuning.
  
- <<Anchor(8)>>
+ <<Anchor(7)>>
  
- == 8. Problem: Instability on Amazon EC2 ==
+ == 7. Problem: Instability on Amazon EC2 ==
   * Various problems suggesting overloading on Amazon EC2 deployments: Scanner timeouts, problems locating HDFS blocks, missed heartbeats, "We slept xxx ms, ten times longer than scheduled" messages, and so on.
   * These problems continue after following the other relevant advice on this page.
   * Or, you are trying to use Small or Medium instance types. (Do not.)
@@ -176, +120 @@

   * Use X-Large (c1.xlarge) instances
   * Consider splitting storage and computational function over disjoint instance sets.
  
- <<Anchor(9)>>
+ <<Anchor(8)>>
  
- == 9. Problem: ZooKeeper SessionExpired events ==
+ == 8. Problem: ZooKeeper SessionExpired events ==
   * Master or Region Servers shutting down with messages like those in the logs:
  
  {{{
@@ -222, +166 @@

   * If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider [[http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|importing into HFiles directly]].
   * HBase ships with some GC tuning, for more information see [[PerformanceTuning|Performance Tuning]].
  
+ 
+ <<Anchor(9)>>
+ 
+ == 9. Problem: Could not find my address: xyz in list of ZooKeeper quorum servers ==
+  * A Zookeeper server wasn't able to start, throws that error. xyz is the name of your server.
+ 
+ === Causes ===
+  * This is a name lookup problem. HBase tries to start a ZK server on some machine but that machine isn't able to find itself in the '''hbase.zookeeper.quorum configuration'''.
+ 
+ === Resolution ===
+  * Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set '''hbase.zookeeper.dns.interface''' and '''hbase.zookeeper.dns.nameserver''' in hbase-site.xml to make sure it resolves to the correct FQDN.
+ 
  <<Anchor(10)>>
  
- == 10. Problem: Scanners keep getting timeouts ==
-  * Client receives org.apache.hadoop.hbase.UnknownScannerException or timeouts even if the region server lease is really high. Fixed in HBase 0.20.0
- 
- === Causes ===
-  * The client, by default, fetches 30 rows when issuing the first next(). All the 29 other calls simply return rows from local memory. So if it takes 3 minutes to process a row and the timeout is set to 30 minutes, it is still not enough to cover 30 * 3 = 90 minutes.
- 
- === Resolution ===
-  * Set hbase.client.scanner.caching in hbase-site.xml at a very low value like 1 or use HTable.setScannerCaching(1).
- 
- <<Anchor(11)>>
- 
- == 11. Problem: Client says no such table but it exists ==
-  * Client can't find region in table, says no such table.
- 
- === Causes ===
-  * Just deleted a large table
- 
- === Resolution ===
-  * Run major compaction on the .META. table.  In the shell, type '''tool''' to learn how to run a major compaction from the shell.
- 
- <<Anchor(12)>>
- 
- == 12. Problem: Could not find my address: xyz in list of ZooKeeper quorum servers ==
-  * A Zookeeper server wasn't able to start, throws that error. xyz is the name of your server.
- 
- === Causes ===
-  * This is a name lookup problem. HBase tries to start a ZK server on some machine but that machine isn't able to find itself in the '''hbase.zookeeper.quorum configuration'''.
- 
- === Resolution ===
-  * Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set '''hbase.zookeeper.dns.interface''' and '''hbase.zookeeper.dns.nameserver''' in hbase-site.xml to make sure it resolves to the correct FQDN.
- 
- <<Anchor(13)>>
- 
- == 13. Problem: Long client pauses under high load; or deadlock if using transactional HBase (THBase) ==
-  * Under high load, some client operations take a long time; waiting appears uneven
-  * If using THBase, apparent deadlocks: for example, in thread dumps IPC Server handlers are blocked in org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegion.updateIndex()
- 
- === Causes ===
-  * The default number of regionserver RPC handlers is insufficient.
- 
- === Resolution ===
-  * Increase the value of "hbase.regionserver.handler.count" in hbase-site.xml. The default is 10. Try 100.
- 
- <<Anchor(14)>>
- 
- == 14. Problem: Zookeeper does not seem to work on Amazon EC2 ==
+ == 10. Problem: Zookeeper does not seem to work on Amazon EC2 ==
   * HBase does not start when deployed as Amazon EC2 instances.
   * Exceptions like the below appear in the master and/or region server logs:
  
@@ -286, +197 @@

  === Resolution ===
   * Use the internal EC2 host names when configuring the Zookeeper quorum peer list.
  
- <<Anchor(15)>>
+ <<Anchor(11)>>
  
- == 15. Problem: General operating environment issues -- zookeeper session timeouts, regionservers shutting down, etc ==
+ == 11. Problem: General operating environment issues -- zookeeper session timeouts, regionservers shutting down, etc ==
  === Causes ===
   . Various.
  
  === Resolution ===
  See the [[http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting|ZooKeeper Operating Environment Troubleshooting]] page.  It has suggestions and tools for checking disk and networking performance; i.e. the operating environment your zookeeper and hbase are running in.  ZooKeeper is the cluster's "canary".  It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster.
  
- <<Anchor(16)>>
+ <<Anchor(12)>>
  
- == 16. Problem: Scanner performance is low ==
+ == 12. Problem: Scanner performance is low ==
  === Causes ===
  Default scanner caching (prefetching) is set to 1. The default is low because if a job takes too long processing, a scanner can time out, which causes unhappy jobs/people/emails. See item #10 above.
  
@@ -305, +216 @@

   * Increase the amount of prefetching on the scanner, to 10, or 100, or 1000, as appropriate for your workload: [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#scannerCaching|HTable.scannerCaching]]
   * This change can be accomplished globally by setting the hbase.client.scanner.caching property in hbase-site.xml to the desired value.
  
- <<Anchor(17)>>
+ <<Anchor(13)>>
  
- == 17. Problem: My shell or client application throws lots of scary exceptions during normal operation ==
+ == 13. Problem: My shell or client application throws lots of scary exceptions during normal operation ==
  === Causes ===
  Since 0.20.0 the default log level for org.apache.hadoop.hbase.* is DEBUG.
  
  === Resolution ===
  On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: {{{log4j.logger.org.apache.hadoop.hbase=DEBUG}}} to this: {{{log4j.logger.org.apache.hadoop.hbase=INFO}}}, or even {{{log4j.logger.org.apache.hadoop.hbase=WARN}}} .
  
+ 
- <<Anchor(18)>>
+ <<Anchor(14)>>
  
- == 18. Problem: The HBase or Hadoop daemons crash after some days of uptime with no errors logged ==
- === Causes ===
- HBase and Hadoop have stability issues on certain versions of the JVM that can cause this issue. In particular, Sun Java 1.6.0_18 is known to be buggy. The current recommended version for production usage is 1.6.0_16.
- 
- === Resolution ===
- Downgrade your JVM to 1.6.0_16.
- 
- <<Anchor(19)>>
- 
- == 19. Problem: Running a Scan or a MapReduce job over a full table fails with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes ==
+ == 14. Problem: Running a Scan or a MapReduce job over a full table fails with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes ==
  
  === Causes ===
- HBase keeps a number of files per region open on HDFS. When you have a large number of regions in a single table, this means that HDFS needs to keep a large number of files open on HDFS, which can cause you to run into the [[#A5|"xceiverCount xx exceeds..."]] issue, or conversely ``OutOfMemoryErrors due to raising the '''dfs.datanode.max.xceivers''' setting too high to escape this issue.
- 
- Each region in HBase corresponds to 0 to N store files in HDFS. The '''dfs.datanode.max.xcievers''' setting controls the maximum number of handler threads that are allowed per HDFS datanode. Each store file consumes at least one thread on the datanode. Once a store file is opened, it stays open until a compaction or splitting is needed; this can result in the dfs.datanode.max.xceivers limit being reached surprisingly quickly if you have a lot of regions in a single table.
- 
- This problem is generally a symptom of an underpowered cluster. 
+ This problem is generally a symptom of a mis-configured or underpowered cluster. 
  
  === Resolution ===
+  * See the Troubleshooting section in the HBase book http://hbase.apache.org/book.html#trouble on xceivers configuration.
+  * See the configuration section in the HBase book http://hbase.apache.org/book.html on '''hbase.hregion.max.filesize'''
- 
- This can be resolved in the following ways:
-  * Increase the maximum file size per region; this is the '''hbase.hregion.max.filesize''' in hbase-site.xml, and it defaults to 268435456 (256 MB). Keep in mind that this will only apply to future region splits, and will not result in existing regions being merged.
-  * Mess with configuration that effects RAM -- i.e. thread stack sizes or, dependent on what your query path looks like, shrink size given over to block cache (will slow your reads though)
   * Add machines to your cluster.
  
- <<Anchor(20)>>
+ <<Anchor(15)>>
  
- == 20. Problem: System instability, and the presence of "java.lang.OutOfMemoryError: unable to create new native thread in exceptions" HDFS datanode logs or that of any system daemon ==
+ == 15. Problem: System instability, and the presence of "java.lang.OutOfMemoryError: unable to create new native thread in exceptions" HDFS datanode logs or that of any system daemon ==
  
  === Causes ===
  
@@ -351, +248 @@

   
  === Resolution ===
  
+ See the HBase book http://hbase.apache.org/book.html on nproc configuration.
- Set this to 16K or higher. We recommend at least the configured number of DataNode xceivers plus 1K. Add the following lines to /etc/security/limits.conf. Substitute the actual user name for {{{<user>}}}.
-   . {{{
- <user>          soft    nproc          32768
- <user>          hard    nproc          32768
- }}}