You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jeff Whiting <je...@qualtrics.com> on 2012/11/01 16:14:52 UTC

Re: Struggling with Region Servers Running out of Memory

No fat rows.  We have kept the default hbase client limit of 10mb. And most values are quite small < 5k.

We haven't tried raising the memory limit and we can try raising one of the servers and see how it 
does.  However looking at the graphs I don't think it will help...but it is worth a try.

~Jeff


On 10/30/2012 10:45 PM, ramkrishna vasudevan wrote:
> Are you writing fat cells?
>
> Did you try raising the heap size? and see if still it is crashing?
>
> Regards
> Ram
>
> On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <jeffw@qualtrics.com <ma...@qualtrics.com>> 
> wrote:
>
>     So I'm looking at ganglia so the numbers are somewhat approximate (this is for a server that
>     just crashed about an 1/2 hour ago due to running out of memory):
>
>     Store files are hovering just below 1k.  Over the last 24 hours it has varied by about 100
>     files (I'm looking at hbase.regionserver.storefiles).
>
>     Block cache count is about 24k varied by about 2k.  Our block cache free goes between 0.7G and
>     0.4G.  It looks like we have almost 3G free after restarting a region server.
>
>     The evicted block count went from 210k to 320k over a 24 hour period.  Hit ratio is close to
>     100 (the graph isn't very detailed so I'm guess it is like 98-99%).
>
>     Block cache size stays at about 2GB.
>
>     ~Jeff
>
>
>
>     On 10/30/2012 6:21 PM, Jeff Whiting wrote:
>
>         We have no coprossesors.  We are running replication from this cluster to another one.
>
>         What is the best way to see how many store files we have? Or checking on the block cache?
>
>         ~Jeff
>
>         On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:
>
>             Hi
>
>             Are you using any coprocessors? Can you see how many store files are
>             created?
>
>             The no of blocks getting cached will give you an idea too..
>
>             Regards
>             Ram
>
>             On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <jeffw@qualtrics.com
>             <ma...@qualtrics.com>> wrote:
>
>                 We have 6 region server given 10G of memory for hbase.  Each region server
>                 has an average of about 100 regions and across the cluster we are averaging
>                 about 100 requests / second with a pretty even read / write load.  We are
>                 running cdh4 (0.92.1-cdh4.0.1, rUnknown)
>
>                 I feel that looking over our load and our requests that the 10GB of memory
>                 should be enough to handle the load and that we shouldn't really be pushing
>                 the the memory limits.
>
>                 However what we are seeing is that our memory usage goes up slowly until
>                 the region server starts sputtering due to gc collection issues and it will
>                 eventually get timed out by zookeeper and be killed.
>
>                 We'll see aborts like this in the log:
>                 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
>                 ABORTING region server ds5.h1.ut1.qprod.net
>                 <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
>                 Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException:
>                 Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net
>                 <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547
>                 as dead server
>                 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
>                 RegionServer abort: loaded coprocessors are: []
>                 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
>                 ABORTING region server ds5.h1.ut1.qprod.net
>                 <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
>                 regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
>                 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
>                 regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
>                 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received
>                 expired from ZooKeeper, aborting
>                 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
>                 RegionServer abort: loaded coprocessors are: []
>
>                 Which are "caused" by:
>                 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We
>                 slept 29014ms instead of 3000ms, this is likely due to a long garbage
>                 collecting pause and it's usually bad, see http://hbase.apache.org/book.**
>                 html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>                 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We
>                 slept 28121ms instead of 3000ms, this is likely due to a long garbage
>                 collecting pause and it's usually bad, see http://hbase.apache.org/book.**
>                 html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>                 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We
>                 slept 31124ms instead of 3000ms, this is likely due to a long garbage
>                 collecting pause and it's usually bad, see http://hbase.apache.org/book.**
>                 html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>                 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We
>                 slept 32209ms instead of 3000ms, this is likely due to a long garbage
>                 collecting pause and it's usually bad, see http://hbase.apache.org/book.**
>                 html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>                 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We
>                 slept 32557ms instead of 3000ms, this is likely due to a long garbage
>                 collecting pause and it's usually bad, see http://hbase.apache.org/book.**
>                 html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>                 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We
>                 slept 33741ms instead of 3000ms, this is likely due to a long garbage
>                 collecting pause and it's usually bad, see http://hbase.apache.org/book.**
>                 html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>
>
>                 We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks
>                 in and really kills the region server's performance.
>
>
>                 We have the jvm metrics kicking out to ganglia and looking at
>                 jvm.RegionServer.metrics.**memHeapUsedM you can see that it will go up
>                 over time and eventually run out of memory.  I can also see in
>                 hmaster:60010/master-status that the usedHeapMB just goes up and I can make
>                 a pretty educated guess as to what server will go down next. It will take
>                 several days to a week of continuous running (after restarting a region
>                 server) before we have a potential problem.
>
>                 Our next one to go will probably be ds6 and jmap -heap shows:
>                 concurrent mark-sweep generation:
>                     capacity = 10398531584 (9916.8125MB)
>                     used     = 9036165000 (8617.558479309082MB)
>                     free     = 1362366584 (1299.254020690918MB)
>                     86.89847145248619% used
>
>                 So we are using 86% of the 10GB heep allocated to the concurrent mark and
>                 sweep generation.  Looking at ds6 in the web interface where has
>                 information about the a tasks it isn't running rpc stuff it doesn't show
>                 any compactions or any background tasks happening. Nor is there any active
>                 rpc call that are longer than 0 seconds (it seems to be handling the
>                 requests just fine).
>
>                 At this point I feel somewhat lost as to how to debug the problem. I'm not
>                 sure what to do next to figure out what is going on.  Any suggestions as to
>                 what to look for or debug where the memory is being used? I can generate
>                 heap dumps via jmap (although it effectively kills the region server) but I
>                 don't really know what to look for to see where the memory is going. I also
>                 have jmx setup on each region server and can connect to it that way.
>
>                 Thanks,
>                 ~Jeff
>
>                 -- 
>                 Jeff Whiting
>                 Qualtrics Senior Software Engineer
>                 jeffw@qualtrics.com <ma...@qualtrics.com>
>
>
>
>
>     -- 
>     Jeff Whiting
>     Qualtrics Senior Software Engineer
>     jeffw@qualtrics.com <ma...@qualtrics.com>
>
>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com