You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jeff Whiting <je...@qualtrics.com> on 2012/11/01 16:14:52 UTC
Re: Struggling with Region Servers Running out of Memory
No fat rows. We have kept the default hbase client limit of 10mb. And most values are quite small < 5k.
We haven't tried raising the memory limit and we can try raising one of the servers and see how it
does. However looking at the graphs I don't think it will help...but it is worth a try.
~Jeff
On 10/30/2012 10:45 PM, ramkrishna vasudevan wrote:
> Are you writing fat cells?
>
> Did you try raising the heap size? and see if still it is crashing?
>
> Regards
> Ram
>
> On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <jeffw@qualtrics.com <ma...@qualtrics.com>>
> wrote:
>
> So I'm looking at ganglia so the numbers are somewhat approximate (this is for a server that
> just crashed about an 1/2 hour ago due to running out of memory):
>
> Store files are hovering just below 1k. Over the last 24 hours it has varied by about 100
> files (I'm looking at hbase.regionserver.storefiles).
>
> Block cache count is about 24k varied by about 2k. Our block cache free goes between 0.7G and
> 0.4G. It looks like we have almost 3G free after restarting a region server.
>
> The evicted block count went from 210k to 320k over a 24 hour period. Hit ratio is close to
> 100 (the graph isn't very detailed so I'm guess it is like 98-99%).
>
> Block cache size stays at about 2GB.
>
> ~Jeff
>
>
>
> On 10/30/2012 6:21 PM, Jeff Whiting wrote:
>
> We have no coprossesors. We are running replication from this cluster to another one.
>
> What is the best way to see how many store files we have? Or checking on the block cache?
>
> ~Jeff
>
> On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:
>
> Hi
>
> Are you using any coprocessors? Can you see how many store files are
> created?
>
> The no of blocks getting cached will give you an idea too..
>
> Regards
> Ram
>
> On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <jeffw@qualtrics.com
> <ma...@qualtrics.com>> wrote:
>
> We have 6 region server given 10G of memory for hbase. Each region server
> has an average of about 100 regions and across the cluster we are averaging
> about 100 requests / second with a pretty even read / write load. We are
> running cdh4 (0.92.1-cdh4.0.1, rUnknown)
>
> I feel that looking over our load and our requests that the 10GB of memory
> should be enough to handle the load and that we shouldn't really be pushing
> the the memory limits.
>
> However what we are seeing is that our memory usage goes up slowly until
> the region server starts sputtering due to gc collection issues and it will
> eventually get timed out by zookeeper and be killed.
>
> We'll see aborts like this in the log:
> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> ABORTING region server ds5.h1.ut1.qprod.net
> <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
> Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException:
> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net
> <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547
> as dead server
> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: []
> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> ABORTING region server ds5.h1.ut1.qprod.net
> <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received
> expired from ZooKeeper, aborting
> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: []
>
> Which are "caused" by:
> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 29014ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 28121ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 31124ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 32209ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 32557ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 33741ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>
>
> We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks
> in and really kills the region server's performance.
>
>
> We have the jvm metrics kicking out to ganglia and looking at
> jvm.RegionServer.metrics.**memHeapUsedM you can see that it will go up
> over time and eventually run out of memory. I can also see in
> hmaster:60010/master-status that the usedHeapMB just goes up and I can make
> a pretty educated guess as to what server will go down next. It will take
> several days to a week of continuous running (after restarting a region
> server) before we have a potential problem.
>
> Our next one to go will probably be ds6 and jmap -heap shows:
> concurrent mark-sweep generation:
> capacity = 10398531584 (9916.8125MB)
> used = 9036165000 (8617.558479309082MB)
> free = 1362366584 (1299.254020690918MB)
> 86.89847145248619% used
>
> So we are using 86% of the 10GB heep allocated to the concurrent mark and
> sweep generation. Looking at ds6 in the web interface where has
> information about the a tasks it isn't running rpc stuff it doesn't show
> any compactions or any background tasks happening. Nor is there any active
> rpc call that are longer than 0 seconds (it seems to be handling the
> requests just fine).
>
> At this point I feel somewhat lost as to how to debug the problem. I'm not
> sure what to do next to figure out what is going on. Any suggestions as to
> what to look for or debug where the memory is being used? I can generate
> heap dumps via jmap (although it effectively kills the region server) but I
> don't really know what to look for to see where the memory is going. I also
> have jmx setup on each region server and can connect to it that way.
>
> Thanks,
> ~Jeff
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com <ma...@qualtrics.com>
>
>
>
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com <ma...@qualtrics.com>
>
>
--
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com