You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by ramkrishna vasudevan <ra...@gmail.com> on 2019/03/28 04:07:40 UTC

Re: Debugging High I/O Wait

Hi Srinidhi

As you said the cache, WAL files are in the RS SSD drives. The cache and
the WAL files reside on seperate SSDs or on the same SSD?

Or there writes happening also while these reads happen from the Bucket
cache?  Is your LRU cache big enough to hold all the index blocks?

Regards
Ram


On Thu, Mar 28, 2019 at 12:27 AM Srinidhi Muppalla <sr...@trulia.com>
wrote:

> Hello,
>
> We've noticed an issue in our HBase cluster where one of the
> region-servers has a spike in I/O wait associated with a spike in Load for
> that node. As a result, our request times to the cluster increase
> dramatically. Initially, we suspected that we were experiencing
> hotspotting, but even after temporarily blocking requests to the highest
> volume regions on that region-servers the issue persisted. Moreover, when
> looking at request counts to the regions on the region-server from the
> HBase UI, they were not particularly high and our own application level
> metrics on the requests we were making were not very high either. From
> looking at a thread dump of the region-server, it appears that our get and
> scan requests are getting stuck when trying to read from the blocks in our
> bucket cache leaving the threads in a 'runnable' state. For context, we are
> running HBase 1.30 on a cluster backed by S3 running on EMR and our bucket
> cache is running in File mode. Our region-servers all have SSDs. We have a
> combined cache with the L1 standard LRU cache and the L2 file mode bucket
> cache. Our Bucket Cache utilization is less than 50% of the allocated space.
>
> We suspect that part of the issue is our disk space utilization on the
> region-server as our max disk space utilization also increased as this
> happened. What things can we do to minimize disk space utilization? The
> actual HFiles are on S3 -- only the cache, application logs, and write
> ahead logs are on the region-servers. Other than the disk space
> utilization, what factors could cause high I/O wait in HBase and is there
> anything we can do to minimize it?
>
> Right now, the only thing that works is terminating and recreating the
> cluster (which we can do safely because it's S3 backed).
>
> Thanks!
> Srinidhi
>

Re: Debugging High I/O Wait

Posted by Srinidhi Muppalla <sr...@trulia.com>.

They reside on the same SSD. Is it advisable to have separate volume for the WALs?

There are writes happening while reads are happening from the Bucket cache. 

I believe our LRU cache is big enough to hold all the index blocks. I don't have the exact numbers from the cluster when it last had the issue, but right now on our currently healthy cluster our region servers have 2.2 GB dedicated to the LRU cache. On average right now, each region server has a sum total of ~20MB for all its indexes. 

Thanks,
Srinidhi

On 3/27/19, 9:24 PM, "ramkrishna vasudevan" <ra...@gmail.com> wrote:

    Hi Srinidhi
    
    As you said the cache, WAL files are in the RS SSD drives. The cache and
    the WAL files reside on seperate SSDs or on the same SSD?
    
    Or there writes happening also while these reads happen from the Bucket
    cache?  Is your LRU cache big enough to hold all the index blocks?
    
    Regards
    Ram
    
    
    On Thu, Mar 28, 2019 at 12:27 AM Srinidhi Muppalla <sr...@trulia.com>
    wrote:
    
    > Hello,
    >
    > We've noticed an issue in our HBase cluster where one of the
    > region-servers has a spike in I/O wait associated with a spike in Load for
    > that node. As a result, our request times to the cluster increase
    > dramatically. Initially, we suspected that we were experiencing
    > hotspotting, but even after temporarily blocking requests to the highest
    > volume regions on that region-servers the issue persisted. Moreover, when
    > looking at request counts to the regions on the region-server from the
    > HBase UI, they were not particularly high and our own application level
    > metrics on the requests we were making were not very high either. From
    > looking at a thread dump of the region-server, it appears that our get and
    > scan requests are getting stuck when trying to read from the blocks in our
    > bucket cache leaving the threads in a 'runnable' state. For context, we are
    > running HBase 1.30 on a cluster backed by S3 running on EMR and our bucket
    > cache is running in File mode. Our region-servers all have SSDs. We have a
    > combined cache with the L1 standard LRU cache and the L2 file mode bucket
    > cache. Our Bucket Cache utilization is less than 50% of the allocated space.
    >
    > We suspect that part of the issue is our disk space utilization on the
    > region-server as our max disk space utilization also increased as this
    > happened. What things can we do to minimize disk space utilization? The
    > actual HFiles are on S3 -- only the cache, application logs, and write
    > ahead logs are on the region-servers. Other than the disk space
    > utilization, what factors could cause high I/O wait in HBase and is there
    > anything we can do to minimize it?
    >
    > Right now, the only thing that works is terminating and recreating the
    > cluster (which we can do safely because it's S3 backed).
    >
    > Thanks!
    > Srinidhi
    >