You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Viral Bajaria <vi...@gmail.com> on 2013/07/08 12:04:56 UTC

optimizing block cache requests + eviction

Hi,

TL;DR;
Trying to make a case for making the block eviction strategy smart and to
not evict remote blocks more frequently and make the requests more smarter.

The question here comes after I debugged the issue that I was having with
random region servers hitting high load averages. I initially thought the
problem was hardware related i.e. bad disk or network since the wait I/O
was too high but it was a combination of things.

I figured with SCR (short circuit read) ON the datanode should almost never
show high amount of block requests from the local regionservers. So my
starting point for debugging was the datanode since it was doing a ton of
I/O. The clienttrace logs helped me figure out which RS nodes were making
block requests. I hacked up a script to report which blocks are being
requested and how many times per minute. I found that some blocks were
being requested 10+ times in a minute and over 2000 times in an hour from
the same regionserver. This was causing the server to do 40+MB/s on reads
alone. That was on the higher side, the average was closer to 100 or so per
hour.

Now why did I end up in such a situation. It happened due to the fact that
I added servers to the cluster and rebalanced the cluster. At the same time
I added some drives and also removed the offending server in my setup. This
caused some of the data to not be co-located with the regionservers. Given
that major_compaction was disabled and it would not have run for a while
(atleast on some tables) these block requests would not go away. One of my
regionservers was totally overwhelmed. I made the situation worse when I
removed the server that was under heavy load with the assumption that it's
a hardware problem with the box without doing a deep dive (doh!). Given
that regionservers will be added in the future, I expect block locality to
go down till major_compaction runs. Also nodes can go down and cause this
problem. So I started thinking of probable solutions, but first some
observations.

*Observations/Comments*
- The surprising part was the regionservers were trying to make so many
requests for the same block in the same minute (let alone hour). Could this
happen because the original request took a few seconds and so the
regionserver re-requested ? I didn't see any fetch errors in the
regionserver logs for blocks.
- Even more strange; my heap size was at 11G and the time when this was
happening, the used heap was at 2-4G. I would have expected the heap to
grow higher than that since the blockCache should be using atleast 40% of
the available heap space.
- Another strange thing that I observed was, the block was being requested
from the same datanode every single time.

*Possible Solution/Changes*
- Would it make sense to give remote blocks higher priority over the local
blocks that can be read via SCR and not let them get evicted if there is a
tie in which block to evict ?
- Should we throttle the number of outgoing requests for a block ? I am not
sure if my firewall caused some issue but I wouldn't expect multiple block
fetch requests in the same minute. I did see a few RST packets getting
dropped at the firewall but I wasn't able to trace the problem was due to
this.
- We have 3 replicas available, shouldn't we request from the other
datanode if one might take a lot of time ? The amount of time it took to
read a block went up when the box was under heavy load, yet the re-requests
were going to the same one. Is this something that is available on the
DFSClient and can we exploit it ?
- Is it possible to migrate a region to a server which has higher number of
blocks available for it ? We don't need to make this automatic, but we
could provide a command that could be invoked manually to assign a region
to a specific regionserver. Thoughts ?

Thanks,
Viral

Re: optimizing block cache requests + eviction

Posted by Jean-Daniel Cryans <jd...@apache.org>.

meta blocks are at the end:
http://hbase.apache.org/book.html#d2617e12979, a way to tell would be
by logging from the HBase side but then I guess it's hard to reconcile
with which file we're actually reading from...

Regarding your second question, you are asking if we block HDFS
blocks? We don't, since we don't even know about HDFS blocks. The
BlockReader seeks into the file and return whatever data is asked.

J-D

On Mon, Jul 8, 2013 at 4:45 PM, Viral Bajaria <vi...@gmail.com> wrote:
> Good question. When I looked at the logs, it's not clear from it whether
> it's reading a meta or data block. Is there any kind of log line that
> indicates that ? Given that it's saying that it's ready from a startOffset
> I would assume this is a data block.
>
> A question that comes to mind, is this read doing a seek to that position
> directly or is it going to cache the block ? Looks like it is not caching
> the block if it's reading directly from a given offset. Or am I wrong ?
>
> Following is a sample line that I used while debugging:
> 2013-07-08 22:58:55,221 DEBUG org.apache.hadoop.hdfs.DFSClient: New
> BlockReaderLocal for file
> /mnt/data/current/subdir34/subdir26/blk_-448970697931783518 of size
> 67108864 startOffset 13006577 length 54102287 short circuit checksum true
>
> On Mon, Jul 8, 2013 at 4:37 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Do you know if it's a data or meta block?

Re: optimizing block cache requests + eviction

Posted by Viral Bajaria <vi...@gmail.com>.

We haven't disable block cache. So I doubt that's the problem.

On Mon, Jul 8, 2013 at 4:50 PM, Varun Sharma <va...@pinterest.com> wrote:

> FYI, if u disable your block cache - you will ask for "Index" blocks for
> every single request. So such a high rate of request is plausible for Index
> blocks even when your requests are totally random on your data.
>
> Varun
>

Re: optimizing block cache requests + eviction

Posted by Varun Sharma <va...@pinterest.com>.

FYI, if u disable your block cache - you will ask for "Index" blocks for
every single request. So such a high rate of request is plausible for Index
blocks even when your requests are totally random on your data.

Varun


On Mon, Jul 8, 2013 at 4:45 PM, Viral Bajaria <vi...@gmail.com>wrote:

> Good question. When I looked at the logs, it's not clear from it whether
> it's reading a meta or data block. Is there any kind of log line that
> indicates that ? Given that it's saying that it's ready from a startOffset
> I would assume this is a data block.
>
> A question that comes to mind, is this read doing a seek to that position
> directly or is it going to cache the block ? Looks like it is not caching
> the block if it's reading directly from a given offset. Or am I wrong ?
>
> Following is a sample line that I used while debugging:
> 2013-07-08 22:58:55,221 DEBUG org.apache.hadoop.hdfs.DFSClient: New
> BlockReaderLocal for file
> /mnt/data/current/subdir34/subdir26/blk_-448970697931783518 of size
> 67108864 startOffset 13006577 length 54102287 short circuit checksum true
>
> On Mon, Jul 8, 2013 at 4:37 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Do you know if it's a data or meta block?
>

Re: optimizing block cache requests + eviction

Posted by Viral Bajaria <vi...@gmail.com>.

Good question. When I looked at the logs, it's not clear from it whether
it's reading a meta or data block. Is there any kind of log line that
indicates that ? Given that it's saying that it's ready from a startOffset
I would assume this is a data block.

A question that comes to mind, is this read doing a seek to that position
directly or is it going to cache the block ? Looks like it is not caching
the block if it's reading directly from a given offset. Or am I wrong ?

Following is a sample line that I used while debugging:
2013-07-08 22:58:55,221 DEBUG org.apache.hadoop.hdfs.DFSClient: New
BlockReaderLocal for file
/mnt/data/current/subdir34/subdir26/blk_-448970697931783518 of size
67108864 startOffset 13006577 length 54102287 short circuit checksum true

On Mon, Jul 8, 2013 at 4:37 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Do you know if it's a data or meta block?

Re: optimizing block cache requests + eviction

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Do you know if it's a data or meta block?

J-D

On Mon, Jul 8, 2013 at 4:28 PM, Viral Bajaria <vi...@gmail.com> wrote:
> I was able to reproduce the same regionserver asking for the same local
> block over 300 times within the same 2 minute window by running one of my
> heavy workloads.
>
> Let me try and gather some stack dumps. I agree that jstack crashing the
> jvm is concerning but there is nothing in the errors to know why it
> happened. I will keep that conversation out of here.
>
> As an addendum, I am using asynchbase as my client. Not sure if the arrival
> of multiple requests for rowkeys that could be in the same non-cached block
> causes hbase to queue up a non-cached block read via SCR and since the box
> is under load, it queues up multiple of these and makes the problem worse.
>
> Thanks,
> Viral
>
> On Mon, Jul 8, 2013 at 3:53 PM, Andrew Purtell <ap...@apache.org> wrote:
>
>> but unless the behavior you see is the _same_ regionserver asking for the
>> _same_ block many times consecutively, it's probably workload related.
>>

Re: optimizing block cache requests + eviction

Posted by Viral Bajaria <vi...@gmail.com>.

I was able to reproduce the same regionserver asking for the same local
block over 300 times within the same 2 minute window by running one of my
heavy workloads.

Let me try and gather some stack dumps. I agree that jstack crashing the
jvm is concerning but there is nothing in the errors to know why it
happened. I will keep that conversation out of here.

As an addendum, I am using asynchbase as my client. Not sure if the arrival
of multiple requests for rowkeys that could be in the same non-cached block
causes hbase to queue up a non-cached block read via SCR and since the box
is under load, it queues up multiple of these and makes the problem worse.

Thanks,
Viral

On Mon, Jul 8, 2013 at 3:53 PM, Andrew Purtell <ap...@apache.org> wrote:

> but unless the behavior you see is the _same_ regionserver asking for the
> _same_ block many times consecutively, it's probably workload related.
>

Re: optimizing block cache requests + eviction

Posted by Andrew Purtell <ap...@apache.org>.

On Mon, Jul 8, 2013 at 12:22 PM, Viral Bajaria <vi...@gmail.com>wrote:

> - I tried taking a stack trace using jstack but after the dump it crashed
> the regionserver. I also did not take the dump on the offending
> regionserver, rather took it on the regionservers that were making the
> block count. I will take a stack trace on the offending server. Is there
> any other tool besides jstack ? I don't want to crash my regionserver.
>

jstack crashing the JVM is concerning but maybe off topic.

As an alternative you can poll the regionserver for a stack dump on its
infoport, e.g. http://<host>:60010/stacks

but unless the behavior you see is the _same_ regionserver asking for the
_same_ block many times consecutively, it's probably workload related.

>  To answer Vladimir's points:
> - Data access pattern definitely turns out to be uniform over a period of
> time.
> - I just did a sweep of my code base and found that there are a few places
> where Scanner are using block cache. I will disable that and see how it
> goes.
>

Please let us know how it goes.

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: optimizing block cache requests + eviction

Posted by Viral Bajaria <vi...@gmail.com>.

Thanks guys for going through that never-ending email! I will create the
JIRA for block cache eviction and the regionserver assignment command. Ted
already pointed to the JIRA which tries to go a different datanode if the
primary is busy (I will add comments to that one).

To answer Andrews' questions:

- I am using HBase 0.94.4
- I tried taking a stack trace using jstack but after the dump it crashed
the regionserver. I also did not take the dump on the offending
regionserver, rather took it on the regionservers that were making the
block count. I will take a stack trace on the offending server. Is there
any other tool besides jstack ? I don't want to crash my regionserver.
- The HBase clients workload is fairly random and I write to a table every
4-5 seconds. I have varying workloads for different tables. But I do a lot
of batching on the client side and group similar rowkeys together before
doing a GET/PUT. For example: best case I end up doing ~100 puts every
second to a region or in the worst case it's ~5K puts every second. But
again since the workload is fairly random. Currently the clients for the
table which had the most amount of data has been disabled and yet I see the
heavy loads.

To answer Vladimir's points:
- Data access pattern definitely turns out to be uniform over a period of
time.
- I just did a sweep of my code base and found that there are a few places
where Scanner are using block cache. I will disable that and see how it
goes.

Thanks,
Viral

Re: optimizing block cache requests + eviction

Posted by Andrew Purtell <ap...@apache.org>.

> Would it make sense to give remote blocks higher priority over the local
blocks that can be read via SCR and not let them get evicted if there is a
tie in which block to evict ?

That sounds like a reasonable idea.  As are the others.

But first, could this be a bug?

What version of HBase? Were you able to take and save stack traces from the
offending RegionServers while they were engaged in this behavior? (If not,
maybe next time?) What were the HBase clients doing at the time that might
correlate? Are there a set of steps that can be distilled that tend
to trigger what you are seeing?


On Monday, July 8, 2013, Viral Bajaria wrote:

> Hi,
>
> TL;DR;
> Trying to make a case for making the block eviction strategy smart and to
> not evict remote blocks more frequently and make the requests more smarter.
>
> The question here comes after I debugged the issue that I was having with
> random region servers hitting high load averages. I initially thought the
> problem was hardware related i.e. bad disk or network since the wait I/O
> was too high but it was a combination of things.
>
> I figured with SCR (short circuit read) ON the datanode should almost never
> show high amount of block requests from the local regionservers. So my
> starting point for debugging was the datanode since it was doing a ton of
> I/O. The clienttrace logs helped me figure out which RS nodes were making
> block requests. I hacked up a script to report which blocks are being
> requested and how many times per minute. I found that some blocks were
> being requested 10+ times in a minute and over 2000 times in an hour from
> the same regionserver. This was causing the server to do 40+MB/s on reads
> alone. That was on the higher side, the average was closer to 100 or so per
> hour.
>
> Now why did I end up in such a situation. It happened due to the fact that
> I added servers to the cluster and rebalanced the cluster. At the same time
> I added some drives and also removed the offending server in my setup. This
> caused some of the data to not be co-located with the regionservers. Given
> that major_compaction was disabled and it would not have run for a while
> (atleast on some tables) these block requests would not go away. One of my
> regionservers was totally overwhelmed. I made the situation worse when I
> removed the server that was under heavy load with the assumption that it's
> a hardware problem with the box without doing a deep dive (doh!). Given
> that regionservers will be added in the future, I expect block locality to
> go down till major_compaction runs. Also nodes can go down and cause this
> problem. So I started thinking of probable solutions, but first some
> observations.
>
> *Observations/Comments*
> - The surprising part was the regionservers were trying to make so many
> requests for the same block in the same minute (let alone hour). Could this
> happen because the original request took a few seconds and so the
> regionserver re-requested ? I didn't see any fetch errors in the
> regionserver logs for blocks.
> - Even more strange; my heap size was at 11G and the time when this was
> happening, the used heap was at 2-4G. I would have expected the heap to
> grow higher than that since the blockCache should be using atleast 40% of
> the available heap space.
> - Another strange thing that I observed was, the block was being requested
> from the same datanode every single time.
>
> *Possible Solution/Changes*
> - Would it make sense to give remote blocks higher priority over the local
> blocks that can be read via SCR and not let them get evicted if there is a
> tie in which block to evict ?
> - Should we throttle the number of outgoing requests for a block ? I am not
> sure if my firewall caused some issue but I wouldn't expect multiple block
> fetch requests in the same minute. I did see a few RST packets getting
> dropped at the firewall but I wasn't able to trace the problem was due to
> this.
> - We have 3 replicas available, shouldn't we request from the other
> datanode if one might take a lot of time ? The amount of time it took to
> read a block went up when the box was under heavy load, yet the re-requests
> were going to the same one. Is this something that is available on the
> DFSClient and can we exploit it ?
> - Is it possible to migrate a region to a server which has higher number of
> blocks available for it ? We don't need to make this automatic, but we
> could provide a command that could be invoked manually to assign a region
> to a specific regionserver. Thoughts ?
>
> Thanks,
> Viral
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: optimizing block cache requests + eviction

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

Viral,

>From what you described here I can conclude that either:

1. your data access pattern is random and uniform (no data locality at all)
2. or you trash block cache with scan operations with block cache enabled.

or both.

but the idea of treating local and remote blocks differently is very good. +1.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Viral Bajaria [viral.bajaria@gmail.com]
Sent: Monday, July 08, 2013 3:04 AM
To: user@hbase.apache.org
Subject: optimizing block cache requests + eviction

Hi,

TL;DR;
Trying to make a case for making the block eviction strategy smart and to
not evict remote blocks more frequently and make the requests more smarter.

The question here comes after I debugged the issue that I was having with
random region servers hitting high load averages. I initially thought the
problem was hardware related i.e. bad disk or network since the wait I/O
was too high but it was a combination of things.

I figured with SCR (short circuit read) ON the datanode should almost never
show high amount of block requests from the local regionservers. So my
starting point for debugging was the datanode since it was doing a ton of
I/O. The clienttrace logs helped me figure out which RS nodes were making
block requests. I hacked up a script to report which blocks are being
requested and how many times per minute. I found that some blocks were
being requested 10+ times in a minute and over 2000 times in an hour from
the same regionserver. This was causing the server to do 40+MB/s on reads
alone. That was on the higher side, the average was closer to 100 or so per
hour.

Now why did I end up in such a situation. It happened due to the fact that
I added servers to the cluster and rebalanced the cluster. At the same time
I added some drives and also removed the offending server in my setup. This
caused some of the data to not be co-located with the regionservers. Given
that major_compaction was disabled and it would not have run for a while
(atleast on some tables) these block requests would not go away. One of my
regionservers was totally overwhelmed. I made the situation worse when I
removed the server that was under heavy load with the assumption that it's
a hardware problem with the box without doing a deep dive (doh!). Given
that regionservers will be added in the future, I expect block locality to
go down till major_compaction runs. Also nodes can go down and cause this
problem. So I started thinking of probable solutions, but first some
observations.

*Observations/Comments*
- The surprising part was the regionservers were trying to make so many
requests for the same block in the same minute (let alone hour). Could this
happen because the original request took a few seconds and so the
regionserver re-requested ? I didn't see any fetch errors in the
regionserver logs for blocks.
- Even more strange; my heap size was at 11G and the time when this was
happening, the used heap was at 2-4G. I would have expected the heap to
grow higher than that since the blockCache should be using atleast 40% of
the available heap space.
- Another strange thing that I observed was, the block was being requested
from the same datanode every single time.

*Possible Solution/Changes*
- Would it make sense to give remote blocks higher priority over the local
blocks that can be read via SCR and not let them get evicted if there is a
tie in which block to evict ?
- Should we throttle the number of outgoing requests for a block ? I am not
sure if my firewall caused some issue but I wouldn't expect multiple block
fetch requests in the same minute. I did see a few RST packets getting
dropped at the firewall but I wasn't able to trace the problem was due to
this.
- We have 3 replicas available, shouldn't we request from the other
datanode if one might take a lot of time ? The amount of time it took to
read a block went up when the box was under heavy load, yet the re-requests
were going to the same one. Is this something that is available on the
DFSClient and can we exploit it ?
- Is it possible to migrate a region to a server which has higher number of
blocks available for it ? We don't need to make this automatic, but we
could provide a command that could be invoked manually to assign a region
to a specific regionserver. Thoughts ?

Thanks,
Viral

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: optimizing block cache requests + eviction

Posted by Ted Yu <yu...@gmail.com>.

For suggestion #3 below, take a look at:

HBASE-7509 Enable RS to query a secondary datanode in parallel, if the
primary takes too long

Cheers

On Mon, Jul 8, 2013 at 3:04 AM, Viral Bajaria <vi...@gmail.com>wrote:

> Hi,
>
> TL;DR;
> Trying to make a case for making the block eviction strategy smart and to
> not evict remote blocks more frequently and make the requests more smarter.
>
> The question here comes after I debugged the issue that I was having with
> random region servers hitting high load averages. I initially thought the
> problem was hardware related i.e. bad disk or network since the wait I/O
> was too high but it was a combination of things.
>
> I figured with SCR (short circuit read) ON the datanode should almost never
> show high amount of block requests from the local regionservers. So my
> starting point for debugging was the datanode since it was doing a ton of
> I/O. The clienttrace logs helped me figure out which RS nodes were making
> block requests. I hacked up a script to report which blocks are being
> requested and how many times per minute. I found that some blocks were
> being requested 10+ times in a minute and over 2000 times in an hour from
> the same regionserver. This was causing the server to do 40+MB/s on reads
> alone. That was on the higher side, the average was closer to 100 or so per
> hour.
>
> Now why did I end up in such a situation. It happened due to the fact that
> I added servers to the cluster and rebalanced the cluster. At the same time
> I added some drives and also removed the offending server in my setup. This
> caused some of the data to not be co-located with the regionservers. Given
> that major_compaction was disabled and it would not have run for a while
> (atleast on some tables) these block requests would not go away. One of my
> regionservers was totally overwhelmed. I made the situation worse when I
> removed the server that was under heavy load with the assumption that it's
> a hardware problem with the box without doing a deep dive (doh!). Given
> that regionservers will be added in the future, I expect block locality to
> go down till major_compaction runs. Also nodes can go down and cause this
> problem. So I started thinking of probable solutions, but first some
> observations.
>
> *Observations/Comments*
> - The surprising part was the regionservers were trying to make so many
> requests for the same block in the same minute (let alone hour). Could this
> happen because the original request took a few seconds and so the
> regionserver re-requested ? I didn't see any fetch errors in the
> regionserver logs for blocks.
> - Even more strange; my heap size was at 11G and the time when this was
> happening, the used heap was at 2-4G. I would have expected the heap to
> grow higher than that since the blockCache should be using atleast 40% of
> the available heap space.
> - Another strange thing that I observed was, the block was being requested
> from the same datanode every single time.
>
> *Possible Solution/Changes*
> - Would it make sense to give remote blocks higher priority over the local
> blocks that can be read via SCR and not let them get evicted if there is a
> tie in which block to evict ?
> - Should we throttle the number of outgoing requests for a block ? I am not
> sure if my firewall caused some issue but I wouldn't expect multiple block
> fetch requests in the same minute. I did see a few RST packets getting
> dropped at the firewall but I wasn't able to trace the problem was due to
> this.
> - We have 3 replicas available, shouldn't we request from the other
> datanode if one might take a lot of time ? The amount of time it took to
> read a block went up when the box was under heavy load, yet the re-requests
> were going to the same one. Is this something that is available on the
> DFSClient and can we exploit it ?
> - Is it possible to migrate a region to a server which has higher number of
> blocks available for it ? We don't need to make this automatic, but we
> could provide a command that could be invoked manually to assign a region
> to a specific regionserver. Thoughts ?
>
> Thanks,
> Viral
>