You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Bryan Beaudreault <bb...@hubspot.com.INVALID> on 2021/12/01 13:04:52 UTC

[DISCUSS] StoreFileScanner parallel seek -- productionize or drop?

hbase.storescanner.parallel.seek.enable was added a few years ago in
https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to
disabled. The description says "Enables StoreFileScanner parallel-seeking
in StoreScanner, a feature which can reduce response latency under special
conditions".

It's not very clear what "special conditions" means. Reading through the
entire comment history on that issue seems to indicate it can help when you
have "high random read, low cache hit rate, many store files".

We have a bunch of clusters with this shape, and in fact we use SSDs for
all storage so I figured this might help a lot. I tried setting this to
true on one RegionServer of one of our highest QPS clusters hoping I'd see
some clear improvement. This very simple test was pretty much a wash, so I
need to do more methodical testing.

In the test one thing became clear though – is the default thread pool size
of 10 good enough for my use-case? I have no way of knowing, as there is no
logging or metrics that I can find around thread pool saturation. What I
ended up doing was spamming refresh of the /dump endpoint of the RS, and
noticed that there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK
executor. This indicates maybe I should scale the thread pool, but
use-cases change over time so this seems like not a great way to determine
that.

Task queuing seems not great for a feature which is aimed at reducing
latencies. I wonder if we should consider some changes to make this more
easy to deploy in production. Here are some ideas I had:

   - Can we generate a better default value for the thread pool size, maybe
   based on number of RS handler threads or some other heuristic?
   - Should we consider eliminating queuing for this feature? Instead, if
   the threadpool is saturated run the seek in-line in the current thread
   (i.e. revert to normal). This would be more similar to how hedged reads
   work in HDFS.
   - Can we expose a metric or logging to help operators know when to scale
   up the thread pool? If we implemented the 2nd option above we could expose
   "seeksInCurrentThread" counter to track this, again similar to how hedged
   reads report on saturation.

But with all of this said, I wonder if anyone is running this in production
and has any updated guidance on when to use this? Does it still make sense
given the last 8 years of development in HBase? Would it ever make sense to
make it enabled by default?

Re: [DISCUSS] StoreFileScanner parallel seek -- productionize or drop?

Posted by Bryan Beaudreault <bb...@hubspot.com.INVALID>.
The cluster I tested on has about 65k store files, with about 315 per
RegionServer. Unfortunately I don't have a great idea of how many of those
were being actively queried during the testing period (30m or so, during
peak traffic).

Are you saying that you've seen parallel seek actually help in this case,
or is this on your list of things to investigate tuning?

On Wed, Dec 1, 2021 at 1:33 PM Clay B. <cw...@clayb.net> wrote:

> Hi Bryan,
>
> I've seen significant performance degradation when tenants have had around
> 100 store files and wanted to look into this tuning to support some
> use-cases (such as DateTierdCompaction). Processing time is a very coarse
> but key metric I've triggered off of as an indicator for deeper schema
> investigation.
>
> May I ask how many store-files regions had in your tests?
>
> -Clay
>
> On Wed, 1 Dec 2021, Andrew Purtell wrote:
>
> > Unless the potential payoff is significant (yes, this might be hard to
> > guess) I would vote for dropping a complex and incomplete (IMHO)
> > disabled-by-default 'feature' that is, I would estimate, rarely used if
> at
> > all, probably not at all.
> >
> >
> > On Wed, Dec 1, 2021 at 8:05 AM Bryan Beaudreault
> > <bb...@hubspot.com.invalid> wrote:
> >
> >> hbase.storescanner.parallel.seek.enable was added a few years ago in
> >> https://issues.apache.org/jira/browse/HBASE-7495
> <https://issues.apache.org/jira/browse/HBASE-7495>,
> but still defaults to
> >> disabled. The description says "Enables StoreFileScanner
> parallel-seeking
> >> in StoreScanner, a feature which can reduce response latency under
> special
> >> conditions".
> >>
> >> It's not very clear what "special conditions" means. Reading through the
> >> entire comment history on that issue seems to indicate it can help when
> you
> >> have "high random read, low cache hit rate, many store files".
> >>
> >> We have a bunch of clusters with this shape, and in fact we use SSDs for
> >> all storage so I figured this might help a lot. I tried setting this to
> >> true on one RegionServer of one of our highest QPS clusters hoping I'd
> see
> >> some clear improvement. This very simple test was pretty much a wash,
> so I
> >> need to do more methodical testing.
> >>
> >> In the test one thing became clear though – is the default thread pool
> size
> >> of 10 good enough for my use-case? I have no way of knowing, as there
> is no
> >> logging or metrics that I can find around thread pool saturation. What I
> >> ended up doing was spamming refresh of the /dump endpoint of the RS, and
> >> noticed that there were sometimes 1-5 tasks queued for the
> RS_PARALLEL_SEEK
> >> executor. This indicates maybe I should scale the thread pool, but
> >> use-cases change over time so this seems like not a great way to
> determine
> >> that.
> >>
> >> Task queuing seems not great for a feature which is aimed at reducing
> >> latencies. I wonder if we should consider some changes to make this more
> >> easy to deploy in production. Here are some ideas I had:
> >>
> >> - Can we generate a better default value for the thread pool size, maybe
> >> based on number of RS handler threads or some other heuristic?
> >> - Should we consider eliminating queuing for this feature? Instead, if
> >> the threadpool is saturated run the seek in-line in the current thread
> >> (i.e. revert to normal). This would be more similar to how hedged reads
> >> work in HDFS.
> >> - Can we expose a metric or logging to help operators know when to scale
> >> up the thread pool? If we implemented the 2nd option above we could
> >> expose
> >> "seeksInCurrentThread" counter to track this, again similar to how
> >> hedged
> >> reads report on saturation.
> >>
> >> But with all of this said, I wonder if anyone is running this in
> production
> >> and has any updated guidance on when to use this? Does it still make
> sense
> >> given the last 8 years of development in HBase? Would it ever make
> sense to
> >> make it enabled by default?
> >>
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> > - A23, Crosstalk
> >
>

Re: [DISCUSS] StoreFileScanner parallel seek -- productionize or drop?

Posted by "Clay B." <cw...@clayb.net>.
Hi Bryan,

I've seen significant performance degradation when tenants have had around 
100 store files and wanted to look into this tuning to support some 
use-cases (such as DateTierdCompaction). Processing time is a very coarse 
but key metric I've triggered off of as an indicator for deeper schema 
investigation.

May I ask how many store-files regions had in your tests?

-Clay

On Wed, 1 Dec 2021, Andrew Purtell wrote:

> Unless the potential payoff is significant (yes, this might be hard to
> guess) I would vote for dropping a complex and incomplete (IMHO)
> disabled-by-default 'feature' that is, I would estimate, rarely used if at
> all, probably not at all.
>
>
> On Wed, Dec 1, 2021 at 8:05 AM Bryan Beaudreault
> <bb...@hubspot.com.invalid> wrote:
>
>> hbase.storescanner.parallel.seek.enable was added a few years ago in
>> https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to
>> disabled. The description says "Enables StoreFileScanner parallel-seeking
>> in StoreScanner, a feature which can reduce response latency under special
>> conditions".
>>
>> It's not very clear what "special conditions" means. Reading through the
>> entire comment history on that issue seems to indicate it can help when you
>> have "high random read, low cache hit rate, many store files".
>>
>> We have a bunch of clusters with this shape, and in fact we use SSDs for
>> all storage so I figured this might help a lot. I tried setting this to
>> true on one RegionServer of one of our highest QPS clusters hoping I'd see
>> some clear improvement. This very simple test was pretty much a wash, so I
>> need to do more methodical testing.
>>
>> In the test one thing became clear though – is the default thread pool size
>> of 10 good enough for my use-case? I have no way of knowing, as there is no
>> logging or metrics that I can find around thread pool saturation. What I
>> ended up doing was spamming refresh of the /dump endpoint of the RS, and
>> noticed that there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK
>> executor. This indicates maybe I should scale the thread pool, but
>> use-cases change over time so this seems like not a great way to determine
>> that.
>>
>> Task queuing seems not great for a feature which is aimed at reducing
>> latencies. I wonder if we should consider some changes to make this more
>> easy to deploy in production. Here are some ideas I had:
>>
>>    - Can we generate a better default value for the thread pool size, maybe
>>    based on number of RS handler threads or some other heuristic?
>>    - Should we consider eliminating queuing for this feature? Instead, if
>>    the threadpool is saturated run the seek in-line in the current thread
>>    (i.e. revert to normal). This would be more similar to how hedged reads
>>    work in HDFS.
>>    - Can we expose a metric or logging to help operators know when to scale
>>    up the thread pool? If we implemented the 2nd option above we could
>> expose
>>    "seeksInCurrentThread" counter to track this, again similar to how
>> hedged
>>    reads report on saturation.
>>
>> But with all of this said, I wonder if anyone is running this in production
>> and has any updated guidance on when to use this? Does it still make sense
>> given the last 8 years of development in HBase? Would it ever make sense to
>> make it enabled by default?
>>
>
>
> -- 
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>   - A23, Crosstalk
>

Re: [DISCUSS] StoreFileScanner parallel seek -- productionize or drop?

Posted by Andrew Purtell <ap...@apache.org>.
Unless the potential payoff is significant (yes, this might be hard to
guess) I would vote for dropping a complex and incomplete (IMHO)
disabled-by-default 'feature' that is, I would estimate, rarely used if at
all, probably not at all.


On Wed, Dec 1, 2021 at 8:05 AM Bryan Beaudreault
<bb...@hubspot.com.invalid> wrote:

> hbase.storescanner.parallel.seek.enable was added a few years ago in
> https://issues.apache.org/jira/browse/HBASE-7495, but still defaults to
> disabled. The description says "Enables StoreFileScanner parallel-seeking
> in StoreScanner, a feature which can reduce response latency under special
> conditions".
>
> It's not very clear what "special conditions" means. Reading through the
> entire comment history on that issue seems to indicate it can help when you
> have "high random read, low cache hit rate, many store files".
>
> We have a bunch of clusters with this shape, and in fact we use SSDs for
> all storage so I figured this might help a lot. I tried setting this to
> true on one RegionServer of one of our highest QPS clusters hoping I'd see
> some clear improvement. This very simple test was pretty much a wash, so I
> need to do more methodical testing.
>
> In the test one thing became clear though – is the default thread pool size
> of 10 good enough for my use-case? I have no way of knowing, as there is no
> logging or metrics that I can find around thread pool saturation. What I
> ended up doing was spamming refresh of the /dump endpoint of the RS, and
> noticed that there were sometimes 1-5 tasks queued for the RS_PARALLEL_SEEK
> executor. This indicates maybe I should scale the thread pool, but
> use-cases change over time so this seems like not a great way to determine
> that.
>
> Task queuing seems not great for a feature which is aimed at reducing
> latencies. I wonder if we should consider some changes to make this more
> easy to deploy in production. Here are some ideas I had:
>
>    - Can we generate a better default value for the thread pool size, maybe
>    based on number of RS handler threads or some other heuristic?
>    - Should we consider eliminating queuing for this feature? Instead, if
>    the threadpool is saturated run the seek in-line in the current thread
>    (i.e. revert to normal). This would be more similar to how hedged reads
>    work in HDFS.
>    - Can we expose a metric or logging to help operators know when to scale
>    up the thread pool? If we implemented the 2nd option above we could
> expose
>    "seeksInCurrentThread" counter to track this, again similar to how
> hedged
>    reads report on saturation.
>
> But with all of this said, I wonder if anyone is running this in production
> and has any updated guidance on when to use this? Does it still make sense
> given the last 8 years of development in HBase? Would it ever make sense to
> make it enabled by default?
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk