You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Karthick Ram <ka...@gmail.com> on 2018/12/20 06:23:54 UTC
Re: Extremely high CPU usage after upgrading to Hbase 1.4.4

Hi Srinidhi,
We are also facing a similar problem in HBase-1.4.8. We're curious to know
what you did after this to resolve the issue. Did you revert back to a
previous version?
I've seen a Jira regarding this issue. (
https://issues.apache.org/jira/browse/HBASE-21620)

@Ted Yu <yu...@gmail.com> any update on this?

On Tue, Sep 11, 2018 at 4:11 AM Srinidhi Muppalla <sr...@trulia.com>
wrote:

> It is during a period when the number of client operations was relatively
> low. It wasn’t zero, but it was definitely off peak hours.
>
> On 9/10/18, 12:16 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
>     In the previous stack trace you sent, shortCompactions and
> longCompactions
>     threads were not active.
>
>     Was the stack trace captured during period when the number of client
>     operations was low ?
>
>     If not, can you capture stack trace during off peak hours ?
>
>     Cheers
>
>     On Mon, Sep 10, 2018 at 12:08 PM Srinidhi Muppalla <
> srinidhim@trulia.com>
>     wrote:
>
>     > Hi Ted,
>     >
>     > The highest number of filters used is 10, but the average is
> generally
>     > close to 1. Is it possible the CPU usage spike has to do with Hbase
>     > internal maintenance operations? It looks like post-upgrade the
> spike isn’t
>     > correlated with the frequency of reads/writes we are making, because
> the
>     > high CPU usage persisted when the number of operations went down.
>     >
>     > Thank you,
>     > Srinidhi
>     >
>     > On 9/8/18, 9:44 AM, "Ted Yu" <yu...@gmail.com> wrote:
>     >
>     >     Srinidhi :
>     >     Do you know the average / highest number of ColumnPrefixFilter's
> in the
>     >     FilterList ?
>     >
>     >     Thanks
>     >
>     >     On Fri, Sep 7, 2018 at 10:00 PM Ted Yu <yu...@gmail.com>
> wrote:
>     >
>     >     > Thanks for detailed background information.
>     >     >
>     >     > I assume your code has done de-dup for the filters contained in
>     >     > FilterListWithOR.
>     >     >
>     >     > I took a look at JIRAs which
>     >     > touched
> hbase-client/src/main/java/org/apache/hadoop/hbase/filter in
>     >     > branch-1.4
>     >     > There were a few patches (some were very big) since the
> release of
>     > 1.3.0
>     >     > So it is not obvious at first glance which one(s) might be
> related.
>     >     >
>     >     > I noticed ColumnPrefixFilter.getNextCellHint (and
>     >     > KeyValueUtil.createFirstOnRow) appearing many times in the
> stack
>     > trace.
>     >     >
>     >     > I plan to dig more in this area.
>     >     >
>     >     > Cheers
>     >     >
>     >     > On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla <
>     > srinidhim@trulia.com>
>     >     > wrote:
>     >     >
>     >     >> Sure thing. For our table schema, each row represents one
> user and
>     > the
>     >     >> row key is that user’s unique id in our system. We currently
> only
>     > use one
>     >     >> column family in the table. The column qualifiers represent
> an item
>     > that
>     >     >> has been surfaced to that user as well as additional
> information to
>     >     >> differentiate the way the item has been surfaced to the user.
>     > Without
>     >     >> getting into too many specifics, the qualifier follows the
> rough
>     > format of:
>     >     >>
>     >     >> “Channel-itemId-distinguisher”.
>     >     >>
>     >     >> The channel here is the channel through the item was
> previously
>     > surfaced
>     >     >> to the user. The itemid is the unique id of the item that has
> been
>     > surfaced
>     >     >> to the user. A distinguisher is some attribute about how that
> item
>     > was
>     >     >> surfaced to the user.
>     >     >>
>     >     >> When we run a scan, we currently only ever run it on one row
> at a
>     > time.
>     >     >> It was chosen over ‘get’ because (from our understanding) the
>     > performance
>     >     >> difference is negligible, and down the road using scan would
> allow
>     > us some
>     >     >> more flexibility.
>     >     >>
>     >     >> The filter list that is constructed with scan works by using a
>     >     >> ColumnPrefixFilter as you mentioned. When a user is being
>     > communicated to
>     >     >> on a particular channel, we have a list of items that we want
> to
>     >     >> potentially surface for that user. So, we construct a prefix
> list
>     > with the
>     >     >> channel and each of the item ids in the form of:
> “channel-itemId”.
>     > Then we
>     >     >> run a scan on that row with that filter list using “WithOr”
> to get
>     > all of
>     >     >> the matching channel-itemId combinations currently in that
>     > row/column
>     >     >> family in the table. This way we can then know which of the
> items
>     > we want
>     >     >> to surface to that user on that channel have already been
> surfaced
>     > on that
>     >     >> channel. The reason we query using a prefix filter is so that
> we
>     > don’t need
>     >     >> to know the ‘distinguisher’ part of the record when writing
> the
>     > actual
>     >     >> query, because the distinguisher is only relevant in certain
>     > circumstances.
>     >     >>
>     >     >> Let me know if this is the information about our query
> pattern that
>     > you
>     >     >> were looking for and if there is anything I can clarify or
> add.
>     >     >>
>     >     >> Thanks,
>     >     >> Srinidhi
>     >     >>
>     >     >> On 9/6/18, 12:24 PM, "Ted Yu" <yu...@gmail.com> wrote:
>     >     >>
>     >     >>     From the stack trace, ColumnPrefixFilter is used during
> scan.
>     >     >>
>     >     >>     Can you illustrate how various filters are formed thru
>     >     >> FilterListWithOR ?
>     >     >>     It would be easier for other people to reproduce the
> problem
>     > given
>     >     >> your
>     >     >>     query pattern.
>     >     >>
>     >     >>     Cheers
>     >     >>
>     >     >>     On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla <
>     >     >> srinidhim@trulia.com>
>     >     >>     wrote:
>     >     >>
>     >     >>     > Hi Vlad,
>     >     >>     >
>     >     >>     > Thank you for the suggestion. I recreated the issue and
>     > attached
>     >     >> the stack
>     >     >>     > traces I took. Let me know if there’s any other info I
> can
>     > provide.
>     >     >> We
>     >     >>     > narrowed the issue down to occurring when upgrading from
>     > 1.3.0 to
>     >     >> any 1.4.x
>     >     >>     > version.
>     >     >>     >
>     >     >>     > Thanks,
>     >     >>     > Srinidhi
>     >     >>     >
>     >     >>     > On 9/4/18, 8:19 PM, "Vladimir Rodionov" <
>     > vladrodionov@gmail.com>
>     >     >> wrote:
>     >     >>     >
>     >     >>     >     Hi, Srinidhi
>     >     >>     >
>     >     >>     >     Next time you will see this issue, take jstack of a
> RS
>     > several
>     >     >> times
>     >     >>     > in a
>     >     >>     >     row. W/o stack traces it is hard
>     >     >>     >     to tell what was going on with your cluster after
> upgrade.
>     >     >>     >
>     >     >>     >     -Vlad
>     >     >>     >
>     >     >>     >
>     >     >>     >
>     >     >>     >     On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla <
>     >     >> srinidhim@trulia.com
>     >     >>     > >
>     >     >>     >     wrote:
>     >     >>     >
>     >     >>     >     > Hello all,
>     >     >>     >     >
>     >     >>     >     > We are currently running Hbase 1.3.0 on an EMR
> cluster
>     >     >> running EMR
>     >     >>     > 5.5.0.
>     >     >>     >     > Recently, we attempted to upgrade our cluster to
> using
>     > Hbase
>     >     >> 1.4.4
>     >     >>     > (along
>     >     >>     >     > with upgrading our EMR cluster to 5.16). After
>     > upgrading, the
>     >     >> CPU
>     >     >>     > usage for
>     >     >>     >     > all of our region servers spiked up to 90%. The
>     > load_one for
>     >     >> all of
>     >     >>     > our
>     >     >>     >     > servers spiked from roughly 1-2 to 10 threads.
> After
>     >     >> upgrading, the
>     >     >>     > number
>     >     >>     >     > of operations to the cluster hasn’t increased.
> After
>     > giving
>     >     >> the
>     >     >>     > cluster a
>     >     >>     >     > few hours, we had to revert the upgrade. From the
> logs,
>     > we are
>     >     >>     > unable to
>     >     >>     >     > tell what is occupying the CPU resources. Is this
> a
>     > known
>     >     >> issue with
>     >     >>     > 1.4.4?
>     >     >>     >     > Any guidance or ideas for debugging the cause
> would be
>     > greatly
>     >     >>     >     > appreciated.  What are the best steps for
> debugging CPU
>     > usage?
>     >     >>     >     >
>     >     >>     >     > Thank you,
>     >     >>     >     > Srinidhi
>     >     >>     >     >
>     >     >>     >
>     >     >>     >
>     >     >>     >
>     >     >>
>     >     >>
>     >     >>
>     >
>     >
>     >
>
>
>

-- 
<https://about.me/karthick.r?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb>
Karthick R
about.me/karthick.r
<https://about.me/karthick.r?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=gmail_api&utm_content=thumb>