You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Tom Brown <to...@gmail.com> on 2012/09/10 19:32:48 UTC

Tracking down coprocessor pauses

Hi,

We have our system setup such that all interaction is done through
co-processors. We update the database via a co-processor (it has the
appropriate logic for dealing with concurrent access to rows), and we
also query/aggregate via co-processor (since we don't want to send all
the data over the network).

This generally works very well. However, some times one of the region
servers will "pause". This doesn't appear to be a GC pause since it
still serves up the UI, and adds occasional messages to the log
regarding the LRU. The only thing I've found is that when I check the
server that's causing the problem (easy to tell, since all the
"working" servers have a low load, and the problem server has a higher
load), I can see that there are a number of execCoprocessor requests
that have been executing for much longer than they should.

I want to know more details about the specifics of those requests; Is
there an API I can use that will allow my coprocessor requests to be
tracked more functionally? Is there a way to hook into the UI so I can
provide my own list of running processes? Or would I have to write
that all myself?

I am using HBase 0.92.1, but will be upgrading to 0.94.1 soon.

Thanks in advance!

--Tom

Re: Tracking down coprocessor pauses

Posted by Andrew Purtell <ap...@apache.org>.
Inline

On Wed, Sep 12, 2012 at 10:40 AM, Tom Brown <to...@gmail.com> wrote:
> I have captured some logs from what is happening during one of these pauses.
>
> http://pastebin.com/K162Einz
>
> Can someone help me figure out what's actually going on from these logs?
>
> --- My interpretation of the logs ---
>
> As you can see at the start of the logs, my coprocessor for updating
> the data is executing rapidly until 10:17:06.
>
> At that time the coprocessor for querying is invoked. This query
> should take only moments to return, but doesn't return until 10:44:52.

Here it would be helpful to get a stacktrace from the regionserver
where the CP is executing, to see where the RPC threads servicing the
CP invocations are hung up.

>
> At 10:18:53 there appear to be some compaction related messages
> (though they didn't appear to be the cause, happening over a minute
> after the server stops functioning).
>
> It appears to run compaction until 10:42:25. The next two minutes
> contain just LRU eviction messages.
>
> At 10:44:52, the query from earlier appears to complete, after having
> summarized only 863 rows. A few other queued requests are attempted,
> but fail with exceptions (ClosedChannelException).
>
> Eventually the exceptions are being thrown from "openScanner", which
> really doesn't sound good to me.

ChannelClosedExceptions appear to be from RPC service threads, now
unstuck, processing queued up CP invocations but the client has given
up, so they can't write back results and error out.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Tracking down coprocessor pauses

Posted by Tom Brown <to...@gmail.com>.
I have captured some logs from what is happening during one of these pauses.

http://pastebin.com/K162Einz

Can someone help me figure out what's actually going on from these logs?

--- My interpretation of the logs ---

As you can see at the start of the logs, my coprocessor for updating
the data is executing rapidly until 10:17:06.

At that time the coprocessor for querying is invoked. This query
should take only moments to return, but doesn't return until 10:44:52.

At 10:18:53 there appear to be some compaction related messages
(though they didn't appear to be the cause, happening over a minute
after the server stops functioning).

It appears to run compaction until 10:42:25. The next two minutes
contain just LRU eviction messages.

At 10:44:52, the query from earlier appears to complete, after having
summarized only 863 rows. A few other queued requests are attempted,
but fail with exceptions (ClosedChannelException).

Eventually the exceptions are being thrown from "openScanner", which
really doesn't sound good to me.


--Tom


On Mon, Sep 10, 2012 at 11:32 AM, Tom Brown <to...@gmail.com> wrote:
> Hi,
>
> We have our system setup such that all interaction is done through
> co-processors. We update the database via a co-processor (it has the
> appropriate logic for dealing with concurrent access to rows), and we
> also query/aggregate via co-processor (since we don't want to send all
> the data over the network).
>
> This generally works very well. However, some times one of the region
> servers will "pause". This doesn't appear to be a GC pause since it
> still serves up the UI, and adds occasional messages to the log
> regarding the LRU. The only thing I've found is that when I check the
> server that's causing the problem (easy to tell, since all the
> "working" servers have a low load, and the problem server has a higher
> load), I can see that there are a number of execCoprocessor requests
> that have been executing for much longer than they should.
>
> I want to know more details about the specifics of those requests; Is
> there an API I can use that will allow my coprocessor requests to be
> tracked more functionally? Is there a way to hook into the UI so I can
> provide my own list of running processes? Or would I have to write
> that all myself?
>
> I am using HBase 0.92.1, but will be upgrading to 0.94.1 soon.
>
> Thanks in advance!
>
> --Tom

Re: Tracking down coprocessor pauses

Posted by Andrew Purtell <ap...@apache.org>.
On Mon, Sep 10, 2012 at 10:32 AM, Tom Brown <to...@gmail.com> wrote:
> I want to know more details about the specifics of those requests; Is
> there an API I can use that will allow my coprocessor requests to be
> tracked more functionally? Is there a way to hook into the UI so I can
> provide my own list of running processes? Or would I have to write
> that all myself?
>
> I am using HBase 0.92.1, but will be upgrading to 0.94.1 soon.

I haven't actually done this, so YMMV, but you should be able to get a
reference to the TaskMonitor singleton
(http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/monitoring/TaskMonitor.html)
via the static method TaskMonitor.get() and then create and update the
state of MonitoredTasks
(http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/monitoring/MonitoredTask.html)
for your coprocessor's internal functions.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Tracking down coprocessor pauses

Posted by Tom Brown <to...@gmail.com>.
Micheal,

We are using HBase to track the usage of our service. Specifically,
each client sends an update when they start a task, at regular
intervals during the task, and an update when they finish a task (and
then presumably they start another, continuing the cycle). Each user
has various attributes (which version of our software they're using,
their location, which task they're working on, etc), and we want to be
able to see stats in aggregate, and be able to drill-down into various
areas (similar to OLAP; Incidentally, we chose HBase because none of
the OLAP systems seemed to accept real-time updates).

The key design is a compound of:  [Attribute1 Attribute2... AttributeN].

Each row has roughly 10 cells, all of which represent counters; Some
require simple incrementing, but others require fancier bitwise
operations to properly increment (using HyperLogLog to estimate a
unique count).

The rows are stored with a 15-second granularity (everything from
0:00-0:15 is stored in one row, everything from 0:15-0:30 is in the
next, etc). The data is formatted such that you can get the
aggregation for a larger time period by combining all of the rows that
comprise that time frame. For the counter cells, this uses straight
addition. For the unique counters, bitwise operations are required.

The most frequently requested data has only one or two relevant
attributes. For example, we commonly want to see the stats of our
system broken out just by task. Of course, that makes writes a little
more difficult. When we have 1000's of users working on the same kind
of task, we'll receive a lot of concurrent updates to the row with
[attribute=TheTask]. HBase supports atomic increments, but not atomic
bitwise operations, so we were required to implement a locking
solution on our own.

There seemed to be a lot of problems with row-level locks, so we
decided to do the locking in the one place we could guarantee it: a
coprocessor. Within the coprocessor is logic to coalesce multiple
updates to the same row into a single HBase update. When performing
aggregations, a requested time period might summarize thousands of
rows into a single summary row. We thought that sending the entire set
over the network was overkill, especially since the aggregation
operations are fairly simple (addition and some bitwise calculations),
so the co-processor also contains code to perform aggregations.

I'm interested in improving the design, so any suggestions will be appreciated.

Thanks in advance,

--Tom

On Mon, Sep 10, 2012 at 12:45 PM, Michael Segel
<mi...@hotmail.com> wrote:
>
> On Sep 10, 2012, at 12:32 PM, Tom Brown <to...@gmail.com> wrote:
>
>> We have our system setup such that all interaction is done through
>> co-processors. We update the database via a co-processor (it has the
>> appropriate logic for dealing with concurrent access to rows), and we
>> also query/aggregate via co-processor (since we don't want to send all
>> the data over the network).
>
> Could you expand on this? On the surface, this doesn't sound like a very good idea.
>

Re: Tracking down coprocessor pauses

Posted by Michael Segel <mi...@hotmail.com>.
On Sep 10, 2012, at 12:32 PM, Tom Brown <to...@gmail.com> wrote:

> We have our system setup such that all interaction is done through
> co-processors. We update the database via a co-processor (it has the
> appropriate logic for dealing with concurrent access to rows), and we
> also query/aggregate via co-processor (since we don't want to send all
> the data over the network).

Could you expand on this? On the surface, this doesn't sound like a very good idea.