You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by "Riesland, Zack" <Za...@sensus.com> on 2016/08/29 16:41:25 UTC

Help w/ table that suddenly keeps timing out

Our cluster recently had some issue related to network outages*.

When all the dust settled, Hbase eventually "healed" itself, and almost everything is back to working well, with a couple of exceptions.

In particular, we have one table where almost every (Phoenix) query times out - which was never the case before. It's very small compared to most of our other tables at around 400 million rows.

I have tried with a raw JDBC connection in Java code as well as with Aqua Data Studio, both of which usually work fine.

The specific failure is that after 15 minutes (the set timeout), I get a one-line error that says: “Error 1102 (XCL02): Cannot get all table regions”

When I look at the GUI tools (like http://<my server>:16010/master-status#storeStats) it shows '1' under "offline regions" for that table (it has 33 total regions). Almost all the other tables show '0'.

Can anyone help me troubleshoot this?

Are there Phoenix tables I can clear out that may be confused?

This isn’t an issue with the schema or skew or anything. The same table with the same data was lightning fast before these hbase issues.

I know there is a CLI tool for fixing HBase issues. I'm wondering whether that "offline region" is the cause of these timeouts.

If not, how I can I figure it out?

Thanks!

* FWIW, what happened was that DNS stopped working for a while, so HBase started referring to all the region servers by IP address, which somewhat worked, until the region servers restarted. Then they were hosed until a bit of manual intervention.

Re: Help w/ table that suddenly keeps timing out

Posted by Ankit Singhal <an...@gmail.com>.

Yes, Ted is right , "Error 1102 (XCL02): Cannot get all table regions"
happens when Phoenix is not able to get locations of all regions. Assigning
that offline region should help.

On Mon, Aug 29, 2016 at 10:22 PM, Ted Yu <yu...@gmail.com> wrote:

> I searched for "Cannot get all table regions" in hbase repo - no hit.
> Seems to be Phoenix error.
>
> Anyway, the cause could be due to the 1 offline region for this table.
> Can you retrieve the encoded region name and search for it in the master
> log ?
>
> Feel free to pastebin snippets of master / region server logs if needed
> (with proper redaction).
>
> See if the following shell command works:
>
>   hbase> assign 'REGIONNAME'
>   hbase> assign 'ENCODED_REGIONNAME'
>
> Cheers
>
> On Mon, Aug 29, 2016 at 9:41 AM, Riesland, Zack <Za...@sensus.com>
> wrote:
>
>> Our cluster recently had some issue related to network outages*.
>>
>> When all the dust settled, Hbase eventually "healed" itself, and almost
>> everything is back to working well, with a couple of exceptions.
>>
>> In particular, we have one table where almost every (Phoenix) query times
>> out - which was never the case before. It's very small compared to most of
>> our other tables at around 400 million rows.
>>
>> I have tried with a raw JDBC connection in Java code as well as with Aqua
>> Data Studio, both of which usually work fine.
>>
>> The specific failure is that after 15 minutes (the set timeout),  I get a
>> one-line error that says: “Error 1102 (XCL02): Cannot get all table regions”
>>
>> When I look at the GUI tools (like http://<my
>> server>:16010/master-status#storeStats) it shows '1' under "offline
>> regions" for that table (it has 33 total regions). Almost all the other
>> tables show '0'.
>>
>> Can anyone help me troubleshoot this?
>>
>> Are there Phoenix tables I can clear out that may be confused?
>>
>> This isn’t an issue with the schema or skew or anything. The same table
>> with the same data was lightning fast before these hbase issues.
>>
>> I know there is a CLI tool for fixing HBase issues. I'm wondering whether
>> that "offline region" is the cause of these timeouts.
>>
>> If not, how I can I figure it out?
>>
>> Thanks!
>>
>>
>>
>> * FWIW, what happened was that DNS stopped working for a while, so HBase
>> started referring to all the region servers by IP address, which somewhat
>> worked, until the region servers restarted. Then they were hosed until a
>> bit of manual intervention.
>>
>>
>>
>
>

Re: Help w/ table that suddenly keeps timing out

Posted by Ted Yu <yu...@gmail.com>.

I searched for "Cannot get all table regions" in hbase repo - no hit.
Seems to be Phoenix error.

Anyway, the cause could be due to the 1 offline region for this table.
Can you retrieve the encoded region name and search for it in the master
log ?

Feel free to pastebin snippets of master / region server logs if needed
(with proper redaction).

See if the following shell command works:

  hbase> assign 'REGIONNAME'
  hbase> assign 'ENCODED_REGIONNAME'

Cheers

On Mon, Aug 29, 2016 at 9:41 AM, Riesland, Zack <Za...@sensus.com>
wrote:

> Our cluster recently had some issue related to network outages*.
>
> When all the dust settled, Hbase eventually "healed" itself, and almost
> everything is back to working well, with a couple of exceptions.
>
> In particular, we have one table where almost every (Phoenix) query times
> out - which was never the case before. It's very small compared to most of
> our other tables at around 400 million rows.
>
> I have tried with a raw JDBC connection in Java code as well as with Aqua
> Data Studio, both of which usually work fine.
>
> The specific failure is that after 15 minutes (the set timeout),  I get a
> one-line error that says: “Error 1102 (XCL02): Cannot get all table regions”
>
> When I look at the GUI tools (like http://<my server>:16010/master-status#storeStats)
> it shows '1' under "offline regions" for that table (it has 33 total
> regions). Almost all the other tables show '0'.
>
> Can anyone help me troubleshoot this?
>
> Are there Phoenix tables I can clear out that may be confused?
>
> This isn’t an issue with the schema or skew or anything. The same table
> with the same data was lightning fast before these hbase issues.
>
> I know there is a CLI tool for fixing HBase issues. I'm wondering whether
> that "offline region" is the cause of these timeouts.
>
> If not, how I can I figure it out?
>
> Thanks!
>
>
>
> * FWIW, what happened was that DNS stopped working for a while, so HBase
> started referring to all the region servers by IP address, which somewhat
> worked, until the region servers restarted. Then they were hosed until a
> bit of manual intervention.
>
>
>