You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Tianying Chang <ti...@ebaysf.com> on 2013/01/25 19:54:21 UTC

AsynchBase client holds stale dead region server for long time even after the META has already been update.

Hi

One machine crashed in our cluster. After 3 minutes, the master detect it and re-assign the regions to other region servers. The regions are back online on other RS within one minute. But the asynchbase client still hold old dead regionserver for 50 minutes and cause data loss. We have to restart the AsynchBase client and that fixed the problem.

It seems there is a bug in AsyncBase client code. Has anyone else seen this? If I want to open a bug for Asynchbase, should I use Hbase jira? or is there a dedicated one for Asynchbase? I seems cannot find dedicated AsynchBase jira.

Thanks
Tian-Ying

Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Marcos Ortiz <ml...@uci.cu>.
Great to hear, Ishan.
We faced a similar error here.
We will test this with the fix that you propose.
Best wishes
On 01/29/2013 12:43 PM, ishan chhabra wrote:
> Hi Tsuna,
> As Shrijeet mentioned, we (@Rocketfuel) were experiencing this bug 
> internally when doing cluster restarts. After some trial and error, I 
> was able to create a set of steps to reproduce this bug in a 
> controlled fashion on our test cluster. Further, using heap dumps and 
> added debug messages, this looks like the cause and fix: 
> https://github.com/OpenTSDB/asynchbase/pull/48. I have tested this 
> repeatedly on the test cluster and things are looking fine. Please 
> have a look and see if this makes sense and if the fix is a correct one.
>
> Cheers,
> Ishan
>
> On Friday, 25 January 2013 22:53:17 UTC-8, tsuna wrote:
>
>     On Fri, Jan 25, 2013 at 5:28 PM, Tianying Chang <tic...@ebaysf.com
>     <javascript:>> wrote:
>     > Thanks for the information! We have seen this couple times
>     recently. Last week, it was very long(like 40+ minutes before we
>     restart). I will follow up on that discuss thread. Thanks a lot!!
>
>     This is bug number 1, I haven't been able to track it down as I've
>     never been able to reproduce it in a controller fashion :(
>     https://github.com/OpenTSDB/asynchbase/issues/1
>     <https://github.com/OpenTSDB/asynchbase/issues/1>
>
>     I also spent hours manually walking references of heap dumps
>     and checking state to see if anything was wrong but I haven't
>     found anything, not even a clue.
>
>     -- 
>     Benoit "tsuna" Sigoure
>

-- 
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at DATEC
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by ishan chhabra <is...@gmail.com>.
Hi Tsuna, 
As Shrijeet mentioned, we (@Rocketfuel) were experiencing this bug 
internally when doing cluster restarts. After some trial and error, I was 
able to create a set of steps to reproduce this bug in a controlled fashion 
on our test cluster. Further, using heap dumps and added debug messages, 
this looks like the cause and fix: 
https://github.com/OpenTSDB/asynchbase/pull/48. I have tested this 
repeatedly on the test cluster and things are looking fine. Please have a 
look and see if this makes sense and if the fix is a correct one. 

Cheers,
Ishan

On Friday, 25 January 2013 22:53:17 UTC-8, tsuna wrote:
>
> On Fri, Jan 25, 2013 at 5:28 PM, Tianying Chang <tic...@ebaysf.com<javascript:>> 
> wrote: 
> > Thanks for the information! We have seen this couple times recently. 
> Last week, it was very long(like 40+ minutes before we restart). I will 
> follow up on that discuss thread. Thanks a lot!! 
>
> This is bug number 1, I haven't been able to track it down as I've 
> never been able to reproduce it in a controller fashion :( 
> https://github.com/OpenTSDB/asynchbase/issues/1 
>
> I also spent hours manually walking references of heap dumps 
> and checking state to see if anything was wrong but I haven't 
> found anything, not even a clue. 
>
> -- 
> Benoit "tsuna" Sigoure 
>

Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by tsuna <ts...@gmail.com>.
On Fri, Jan 25, 2013 at 5:28 PM, Tianying Chang <ti...@ebaysf.com> wrote:
> Thanks for the information! We have seen this couple times recently. Last week, it was very long(like 40+ minutes before we restart). I will follow up on that discuss thread. Thanks a lot!!

This is bug number 1, I haven't been able to track it down as I've
never been able to reproduce it in a controller fashion :(
https://github.com/OpenTSDB/asynchbase/issues/1

I also spent hours manually walking references of heap dumps
and checking state to see if anything was wrong but I haven't
found anything, not even a clue.

-- 
Benoit "tsuna" Sigoure

RE: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Tianying Chang <ti...@ebaysf.com>.
Thanks Shrijeet

Thanks for the information! We have seen this couple times recently. Last week, it was very long(like 40+ minutes before we restart). I will follow up on that discuss thread. Thanks a lot!!

Tian-Ying 


-----Original Message-----
From: Shrijeet Paliwal [mailto:shrijeet.paliwal@gmail.com] 
Sent: Friday, January 25, 2013 11:02 AM
To: Ted Yu
Cc: Async HBase; user@hbase.apache.org
Subject: Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

This has been raised earlier
https://groups.google.com/d/topic/asynchbase/xE2lYE6CbmQ/discussion , https://groups.google.com/d/topic/asynchbase/nfLTwjdqq9M/discussion . It does look like a bug but a hard one to reproduce.

We have been seeing this it our production environment, efforts are on to reproduce this in testing environment.

--
Shrijeet


On Fri, Jan 25, 2013 at 10:58 AM, Ted Yu <yu...@gmail.com> wrote:

> Tianying:
> I moved user@ to Cc.
>
> There is a google group for asynchbase.
> Please subscribe to that group.
>
> Can you clarify the version of asynchbase you're using ?
>
> Cheers
>
> On Fri, Jan 25, 2013 at 10:54 AM, Tianying Chang <ti...@ebaysf.com>wrote:
>
>> Hi
>>
>> One machine crashed in our cluster. After 3 minutes, the master 
>> detect it and re-assign the regions to other region servers. The 
>> regions are back online on other RS within one minute. But the 
>> asynchbase client still hold old dead regionserver for 50 minutes and 
>> cause data loss. We have to restart the AsynchBase client and that fixed the problem.
>>
>> It seems there is a bug in AsyncBase client code. Has anyone else 
>> seen this? If I want to open a bug for Asynchbase, should I use Hbase 
>> jira? or is there a dedicated one for Asynchbase? I seems cannot find 
>> dedicated AsynchBase jira.
>>
>> Thanks
>> Tian-Ying
>>
>
>  --
>
>
>

Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Shrijeet Paliwal <sh...@gmail.com>.
This has been raised earlier
https://groups.google.com/d/topic/asynchbase/xE2lYE6CbmQ/discussion ,
https://groups.google.com/d/topic/asynchbase/nfLTwjdqq9M/discussion . It
does look like a bug but a hard one to reproduce.

We have been seeing this it our production environment, efforts are on to
reproduce this in testing environment.

--
Shrijeet


On Fri, Jan 25, 2013 at 10:58 AM, Ted Yu <yu...@gmail.com> wrote:

> Tianying:
> I moved user@ to Cc.
>
> There is a google group for asynchbase.
> Please subscribe to that group.
>
> Can you clarify the version of asynchbase you're using ?
>
> Cheers
>
> On Fri, Jan 25, 2013 at 10:54 AM, Tianying Chang <ti...@ebaysf.com>wrote:
>
>> Hi
>>
>> One machine crashed in our cluster. After 3 minutes, the master detect it
>> and re-assign the regions to other region servers. The regions are back
>> online on other RS within one minute. But the asynchbase client still hold
>> old dead regionserver for 50 minutes and cause data loss. We have to
>> restart the AsynchBase client and that fixed the problem.
>>
>> It seems there is a bug in AsyncBase client code. Has anyone else seen
>> this? If I want to open a bug for Asynchbase, should I use Hbase jira? or
>> is there a dedicated one for Asynchbase? I seems cannot find dedicated
>> AsynchBase jira.
>>
>> Thanks
>> Tian-Ying
>>
>
>  --
>
>
>

RE: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Tianying Chang <ti...@ebaysf.com>.
Thanks Marcos. Can I file a bug there? Or at the googleGroup? 

-----Original Message-----
From: Marcos Ortiz [mailto:mlortiz@uci.cu] 
Sent: Friday, January 25, 2013 1:30 PM
To: user@hbase.apache.org
Cc: Tianying Chang; Async HBase
Subject: Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Regards, Tianying
AsynchBase is a StumbleUpon's open source project.
You can find it on its GitHub's profile:
https://github.com/stumbleupon/asynchbase

Best wishes
On 01/25/2013 02:12 PM, Tianying Chang wrote:
> Ted
>
> it is 1.3.1
>
> Thanks
> Tian-Ying
> ________________________________________
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: Friday, January 25, 2013 10:58 AM
> To: Async HBase
> Cc: user@hbase.apache.org
> Subject: Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.
>
> Tianying:
> I moved user@ to Cc.
>
> There is a google group for asynchbase.
> Please subscribe to that group.
>
> Can you clarify the version of asynchbase you're using ?
>
> Cheers
>
> On Fri, Jan 25, 2013 at 10:54 AM, Tianying Chang <ti...@ebaysf.com> wrote:
>
>> Hi
>>
>> One machine crashed in our cluster. After 3 minutes, the master 
>> detect it and re-assign the regions to other region servers. The 
>> regions are back online on other RS within one minute. But the 
>> asynchbase client still hold old dead regionserver for 50 minutes and 
>> cause data loss. We have to restart the AsynchBase client and that fixed the problem.
>>
>> It seems there is a bug in AsyncBase client code. Has anyone else 
>> seen this? If I want to open a bug for Asynchbase, should I use Hbase 
>> jira? or is there a dedicated one for Asynchbase? I seems cannot find 
>> dedicated AsynchBase jira.
>>
>> Thanks
>> Tian-Ying
>>


--
Marcos Ortiz Valmaseda,
Technical Product Manager at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Marcos Ortiz <ml...@uci.cu>.
Regards, Tianying
AsynchBase is a StumbleUpon's open source project.
You can find it on its GitHub's profile:
https://github.com/stumbleupon/asynchbase

Best wishes
On 01/25/2013 02:12 PM, Tianying Chang wrote:
> Ted
>
> it is 1.3.1
>
> Thanks
> Tian-Ying
> ________________________________________
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: Friday, January 25, 2013 10:58 AM
> To: Async HBase
> Cc: user@hbase.apache.org
> Subject: Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.
>
> Tianying:
> I moved user@ to Cc.
>
> There is a google group for asynchbase.
> Please subscribe to that group.
>
> Can you clarify the version of asynchbase you're using ?
>
> Cheers
>
> On Fri, Jan 25, 2013 at 10:54 AM, Tianying Chang <ti...@ebaysf.com> wrote:
>
>> Hi
>>
>> One machine crashed in our cluster. After 3 minutes, the master detect it
>> and re-assign the regions to other region servers. The regions are back
>> online on other RS within one minute. But the asynchbase client still hold
>> old dead regionserver for 50 minutes and cause data loss. We have to
>> restart the AsynchBase client and that fixed the problem.
>>
>> It seems there is a bug in AsyncBase client code. Has anyone else seen
>> this? If I want to open a bug for Asynchbase, should I use Hbase jira? or
>> is there a dedicated one for Asynchbase? I seems cannot find dedicated
>> AsynchBase jira.
>>
>> Thanks
>> Tian-Ying
>>


-- 
Marcos Ortiz Valmaseda,
Technical Product Manager at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

RE: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Tianying Chang <ti...@ebaysf.com>.
Ted

it is 1.3.1  

Thanks
Tian-Ying 
________________________________________
From: Ted Yu [yuzhihong@gmail.com]
Sent: Friday, January 25, 2013 10:58 AM
To: Async HBase
Cc: user@hbase.apache.org
Subject: Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Tianying:
I moved user@ to Cc.

There is a google group for asynchbase.
Please subscribe to that group.

Can you clarify the version of asynchbase you're using ?

Cheers

On Fri, Jan 25, 2013 at 10:54 AM, Tianying Chang <ti...@ebaysf.com> wrote:

> Hi
>
> One machine crashed in our cluster. After 3 minutes, the master detect it
> and re-assign the regions to other region servers. The regions are back
> online on other RS within one minute. But the asynchbase client still hold
> old dead regionserver for 50 minutes and cause data loss. We have to
> restart the AsynchBase client and that fixed the problem.
>
> It seems there is a bug in AsyncBase client code. Has anyone else seen
> this? If I want to open a bug for Asynchbase, should I use Hbase jira? or
> is there a dedicated one for Asynchbase? I seems cannot find dedicated
> AsynchBase jira.
>
> Thanks
> Tian-Ying
>

Re: AsynchBase client holds stale dead region server for long time even after the META has already been update.

Posted by Ted Yu <yu...@gmail.com>.
Tianying:
I moved user@ to Cc.

There is a google group for asynchbase.
Please subscribe to that group.

Can you clarify the version of asynchbase you're using ?

Cheers

On Fri, Jan 25, 2013 at 10:54 AM, Tianying Chang <ti...@ebaysf.com> wrote:

> Hi
>
> One machine crashed in our cluster. After 3 minutes, the master detect it
> and re-assign the regions to other region servers. The regions are back
> online on other RS within one minute. But the asynchbase client still hold
> old dead regionserver for 50 minutes and cause data loss. We have to
> restart the AsynchBase client and that fixed the problem.
>
> It seems there is a bug in AsyncBase client code. Has anyone else seen
> this? If I want to open a bug for Asynchbase, should I use Hbase jira? or
> is there a dedicated one for Asynchbase? I seems cannot find dedicated
> AsynchBase jira.
>
> Thanks
> Tian-Ying
>