You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by st...@duboce.net on 2010/12/16 01:14:27 UTC

Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/
-----------------------------------------------------------

Review request for hbase and Jonathan Gray.


Summary
-------

M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
 Removed stale comments and TODOs.

 Added a 'version' datamenber, the znode edit version which we keep across open process.

 Refactored the setting of OPENING out into a method that is used in multiple places 
 now rather than repeat code.  Did this in new tickleOpening method.

 Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
 While its running we update OPENING state if its running a while.


This addresses bug hbase-3362.
    http://issues.apache.org/jira/browse/hbase-3362


Diffs
-----

  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1049707 

Diff: http://review.cloudera.org/r/1298/diff


Testing
-------

Ran it on my cluster. Seems to work as the old code did.


Thanks,

stack

Re: Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Posted by Jonathan Gray <jg...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/#review2103
-----------------------------------------------------------

Ship it!


it's getting pretty crazy but this looks good.

it's unfortunate we have all these extra node transitioning methods inside this class.  this pattern of doing node transitions and tracking expected version is very common and we'll probably have more of it so we should look at doing some kind of generic abstraction for that pattern soon.

+1 for commit, thanks for the changes


trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6561>

    typo 'initalizes' but good comment



trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6562>

    interesting thing is... we only use this progressable if we do a log replay.  in that case, a region open is not really idempotent as we treat it here.
    
    outside scope of this jira but something to think about.


- Jonathan


On 2010-12-16 17:01:04, stack wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/1298/
> -----------------------------------------------------------
> 
> (Updated 2010-12-16 17:01:04)
> 
> 
> Review request for hbase and Jonathan Gray.
> 
> 
> Summary
> -------
> 
> M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
>  Removed stale comments and TODOs.
> 
>  Added a 'version' datamenber, the znode edit version which we keep across open process.
> 
>  Refactored the setting of OPENING out into a method that is used in multiple places 
>  now rather than repeat code.  Did this in new tickleOpening method.
> 
>  Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
>  While its running we update OPENING state if its running a while.
> 
> 
> This addresses bug hbase-3362.
>     http://issues.apache.org/jira/browse/hbase-3362
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1050086 
> 
> Diff: http://review.cloudera.org/r/1298/diff
> 
> 
> Testing
> -------
> 
> Ran it on my cluster. Seems to work as the old code did.
> 
> 
> Thanks,
> 
> stack
> 
>

Re: Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Posted by st...@duboce.net.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/
-----------------------------------------------------------

(Updated 2010-12-16 17:01:04.757304)


Review request for hbase and Jonathan Gray.


Changes
-------

I implemented Jon's suggestions and then some.  Not pretty but works in my local and cluster testing.


Summary
-------

M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
 Removed stale comments and TODOs.

 Added a 'version' datamenber, the znode edit version which we keep across open process.

 Refactored the setting of OPENING out into a method that is used in multiple places 
 now rather than repeat code.  Did this in new tickleOpening method.

 Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
 While its running we update OPENING state if its running a while.


This addresses bug hbase-3362.
    http://issues.apache.org/jira/browse/hbase-3362


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1050086 

Diff: http://review.cloudera.org/r/1298/diff


Testing
-------

Ran it on my cluster. Seems to work as the old code did.


Thanks,

stack

Re: Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Posted by st...@duboce.net.


> On 2010-12-16 00:14:36, Jonathan Gray wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java, line 173
> > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line173>
> >
> >     This is a busy wait loop?
> >     
> >     Should we add a wait/notify on something passed to the thread and w/ a timeout of the period?
> >     
> >     And then we should probably also have some kind of max timeout.  Even if minutes, there could be weird cluster state where the RS misses META availability but someone else might handle it properly, so max timeout might be good?
> 
> stack wrote:
>     I need to add a small sleep.  I'd rather do this than wait/notify.  t.isAlive should be enough.  Regards max timeout, I should add check if server is stopped ... and for max timeout, what you think?  Ten minutes?  Then abort?
> 
> Jonathan Gray wrote:
>     I was thinking 5 minutes.
>     
>     How long you going to sleep for?  That seems like an unideal way to do this.  I would prefer wait/notify and have timeout on wait be this 1/3 period, but small sleep could work.  If really small, we're in busy loop again.  If too big, we increase how long we have to wait.  This is on critical path of every single region open.
>     
>     If we go down path of threads doing work, I don't see why we don't want to use wait/notify to let the blocked thread know when it's done.

5 minute is not enough.  IIRC, it was > 5 minutes before the region came back online.  Let me see.

I want to avoid mother thread depending on daughter thread signaling it to stop... seems redundant when I'm watching the daughter with the isAlive already.

The sleep would be short.  1ms or so.  Normally we'd not trip into the sleep.  The operation will have compeleted before we have chance to sleep.  It'd only sleep when no progress can be made.

I'll add wait/notify for you to get this patch cleared past review, np.


- stack


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/#review2083
-----------------------------------------------------------


On 2010-12-15 16:14:27, stack wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/1298/
> -----------------------------------------------------------
> 
> (Updated 2010-12-15 16:14:27)
> 
> 
> Review request for hbase and Jonathan Gray.
> 
> 
> Summary
> -------
> 
> M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
>  Removed stale comments and TODOs.
> 
>  Added a 'version' datamenber, the znode edit version which we keep across open process.
> 
>  Refactored the setting of OPENING out into a method that is used in multiple places 
>  now rather than repeat code.  Did this in new tickleOpening method.
> 
>  Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
>  While its running we update OPENING state if its running a while.
> 
> 
> This addresses bug hbase-3362.
>     http://issues.apache.org/jira/browse/hbase-3362
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1049707 
> 
> Diff: http://review.cloudera.org/r/1298/diff
> 
> 
> Testing
> -------
> 
> Ran it on my cluster. Seems to work as the old code did.
> 
> 
> Thanks,
> 
> stack
> 
>

Re: Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Posted by Jonathan Gray <jg...@apache.org>.


> On 2010-12-16 00:14:36, Jonathan Gray wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java, line 96
> > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line96>
> >
> >     We were not previously but we should probably log this condition
> 
> stack wrote:
>     We do in the method?

yeah you're right, thought that was only on the exception.  nvm.


> On 2010-12-16 00:14:36, Jonathan Gray wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java, line 173
> > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line173>
> >
> >     This is a busy wait loop?
> >     
> >     Should we add a wait/notify on something passed to the thread and w/ a timeout of the period?
> >     
> >     And then we should probably also have some kind of max timeout.  Even if minutes, there could be weird cluster state where the RS misses META availability but someone else might handle it properly, so max timeout might be good?
> 
> stack wrote:
>     I need to add a small sleep.  I'd rather do this than wait/notify.  t.isAlive should be enough.  Regards max timeout, I should add check if server is stopped ... and for max timeout, what you think?  Ten minutes?  Then abort?

I was thinking 5 minutes.

How long you going to sleep for?  That seems like an unideal way to do this.  I would prefer wait/notify and have timeout on wait be this 1/3 period, but small sleep could work.  If really small, we're in busy loop again.  If too big, we increase how long we have to wait.  This is on critical path of every single region open.

If we go down path of threads doing work, I don't see why we don't want to use wait/notify to let the blocked thread know when it's done.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/#review2083
-----------------------------------------------------------


On 2010-12-15 16:14:27, stack wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/1298/
> -----------------------------------------------------------
> 
> (Updated 2010-12-15 16:14:27)
> 
> 
> Review request for hbase and Jonathan Gray.
> 
> 
> Summary
> -------
> 
> M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
>  Removed stale comments and TODOs.
> 
>  Added a 'version' datamenber, the znode edit version which we keep across open process.
> 
>  Refactored the setting of OPENING out into a method that is used in multiple places 
>  now rather than repeat code.  Did this in new tickleOpening method.
> 
>  Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
>  While its running we update OPENING state if its running a while.
> 
> 
> This addresses bug hbase-3362.
>     http://issues.apache.org/jira/browse/hbase-3362
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1049707 
> 
> Diff: http://review.cloudera.org/r/1298/diff
> 
> 
> Testing
> -------
> 
> Ran it on my cluster. Seems to work as the old code did.
> 
> 
> Thanks,
> 
> stack
> 
>

Re: Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Posted by st...@duboce.net.


> On 2010-12-16 00:14:36, Jonathan Gray wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java, line 96
> > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line96>
> >
> >     We were not previously but we should probably log this condition

We do in the method?


> On 2010-12-16 00:14:36, Jonathan Gray wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java, line 173
> > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line173>
> >
> >     This is a busy wait loop?
> >     
> >     Should we add a wait/notify on something passed to the thread and w/ a timeout of the period?
> >     
> >     And then we should probably also have some kind of max timeout.  Even if minutes, there could be weird cluster state where the RS misses META availability but someone else might handle it properly, so max timeout might be good?

I need to add a small sleep.  I'd rather do this than wait/notify.  t.isAlive should be enough.  Regards max timeout, I should add check if server is stopped ... and for max timeout, what you think?  Ten minutes?  Then abort?


> On 2010-12-16 00:14:36, Jonathan Gray wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java, line 238
> > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line238>
> >
> >     maybe this should be warn.  i think i'd want to see it and also logging of stack trace (i don't see logging of it elsewhere)

For sure.


- stack


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/#review2083
-----------------------------------------------------------


On 2010-12-15 16:14:27, stack wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/1298/
> -----------------------------------------------------------
> 
> (Updated 2010-12-15 16:14:27)
> 
> 
> Review request for hbase and Jonathan Gray.
> 
> 
> Summary
> -------
> 
> M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
>  Removed stale comments and TODOs.
> 
>  Added a 'version' datamenber, the znode edit version which we keep across open process.
> 
>  Refactored the setting of OPENING out into a method that is used in multiple places 
>  now rather than repeat code.  Did this in new tickleOpening method.
> 
>  Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
>  While its running we update OPENING state if its running a while.
> 
> 
> This addresses bug hbase-3362.
>     http://issues.apache.org/jira/browse/hbase-3362
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1049707 
> 
> Diff: http://review.cloudera.org/r/1298/diff
> 
> 
> Testing
> -------
> 
> Ran it on my cluster. Seems to work as the old code did.
> 
> 
> Thanks,
> 
> stack
> 
>

Re: Review Request: hbase-3362 If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Posted by Jonathan Gray <jg...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/#review2083
-----------------------------------------------------------

Ship it!


a few small comments.  i think the loop should change as described in my comment (busy loop w/ call to currentTimeMillis as i read it).  otherwise +1, good stuff.  we need some tickle util class soon :)


trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6529>

    "on this server" should probably be left in comment to be clear what this is checking



trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6530>

    We were not previously but we should probably log this condition



trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6531>

    This is a busy wait loop?
    
    Should we add a wait/notify on something passed to the thread and w/ a timeout of the period?
    
    And then we should probably also have some kind of max timeout.  Even if minutes, there could be weird cluster state where the RS misses META availability but someone else might handle it properly, so max timeout might be good?



trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6533>

    whitespace



trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
<http://review.cloudera.org/r/1298/#comment6532>

    maybe this should be warn.  i think i'd want to see it and also logging of stack trace (i don't see logging of it elsewhere)


- Jonathan


On 2010-12-15 16:14:27, stack wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/1298/
> -----------------------------------------------------------
> 
> (Updated 2010-12-15 16:14:27)
> 
> 
> Review request for hbase and Jonathan Gray.
> 
> 
> Summary
> -------
> 
> M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
>  Removed stale comments and TODOs.
> 
>  Added a 'version' datamenber, the znode edit version which we keep across open process.
> 
>  Refactored the setting of OPENING out into a method that is used in multiple places 
>  now rather than repeat code.  Did this in new tickleOpening method.
> 
>  Added new PostOpenDeployTasksThread which we run to do the postOpenDeployTasks.
>  While its running we update OPENING state if its running a while.
> 
> 
> This addresses bug hbase-3362.
>     http://issues.apache.org/jira/browse/hbase-3362
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1049707 
> 
> Diff: http://review.cloudera.org/r/1298/diff
> 
> 
> Testing
> -------
> 
> Ran it on my cluster. Seems to work as the old code did.
> 
> 
> Thanks,
> 
> stack
> 
>