You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Lars George (JIRA)" <ji...@apache.org> on 2011/05/16 15:17:47 UTC

[jira] [Created] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Scheduled tasks in distributed log splitting not in sync with ZK
----------------------------------------------------------------

Key: HBASE-3890
URL: https://issues.apache.org/jira/browse/HBASE-3890
Project: HBase
Issue Type: Bug
Components: regionserver
Affects Versions: 0.92.0
Reporter: Lars George
Fix For: 0.92.0

This is in continuation to HBASE-3889:

Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:

{noformat}
Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765
- Waiting for distributed tasks to finish. scheduled=2 done=1 error=0 4380s

Master startup
- Splitting logs after master startup 4388s
{noformat}

There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.

The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-3890:
--------------------------

    Fix Version/s:     (was: 0.92.0)
                   0.94.0

Lars agrees to punt this one.

> Scheduled tasks in distributed log splitting not in sync with ZK
> ----------------------------------------------------------------
>
>                 Key: HBASE-3890
>                 URL: https://issues.apache.org/jira/browse/HBASE-3890
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0
>            Reporter: Lars George
>             Fix For: 0.94.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765	
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup	
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Posted by "Prakash Khemani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034149#comment-13034149 ] 

Prakash Khemani commented on HBASE-3890:
----------------------------------------

With the bug you identified in HBASE-3889 this behavior is expected. The SplitLogManager will put up a task, a SplitLogWorker will pick it up and will never complete it because of the bug. Manager will resubmit the task and another worker will pick it up to never complete it. The Manager resubmits at most hbase.splitlog.max.resubmit (default = 3) times after which the task hangs.



> Scheduled tasks in distributed log splitting not in sync with ZK
> ----------------------------------------------------------------
>
>                 Key: HBASE-3890
>                 URL: https://issues.apache.org/jira/browse/HBASE-3890
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0
>            Reporter: Lars George
>             Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765	
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup	
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Posted by "Lars Hofhansl (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-3890:
---------------------------------

    Fix Version/s:     (was: 0.94.0)
                   0.96.0

Punting for 0.94 as well.
                
> Scheduled tasks in distributed log splitting not in sync with ZK
> ----------------------------------------------------------------
>
>                 Key: HBASE-3890
>                 URL: https://issues.apache.org/jira/browse/HBASE-3890
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0
>            Reporter: Lars George
>             Fix For: 0.96.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765	
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup	
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034291#comment-13034291 ] 

Lars George commented on HBASE-3890:
------------------------------------

Hi Prakash, thanks for the input! I think though this is unrelated, as this happened after the patch and restart. The messages should all have been the replay of the recovered logs. They do not show up since the first few will make them drop from the TaskMonitor because of the reuse.

I am not sure where and how this errs, and if it does at all. But I got those "leaked" log hints and the UI did not show any running tasks as it should have. So something is amiss, but I still need to check what is wrong.

> Scheduled tasks in distributed log splitting not in sync with ZK
> ----------------------------------------------------------------
>
>                 Key: HBASE-3890
>                 URL: https://issues.apache.org/jira/browse/HBASE-3890
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0
>            Reporter: Lars George
>             Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765	
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup	
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Posted by "Lars George (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034015#comment-13034015 ] 

Lars George commented on HBASE-3890:
------------------------------------

I restarted the cluster (the pseudo distributed local instance) and then it kicked into gear with the replaying of the logs, taking up from where it was stuck before in limbo.

> Scheduled tasks in distributed log splitting not in sync with ZK
> ----------------------------------------------------------------
>
>                 Key: HBASE-3890
>                 URL: https://issues.apache.org/jira/browse/HBASE-3890
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0
>            Reporter: Lars George
>             Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765	
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup	
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3890) Scheduled tasks in distributed log splitting not in sync with ZK

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3890:
-------------------------

    Fix Version/s: 0.92.0

> Scheduled tasks in distributed log splitting not in sync with ZK
> ----------------------------------------------------------------
>
>                 Key: HBASE-3890
>                 URL: https://issues.apache.org/jira/browse/HBASE-3890
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0
>            Reporter: Lars George
>             Fix For: 0.92.0
>
>
> This is in continuation to HBASE-3889:
> Note that there must be more slightly off here. Although the splitlogs znode is now empty the master is still stuck here:
> {noformat}
> Doing distributed log split in hdfs://localhost:8020/hbase/.logs/10.0.0.65,60020,1305406356765	
> - Waiting for distributed tasks to finish. scheduled=2 done=1 error=0   4380s
> Master startup	
> - Splitting logs after master startup   4388s
> {noformat}
> There seems to be an issue with what is in ZK and what the TaskBatch holds. In my case it could be related to the fact that the task was already in ZK after many faulty restarts because of the NPE. Maybe it was added once (since that is keyed by path, and that is unique on my machine), but the reference count upped twice? Now that the real one is done, the done counter has been increased, but will never match the scheduled.
> The code could also check if ZK is actually depleted, and therefore treat the scheduled task as bogus? This of course only treats the symptom, not the root cause of this condition. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira