You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by "Keith Turner (Created) (JIRA)" <ji...@apache.org> on 2012/03/06 23:15:57 UTC

[jira] [Created] (ACCUMULO-449) Failed log copy is not restarted

Failed log copy is not restarted
--------------------------------

                 Key: ACCUMULO-449
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-449
             Project: Accumulo
          Issue Type: Bug
          Components: logger, master
            Reporter: Keith Turner
            Assignee: Eric Newton
             Fix For: 1.4.0


I shut a single node instance down uncleanly.  When I restarted it the logger did not have enough memory to preform the log sort, it got an OOME and died.  I edited accumulo-env.sh and gave the logger process more memory.  I restarted the logger process.  However, the log recovery never restarted.   

The master was continually printing message like the following.

{noformat}
06 17:07:16,609 [master.CoordinateRecoveryTask] DEBUG: Copying 65c48045-88c1-48e4-93d3-4865a9a86050 from xxx.xxx.xxx.xxx:11224 (for 1210.306000 seconds) 0.0
{noformat}

After 20m I restarted the master and then log recovery proceeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-449) Failed log copy is not restarted

Posted by "Eric Newton (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223861#comment-13223861 ] 

Eric Newton commented on ACCUMULO-449:
--------------------------------------

It does restart, but it takes a long time to timeout (an hour?!?).  We need to use an API to get the status from the logger: using HDFS to communicate is too much of a kludge.

                
> Failed log copy is not restarted
> --------------------------------
>
>                 Key: ACCUMULO-449
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-449
>             Project: Accumulo
>          Issue Type: Bug
>          Components: logger, master
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> I shut a single node instance down uncleanly.  When I restarted it the logger did not have enough memory to preform the log sort, it got an OOME and died.  I edited accumulo-env.sh and gave the logger process more memory.  I restarted the logger process.  However, the log recovery never restarted.   
> The master was continually printing message like the following.
> {noformat}
> 06 17:07:16,609 [master.CoordinateRecoveryTask] DEBUG: Copying 65c48045-88c1-48e4-93d3-4865a9a86050 from xxx.xxx.xxx.xxx:11224 (for 1210.306000 seconds) 0.0
> {noformat}
> After 20m I restarted the master and then log recovery proceeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-449) Failed log copy is not restarted

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-449:
----------------------------------

    Affects Version/s: 1.4.0
                       1.3.5
        Fix Version/s:     (was: 1.4.0)
                       1.4.1

Should probably notice that the logger lost its zookeeper lock.
                
> Failed log copy is not restarted
> --------------------------------
>
>                 Key: ACCUMULO-449
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-449
>             Project: Accumulo
>          Issue Type: Bug
>          Components: logger, master
>    Affects Versions: 1.3.5, 1.4.0
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>              Labels: 14_qa_bug
>             Fix For: 1.4.1
>
>
> I shut a single node instance down uncleanly.  When I restarted it the logger did not have enough memory to preform the log sort, it got an OOME and died.  I edited accumulo-env.sh and gave the logger process more memory.  I restarted the logger process.  However, the log recovery never restarted.   
> The master was continually printing message like the following.
> {noformat}
> 06 17:07:16,609 [master.CoordinateRecoveryTask] DEBUG: Copying 65c48045-88c1-48e4-93d3-4865a9a86050 from xxx.xxx.xxx.xxx:11224 (for 1210.306000 seconds) 0.0
> {noformat}
> After 20m I restarted the master and then log recovery proceeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-449) Failed log copy is not restarted

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223749#comment-13223749 ] 

Keith Turner commented on ACCUMULO-449:
---------------------------------------

Originally the logger had a max of 128m of heap.  I think I copied accumulo-env.sh.512MBBstandalone-native-example.  I upped the logger heap to 512m max.
                
> Failed log copy is not restarted
> --------------------------------
>
>                 Key: ACCUMULO-449
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-449
>             Project: Accumulo
>          Issue Type: Bug
>          Components: logger, master
>            Reporter: Keith Turner
>            Assignee: Eric Newton
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> I shut a single node instance down uncleanly.  When I restarted it the logger did not have enough memory to preform the log sort, it got an OOME and died.  I edited accumulo-env.sh and gave the logger process more memory.  I restarted the logger process.  However, the log recovery never restarted.   
> The master was continually printing message like the following.
> {noformat}
> 06 17:07:16,609 [master.CoordinateRecoveryTask] DEBUG: Copying 65c48045-88c1-48e4-93d3-4865a9a86050 from xxx.xxx.xxx.xxx:11224 (for 1210.306000 seconds) 0.0
> {noformat}
> After 20m I restarted the master and then log recovery proceeded.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira