You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2012/08/24 03:35:41 UTC

[jira] [Created] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Devaraj Das created HBASE-6649:
----------------------------------

             Summary: [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
                 Key: HBASE-6649
                 URL: https://issues.apache.org/jira/browse/HBASE-6649
             Project: HBase
          Issue Type: Bug
            Reporter: Devaraj Das


Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 

Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455143#comment-13455143 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

Just failed again: https://builds.apache.org/job/PreCommit-HBASE-Build/2852//testReport/
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443651#comment-13443651 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

Failed tests:   queueFailover(org.apache.hadoop.hbase.replication.TestReplication): Waited too much time for queueFailover replication. Waited 40364ms.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448514#comment-13448514 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Yeah .. although I should submit a patch for trunk as well.. 
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459923#comment-13459923 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.94 #476 (See [https://builds.apache.org/job/HBase-0.94/476/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160)

     Result = FAILURE
jdcryans : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment:     (was: 6649-fix-io-exception-handling.patch)
    
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-6649.
---------------------------------------

    Resolution: Fixed

Re-closing, I opened HBASE-6847.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455031#comment-13455031 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.94-security #52 (See [https://builds.apache.org/job/HBase-0.94-security/52/])
    HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289)

     Result = SUCCESS
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452621#comment-13452621 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

The problem happened with a recovered log file.. (another RS was trying to replicate files of a previously crashed RS).

The problem here is that the method reads some rows but loses them due to an exception eventually. Look for the lines with the string {noformat}vesta.apache.org%2C57779%2C1345217521341.1345217601487{noformat} in the file http://bit.ly/RDdmPg. You will see a bunch of lines like:
{noformat}
java.io.EOFException: hdfs://localhost:60044/user/hudson/hbase/.oldlogs/vesta.apache.org%2C57779%2C1345217521341.1345217601487, entryStart=40929, pos=40960, end=40960, edit=3
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:252)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:208)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:427)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
{noformat}

Unless I have missed something, here the problem seems to have been caused by the fact that the second call to reader.next in the method readAllEntriesToReplicateOrNextFile fails (please let me know if you need more details).
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: HBase-0.92 #502 test - queueFailover [Jenkins].html
                HBase-0.92 #495 test - queueFailover [Jenkins].html

Uploading the two outputs that I had saved (the links in the jira description aren't valid any more). The worrisome part for me is that in both the cases, the replication seems to be incomplete (although the test waited for a fair bit of time). The fact that one RS from each cluster crashes is expected in this test and the test checks to see that replication succeeds even under this situation.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459977#comment-13459977 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.92 #583 (See [https://builds.apache.org/job/HBase-0.92/583/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388159)
Fixing the CHANGES.txt after 0.92.2's release and adding HBASE-6649 (Revision 1388157)

     Result = SUCCESS
jdcryans : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

jdcryans : 
Files : 
* /hbase/branches/0.92/CHANGES.txt

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459733#comment-13459733 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

+1 on last patch.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443515#comment-13443515 ] 

Ted Yu edited comment on HBASE-6649 at 8/29/12 8:43 AM:
--------------------------------------------------------

>From https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92-security/118/testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/:
{code}
2012-08-28 17:29:54,404 DEBUG [main-EventThread] master.AssignmentManager(2911): based on AM, current region=.META.,,1.1028785192 is on server=juno.apache.org,43891,1346174923071 server being checked: juno.apache.org,55977,1346174923023
2012-08-28 17:29:54,405 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZooKeeperWatcher(266): regionserver:43891-0x1396e4723930005 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
2012-08-28 17:29:54,406 DEBUG [main-EventThread] master.ServerManager(394): Added=juno.apache.org,55977,1346174923023 to dead servers, submitted shutdown handler to be executed, root=false, meta=false
2012-08-28 17:29:54,406 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(266): master:55418-0x1396e4723930003 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
2012-08-28 17:29:54,406 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZKUtil(229): regionserver:43891-0x1396e4723930005 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
2012-08-28 17:29:54,407 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] handler.ServerShutdownHandler(175): Splitting logs for juno.apache.org,55977,1346174923023
...
2012-08-28 17:29:54,407 DEBUG [main-EventThread] zookeeper.ZKUtil(229): master:55418-0x1396e4723930003 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
2012-08-28 17:29:54,410 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.MasterFileSystem(267): Renamed region directory: hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting
2012-08-28 17:29:54,410 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(894): dead splitlog worker juno.apache.org,55977,1346174923023
2012-08-28 17:29:54,413 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(246): Scheduling batch of logs to split
2012-08-28 17:29:54,414 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(248): started splitting logs in [hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting]
...
2012-08-28 17:29:55,000 ERROR [IPC Server handler 7 on 59869] security.UserGroupInformation(1124): PriviledgedActionException as:jenkins.hfs.0 cause:java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
2012-08-28 17:29:55,004 FATAL [RegionServer:0;juno.apache.org,55977,1346174923023.logRoller] regionserver.HRegionServer(1537): ABORTING region server juno.apache.org,55977,1346174923023: IOE in log roller
java.io.IOException: cannot get log writer
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:715)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:662)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:594)
	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:712)
	... 4 more
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3251)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
	at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:198)
	at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:601)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:442)
	at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
	... 5 more
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1167)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

	at org.apache.hadoop.ipc.Client.call(Client.java:1070)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
	at $Proxy8.create(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy8.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248)
	... 13 more
{code}
It is clear that log splitting (splitLog() call on master) raced with log roller (on region server).
In run() of log roller:
{code}
      } catch (IOException ex) {
{code}
One option is to distinguish FileNotFoundException from other IOE's and exit.
                
      was (Author: yuzhihong@gmail.com):
    From https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92-security/lastCompletedBuild/testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/:
{code}
2012-08-28 17:29:54,404 DEBUG [main-EventThread] master.AssignmentManager(2911): based on AM, current region=.META.,,1.1028785192 is on server=juno.apache.org,43891,1346174923071 server being checked: juno.apache.org,55977,1346174923023
2012-08-28 17:29:54,405 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZooKeeperWatcher(266): regionserver:43891-0x1396e4723930005 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
2012-08-28 17:29:54,406 DEBUG [main-EventThread] master.ServerManager(394): Added=juno.apache.org,55977,1346174923023 to dead servers, submitted shutdown handler to be executed, root=false, meta=false
2012-08-28 17:29:54,406 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(266): master:55418-0x1396e4723930003 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
2012-08-28 17:29:54,406 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZKUtil(229): regionserver:43891-0x1396e4723930005 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
2012-08-28 17:29:54,407 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] handler.ServerShutdownHandler(175): Splitting logs for juno.apache.org,55977,1346174923023
...
2012-08-28 17:29:54,407 DEBUG [main-EventThread] zookeeper.ZKUtil(229): master:55418-0x1396e4723930003 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
2012-08-28 17:29:54,410 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.MasterFileSystem(267): Renamed region directory: hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting
2012-08-28 17:29:54,410 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(894): dead splitlog worker juno.apache.org,55977,1346174923023
2012-08-28 17:29:54,413 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(246): Scheduling batch of logs to split
2012-08-28 17:29:54,414 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(248): started splitting logs in [hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting]
...
2012-08-28 17:29:55,000 ERROR [IPC Server handler 7 on 59869] security.UserGroupInformation(1124): PriviledgedActionException as:jenkins.hfs.0 cause:java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
2012-08-28 17:29:55,004 FATAL [RegionServer:0;juno.apache.org,55977,1346174923023.logRoller] regionserver.HRegionServer(1537): ABORTING region server juno.apache.org,55977,1346174923023: IOE in log roller
java.io.IOException: cannot get log writer
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:715)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:662)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:594)
	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:712)
	... 4 more
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3251)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
	at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:198)
	at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:601)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:442)
	at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
	... 5 more
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1167)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

	at org.apache.hadoop.ipc.Client.call(Client.java:1070)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
	at $Proxy8.create(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy8.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248)
	... 13 more
{code}
It is clear that log splitting (splitLog() call on master) raced with log roller (on region server).
In run() of log roller:
{code}
      } catch (IOException ex) {
{code}
One option is to distinguish FileNotFoundException from other IOE's and exit.
                  
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458839#comment-13458839 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

bq. This would be a dataloss issue without the fix.
bq. I have seen dataloss issues (via the unit test) without this patch..

FWIW if there was indeed dataloss caused by this, it would have been when recovering logs. During normal operation that exception was retried until we're able to read the file.

bq. could you please try this patch out in your cluster.

It's not exactly a test cluster, more like prod-ish, so I'll put it on only one machine. I assume it might take the whole day to hit the condition.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443515#comment-13443515 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

>From https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92-security/lastCompletedBuild/testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/:
{code}
2012-08-28 17:29:54,404 DEBUG [main-EventThread] master.AssignmentManager(2911): based on AM, current region=.META.,,1.1028785192 is on server=juno.apache.org,43891,1346174923071 server being checked: juno.apache.org,55977,1346174923023
2012-08-28 17:29:54,405 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZooKeeperWatcher(266): regionserver:43891-0x1396e4723930005 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
2012-08-28 17:29:54,406 DEBUG [main-EventThread] master.ServerManager(394): Added=juno.apache.org,55977,1346174923023 to dead servers, submitted shutdown handler to be executed, root=false, meta=false
2012-08-28 17:29:54,406 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(266): master:55418-0x1396e4723930003 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
2012-08-28 17:29:54,406 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZKUtil(229): regionserver:43891-0x1396e4723930005 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
2012-08-28 17:29:54,407 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] handler.ServerShutdownHandler(175): Splitting logs for juno.apache.org,55977,1346174923023
...
2012-08-28 17:29:54,407 DEBUG [main-EventThread] zookeeper.ZKUtil(229): master:55418-0x1396e4723930003 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
2012-08-28 17:29:54,410 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.MasterFileSystem(267): Renamed region directory: hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting
2012-08-28 17:29:54,410 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(894): dead splitlog worker juno.apache.org,55977,1346174923023
2012-08-28 17:29:54,413 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(246): Scheduling batch of logs to split
2012-08-28 17:29:54,414 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(248): started splitting logs in [hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting]
...
2012-08-28 17:29:55,000 ERROR [IPC Server handler 7 on 59869] security.UserGroupInformation(1124): PriviledgedActionException as:jenkins.hfs.0 cause:java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
2012-08-28 17:29:55,004 FATAL [RegionServer:0;juno.apache.org,55977,1346174923023.logRoller] regionserver.HRegionServer(1537): ABORTING region server juno.apache.org,55977,1346174923023: IOE in log roller
java.io.IOException: cannot get log writer
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:715)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:662)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:594)
	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:712)
	... 4 more
Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3251)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
	at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:198)
	at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:601)
	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:442)
	at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
	... 5 more
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1167)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

	at org.apache.hadoop.ipc.Client.call(Client.java:1070)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
	at $Proxy8.create(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy8.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248)
	... 13 more
{code}
It is clear that log splitting (splitLog() call on master) raced with log roller (on region server).
In run() of log roller:
{code}
      } catch (IOException ex) {
{code}
One option is to distinguish FileNotFoundException from other IOE's and exit.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459917#comment-13459917 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-TRUNK #3360 (See [https://builds.apache.org/job/HBase-TRUNK/3360/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388161)

     Result = FAILURE
jdcryans : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454189#comment-13454189 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

What I meant is that the reader gets this 10 times:

{noformat}
java.io.EOFException: hdfs://localhost:60044/user/hudson/hbase/.oldlogs/vesta.apache.org%2C57779%2C1345217521341.1345217601487, entryStart=40929, pos=40960, end=40960, edit=3
{noformat}

So if I'm reading this correctly it's able to read the file and got 3 edits but gets an EOF. Is something half written? Then it gives up on the file:

{noformat}
2012-08-17 15:33:50,099 INFO  [ReplicationExecutor-0.replicationSource,2-vesta.apache.org,57779,1345217521341] regionserver.ReplicationSourceManager(352): Done with the recovered queue 2-vesta.apache.org,57779,1345217521341
{noformat}

And there's data loss.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446610#comment-13446610 ] 

stack commented on HBASE-6649:
------------------------------

I think we disable it in 0.92 and perhaps in 0.94 for 0.94.2 (unless someone fixes it meantime).  We leave this issue as critical on trunk.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458274#comment-13458274 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Okay a plausible explanation - 
1. ReplicationSource.readAllEntriesToReplicateOrNextFile throws an IOException (which causes the log "Break on IOE:" to print), but ignores the exception.
2. When readAllEntriesToReplicateOrNextFile returns, the reader's file-pointer position is queried and 'this.position' is set to that (the reader's file-pointer is possibly pointing to gibberish)
3. Eventually, readAllEntriesToReplicateOrNextFile gets called again, and this time this.reader.next inside throws IndexOutOfBounds exception because it read gibberish (looking at the code of DataInputStream.java, it seems like one of the cases where the IndexOutOfBounds is thrown is when the length passed to readFully is less than 0).

The fix I can think of is to reset the reader's 'position' to the last valid position (upon return from the method readAllEntriesToReplicateOrNextFile).

Thoughts on the above? Does the analysis make sense?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-fix-io-exception-handling.patch

Attaching a more complete fix (for 0.94)

[~jdcryans], could you please try this patch out in your cluster.

The more I think about it, the more I am beginning to believe that setting the position so that it always points to a valid location is the fix here...

[~lhofhansl] I have seen dataloss issues (via the unit test) without this patch..
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448924#comment-13448924 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

Patch looks good to me.
(As Ted points out there might other issues as well)
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452581#comment-13452581 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

bq. This is because of multiple calls to reader.next within readAllEntriesToReplicateOrNextFile. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated.

The position in the log is updated in ZK only once the edits are replicated hence, even if you fail on the second or hundredth edit, the next region server that will be in charge of that log will pick up where the previous RS was (even if that means re-reading some edits).
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458404#comment-13458404 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

I say we revert from 0.94.2 and retry in 0.94.3.

Although from DD's comment:
bq. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated.

This would be a dataloss issue without the fix.

I find that a bit confusion. Since J-D saw the ignored exception in the test cluster eventually on all machines, it seems there was data lost in all versions before 0.94.2? That seems very unlikely.

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458131#comment-13458131 ] 

Lars Hofhansl edited comment on HBASE-6649 at 9/19/12 7:38 AM:
---------------------------------------------------------------

You mean fix or rollback (the change)?
                
      was (Author: lhofhansl):
    You fix or rollback (the change)?
                  
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-fix-io-exception-handling-1.patch

Attaching a patch with the 'position' fix.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448988#comment-13448988 ] 

stack commented on HBASE-6649:
------------------------------

J-D on vacation.  Let me commit this.  Will add the log message Ted suggests though my sense it overkill, lets see.  Would suggest new issue for other 'parts' DD.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448117#comment-13448117 ] 

Hadoop QA commented on HBASE-6649:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12543752/6649-1.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2779//console

This message is automatically generated.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448942#comment-13448942 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

@J-D:
What do you think ?

nit:
{code}
+      } catch (IOException ie) {
+        break;
{code}
A log statement is desirable before break.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448914#comment-13448914 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplication.txt was 0 length.
There was no JVM left from TestReplication by the time I got back to computer.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6649:
--------------------------

    Attachment: 6649-2.txt

Without patch, TestReplication#queueFailover failed on 4th iteration.

With patch v2, 6 iterations passed.

Running 100 more iterations.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469983#comment-13469983 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.92-security #143 (See [https://builds.apache.org/job/HBase-0.92-security/143/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388159)
Fixing the CHANGES.txt after 0.92.2's release and adding HBASE-6649 (Revision 1388157)
HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381291)

     Result = FAILURE
jdcryans : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

jdcryans : 
Files : 
* /hbase/branches/0.92/CHANGES.txt

stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469872#comment-13469872 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.94-security-on-Hadoop-23 #8 (See [https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/8/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160)
HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289)

     Result = FAILURE
jdcryans : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459137#comment-13459137 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

The server that has the patch did a "Break on IOE" twice, and it seems to work:

{noformat}
2012-09-19 21:26:50,104 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r6s44%2C10304%2C1348088378534.1348089931722 at 21992487
2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on IOE: hdfs://va1r5s41:10101/va1-backup/.logs/va1r6s44,10304,1348088378534/va1r6s44%2C10304%2C1348088378534.1348089931722, entryStart=21993911, pos=22058496, end=22058496, edit=5
2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:783007 and seenEntries:5 and size: 64585
2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 5
2012-09-19 21:26:50,119 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #va1r6s44%2C10304%2C1348088378534.1348089931722 for position 21993911 in hdfs://va1r5s41:10101/va1-backup/.logs/va1r6s44,10304,1348088378534/va1r6s44%2C10304%2C1348088378534.1348089931722
2012-09-19 21:26:50,129 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
2012-09-19 21:26:50,129 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 145502
2012-09-19 21:26:50,129 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r6s44%2C10304%2C1348088378534.1348089931722 at 21993911
{noformat}

One thing that I saw that this patch breaks is the size in "currentNbOperations:783007 and seenEntries:5 and size: 64585" because it relies on this.position being the position at the beginning. I often see that number at 0 while having edits to replicate. It's minor since in HBASE-6804 I'm removing that log message altogether but we may want to either remove the size or keep track of what it is at the beginning of the loop within the context of this jira.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446602#comment-13446602 ] 

stack commented on HBASE-6649:
------------------------------

Should we disable this flapping test till its figured?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448907#comment-13448907 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

[~zhihyu@ebaysf.com]This patch fixes a specific problem to do with replication missing rows, and in my observations, that leads to somewhat frequent TestReplication.queueFailover failures. On trunk, do you know which test hangs? There probably are more issues to fix in the replication area, and we should have follow up jiras (and this jira is part-1 :)).
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Fix Version/s: 0.92.3
         Assignee: Devaraj Das
           Status: Patch Available  (was: Open)
    
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6649:
--------------------------

    Attachment: 6649.txt

Reducing replication.sleep.before.failover
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649.txt
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458352#comment-13458352 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

Should we pull HBASE-6719 into this?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459758#comment-13459758 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

I'm going to create a new jira first (should have done that when I found that problem) and post the patches there with a small nit fixed.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446631#comment-13446631 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

So far it seems like a hdfs issue (somehow there are a couple of missing rows in the replicated data). In one or two days I will post some concrete comments..
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454181#comment-13454181 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

bq. Oh I see what you mean. Very good find! I wonder what's that gibberish at the end of the file.

Thanks! Are you referring to the log file? I see the following at the end (no gibberish):

{noformat}
2012-08-17 15:35:01,161 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(474): Opening log for replication vesta.apache.org%2C40480%2C1345217521368.1345217648386 at 258
2012-08-17 15:35:01,164 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(429): currentNbOperations:13022 and seenEntries:0 and size: 0
2012-08-17 15:35:01,164 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(549): Nothing to replicate, sleeping 100 times 10
{noformat}
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458850#comment-13458850 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Thanks, JD
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-1.patch

After spending some time on debugging what was going on (where I took the failure as in http://bit.ly/RDdmPg as the test failure to debug), seems to me that the problem is due to the way exceptions are handled in ReplicationSource.java. Basically, the replication would fail with exceptions for all entries involved in a particular call to ReplicationSource.readAllEntriesToReplicateOrNextFile, even if the exception were thrown for the tailing entry(s). This is because of multiple calls to reader.next within readAllEntriesToReplicateOrNextFile. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated.

The patch attached here makes the exception handling so that if there were a exception in the second time, the method would just return (thereby allowing the present call to readAllEntriesToReplicateOrNextFile proceed normally). The following call to readAllEntriesToReplicateOrNextFile would actually throw the exception.

With this patch, I stopped noticing the failures similar to http://bit.ly/RDdmPg. 

However, I do see some other failures and that I am still debugging (and that's why I renamed this issue to Part-1!)
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458192#comment-13458192 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

But now that I think about it, it may crap out when coming back to read even on a recovered file. The data will all make it to the other cluster but that source will never be fully cleaned up.

Which leads me to think that this is a bug in DFSClient. It's expecting something it's not getting.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449147#comment-13449147 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.92 #557 (See [https://builds.apache.org/job/HBase-0.92/557/])
    HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381291)

     Result = SUCCESS
stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-0.92.patch
                6649-trunk.patch

Don't mind adding a few comments around the exception handling..
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Himanshu Vashishtha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448987#comment-13448987 ] 

Himanshu Vashishtha commented on HBASE-6649:
--------------------------------------------

lgtm. 
The exception will be re-thrown in the next try, so +0 on adding a log statement before break.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl updated HBASE-6649:
---------------------------------

    Fix Version/s: 0.94.2
                   0.96.0

I'd also like this in 0.94. The 0.92 will probably just apply cleanly. If not I'll make one.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl reopened HBASE-6649:
----------------------------------

    
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-6649:
-------------------------

    Priority: Blocker  (was: Major)
    
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458121#comment-13458121 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

We applied this patch on a cluster that replicates and about all the nodes stopped replicated after some time. This is what I see in the logs:

{noformat}
2012-09-17 20:04:08,111 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78617132
2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on IOE: hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318, entryStart=78641557, pos=78771200, end=78771200, edit=84
2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:164529 and seenEntries:84 and size: 154068
2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 84
2012-09-17 20:04:08,146 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #va1r3s24%2C10304%2C1347911704238.1347911706318 for position 78771200 in hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
2012-09-17 20:04:08,158 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
2012-09-17 20:04:08,158 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 93234
2012-09-17 20:04:08,158 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78771200
2012-09-17 20:04:08,163 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected exception in ReplicationSource, currentPath=hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
java.lang.IndexOutOfBoundsException
        at java.io.DataInputStream.readFully(DataInputStream.java:175)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2001)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1901)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1947)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:394)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:307)
{noformat}

The file is still in HDFS and it's about double the size we see up there, so it wasn't the end of the file. Looking at other nodes, we always get "Break on IOE" before getting the exception that kills replication. This is why I think that this patch is the issue. Somehow reading up to the end is reading too far.

We need to fix or backport.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443643#comment-13443643 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

bq. Test failed on the 4th of the 100 iterations.
What failure did you see?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460080#comment-13460080 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #184 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/184/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388161)

     Result = FAILURE
jdcryans : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446613#comment-13446613 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

I saw that comment too. 
On my laptop the loading took about 20 seconds. 
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460750#comment-13460750 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.94-security #53 (See [https://builds.apache.org/job/HBase-0.94-security/53/])
    HBASE-6847  HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160)

     Result = SUCCESS
jdcryans : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.92.3, 0.94.2, 0.96.0
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458378#comment-13458378 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

bq. The patch only catches and ignores IOE (as opposed to all exceptions)

What it does do is permitting to read up to the end of the file.

bq. [Not sure which hadoop version you are on, but there is no chance you are hitting HDFS-1108, right?]

We are on CDH3u3, didn't change when we applied the patch.

bq. Okay a plausible explanation -

It's plausible but unless we really understand what that "gibberish" is at the end of the file, we can't truly make a fix. I don't know why that IOE is throw out but normally we just silently finish reading from the file. There is some special case here.

bq. Should we pull HBASE-6719 into this?

I think it's separate issues.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458222#comment-13458222 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Yeah, [~jdcryans] not sure how one could get a IndexOutOfBounds exception. I can't see how the patch would make it surface as well .. The patch only catches and ignores IOE (as opposed to *all* exceptions).. But yeah give me another hour please. Let me dig some more.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446608#comment-13446608 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

A flapping test is almost worse than a failing test. It adds to the runtime, but does not add confidence to the test run.

There're some scary comment in there as well:
{code}
    // Takes about 20 secs to run the full loading, kill around the middle
    Thread killer1 = killARegionServer(utility1, 7500, rsToKill1);
    Thread killer2 = killARegionServer(utility2, 10000, rsToKill2);
{code}

On what machine does it take 20s?
I'd say we disable it for now.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458134#comment-13458134 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Looking at the logs/patch more closely.. Will get back soon.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-fix-io-exception-handling-1-trunk.patch

Same patch, for trunk.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448512#comment-13448512 ] 

stack commented on HBASE-6649:
------------------------------

This patch makes sense to me.  We replicate all up to the exception and then next time in, we should pick up the IOE again.  Want me to commit this DD?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-6649:
--------------------------

    Attachment:     (was: 6649.txt)
    
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458185#comment-13458185 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

[~devaraj] I'm still trying to figure out exactly how we get the IndexOutOfBoundsException (I'd say the file didn't get new data and we started reading exactly at the end and the DFSClient doesn't like that? Or it's missing something at the end?), but if it's a case of reading the tail of a recovered log then we *could* add a check like this:

{code}
      try {
        entry = this.reader.next(entriesArray[currentNbEntries]);
      } catch (IOException ie) {
        if (queueRecovered) {
          LOG.debug("Break on IOE: " + ie.getMessage());
          break;
        } else {
          throw ie;
        }
      }
{code}
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-fix-io-exception-handling.patch

This patch demonstrates what I commented with earlier. Please have a look. I could make a method which has the getPosition() and next().. but I wanted to check on whether folks agree with the fix first.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446607#comment-13446607 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

I think we should disable this test. 
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443631#comment-13443631 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

Test failed on the 4th of the 100 iterations.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails
> ------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454347#comment-13454347 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

This log file belongs to a crashed RS, and yes, it seems like the last record wasn't completely written to the file before the RS crashed. That should be fine, i.e., no dataloss should happen - in the queueFailover test, the client would have got exceptions to the flushCommit call and it would have retried the batch of 'put' and the corresponding records would have ended up in another RS.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459163#comment-13459163 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Good to know, JD. I'll submit a patch with the logging addressed in a bit.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458251#comment-13458251 ] 

Devaraj Das commented on HBASE-6649:
------------------------------------

Has there been any change in your cluster environment (hadoop version, etc. using different version of dfs client causing the issue to surface)? [Not sure which hadoop version you are on, but there is no chance you are hitting HDFS-1108, right?]
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448900#comment-13448900 ] 

Ted Yu commented on HBASE-6649:
-------------------------------

When I ran DD's patch in trunk, TestReplication still hung.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449035#comment-13449035 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-0.94 #450 (See [https://builds.apache.org/job/HBase-0.94/450/])
    HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Summary: [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]  (was: [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails)
    
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>         Attachments: 6649-2.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449069#comment-13449069 ] 

Hudson commented on HBASE-6649:
-------------------------------

Integrated in HBase-TRUNK #3307 (See [https://builds.apache.org/job/HBase-TRUNK/3307/])
    HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381287)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HBASE-6649:
-------------------------------

    Attachment: 6649-trunk.patch

Patch for trunk
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.92.3
>
>         Attachments: 6649-1.patch, 6649-2.txt, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458135#comment-13458135 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

[~lhofhansl] Trying to figure out what the problem is first although if we're in a hurry we can just rollback. (not backport, doh!)
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458131#comment-13458131 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

You fix or rollback (the change)?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-6649:
-------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Committed to trunk, 0.92, and 0.94.  Thanks for the reviews lads and DD for the patch.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454157#comment-13454157 ] 

Jean-Daniel Cryans commented on HBASE-6649:
-------------------------------------------

Oh I see what you mean. Very good find! I wonder what's that gibberish at the end of the file.
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459754#comment-13459754 ] 

Lars Hofhansl commented on HBASE-6649:
--------------------------------------

J-D, any objections to committing this?
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-fix-io-exception-handling-1.patch, 6649-fix-io-exception-handling-1-trunk.patch, 6649-fix-io-exception-handling.patch, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6649) [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-6649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-6649:
-------------------------

    Attachment: 6649.txt

Here is what I applied.  Includes Ted's suggested logging.  I applied this same patch to 0.94 and 0.92 w/ -p1
                
> [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-6649
>                 URL: https://issues.apache.org/jira/browse/HBASE-6649
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>             Fix For: 0.96.0, 0.92.3, 0.94.2
>
>         Attachments: 6649-0.92.patch, 6649-1.patch, 6649-2.txt, 6649-trunk.patch, 6649-trunk.patch, 6649.txt, HBase-0.92 #495 test - queueFailover [Jenkins].html, HBase-0.92 #502 test - queueFailover [Jenkins].html
>
>
> Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 .. 
> Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira