You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Keith Turner (Created) (JIRA)" <ji...@apache.org> on 2012/02/27 20:02:46 UTC

[jira] [Created] (ACCUMULO-427) Data lost when tablets moving around frequently

Data lost when tablets moving around frequently
-----------------------------------------------

                 Key: ACCUMULO-427
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-427
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
         Environment: 10 node cluster running random walk test w/ agitation
            Reporter: Keith Turner
            Assignee: Keith Turner
            Priority: Blocker
             Fix For: 1.4.0


The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that last data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (ACCUMULO-427) Data lost when tablets moving around frequently

Posted by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-427.
-----------------------------------

    Resolution: Fixed
    
> Data lost when tablets moving around frequently
> -----------------------------------------------
>
>                 Key: ACCUMULO-427
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-427
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: 10 node cluster running random walk test w/ agitation
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Blocker
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that lost data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ACCUMULO-427) Data lost when tablets moving around frequently

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217403#comment-13217403 ] 

Keith Turner commented on ACCUMULO-427:
---------------------------------------

This bug was triggered by ACCUMULO-329
                
> Data lost when tablets moving around frequently
> -----------------------------------------------
>
>                 Key: ACCUMULO-427
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-427
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: 10 node cluster running random walk test w/ agitation
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Blocker
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that lost data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ACCUMULO-427) Data lost when tablets moving around frequently

Posted by "Keith Turner (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner updated ACCUMULO-427:
----------------------------------

    Description: The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that lost data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.  (was: The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that last data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.)
    
> Data lost when tablets moving around frequently
> -----------------------------------------------
>
>                 Key: ACCUMULO-427
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-427
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: 10 node cluster running random walk test w/ agitation
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Blocker
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that lost data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (ACCUMULO-427) Data lost when tablets moving around frequently

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ACCUMULO-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217555#comment-13217555 ] 

Keith Turner commented on ACCUMULO-427:
---------------------------------------

Looking at the code, the following sequence of events will result in data loss. 

 * Client starts update session 
 * Tablet A is unloaded
 * Client sends mutation batch 1 to tablet A, which fails
 * Tablet A is loaded
 * Client sends mutation batch 2 tablet A, which succeeds 
 * Tablet A is unloaded
 * Client sends mutation batch 3 to tablet A, which fails

In the above sequence, the failure of batch 1 is forgotten by the current code.  It assumes batch 1 and 2 were successful and that batch 3 was not.

I think this bug was caused by changes made in 1.4, and does not exist in 1.3.
                
> Data lost when tablets moving around frequently
> -----------------------------------------------
>
>                 Key: ACCUMULO-427
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-427
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>         Environment: 10 node cluster running random walk test w/ agitation
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Blocker
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> The shard random walk test failed when verifiy its new index.  This test has two tables a document table and a sharded index table used to find documents.  The test has a node that rebuilds the index from the document table and then verifies that the new and old index are the same.  This verification failed.  The failure was all realted to data loss in one tablet in the new index table.  The data that was lost was read from two tablets in the document table.  None of the lost data appeared in any write ahead logs.  The tablet that lost data was being moved around very frequently during the time of the data loss.  All of the evidence points to a bug in the batch writer or the tablet server code related to writing data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira