You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Eric Newton (Commented) (JIRA)" <ji...@apache.org> on 2012/02/09 15:09:59 UTC
[jira] [Commented] (ACCUMULO-356) merge failed to complete

    [ https://issues.apache.org/jira/browse/ACCUMULO-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204541#comment-13204541 ] 

Eric Newton commented on ACCUMULO-356:
--------------------------------------

There was an inconsistent read of the METADATA table due to a split occurring in the tablet during the merge.  Operations, such as merge and deleteTable will need to perform a consistent scan of the metadata table, ensuring that all necessary tablets are offline and the end-row/prev-row entries match before writing to the METADATA table.

                
> merge failed to complete
> ------------------------
>
>                 Key: ACCUMULO-356
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-356
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: random walk test of "All" without "Security" on a 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.0
>
>
> Noticed the random walk test was generally hung.  There was a tablet offline.  Listed the Fate operations:
> {noformat}
> $ ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 59c0403614dc0c39  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:cz]          top: RenameTable
> txid: 37539f8d61548764  status: IN_PROGRESS         op: ChangeTableState  locked: []              locking: [W:cz]          top: ChangeTableState
> txid: 02f8323a3136e60d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:cz]          top: TableRangeOp
> txid: 044015732e97eec1  status: IN_PROGRESS         op: CompactRange     locked: []              locking: [R:cz]          top: CompactRange
> txid: 6ce9dd63f9d51448  status: IN_PROGRESS         op: CompactRange     locked: []              locking: [R:cz]          top: CompactRange
> txid: 417cb9b60e44ecd9  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:cz]          top: TableRangeOp
> txid: 5e7c5284a4677d6c  status: IN_PROGRESS         op: DeleteTable      locked: []              locking: [W:cz]          top: DeleteTable
> txid: 6633d3d841d66995  status: IN_PROGRESS         op: TableRangeOp     locked: [W:cz]          locking: []              top: TableRangeOpWait
> {noformat}
> Quick check of master logs finds that Fate tx 6633d3d841d66995 did not complete.  Here's what the master saw for this transaction:
> The transaction is seeded, and a write lock is put into the queue; but there are other operations going on
> {noformat}
> 30 20:52:29,387 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 4 userData 6633d3d841d66995 lockType WRITE
> {noformat}
> For example, there was a read lock for transaction 56fada3359bfb4c7
> {noformat}
> 30 20:52:29,391 [tableOps.Utils] INFO : table cz (56fada3359bfb4c7) unlocked for read
> {noformat}
> And yet another write lock ahead of us for transaction 62c3cecaf387398e
> {noformat}
> 30 20:52:29,486 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) locked for write operation: ONLINE
> 30 20:52:29,488 [tables.TableManager] DEBUG: Transitioning state for table cz from OFFLINE to ONLINE
> 30 20:52:29,491 [tables.TableManager] DEBUG: State transition to ONLINE @ WatchedEvent state:SyncConnected type:NodeDataChanged path:/accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/tables/cz/state
> 30 20:52:29,491 [master.EventCoordinator] INFO : Table state in zookeeper changed for cz to ONLINE
> 30 20:52:29,494 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) unlocked for write
> {noformat}
> So now our table is in the online state... but the tablets may not be (yet). Our Fate operation gets the write lock and starts to execute
> {noformat}
> 30 20:52:29,494 [tableOps.ChangeTableState] DEBUG: Changed table state cz ONLINE
> 30 20:52:29,494 [master.EventCoordinator] INFO : Set table state of cz to ONLINE
> 30 20:52:29,495 [tableOps.Utils] INFO : table cz (6633d3d841d66995) locked for write operation: MERGE
> {noformat}
> Hmm... this seems strange... we went from STARTED directly to MERGING: no OFFLINE, no CHOPPING.  Those are very strange MergeStats, too.  I'm not 100% sure that these merge stats go with this operation, but the timing suggests that it does (need to fix that log message).
> {noformat}
> 30 20:52:29,500 [master.EventCoordinator] INFO : Merge state of cz;7c9131203e34514c;39ceaf91dd1db6ef set to STARTED
> 30 20:52:29,516 [state.MergeStats] INFO : 0 are hosted, total 0
> 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped
> 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped, 0 are offline
> 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are unassigned
> 30 20:52:29,535 [master.EventCoordinator] INFO : Merge state of cz;7c9131203e34514c;39ceaf91dd1db6ef set to MERGING
> {noformat}
> Looks like the tablets were offline, too, because the main master loop is trying to assign them:
> {noformat}
> 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;08498ad358a45bb0;050754dd3cc35c16=192.168.117.4:9997[134d7425fc8959e]
> 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;050754dd3cc35c16<=192.168.117.5:9997[334c843791cb78b]
> 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet cz;137d382d52b8fbd9;08498ad358a45bb0=192.168.117.6:9997[134d7425fc8959f]
> 30 20:52:29,615 [master.EventCoordinator] INFO : tablet cz;08498ad358a45bb0;050754dd3cc35c16 was loaded
> 30 20:52:29,616 [master.EventCoordinator] INFO : tablet cz;137d382d52b8fbd9;08498ad358a45bb0 was loaded
> {noformat}
> And, just to make this more interesting, a tablet splits shortly after being loaded:
> {noformat}
> 30 20:52:29,603 [state.MergeStats] INFO : 1 tablets are unassigned
> 30 20:52:29,688 [master.EventCoordinator] INFO : tablet cz<;137d382d52b8fbd9 was unloaded
> 30 20:52:30,081 [master.EventCoordinator] INFO : 192.168.117.9:9997 reported split cz;08498ad358a45bb0;050754dd3cc35c16, cz;137d382d52b8fbd9;08498ad358a45bb0
> {noformat}
> Master takes some of the tablets offline
> {noformat}
> 30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< was unloaded
> 30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;08498ad358a45bb0;050754dd3cc35c16 was unloaded
> 30 20:52:30,084 [master.EventCoordinator] INFO : tablet cz;137d382d52b8fbd9;08498ad358a45bb0 was unloaded
> {noformat}
> But not all of them
> {noformat}
> 30 20:52:30,445 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< was loaded
> {noformat}
> The !METADATA status
> {noformat}
> !METADATA> scan -b cz; -e d
> cz;050754dd3cc35c16 file:/b-00006eb/I00006ec.rf []    50848,0
> cz;050754dd3cc35c16 file:/t-00006p2/F0000qra.rf []    2936,660
> cz;050754dd3cc35c16 file:/t-00006p2/F00023p9.rf []    3176,720
> cz;050754dd3cc35c16 last:334c843791cb78b []    192.168.117.5:9997
> cz;050754dd3cc35c16 loc:334c843791cb78b []    192.168.117.5:9997
> cz;050754dd3cc35c16 srv:dir []    /t-00006p2
> cz;050754dd3cc35c16 srv:flush []    0
> cz;050754dd3cc35c16 srv:lock []    tservers/192.168.117.5:9997/zlock-0000000000$334c843791cb78b
> cz;050754dd3cc35c16 srv:time []    M1327993277526
> cz;050754dd3cc35c16 ~tab:~pr []    \x00
> cz;08498ad358a45bb0 file:/b-00006eb/I00006ec.rf []    47216,0
> cz;08498ad358a45bb0 file:/t-00006p3/F0000qj9.rf []    1883,400
> cz;08498ad358a45bb0 file:/t-00006p3/F00022u1.rf []    1717,360
> cz;08498ad358a45bb0 last:134d7425fc8959e []    192.168.117.4:9997
> cz;08498ad358a45bb0 loc:134d7425fc8959e []    192.168.117.4:9997
> cz;08498ad358a45bb0 srv:dir []    /t-00006p3
> cz;08498ad358a45bb0 srv:flush []    0
> cz;08498ad358a45bb0 srv:lock []    tservers/192.168.117.4:9997/zlock-0000000000$134d7425fc8959e
> cz;08498ad358a45bb0 srv:time []    M1327993277506
> cz;08498ad358a45bb0 ~tab:~pr []    \x01050754dd3cc35c16
> cz;137d382d52b8fbd9 file:/b-00006eb/I00006ec.rf []    118041,0
> cz;137d382d52b8fbd9 file:/t-00006os/F0000quc.rf []    6017,1430
> cz;137d382d52b8fbd9 file:/t-00006os/F00023gh.rf []    5409,1270
> cz;137d382d52b8fbd9 last:134d7425fc8959f []    192.168.117.6:9997
> cz;137d382d52b8fbd9 loc:134d7425fc8959f []    192.168.117.6:9997
> cz;137d382d52b8fbd9 srv:dir []    /t-00006os
> cz;137d382d52b8fbd9 srv:flush []    0
> cz;137d382d52b8fbd9 srv:lock []    tservers/192.168.117.6:9997/zlock-0000000000$134d7425fc8959f
> cz;137d382d52b8fbd9 srv:time []    M1327993277515
> cz;137d382d52b8fbd9 ~tab:~pr []    \x0108498ad358a45bb0
> cz< chopped:chopped []    chopped
> cz< file:/b-00006eb/I00006ec.rf []    1210190,0
> cz< srv:compact []    1
> cz< srv:dir []    /default_tablet
> cz< srv:lock []    tservers/192.168.117.10:9997/zlock-0000000000$334c843791cb78f
> cz< srv:time []    M0
> cz< ~tab:~pr []    \x01137d382d52b8fbd9
> {noformat}
> 4 tablets, consistent splits, 3 are online, the offline tablet is chopped.  The offline tablet doesn't have a last location, which is strange.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira