You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by "Eric Newton (Created) (JIRA)" <ji...@apache.org> on 2012/02/29 14:15:56 UTC

[jira] [Created] (ACCUMULO-436) tablet merge stuck

tablet merge stuck
------------------

                 Key: ACCUMULO-436
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
             Project: Accumulo
          Issue Type: Bug
          Components: master
         Environment: randomwalk with agitation on 10-node test cluster
            Reporter: Eric Newton
            Assignee: Eric Newton
             Fix For: 1.4.0


After 14 hours of randomwalk, a merge operation appeared to be stuck.


Garbage collector was stuck, some tablets were offline:
||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
|10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|


Garbage collector could not get a consistent !METADATA table scan:
{noformat}
29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
{noformat}

Table (id 24q) had a merge in progress:
{noformat}
./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
{noformat}

Scan of table 24q:
{noformat}
scan -b 24q; -e 24q<
24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
24q;073b220b74a75059 srv:compact []    3
24q;073b220b74a75059 srv:dir []    /t-00031y0
24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;073b220b74a75059 srv:time []    M0
24q;073b220b74a75059 ~tab:~pr []    \x00
24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
24q;1419e44259517c51 srv:compact []    3
24q;1419e44259517c51 srv:dir []    /t-00031y1
24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;1419e44259517c51 srv:time []    M0
24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
24q;51fc3e7faea2b7e9 srv:compact []    3
24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;51fc3e7faea2b7e9 srv:time []    M0
24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
24q;5e65b844f2c7f868 chopped:chopped []    chopped
24q;5e65b844f2c7f868 srv:compact []    3
24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;5e65b844f2c7f868 srv:time []    M0
24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
24q;5f83b8f927c41c9d chopped:chopped []    chopped
24q;5f83b8f927c41c9d srv:compact []    3
24q;5f83b8f927c41c9d srv:dir []    /t-000329w
24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q;5f83b8f927c41c9d srv:time []    M0
24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
24q< chopped:chopped []    chopped
24q< srv:compact []    3
24q< srv:dir []    /default_tablet
24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q< srv:time []    M0
24q< ~tab:~pr []    \x011419e44259517c51
{nofromat}

Master Logs
{noformat}
29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
{noformat}

The final consistency check is failing because the delete is partially complete.  The delete step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (ACCUMULO-436) tablet merge stuck

Posted by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-436.
-----------------------------------

    Resolution: Fixed
    
> tablet merge stuck
> ------------------
>
>                 Key: ACCUMULO-436
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk with agitation on 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>              Labels: 14_qa_bug
>             Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
> |10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
> txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
> txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact []    3
> 24q;073b220b74a75059 srv:dir []    /t-00031y0
> 24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time []    M0
> 24q;073b220b74a75059 ~tab:~pr []    \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact []    3
> 24q;1419e44259517c51 srv:dir []    /t-00031y1
> 24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time []    M0
> 24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
> 24q;51fc3e7faea2b7e9 srv:compact []    3
> 24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time []    M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped []    chopped
> 24q;5e65b844f2c7f868 srv:compact []    3
> 24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time []    M0
> 24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped []    chopped
> 24q;5f83b8f927c41c9d srv:compact []    3
> 24q;5f83b8f927c41c9d srv:dir []    /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time []    M0
> 24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
> 24q< chopped:chopped []    chopped
> 24q< srv:compact []    3
> 24q< srv:dir []    /default_tablet
> 24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time []    M0
> 24q< ~tab:~pr []    \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-436) tablet merge stuck

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219326#comment-13219326 ] 

Keith Turner commented on ACCUMULO-436:
---------------------------------------

Looking at the code I noticed that getHighTablet() may throw an exception of read a tablet form the next table in the case where the high tablet does not exist.  This a case that could occur if deleteTablets() is run twice.
                
> tablet merge stuck
> ------------------
>
>                 Key: ACCUMULO-436
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk with agitation on 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
> |10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
> txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
> txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact []    3
> 24q;073b220b74a75059 srv:dir []    /t-00031y0
> 24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time []    M0
> 24q;073b220b74a75059 ~tab:~pr []    \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact []    3
> 24q;1419e44259517c51 srv:dir []    /t-00031y1
> 24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time []    M0
> 24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
> 24q;51fc3e7faea2b7e9 srv:compact []    3
> 24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time []    M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped []    chopped
> 24q;5e65b844f2c7f868 srv:compact []    3
> 24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time []    M0
> 24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped []    chopped
> 24q;5f83b8f927c41c9d srv:compact []    3
> 24q;5f83b8f927c41c9d srv:dir []    /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time []    M0
> 24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
> 24q< chopped:chopped []    chopped
> 24q< srv:compact []    3
> 24q< srv:dir []    /default_tablet
> 24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time []    M0
> 24q< ~tab:~pr []    \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (ACCUMULO-436) tablet merge stuck

Posted by "Eric Newton (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219320#comment-13219320 ] 

Eric Newton edited comment on ACCUMULO-436 at 2/29/12 4:26 PM:
---------------------------------------------------------------

Good catch.  Zookeeper is probably the simplest option for now. We should use Fate/Repo to perform the last part of the merge in 1.4.1 or 1.5.

                
      was (Author: ecn):
    Good catch.  Zookeeper is probably the simplest option for now. We should use Fate/Repo to perform the last part of the merge in 1.4 or 1.5.

                  
> tablet merge stuck
> ------------------
>
>                 Key: ACCUMULO-436
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk with agitation on 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
> |10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
> txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
> txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact []    3
> 24q;073b220b74a75059 srv:dir []    /t-00031y0
> 24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time []    M0
> 24q;073b220b74a75059 ~tab:~pr []    \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact []    3
> 24q;1419e44259517c51 srv:dir []    /t-00031y1
> 24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time []    M0
> 24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
> 24q;51fc3e7faea2b7e9 srv:compact []    3
> 24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time []    M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped []    chopped
> 24q;5e65b844f2c7f868 srv:compact []    3
> 24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time []    M0
> 24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped []    chopped
> 24q;5f83b8f927c41c9d srv:compact []    3
> 24q;5f83b8f927c41c9d srv:dir []    /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time []    M0
> 24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
> 24q< chopped:chopped []    chopped
> 24q< srv:compact []    3
> 24q< srv:dir []    /default_tablet
> 24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time []    M0
> 24q< ~tab:~pr []    \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-436) tablet merge stuck

Posted by "Eric Newton (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219320#comment-13219320 ] 

Eric Newton commented on ACCUMULO-436:
--------------------------------------

Good catch.  Zookeeper is probably the simplest option for now. We should use Fate/Repo to perform the last part of the merge in 1.4 or 1.5.

                
> tablet merge stuck
> ------------------
>
>                 Key: ACCUMULO-436
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk with agitation on 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
> |10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
> txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
> txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact []    3
> 24q;073b220b74a75059 srv:dir []    /t-00031y0
> 24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time []    M0
> 24q;073b220b74a75059 ~tab:~pr []    \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact []    3
> 24q;1419e44259517c51 srv:dir []    /t-00031y1
> 24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time []    M0
> 24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
> 24q;51fc3e7faea2b7e9 srv:compact []    3
> 24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time []    M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped []    chopped
> 24q;5e65b844f2c7f868 srv:compact []    3
> 24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time []    M0
> 24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped []    chopped
> 24q;5f83b8f927c41c9d srv:compact []    3
> 24q;5f83b8f927c41c9d srv:dir []    /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time []    M0
> 24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
> 24q< chopped:chopped []    chopped
> 24q< srv:compact []    3
> 24q< srv:dir []    /default_tablet
> 24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time []    M0
> 24q< ~tab:~pr []    \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ACCUMULO-436) tablet merge stuck

Posted by "Eric Newton (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Newton updated ACCUMULO-436:
---------------------------------

    Description: 
After 14 hours of randomwalk, a merge operation appeared to be stuck.


Garbage collector was stuck, some tablets were offline:
||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
|10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|


Garbage collector could not get a consistent !METADATA table scan:
{noformat}
29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
{noformat}

Table (id 24q) had a merge in progress:
{noformat}
./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
{noformat}

Scan of table 24q:
{noformat}
scan -b 24q; -e 24q<
24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
24q;073b220b74a75059 srv:compact []    3
24q;073b220b74a75059 srv:dir []    /t-00031y0
24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;073b220b74a75059 srv:time []    M0
24q;073b220b74a75059 ~tab:~pr []    \x00
24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
24q;1419e44259517c51 srv:compact []    3
24q;1419e44259517c51 srv:dir []    /t-00031y1
24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;1419e44259517c51 srv:time []    M0
24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
24q;51fc3e7faea2b7e9 srv:compact []    3
24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;51fc3e7faea2b7e9 srv:time []    M0
24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
24q;5e65b844f2c7f868 chopped:chopped []    chopped
24q;5e65b844f2c7f868 srv:compact []    3
24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;5e65b844f2c7f868 srv:time []    M0
24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
24q;5f83b8f927c41c9d chopped:chopped []    chopped
24q;5f83b8f927c41c9d srv:compact []    3
24q;5f83b8f927c41c9d srv:dir []    /t-000329w
24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q;5f83b8f927c41c9d srv:time []    M0
24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
24q< chopped:chopped []    chopped
24q< srv:compact []    3
24q< srv:dir []    /default_tablet
24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q< srv:time []    M0
24q< ~tab:~pr []    \x011419e44259517c51
{noformat}

Master Logs
{noformat}
29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
{noformat}

The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.


  was:
After 14 hours of randomwalk, a merge operation appeared to be stuck.


Garbage collector was stuck, some tablets were offline:
||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
|10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|


Garbage collector could not get a consistent !METADATA table scan:
{noformat}
29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
{noformat}

Table (id 24q) had a merge in progress:
{noformat}
./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
{noformat}

Scan of table 24q:
{noformat}
scan -b 24q; -e 24q<
24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
24q;073b220b74a75059 srv:compact []    3
24q;073b220b74a75059 srv:dir []    /t-00031y0
24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;073b220b74a75059 srv:time []    M0
24q;073b220b74a75059 ~tab:~pr []    \x00
24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
24q;1419e44259517c51 srv:compact []    3
24q;1419e44259517c51 srv:dir []    /t-00031y1
24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;1419e44259517c51 srv:time []    M0
24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
24q;51fc3e7faea2b7e9 srv:compact []    3
24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;51fc3e7faea2b7e9 srv:time []    M0
24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
24q;5e65b844f2c7f868 chopped:chopped []    chopped
24q;5e65b844f2c7f868 srv:compact []    3
24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;5e65b844f2c7f868 srv:time []    M0
24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
24q;5f83b8f927c41c9d chopped:chopped []    chopped
24q;5f83b8f927c41c9d srv:compact []    3
24q;5f83b8f927c41c9d srv:dir []    /t-000329w
24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q;5f83b8f927c41c9d srv:time []    M0
24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
24q< chopped:chopped []    chopped
24q< srv:compact []    3
24q< srv:dir []    /default_tablet
24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q< srv:time []    M0
24q< ~tab:~pr []    \x011419e44259517c51
{nofromat}

Master Logs
{noformat}
29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
{noformat}

The final consistency check is failing because the delete is partially complete.  The delete step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.


    
> tablet merge stuck
> ------------------
>
>                 Key: ACCUMULO-436
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk with agitation on 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
> |10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
> txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
> txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact []    3
> 24q;073b220b74a75059 srv:dir []    /t-00031y0
> 24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time []    M0
> 24q;073b220b74a75059 ~tab:~pr []    \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact []    3
> 24q;1419e44259517c51 srv:dir []    /t-00031y1
> 24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time []    M0
> 24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
> 24q;51fc3e7faea2b7e9 srv:compact []    3
> 24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time []    M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped []    chopped
> 24q;5e65b844f2c7f868 srv:compact []    3
> 24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time []    M0
> 24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped []    chopped
> 24q;5f83b8f927c41c9d srv:compact []    3
> 24q;5f83b8f927c41c9d srv:dir []    /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time []    M0
> 24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
> 24q< chopped:chopped []    chopped
> 24q< srv:compact []    3
> 24q< srv:dir []    /default_tablet
> 24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time []    M0
> 24q< ~tab:~pr []    \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-436) tablet merge stuck

Posted by "Keith Turner (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219302#comment-13219302 ] 

Keith Turner commented on ACCUMULO-436:
---------------------------------------

I think a further change needs to be made.  The new mergeStarted() function looks at the prevRow to determine if a merge was started.  The deleteTablets() funtion will deletes stuff from !METADATA and then modifies prevrow.  So if deleteTablets() is terminated between deleting stuff and updating the prevrow, then I think mergeStarted() and verifyMergeConsistency() will both return false indefinitely. I thought of switching the order of operations, but this is tricky because the high tablet could be deleted (this would cause the getHighTablet() method to throw an exception).

I am thinking a solution is to put a marker in zookeeper when the merge starts, and if that is there continue.  I thought of putting a marker in the !METADATA table, but the way deleteTablets() works there is no good place to put it.
                
> tablet merge stuck
> ------------------
>
>                 Key: ACCUMULO-436
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-436
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master
>         Environment: randomwalk with agitation on 10-node test cluster
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers||	\# Total Tablet Servers||	Loggers	Last GC	||\# Tablets	||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS Load||
> |10	|10	|10	|*Running 2/29/12 12:14 PM*	|299	|*4*	|277.50M	|311	|5.53K	|—	|0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : [] 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency : 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 08db6105a25c0788  status: IN_PROGRESS         op: CloneTable       locked: []              locking: [R:24q]         top: CloneTable
> txid: 5f798db1cab5fdea  status: IN_PROGRESS         op: BulkImport       locked: []              locking: [R:24q]         top: BulkImport
> txid: 6aa9a8a9b36a4f4d  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 5c6e82e235ec3855  status: IN_PROGRESS         op: TableRangeOp     locked: []              locking: [W:24q]         top: TableRangeOp
> txid: 653a9293ba9f1cdc  status: IN_PROGRESS         op: RenameTable      locked: []              locking: [W:24q]         top: RenameTable
> txid: 651c62eb37136b6e  status: IN_PROGRESS         op: TableRangeOp     locked: [W:24q]         locking: []              top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 []    192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact []    3
> 24q;073b220b74a75059 srv:dir []    /t-00031y0
> 24q;073b220b74a75059 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time []    M0
> 24q;073b220b74a75059 ~tab:~pr []    \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd []    192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact []    3
> 24q;1419e44259517c51 srv:dir []    /t-00031y1
> 24q;1419e44259517c51 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time []    M0
> 24q;1419e44259517c51 ~tab:~pr []    \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped []    chopped
> 24q;51fc3e7faea2b7e9 srv:compact []    3
> 24q;51fc3e7faea2b7e9 srv:dir []    /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time []    M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr []    \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped []    chopped
> 24q;5e65b844f2c7f868 srv:compact []    3
> 24q;5e65b844f2c7f868 srv:dir []    /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []    tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time []    M0
> 24q;5e65b844f2c7f868 ~tab:~pr []    \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped []    chopped
> 24q;5f83b8f927c41c9d srv:compact []    3
> 24q;5f83b8f927c41c9d srv:dir []    /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time []    M0
> 24q;5f83b8f927c41c9d ~tab:~pr []    \x015e65b844f2c7f868
> 24q< chopped:chopped []    chopped
> 24q< srv:compact []    3
> 24q< srv:dir []    /default_tablet
> 24q< srv:lock []    tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time []    M0
> 24q< ~tab:~pr []    \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete : false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially complete.  The final step is not "adampotent" enough: partial execution leaves the Repo in a state in which it cannot continue after restart.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira