You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Keith Turner (Resolved) (JIRA)" <ji...@apache.org> on 2012/02/01 21:42:58 UTC

[jira] [Resolved] (ACCUMULO-315) Hole in metadata table occurred during random walk test

     [ https://issues.apache.org/jira/browse/ACCUMULO-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith Turner resolved ACCUMULO-315.
-----------------------------------

    Resolution: Fixed
    
> Hole in metadata table occurred during random walk test
> -------------------------------------------------------
>
>                 Key: ACCUMULO-315
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-315
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>         Environment: Running 1.4.0 SNAPSHOT on 10 node cluster.
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.4.0
>
>
> While running the random walk test a hole in the metadata table occurred.  A client tried to delete the table with the whole and the fate op got stuck.  Was continually seeing the following in the master logs.
> {noformat}
> 14 00:02:11,273 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4ct locationState: 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef@(null,xxx.xxx.xxx.xxx:9997[134d7425fc503e1],null)
> {noformat}
> The metadata table contained the following.  Tablet 4ct;4d2d3be2823b0bf4 had a location.
> {noformat}
> 4ct;262249211a62cd6f ~tab:~pr []    \x011819e56edae21302
> 4ct;27b693c626c2d4ef ~tab:~pr []    \x01262249211a62cd6f
> 4ct;43422047c78fa52b ~tab:~pr []    \x0141ea825af0f262d9
> 4ct;4d2d3be2823b0bf4 ~tab:~pr []    \x0127b693c626c2d4ef
> 4ct;4f89df61392bb311 ~tab:~pr []    \x014d2d3be2823b0bf4
> {noformat}
> Found the following events on a tablet server.
> {noformat}
> #the tablet server events below are caused by the delete range operation
> 13 21:36:04,287 [tabletserver.Tablet] TABLET_HIST: 4ct;4d2d3be2823b0bf4;262249211a62cd6f split 4ct;27b693c626c2d4ef;262249211a62cd6f 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef
> 13 21:36:04,369 [tabletserver.Tablet] TABLET_HIST: 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef split 4ct;41ea825af0f262d9;27b693c626c2d4ef 4ct;4d2d3be2823b0bf4;41ea825af0f262d9
> 13 21:36:04,370 [tabletserver.Tablet] TABLET_HIST: 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 opened
> 13 21:36:06,141 [tabletserver.Tablet] TABLET_HIST: 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 closed
> 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for low split 4ct;43422047c78fa52b;41ea825af0f262d9  [/t-0001cdi/F0001bmw.rf, /t-0001cdi/F0001bn1.rf]
> 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for high split 4ct;4d2d3be2823b0bf4;43422047c78fa52b  [/t-0001cdi/A0001cef.rf, /t-0001cdi/F0001bmw.rf, /t-0001cdi/F0001bn1.rf]
> #split from other random walker
> 13 21:36:06,351 [tabletserver.Tablet] TABLET_HIST: 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 split 4ct;43422047c78fa52b;41ea825af0f262d9 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> {noformat}
> The following events occurred on the master and overlap in time with the split on the tablet server.
> {noformat}
> 13 21:36:06,312 [master.EventCoordinator] INFO : Merge state of 4ct;41ea825af0f262d9;27b693c626c2d4ef set to MERGING
> 13 21:36:06,312 [master.Master] DEBUG: Deleting tablets for 4ct;41ea825af0f262d9;27b693c626c2d4ef
> 13 21:36:06,316 [master.Master] DEBUG: Found following tablet 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> 13 21:36:06,317 [master.Master] DEBUG: Making file deletion entries for 4ct;41ea825af0f262d9;27b693c626c2d4ef
> 13 21:36:06,325 [master.Master] DEBUG: Removing metadata table entries in range [4ct;27b693c626c2d4ef%00; : [] 9223372036854775807 false,4ct;41ea825af0f262d9%00; : [] 9223372036854775807 false)
> 13 21:36:06,331 [master.Master] DEBUG: Updating prevRow of 4ct;4d2d3be2823b0bf4;43422047c78fa52b to 27b693c626c2d4ef
> {noformat}
> After many hours of debugging Eric and I figured out what was going on.  Two random walkers were running the concurrent test.  One client initiated a delete range on table id 4ct for the range 27b693c626c2d4ef to 41ea825af0f262d9.  While this delete range operation was occurring another client add the split point 43422047c78fa52b.  The master read the metadata table while the split was occurring and got inconsistent/incomplete information about what tablets related to the delete range operation were online.  It assumed the required tablets were offline when they were not.  The log messages above show that the split and updating of the prevRow by the master overlap in time.
> We think the best solution is to ensure that scans of the metadata table for merges and delete range are consistent with respect to end row and prev end row matching.  Can not consider tablets individually.  Must ensure the portion of the metadata table under consideration forms a proper sorted linked list.      

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira