You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Keith Turner (JIRA)" <ji...@apache.org> on 2013/05/23 21:09:20 UTC

[jira] [Commented] (ACCUMULO-1453) Track tablet migrations and failed loads

    [ https://issues.apache.org/jira/browse/ACCUMULO-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665507#comment-13665507 ] 

Keith Turner commented on ACCUMULO-1453:
----------------------------------------

One of the usual suspects for failed loads is walog recovery problems.  The issue w/ RFiles mentioned in the issue description seems more tricky to isolate.  If a tablet has a problematic file, the tablet will likely load successfully.  Failure will occur when a scan of the rfile is attempted. When a tablet server fails, how do you know which tablet(s) caused the problem?  

Another approach to solving this issue may be to identify what can cause tablet server failure and try to defend against those.  One possible cause could be key/value in a rfile that exceeds memory.  This would be easy to defend against by making Accumulo refuse to load key/values that are too large.  Another possible cause is an iterator that runs amok and consumes all memory.  This is harder to defend against, ACCUMULO-1188 is one approach.


                
> Track tablet migrations and failed loads
> ----------------------------------------
>
>                 Key: ACCUMULO-1453
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1453
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: master, tserver
>    Affects Versions: 1.5.0, 1.4.3
>            Reporter: Mike Drob
>
> If a bad RFile or Tablet somehow gets in the system and brings down a tserver, then as the master migrates it to other servers it will likely cause cascading failures.
> It might be a good idea to keep track of how many consecutive failures to load there are for a given tablet, and either warn or refuse to host the tablet if this value exceeds a given threshold.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira