You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Christopher Tubbs (JIRA)" <ji...@apache.org> on 2019/04/16 19:28:00 UTC

[jira] [Resolved] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

     [ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Tubbs resolved ACCUMULO-4851.
-----------------------------------------
    Resolution: Cannot Reproduce

Lots of new WAL improvements have been made in the 1.9.x releases. This is likely OBE. However, please open a new issue at https://github.com/apache/accumulo/issues  if it continues to be a problem.

> WAL recovery directory should be deleted before running LogSorter
> -----------------------------------------------------------------
>
>                 Key: ACCUMULO-4851
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as having failed to load. Digging into the exception, we could see the tablet load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>         at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640)
>         at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
>         at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>         at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>         at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>         at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>         at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590)
>         ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>         at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823)
>         at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>         at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>         at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
>         at org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
>         at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>         ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:12 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data
> -rw-r--r--   3 accumulo hdfs        642 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001
> -rw-r--r--   3 accumulo hdfs    8540196 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data
> -rw-r--r--   3 accumulo hdfs        524 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002
> -rw-r--r--   3 accumulo hdfs    8150879 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data
> -rw-r--r--   3 accumulo hdfs        584 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003
> -rw-r--r--   3 accumulo hdfs    8438021 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data
> -rw-r--r--   3 accumulo hdfs        630 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index
> drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004
> -rw-r--r--   3 accumulo hdfs    4956770 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data
> -rw-r--r--   3 accumulo hdfs        408 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index
> {noformat}
>  The strange thing here is that we both finished and failed markers for this WAL's recovery directory. Given the timestamps, it appears that TServer1 tried to do recovery, failed for some reason, and then TServer2 came along and successfully completely LogSort.
> However, when the merged-read of the sorted files came along, it treated the failed flag as a sorted-chunk, and failed as such.
> I think the simple solution would be to whack the recovery directory if it exists before running the LogSorter.
> Obligatory: I don't know if branches in Apache are verbatim to the fork I'm looking at. Identifying all relevant branches is a necessary step here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)