You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/08/19 20:28:28 UTC

[GitHub] [accumulo] EdColeman opened a new issue #1689: Tserver in bad state may be writing corrupted files during compactions.

EdColeman opened a new issue #1689:
URL: https://github.com/apache/accumulo/issues/1689

The symptom is that the system is encountering an IOException (incorrect data check) where the zlib decompressor cannot uncompress a file. The file would have been created during a previous compaction.

The tserver seems to be in a bad state, and rather than writing corrupt files, it would be preferable if the condition(s) can't be corrected, then stop the server.

On processing the corrupted files:

The IOException is being thrown on lines 348 or 385 of Compactor.compactLocalityGroup. (exceptions occur on both)
https://github.com/apache/accumulo/blob/7a2d12eaf785f924f555733a54f40828cdb2414f/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java#L348
or
https://github.com/apache/accumulo/blob/7a2d12eaf785f924f555733a54f40828cdb2414f/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Compactor.java#L385

The file(s) can be partially processed with rfile-info. You can examine the file metadata, but if you do something that reads the entire file (like -keyStats) then it fails with an exception reported in line 815 of RFile LocalityGroupReader.

When the files are being compacted it looks like the tserver was in an unhealthy state. Potential contributors:

1) The dynamic class loader kept continually rebuilding. AccumuloReloadingVFSClassLoader run loop just keeps executing - the message at line 83 appears frequently in the log.
https://github.com/apache/accumulo/blob/7a2d12eaf785f924f555733a54f40828cdb2414f/start/src/main/java/org/apache/accumulo/start/classloader/vfs/AccumuloReloadingVFSClassLoader.java#L83

2) The tablet servers are throwing a null pointer exception that is being logged by https://github.com/apache/accumulo/blob/7a2d12eaf785f924f555733a54f40828cdb2414f/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L452

There may be other issues, but these stand out in the logs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694603973

@EdColeman Thanks for the code in #1709. I get the gist of what you're proposing as a solution (to it's own lock and self-terminate). But, I still don't understand the problem from the above description. You said there's a `NullPointerException` being thrown, but it's not possible for it to be thrown in the code you updated... that's just where it is finally caught and handled... so it must be thrown somewhere in `LargestFirstMemoryManager.getMemoryManagementActions(...)` or something it calls, but it's not clear where that could be from the above description, without a stack trace.

I'm suspicious of Tables.exists returning null (presumably, from line 180 in `LargestFirstMemoryManager`, looking at the 1.10 code), instead of an empty list of children. And, in any case, I don't see how the list can be empty in normal circumstances, because there should always be *some* children there. If there's abnormal circumstances, then I don't think your proposed fix is really going to solve much... because you're just fighting the symptom. If symptoms from ZooKeeper are **that** abnormal, other code is likely going to hit that same broken code path.

I think what we need is a stack trace, in order to try to track down the actual underlying cause, and the specific code that threw the exception.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696160463


   Zookeeper "should" not be reloading - I believe the jar that changes are custom iterators.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694666753


   Ugh. Classloaders. If you can't trust your class loader, you can't trust anything. I don't know how we can reasonably protect against class loader problems... because those can cause problems *anywhere* in our code, including any code we write to try to guard against problems.
   
   Another possible issue is the attempting to catch Throwable and handle it... which could include unrecoverable Errors. We may be causing more problems than we are solving by catching things like `OutOfMemoryError` and trying to do something to handle it. IMO, we should never catch Throwable, unless we're going to immediately halt the JVM. I can't help but wonder if there was an earlier `Error` thrown prior to the NPE.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696407230

I'm pretty confident in saying that I don't think this is an Accumulo memory manager bug at all. Perhaps this is a hardware bug (it seems like some sort of memory/storage corruption).

Perhaps it can be mitigated within the classloader, through detection of rapid change events? However, given the direction we're headed regarding the classloader changes (see mailing list for discussion about that), I'm not sure what changes to the existing classloader, if any, would be worth it at this time.

Related issue: we should really stop trying to catch `Throwable`... and limit ourselves to catching `Exception`, at most (preferably more narrow than that, if possible). In this code, and in lots of other places, it looks like we could have caught a `Throwable` (which includes `Error`s), assumed we could handle it, and possibly only logged (or failed to log) and moved on, when in fact, we have no chance of recovery (such as the case with `Error`s like `OutOfMemoryError`). According [to Oracle's troubleshooting guide](https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks002.html), this may be thrown when there's not enough native memory to load a class.... which is a situation that seems like it could cause the reloading behavior you saw, as well as other memory corruption issues, if we didn't just exit immediately and instead try to "handle" it and continue after hitting an `OutOfMemoryError`.

Were there low-memory issues on the machine(s) where you saw this occur?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696755414


   [1] is what is invoked when the TabletServer loses its lock. I think we
   should create three issues for this:
   
   1. Remove / Replace the catching of Throwable in the code base
   2. Addition of [1] in server components that handle the case where
   exceptions are thrown in critical threads
   3. A solution to this particular problem, which may involve a little of
   both, but can be easily backported to 1.10.
   
   [1]
   https://github.com/apache/accumulo/blob/530e3dec5361244118857789054e12a53d96d34b/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L651
   [2]
   https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Thread.UncaughtExceptionHandler.html
   
   On Tue, Sep 22, 2020 at 10:04 AM EdColeman <no...@github.com> wrote:
   
   > Is there any recommendation on how to kill the tserver?
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/accumulo/issues/1689#issuecomment-696743454>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAEKUZ5DIDTFD4CQ6EVK4Q3SHCVGTANCNFSM4QFKORWQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696755414


   [1] is what is invoked when the TabletServer loses its lock. I think we
   should create three issues for this:
   
   1. Remove / Replace the catching of Throwable in the code base
   2. Addition of [1] in server components that handle the case where
   exceptions are thrown in critical threads
   3. A solution to this particular problem, which may involve a little of
   both, but can be easily backported to 1.10.
   
   [1]
   https://github.com/apache/accumulo/blob/530e3dec5361244118857789054e12a53d96d34b/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L651
   [2]
   https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Thread.UncaughtExceptionHandler.html
   
   On Tue, Sep 22, 2020 at 10:04 AM EdColeman <no...@github.com> wrote:
   
   > Is there any recommendation on how to kill the tserver?
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/accumulo/issues/1689#issuecomment-696743454>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAEKUZ5DIDTFD4CQ6EVK4Q3SHCVGTANCNFSM4QFKORWQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii edited a comment on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii edited a comment on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696185641


   Is the VFS reloading common (with these kinds of errors)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii edited a comment on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii edited a comment on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694603973

I think what we need is a stack trace, in order to try to track down the actual underlying cause, and the specific code that threw the exception.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696422426






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696180404






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694859786


   Sample stack trace:
   
   `For version 1.9.3 
   
   ```
   [time] [tserver.TableServerResourceManager] TabletServerResourceManager ERROR: Memory manager failed null
   java.lang.NullPointerException
       at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:148)
       at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tableExists(LargestFirstMemoryManager.java:153)
       at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.getMemoryManagementActions(LargestFirstMemoryManager.java:180)
       at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagementFramework.manageMemory(TabletServerResourceManager.java:440)
       at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagementFramework.access$400(TabletServerResourceManager.java:349)
       at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagementFramework.$2.run(TabletServerResourceManager.java:377)
       at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
       at java.lang.Thread.run(Thread.java:748)
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694500736


   (Using 1.10 code) when the tserver gets into a bad state, it looks like zooCache may be returning null in the Tables.exists() check (Tables - line 147).  
   
   In TabletServerResourceManager - line 451 has a catch throwable and just a log statement.  The code is in a continuous loop and I believe the code after the error is correctly guarded, but the loop never will end.
   
   I don't think that killing the runnable would work - the tserver might never notice it lost the memory manager thread.  
   
   I think zookeeper is available - how bad would it be if on catching the exception, it just deleted the tablet server lock and thereby killed the server?  That would be preferable to writing corrupt data, but maybe there are other "recoverable errors"?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694880436


   Timeline - maybe this helps to provide context, an abbreviated timeline of the messages logged:
   ```
   16:34:20,453 - tserver started and looks to be normal assignments and no errors.
   16:41:04,036 - Rebuilding dynamic classloader using files...
   16:41:05,045 - [same]
   16:41:05,114 - [same]
   16:47:36,354 - [same]
   16:47:36,414 - [same] 452 times over the next 18 hrs.
   ...
   16:51:02,398 - WARN loadTablet message from a master that does not hold the master lock
   ...
   16:53:22,105 - ERROR: memory manager failed null.
     these continued for the next 18 hrs - 2300 times.
   
   next day
   10:05:29,983 - tserver dies with a segfault - Problematic frame: J 8944 java.util.HashSet.add(Ljava/lang/Object;)Z (20 bytes)
   
   
   The only ERROR log messages reported:
   
   16:51:02,398 - bad master lock / loadTablet request.
   16:53:22,105 - First memory manager failed.
   19:18:01,855 - vfs.AccumuloReloadingVFSClassLoader ERROR: Invalid URI escape sequence
   
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694503117


   @EdColeman Can you provide an example patch of what changes you mean to propose? I'm having a hard time following the narrative.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696185641


   Is the VFS reloading is common?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-679360430


   I removed the 1.10 tag so that it does not impend the 1.10 release.
   
   This occurred with Accumulo 1.9.3.  It seems likely that this issue has been around for a while and is not new to 1.9.  There may be mitigations that limit the issues with (re)loading customer iterators (i.e.deploy jars locally to the tserver)  Because this may be difficult to reproduce, developing fixes may be more complicated or have other (as yet unexplored) consequences.  Resolution of this issue does not need to block 1.10.  If a solution is found to be feasible then it could be made available as a patch. 
   
   Resolving this for a future 2.x version may be more practical. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-718248255


   Another related JIRA issue: https://issues.apache.org/jira/browse/ACCUMULO-2495


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694639463

It has occurred more than once, however, I'm not sure how to approach recreating it in a test environment. It seems the trigger is that the dynamic classloader starts to thrash, continually reloading - I then suspect that this is consuming memory or other resources until things fail. One thing that sets off the classloader is when new jars are deployed in hdfs for the system, but then its only one or two servers that get into this state. On an another occasion, it was on a server restart.

I'd like to treat the classloader as a separate issue, The classloader issue may be a trigger, but the fact that the tserver can get into this state and write corrupt files probably should be addressed, even if that one trigger is eliminated, maybe there are others?

The stacktrace points to the NullPointer coming from the Tables.exists() call - I'll work on getting a more detailed representation of it.

Its clear that the tserver is unhealthy, and I agree, I am skeptical of Tables.exits returning a null, but if things are that bad, then I was trying to see if there was an approach that, philosophically, would pull the plug to eliminate or at least greatly reduce the blast footprint.

The thing is that eventually the tserver does seem to die with a segfault - but only after a lengthy time (days) and the potential for corrupt files being written and causing data loss the whole time the tserver is in this state. The bad files are only uncovered during the next compaction, so there is quite a delay between when the damage occurred and when Accumulo reports an issue other than in the tserver logs. To find the problem, it is necessary to take the file reported as corrupt (it shows up in the UI error log) and then grep across all of the tserver log files looking for the compaction plan log message that created said file.

Other than the corrupted files, the tserver seems to be operating normally - but that might not be a correct assumption - there are no other indications of errors - but that might be different than operating as intended. The delay for discovery does not help in answering this question.

As far as 2.x - 1) again, not sure how to reliably trigger this. 2) with 2.x there could be an entirely different approach.

There is an open issue - https://github.com/apache/accumulo/issues/946 that could be implemented to achieve this. My thought was that if the tserver knew that it needed a memory manager thread and that thread dies, then the tserver could take action - to either re-spawn an new process or terminate itself. If the tserver was aware, then throwing an exception that kills the thread would be appropriate. As it is now, I think that if the memory manager thread dies, the tserver would not know and that seemed like a sub-optimal condition, so I was exploring if it could be appropriate to kill the server directly - and removing the lock was the most direct way that I thought of.

Implementing critical thread monitoring and recovery would be more comprehensive and probably a better approach - but the changes would be significant enough that I would not call them a bug fix. Being that this is happening on 1.9.3 and was not addressed in 1.10, I think a bug fix is appropriate and necessary. Being that this has the potential for data loss, having an fix in 1.10.1 would likely be suitable for people to back port and patch - and I suspect the desire for that is more immediate than could be achieved by starting with 2.x as the target.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696323167


   Yes - each time the dynamic classloader starts rebuilding continuously.  After a time, the memory manager then starts throwing the null pointer.
   
   Another thing pointer out today, sometimes the path passed to the classloader is invalid, I've found cases where its truncated, or has "invalid characters", its been plus signs, exclamation points and pound signs -  however, in the most recent case, none of the invalid paths show up until well after the reloaded death-cycle has started.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694516735


   Concept in https://github.com/apache/accumulo/pull/1709  (Not even sure if it will compile - I know the code is incorrect, but should be enough for the gist)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696180404


   > The errors listed in the summary are my summary descriptions - the actual errors where in the stack trace.
   
   Somehow my grep failed me. I see the message now in `TabletServerResourceManager`. :smiley_cat:
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696422426


   Looking at the log messages for server.GarbageCollectionLogger the collection times for ConcurrentMarkAndSweep are 0.11 and its reporting plenty of free memory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-695053928


   Thanks, that helps a lot.
   
   > 16:51:02,398 - WARN loadTablet message from a master that does not hold the master lock
   
   Hmm. Did a master get into a bad state first and fail over to a secondary? This seems weird. I wonder if the master was having issues also.
   
   > 19:18:01,855 - vfs.AccumuloReloadingVFSClassLoader ERROR: Invalid URI escape sequence
   
   What's this? This looks like it was triggered by a bad VFS classloader configuration. Maybe that'd be a good place to guard against. It'd be very hard to protect code against a bad classloader... because it could be doing bad things anywhere. But, it should be possible to protect the classloader from bad configuration / user error.
   
   > 16:53:22,105 - ERROR: memory manager failed null.
   > 16:53:22,105 - First memory manager failed.
   
   I don't recognize either of these two messages, and cannot find anything in the 1.9.3 code that looks like these messages. Are you sure you aren't seeing this error in some fork of Accumulo with custom code in these areas?
   
   > ```
   > [time] [tserver.TableServerResourceManager] TabletServerResourceManager ERROR: Memory manager failed null
   > java.lang.NullPointerException
   >     at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:148)
   >     at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tableExists(LargestFirstMemoryManager.java:153)
   > ```
   
   This certainly looks like ZooKeeper returned null for the list of children on that entry. However, I can see no way that would be possible unless the ZooKeeper client threads were somehow also in a bad state. If that's the case, then while you saw this error occur in the memory manager... it could actually occur pretty much anywhere. I wonder: were the ZooKeeper jars being loaded via the reloading VFS classloader? If so, that might be a ready explanation for how the ZooKeeper code failed here as a consequence of a bug or failure condition in that classloader, and indicate that the underlying issue is that classloader, and that this stack trace might be a red herring.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696407230






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696159392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii edited a comment on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii edited a comment on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696185641


   Is the VFS reloading common (with these kinds of errors)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-744734358


   #1818 handles the case where exceptions are thrown in threads. I can look at the places where we are still catching Throwable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696184454


   The `WARN loadTablet message from a master that does not hold the master lock` does not seem common with occurrences of this situation. At least for the other occurrence that I have logs for, the error does not appear to be in the log.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-694604569


   Also, is this issue reproducible or was it a one-off thing you're trying to track down? It'd be great if it were reproducible on 2.x, so we can try to fix it there first.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696743454


   Is there any recommendation on how to kill the tserver?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696159392


   The errors listed in the summary are my summary descriptions - the actual errors where in the stack trace.  
   
   The Invalid URI error occurred hours after the memory manager failed, so it does not seem to indicate that would be the trigger.
   
   I do not know if the master lock / fail over is common to each occurrence - I noted it here in case it was significant, but I cannot say it was common.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion closed issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

dlmarion closed issue #1689:
URL: https://github.com/apache/accumulo/issues/1689


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-718234478


   I ran into this old JIRA issue that seems to be the same issue: https://issues.apache.org/jira/browse/ACCUMULO-1708


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #1689: Tserver in bad state may be writing corrupted files during compactions.

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #1689:
URL: https://github.com/apache/accumulo/issues/1689#issuecomment-696434005

Overall, there may be a difference in philosophy or focus. I do not disagree with the premises of anything that you wrote - except I think I'm coming at it from a different angle.

I'm aware of the classloader rework - that's one reason I was less focused on the "how it got that way" and am trying to address "no matter how we got here, can we at least stop writing corrupt files".

Whatever the root cause, we should try to protect ourselves, and if we can't recover, then at least stop from corrupting data. It seems very likely as this has unfolded that things are pointing to something external to Accumulo. But, I think its a bug that Accumulo keeps working (and working incorrectly). It should be a given that the hardware works - but it is impossible to provide that guarantee - things go wrong.

I agree that catching Throwable may not always be appropriate - in this case it is not - so, for this one case, is there an acceptable solution? I've proposed one way.

A second, and more general way could be to leverage `Thread.UncaughtExceptionHandler` - using that, the tserver could create the threads and assign a handler that would do essentially the same thing - stop the tserver, either by deleting the lock or whatever the preferred mechanism is, if the underlying "critical" thread dies. The we don't need to guard against unexpected exceptions - we let them kill the thread - and then decide to either kill the tserver or maybe spawn a new thread - if that could be determined to be appropriate and safe.

1) So, in general - for cases where we are catching `Throwable ` and that is causing issues, - would it be better if we stopped the tserver?

2) If it is determined that we want to stop, is deleting the lock acceptable, or is there a preferred, alternate method.

While the general issue of catching and swallowing Throwable is a bigger issue - for this one case where we can identify a case that this is not appropriate - can we fix that and then examine the larger issue as time allows or when other cases are identified?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org