You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/11/18 21:51:14 UTC

[GitHub] [accumulo] dlmarion opened a new issue, #3087: Client in endless loop when ZooKeeper unavailable

dlmarion opened a new issue, #3087:
URL: https://github.com/apache/accumulo/issues/3087

   User running 1.10.2 reporting that the Accumulo client gets stuck in an endless loop when ZooKeeper is unavailable.  They'd like to be able to break out of the loop.
   
   ```
   Saw (possibly) transient exception communicating with ZooKeeper, will retry
    org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>
    	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
    	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
    	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256)
    	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:394)
    	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:366)
    	at org.apache.accumulo.fate.zookeeper.ZooCache$ZooRunnable.retry(ZooCache.java:260)
    	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:423)
    	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:352)
    	at org.apache.accumulo.core.client.ZooKeeperInstance.getInstanceID(ZooKeeperInstance.java:207)
    	at org.apache.accumulo.core.client.impl.Tables.getTableMap(Tables.java:134)
    	at org.apache.accumulo.core.client.impl.Tables.getTableMap(Tables.java:122)
    	at org.apache.accumulo.core.client.impl.Tables.getNameToIdMap(Tables.java:105)
    	at org.apache.accumulo.core.client.impl.Tables._getTableId(Tables.java:79)
    	at org.apache.accumulo.core.client.impl.Tables.getTableId(Tables.java:71)
    	at org.apache.accumulo.core.client.impl.ConnectorImpl.getTableId(ConnectorImpl.java:88)
    	at org.apache.accumulo.core.client.impl.ConnectorImpl.createBatchWriter(ConnectorImpl.java:143)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1398788327

   Closing for now, can be re-opened if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1320570149

   Without ZooKeeper how is the cluster running?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1323020472

   I was responding to the general behavior described in the original post and first few comments. If all that is being requested here is a timeout option for a particular API endpoint, that's probably doable, as long as we're careful to not break existing expected default behavior.
   
   As a note about possible implementation, I will add that adding timeouts to APIs is tricky when the internal components are independently blocking. Having a monitor thread outside the call to the API generally seems more reliable when you want to be sure that it doesn't get stuck in some arbitrary code block that's blocking. Maybe it makes sense to just do that by running stuff in a separate thread, as part of a future asynchronous API instead of extending the existing synchronous APIs with timeout parameters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] jwomeara commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

jwomeara commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1330891703

   I think I'm going to just revisit this once we are building against accumulo 2.x.  For now, I am going to work around this by adding accumulo health information to the audit service so that it can mark itself as unhealthy if we are hung on accumulo calls.  This will trigger a restart of the service.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

dlmarion commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1320566966

   I looked at the 2.1 code and I think the same issue might exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1320580324

   Not sure if this is related and can be handled in the same fashion (or with the same code) but if you have a shell open and shutdown the cluster you need to find and the kill the shell pid.
   
   Not really a operational scenario - but does happen if you are testing with multiple tabs / windows open.
   
   Also not sure if this is really a bug in 1.10, but might be more appropriate for 3.x and maybe 2.1.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] jwomeara commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

jwomeara commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1322609064

   As I see it, my options are to either code around accumulo by running this call in a separate thread, or update Accumulo to handle this use case.  I would prefer to update Accumulo as that seems cleaner to me.  Adding the retry/timeout parameters to BatchWriterConfig would ensure that both the Connector API and default behavior remain unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] jwomeara commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

jwomeara commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1327660048

   Hmmm, so I tried updating my code to run the accumulo API calls (instance.getConnector, and connector.createBatchWriter) in their own threads so that I can cancel them after a configurable amount of time.  It seems like those accumulo API calls are not responding to the interrupt/cancel being called on the thread because the threads will remain living until zookeeper is brought back online, at which point the calls fail.  Might not be able to solve this problem solely by running the calls in a separate thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] jwomeara commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

jwomeara commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1320573284

   It might not be there!  And if so, accumulo shouldn't be sticking around waiting for it indefinitely!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1320924082

   Because components start at different times, and can be flaky, I'm pretty sure this is working as intended, and not a bug at all. We rely on retrying when the connection to ZooKeeper is flaky, because the problem is usually transient, and system reliability overall is improved when we do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1320924169

   This has been Accumulo's intended behavior for as long as I can remember.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] dlmarion closed issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by "dlmarion (via GitHub)" <gi...@apache.org>.

dlmarion closed issue #3087: Client in endless loop when ZooKeeper unavailable
URL: https://github.com/apache/accumulo/issues/3087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1329589939

   @jwomeara What about using the new 2.x client API? If we add anything in a newer version, it would be to that API, not the deprecated one anyway. I'm guessing it's probably the same, but it's hard to say, since there was so many changes in 2.x to the client code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] jwomeara commented on issue #3087: Client in endless loop when ZooKeeper unavailable

Posted by GitBox <gi...@apache.org>.

jwomeara commented on issue #3087:
URL: https://github.com/apache/accumulo/issues/3087#issuecomment-1322596544

I'm not asking for any default behavior to be changed. If the way it's written today works for 99% of the use cases, then I wouldn't argue for changing that.

The problem I'm having is that in the event that there is a networking issue, or my zookeeper instances are (temporarily) unavailable, my code will get stuck in the createBatchWriter call indefinitely, removing my ability to react to the situation. Giving me a way to take back control, either by specifying a certain number of retries or a timeout, would be useful.

In the specific use case I'm dealing with, I have a limited number of rabbitmq consumer threads that read audit messages from a queue and then write those messages to accumulo using a batch writer. I am getting into a situation where all of my consumer threads are getting locked up waiting on createBatchWriter to return, and even when the zookeeper instances return, the call still hangs. If I were able to configure accumulo to throw an exception instead, then I would be able to react and possibly write my audit messages somewhere else, like HDFS, or even attempt creating a new batch writer. The way it is now, I am locking up all of my threads and backing up rabbitmq, with my only recourse being to roll the audit service.

If adding a timeout option is a no-go, then I will have to resort to creating the batch writer in a separate thread that I am able to kill myself after a certain amount of time. That seems sloppy to me, but I don't think I have any other options without making a change to accumulo.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org