You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/02/09 01:04:16 UTC

[GitHub] [accumulo] ctubbsii opened a new issue #2474: Broken or Flaky test: FateIT

ctubbsii opened a new issue #2474:
URL: https://github.com/apache/accumulo/issues/2474


   **Test name(s)**
    - org.apache.accumulo.test.fate.zookeeper.FateIT
   
   **Describe the failure observed**
   
   <details>
   <summary>Click here to expand</summary>
   
   ```
   [ERROR] Tests run: 6, Failures: 0, Errors: 6, Skipped: 0, Time elapsed: 20.824 s <<< FAILURE! - in org.apache.accumulo.test.fate.zookeeper.FateIT
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.815 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/2f292b02-56cd-4f7d-aa3e-8a8b31187557/fate/tx_699cb260ffb1234d
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.176 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/2f292b02-56cd-4f7d-aa3e-8a8b31187557/fate/tx_74b386e44516c806
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.184 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/2f292b02-56cd-4f7d-aa3e-8a8b31187557/fate/tx_317d2c0c773fb67f
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.176 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/2f292b02-56cd-4f7d-aa3e-8a8b31187557/fate/tx_1146fc5b1d3886ef
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.161 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/2f292b02-56cd-4f7d-aa3e-8a8b31187557/fate/tx_47fd229d568ce869
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.159 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/2f292b02-56cd-4f7d-aa3e-8a8b31187557/fate/tx_25f4481f2a0b9e6d
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   ```
   
   </details>
   
   **Testing Environment:**
   ```
   Apache Maven 3.8.1 (05c21c65bdfed0f71a2f2ada8b84da59348c4c5d)
   Maven home: /var/lib/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.8.1
   Java version: 11.0.14, vendor: Red Hat, Inc., runtime: /usr/lib/jvm/java-11-openjdk-11.0.14.0.9-2.fc35.x86_64
   Default locale: en_US, platform encoding: UTF-8
   OS name: "linux", version: "5.16.5-200.fc35.x86_64", arch: "amd64", family: "unix"
   ```
   
   **What have you tried already?**
   Nothing
   
   **Additional context**
   The last changes here were for #2462 and #2470. This is not a known flaky test, in general, so I suspect those changes are related (but it's not certain).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1057405029


   That's interesting. We have discussed using `sync` previously with ZooKeeper, but I don't think we are actually using it anywhere in Accumulo. Presumably you only need to do the third sync call from above, on `txdir`, before checking the status? This might not be an issue with `Fate` because it all runs in the `Manager` process and the FaTE transaction runner threads are likely using a shared `ZooReaderWriter` from the `ServerContext`. This seems very similar to using the Java `volatile` modifier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1035285544


   FateIT uses ZooKeeperTestingServer, which should use PortUtils to get a free port for the server, It does [not](https://github.com/apache/accumulo/blob/main/test/src/main/java/org/apache/accumulo/test/zookeeper/ZooKeeperTestingServer.java#L103) currently do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1036212393


   [FateIT-logs.tar.gz](https://github.com/apache/accumulo/files/8048922/FateIT-logs.tar.gz)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii closed issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii closed issue #2474:
URL: https://github.com/apache/accumulo/issues/2474


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1060695879


   I think if we inserted a `sync` [here](https://github.com/apache/accumulo/blob/main/test/src/main/java/org/apache/accumulo/test/fate/zookeeper/FateIT.java#L152), that might solve the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1060736248


   So, I changed the test method to:
   
   ```
     private static TStatus getTxStatus(ZooReaderWriter zrw, long txid)
         throws KeeperException, InterruptedException {
       String txdir = String.format("%s%s/tx_%016x", ZK_ROOT, Constants.ZFATE, txid);
       zrw.getZooKeeper().sync(txdir, null, null);
       return TStatus.valueOf(new String(zrw.getData(txdir), UTF_8));
     }
   ```
   
   Subsequent tests both succeeded and failed. Looking at the `sync` javadoc, it's asynchronous. :-(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1058101737


   > The code that is being tested is for metrics - timing issues may not matter in this case and the metrics code could be modified to ignore / report zero if the node cannot be found. The value / node / client should naturally re-sync on the next metrics collection cycle.
   
     I'm not sure what you mean by this, FateIT.testTransactionStatus() is not testing metrics. It's testing that the status of the transaction is what we expect as it moves through the Fate lifecycle.
     
     I believe that there is only one ZK server being used with ZooKeeperTestingServer, so the issue here is that the Fate transaction runner thread sets the state on the node and the test thread is not seeing the change. I was under the impression that sync would help when you had multiple servers, which I don't think is the case here. I do agree that maybe looping instead of sync might work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
EdColeman commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1058066415


   The code that is being tested is for metrics - timing issues may not matter in this case and the metrics code could be modified to ignore / report zero if the node cannot be found.  The value / node / client should naturally re-sync on the next metrics collection cycle.
   
   For the test code - adding a loop and waiting for the node to exist  might be sufficient - and if not might be pointing to a logic error / bad assumption on the way that zookeeper is being used.
   
   I'm cautious of adding a sync in this case - it might be masking other issues, and the concern would be that there could also be problems with locks - which should be much more sensitive to having repeatable / consistent state across clients and connections. And in those cases I think that requiring "exact" state across the cluster might point to a more fragile design rather than accepting that it is a distributed system and "things happen when they happen, deal with it (tm)" approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1060713495


   I'm pretty sure we'd have to do it before each call to get the status, since the status is updated in a separate thread by a different ZK client. However, we don't need to have all the lines I had. I was told that the path parameter wasn't really used, but from my experimentation, it seemed that the path did need to exist at least. So, using one sync call to the "/" path should suffice inside the get status method.
   
   > I could be mis-remembering - but fateTransactionStatus is for metrics. (added 1.10?)
   
   It might be used for that, but in this test, it's being used to check for expected progress through the FaTE lifecycle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1036206841


   Confirmed that this is still a problem after the changes in #2485:
   
   ```
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/3d3f19b8-4ff8-47f0-b33b-a87c3d0804a2/fate/tx_731d81b1c8ef162b
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii removed a comment on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii removed a comment on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1036212393


   [FateIT-logs.tar.gz](https://github.com/apache/accumulo/files/8048922/FateIT-logs.tar.gz)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1036442794


   @milleruntime - not at the moment. I have been working on the ExternalCompaction ITs. With Keith's help, I think I have the solution for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1061791145


   > So, I changed the test method to:
   > 
   > ```
   >   private static TStatus getTxStatus(ZooReaderWriter zrw, long txid)
   >       throws KeeperException, InterruptedException {
   >     String txdir = String.format("%s%s/tx_%016x", ZK_ROOT, Constants.ZFATE, txid);
   >     zrw.getZooKeeper().sync(txdir, null, null);
   >     return TStatus.valueOf(new String(zrw.getData(txdir), UTF_8));
   >   }
   > ```
   > 
   > Subsequent tests both succeeded and failed. Looking at the `sync` javadoc, it's asynchronous. :-(
   
   Right, so, we gotta create a callback to sync the sync :smiley_cat:
   I forgot to mention that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1035305783


   So, the question is whether or not `SO_REUSEADDR` (used in PortUtils) would allow 2 different processes to bind to the same port (I think the answer is yes, but I'm not 100% sure). If so, then the current implementation of `ZooKeeperTestingServer.getPort` could cause it to bind to another port that was already allocated using PortUtils. Is it possible that we have two (or more) ZK servers listening on the same  port?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1036428337


   @dlmarion are you looking at this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] EdColeman commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
EdColeman commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1058117645


   I could be mis-remembering - but fateTransactionStatus is for metrics. (added 1.10?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii edited a comment on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii edited a comment on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1036206841


   Confirmed that this is still a problem after the changes in #2485:
   
   ```
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/3d3f19b8-4ff8-47f0-b33b-a87c3d0804a2/fate/tx_731d81b1c8ef162b
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:193)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:156)
   ```
   
   Here are the failsafe logs for this test: [FateIT-logs.tar.gz](https://github.com/apache/accumulo/files/8048922/FateIT-logs.tar.gz)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1058117116


   > The code that is being tested is for metrics - timing issues may not matter in this case and the metrics code could be modified to ignore / report zero if the node cannot be found. The value / node / client should naturally re-sync on the next metrics collection cycle.
   
   I think you're referring to #2406 , which may be flaky for a similar reason as this one? The code being tested here that I've seen failing is testing that the expected state changes happen as a FaTE operation makes progress.
   
   > For the test code - adding a loop and waiting for the node to exist might be sufficient - and if not might be pointing to a logic error / bad assumption on the way that zookeeper is being used.
   
   A loop in the test code means it could wait up to the ZK timeout (30 seconds) before it sees the update. An explicit sync will make it more stable. I think it probably makes sense to sync when checking the expected state.
   
   > I'm cautious of adding a sync in this case - it might be masking other issues, and the concern would be that there could also be problems with locks - which should be much more sensitive to having repeatable / consistent state across clients and connections. And in those cases I think that requiring "exact" state across the cluster might point to a more fragile design rather than accepting that it is a distributed system and "things happen when they happen, deal with it (tm)" approach.
   
   I share that concern, and is why I haven't applied this fix yet. However, I'm inclined to advocate for having a sync (asking questions from ZK devs about its usage first) for the test if it looks like this is specifically a test problem. If there are other places in our code that also would benefit from a sync, then those might be pre-existing issues, which can be solved later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1057398477


   I found that this test becomes extremely stable when I do:
   
   ```diff
   @@ -197,5 +197,8 @@ public class FateIT {
      private static TStatus getTxStatus(ZooReaderWriter zrw, long txid)
          throws KeeperException, InterruptedException {
        String txdir = String.format("%s%s/tx_%016x", ZK_ROOT, Constants.ZFATE, txid);
   +    zrw.getZooKeeper().sync(ZK_ROOT, null, null);
   +    zrw.getZooKeeper().sync(ZK_ROOT + Constants.ZFATE, null, null);
   +    zrw.getZooKeeper().sync(txdir, null, null);
        return TStatus.valueOf(new String(zrw.getData(txdir), UTF_8));
      }
   ```
   
   This seems to fix the test. However, I don't know if it's enough to address any potential issues when actually running Accumulo. In other words, I don't know if this is merely a test bug that is fixed by this change, or if it's a problem with Accumulo itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1035356361


   I'm going to reopen this test, because I think the modification in #2485 is not likely going to fix this. The error is that the FaTE transaction node simply does not exist in ZK. There is no problem with actually communicating with ZK on that port or any other ZK-specific issue indicated in the logs of this test that failed. `SO_REUSEADDR` would not have allowed multiple ZKs to listen on the same socket. It only allows binding to recently used ports that haven't been cleaned up yet. And, the probability of selecting a port in use would have been very low, and the test has failed multiple times. So, I'm pretty confident that's not going to fix this, although it is an improvement to the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1057420566


   > That's interesting. We have discussed using `sync` previously with ZooKeeper, but I don't think we are actually using it anywhere in Accumulo. Presumably you only need to do the third sync call from above, on `txdir`, before checking the status?
   
   I tried only doing the last one. It didn't exist yet (or the client wasn't aware of its existence), so it had nothing to sync on. I had to sync the parents so knowledge of the children were updated before I could sync the child nodes. It's possible I only needed to sync the parent, though... if the sync is effectively recursive... I'm not sure exactly how it works, only that some aggressive syncing did solve the test failure.
   
   > This might not be an issue with `Fate` because it all runs in the `Manager` process and the FaTE transaction runner threads are likely using a shared `ZooReaderWriter` from the `ServerContext`. This seems very similar to using the Java `volatile` modifier.
   
   1. ZooReaderWriter is merely a utility wrapper that adds methods to a ZooKeeper composed within it. It has no state itself, so whether it is shared or not has no influence on the matter. Furthermore, it should use the same ZooKeeper session provided by the ZooSession singleton, so it doesn't matter if you construct a new ZRW or reuse it.
   2. Similarly, the ZooKeeper composed within it is itself contained within a ZooSession singleton, and can be replaced if the old ZooKeeper session dies. So, there's no guarantee that you reuse the same ZooKeeper object.
   3. And finally, sync is sometimes necessary because you can be talking to a different ZooKeeper server, which ZooKeeper objects can transparently switch between, and not necessarily because of a different ZooKeeper client instance being used. I would imagine that it would stay sync'd for actions performed by the same ZooKeeper client instance, even if it transparently switches to another server, but I don't know what guarantees there are for that.
   
   In any case, even if sync isn't necessary for item 3 in this list (assuming the same ZK object stays sync'd when it switches servers transparently), which I'm not sure about, we don't have guarantees over item 2... because ZooSession can change the ZooKeeper object used.
   
   These are part of the things I'm trying to investigate in #2298 when I encountered this. However, progress is slow on that because of the complications involving the deprecated Instance/Connector objects and their requirements to use the singleton ZooKeeper sessions, and we can't remove those until 3.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1058120902


   FWIW - I have no issue with a sync() call in test code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii edited a comment on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii edited a comment on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1057398477


   I found that this test becomes extremely stable when I do:
   
   ```diff
   @@ -197,5 +197,8 @@ public class FateIT {
      private static TStatus getTxStatus(ZooReaderWriter zrw, long txid)
          throws KeeperException, InterruptedException {
   +    zrw.getZooKeeper().sync(ZK_ROOT, null, null);
   +    zrw.getZooKeeper().sync(ZK_ROOT + Constants.ZFATE, null, null);
        String txdir = String.format("%s%s/tx_%016x", ZK_ROOT, Constants.ZFATE, txid);
   +    zrw.getZooKeeper().sync(txdir, null, null);
        return TStatus.valueOf(new String(zrw.getData(txdir), UTF_8));
      }
   ```
   
   This seems to fix the test. However, I don't know if it's enough to address any potential issues when actually running Accumulo. In other words, I don't know if this is merely a test bug that is fixed by this change, or if it's a problem with Accumulo itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1042016224


   I saw this fail again 3 times after less than 4 seconds each, before succeeding on a fourth attempt. The error was the same every time:
   
   ```
   [ERROR] org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus  Time elapsed: 3.175 s  <<< ERROR!
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /accumulo/11bc8f82-60ac-4a63-bc6b-f417dc345bc4/fate/tx_445178e030cf8c96
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
   	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.lambda$getData$0(ZooReader.java:65)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:165)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:144)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:131)
   	at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:65)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.getTxStatus(FateIT.java:200)
   	at org.apache.accumulo.test.fate.zookeeper.FateIT.testTransactionStatus(FateIT.java:163)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2474:
URL: https://github.com/apache/accumulo/issues/2474#issuecomment-1060718396


   > I'm pretty sure we'd have to do it before each call to get the status
   
    Agreed, I just saw it fail when checking the status the 3rd time, which means it found the node the first two times, but not the third.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion closed issue #2474: Broken or Flaky test: FateIT

Posted by GitBox <gi...@apache.org>.
dlmarion closed issue #2474:
URL: https://github.com/apache/accumulo/issues/2474


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org