You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "hasnain-db (via GitHub)" <gi...@apache.org> on 2023/10/31 04:30:30 UTC

[PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

hasnain-db opened a new pull request, #43596:
URL: https://github.com/apache/spark/pull/43596

   ### What changes were proposed in this pull request?
   
   Improve a few timing related constraints:
   
   * Wait 10s instead of 5 for a reload to happen when under high load. This should not delay the test in the average case as it checks every 100ms for an event to happen.
   * In certain cases we run *too fast* so the new file we create has the same timestamp as the old file, and thus we never reload. Add a sleep there so the modification times are different. This was accidentally reverted in https://github.com/apache/spark/pull/43249/commits/b7dac1f96ec45a300963e4da1dc4fc1173470da7#diff-de96db6b61e9f48fb9bd8b781f4367e60a48b3886dfe03d4cf16b47ef6c26d0a
   
   ### Why are the changes needed?
   
   These changes are needed to make the test more reliable and less flaky under load.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   Ran this test in parallel on a machine under high load. Previously under those conditions I would repeatedly get high rates of failure (80%+) and now it does not fail.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1379825659


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));

Review Comment:
   Why do we need this ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #43596:
URL: https://github.com/apache/spark/pull/43596#issuecomment-1791900605

   Merged to master.
   Thanks for fixing this @hasnain-db !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1379825659


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));

Review Comment:
   ~Why do we need this ?~ I can see how it can have an impact - would suggest making the duration slightly higher (1s ?), in case there is some timing issue.



##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));

Review Comment:
   ~Why do we need this ?~ I can see how it can have an impact - a super nit: would suggest making the duration slightly higher (1s ?), in case there is some timing issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1379834361


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));
+
       // Add another cert
       Map<String, X509Certificate> certs = new HashMap<String, X509Certificate>();
       certs.put("cert1", cert1);
       certs.put("cert2", cert2);
       createTrustStore(trustStore, "password", certs);
 
-      // Wait up to 5s until we reload
-      waitForReloadCount(tm, 1, 50);
+      // Wait up to 10s until we reload
+      waitForReloadCount(tm, 1, 100);

Review Comment:
   Did you observe failures here ? Given the reload should have happened above, what would result in potentially such a high delay here ?
   (Same for the cases below)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm closed pull request #43596: [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky
URL: https://github.com/apache/spark/pull/43596


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1379825659


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));

Review Comment:
   ~Why do we need this ?~ I can see how it can have an impact.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1379834361


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));
+
       // Add another cert
       Map<String, X509Certificate> certs = new HashMap<String, X509Certificate>();
       certs.put("cert1", cert1);
       certs.put("cert2", cert2);
       createTrustStore(trustStore, "password", certs);
 
-      // Wait up to 5s until we reload
-      waitForReloadCount(tm, 1, 50);
+      // Wait up to 10s until we reload
+      waitForReloadCount(tm, 1, 100);

Review Comment:
   Did you observe failures here ? Given the reload should have happened above, what would result in potentially such a high delay here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "hasnain-db (via GitHub)" <gi...@apache.org>.
hasnain-db commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1380349554


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));

Review Comment:
   thanks, fixing.
   
   also, for future people seeing this: the reason this is needed is that in certain scenarios this can run too fast, so the modification time of the original trust store and the one we create right after this is the same - in which case we do not reload the file, which fails this test.



##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));
+
       // Add another cert
       Map<String, X509Certificate> certs = new HashMap<String, X509Certificate>();
       certs.put("cert1", cert1);
       certs.put("cert2", cert2);
       createTrustStore(trustStore, "password", certs);
 
-      // Wait up to 5s until we reload
-      waitForReloadCount(tm, 1, 50);
+      // Wait up to 10s until we reload
+      waitForReloadCount(tm, 1, 100);

Review Comment:
   yeah, on a heavily loaded system (~40 load average on ~40 cores) this can take longer than 5s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "hasnain-db (via GitHub)" <gi...@apache.org>.
hasnain-db commented on PR #43596:
URL: https://github.com/apache/spark/pull/43596#issuecomment-1786430142

   cc @mridulm this flakiness is mostly in our internal environment (haven't seen it too often in the sbt build) but I figured it makes sense to put here in case things change in the future. Let me know if you'd prefer we avoid this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1381165616


##########
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##########
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
       // At this point we haven't reloaded, just the initial load
       assertEquals(0, tm.reloadCount);
 
+      // Wait so that the file modification time is different
+      Thread.sleep((tm.getReloadInterval() + 200));
+
       // Add another cert
       Map<String, X509Certificate> certs = new HashMap<String, X509Certificate>();
       certs.put("cert1", cert1);
       certs.put("cert2", cert2);
       createTrustStore(trustStore, "password", certs);
 
-      // Wait up to 5s until we reload
-      waitForReloadCount(tm, 1, 50);
+      // Wait up to 10s until we reload
+      waitForReloadCount(tm, 1, 100);

Review Comment:
   That should do it :-)
   I would expect a lot more tests (outside of this effort) to be impacted with machines that loaded !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org