You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/22 02:55:07 UTC

[GitHub] [spark] LuciferYang opened a new pull request, #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

LuciferYang opened a new pull request, #37962:
URL: https://github.com/apache/spark/pull/37962

   ### What changes were proposed in this pull request?
   After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`.
   
   `YarnShuffleIntegrationSuite` not set `YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the configuration is false,  so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data.
   
   This pr aims to let `YarnShuffleIntegrationSuite` restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes:
   
   1. Add a new un-document configuration `spark.yarn.shuffle.testing` to `YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == null && spark.yarn.shuffle.testing == true`.
   
   2. Only set `spark.yarn.shuffle.testing = true` in `YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` is not null to ensure that registeredExecFile reload scenarios will be verified.
   
   ### Why are the changes needed?
   Fix registeredExecFile reload  test scenarios.
   
   Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true?
   
   This configuration has been tried
   
   **Hadoop 3.3.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-3
   ```
   
   ```
   2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException
   2022-09-10T11:44:42.1715234Z at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
   2022-09-10T11:44:42.1719347Z at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   2022-09-10T11:44:42.1723090Z at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
   2022-09-10T11:44:42.1726759Z at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   2022-09-10T11:44:42.1731028Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
   2022-09-10T11:44:42.1735424Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
   2022-09-10T11:44:42.1740303Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1745576Z at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
   2022-09-10T11:44:42.1828858Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1829712Z at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
   2022-09-10T11:44:42.1830633Z at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
   2022-09-10T11:44:42.1831431Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1832279Z at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)
   ```
   
   **Hadoop 2.7.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-2
   ```
   
   ```
   YarnShuffleIntegrationWithLevelDBBackendSuite:
   org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite *** ABORTED ***
     java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
     at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
     at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
     at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     ...
   Run completed in 3 seconds, 992 milliseconds.
   Total number of tests run: 0
   Suites: completed 1, aborted 1
   Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
   *** 1 SUITE ABORTED ***
   ```
   
   From the above test, we need to use a fixed port to enable Yarn NodeManager recovery, but this is difficult to be guaranteed in UT, so this pr try a workaround way.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37962:
URL: https://github.com/apache/spark/pull/37962#discussion_r977145784


##########
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##########
@@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) throws Exception {
 
     boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE);
 
+    if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) {
+      _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here. 
   `com.google.common.io.Files.createTempDir` is used instead due to  SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` in `JavaUtils`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37962:
URL: https://github.com/apache/spark/pull/37962#issuecomment-1254782580

   thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios 
URL: https://github.com/apache/spark/pull/37962


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37962:
URL: https://github.com/apache/spark/pull/37962#discussion_r977145784


##########
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:
##########
@@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) throws Exception {
 
     boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE);
 
+    if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) {
+      _recoveryPath = new Path(Files.createTempDir().toURI());

Review Comment:
   There is a little difference here. 
   `com.google.common.io.Files.createTempDir` is used instead due to  SPARK-39102 is not in Spark 3.3, so there is no `createTempDir` method in `JavaUtils`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #37962: [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #37962:
URL: https://github.com/apache/spark/pull/37962#issuecomment-1254772221

   The test passed. Merged to branch-3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org