You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/09/02 19:51:55 UTC

[GitHub] [systemds] kev-inn opened a new pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

kev-inn opened a new pull request #1047:
URL: https://github.com/apache/systemds/pull/1047


   Adds a new function to start multiple federated workers. The function first starts multiple processes and then waits for all of them to be ready for a connection by pinging (by sending a `CLEAR`) them.
   
   This method cannot be used for threads (running federated workers), because they share static variables, therefore their startup interferes with each other. Switching purely to processes should definitely be considered.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] kev-inn commented on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
kev-inn commented on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-688257975


   > But the issue is that it is hard / impossible to debug the processes while it is possible when inside the same JVM using threads, and in that context if we really want to make it smart then we would need to change the system such that it does not produce these static variables and objects that potentially also lead to more bugs in other parts of the program.
   
   Yes, that is the one major downside. Intellij supports attaching to the debugger to process, but I am not sure if that process can be automated. [`ThreadLocal`](https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html) would be another option for using threads, but I believe that would lead to performance loss, even though we only get debug improvements.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-687735766


   agree! also really like the PR!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] kev-inn edited a comment on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
kev-inn edited a comment on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-688257975


   > But the issue is that it is hard / impossible to debug the processes while it is possible when inside the same JVM using threads, and in that context if we really want to make it smart then we would need to change the system such that it does not produce these static variables and objects that potentially also lead to more bugs in other parts of the program.
   
   Yes, that is the one major downside. Intellij supports attaching to the debugger to process, but I am not sure if that operation can be automated. [`ThreadLocal`](https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html) would be another option for using threads, but I believe that would lead to performance loss, even though we only get debug improvements.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-700683655


   PR, on hold since we haven't found the perfect setup, and therefore the optimization is invalid, since it disables debugging.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] corepointer commented on a change in pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
corepointer commented on a change in pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#discussion_r519942448



##########
File path: src/test/java/org/apache/sysds/test/AutomatedTestBase.java
##########
@@ -1380,80 +1386,116 @@ public static int getRandomAvailablePort() {
 			return availableSocket.getLocalPort();
 		}
 		catch(IOException e) {
-			// If no port was found just use 9999
+			// If no port was found just use 9990
 			return 9990;
 		}

Review comment:
       If you get an IOException upon creating a socket, there are probably more severe issues pending than finding a nice port number :fearful: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] kev-inn commented on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
kev-inn commented on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-723118768


   I fixed the problem with debugging, please read the last commit message to get a view of the current state.
   I am not sure if the enhancement (keeping workers alive) should be put into this PR or a new one, since I can imagine that there will be a few trade-offs which require discussion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#discussion_r519916551



##########
File path: src/test/java/org/apache/sysds/test/AutomatedTestBase.java
##########
@@ -1380,80 +1386,116 @@ public static int getRandomAvailablePort() {
 			return availableSocket.getLocalPort();
 		}
 		catch(IOException e) {
-			// If no port was found just use 9999
+			// If no port was found just use 9990
 			return 9990;
 		}

Review comment:
       i think i did this, because i experienced weird behavior in cases where i searched for a random available port recursively, and i did not want a potential infinite loop in our tests startup.
   But i do agree that searching for a valid port would be better, than just returning this default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] sebwrede commented on a change in pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
sebwrede commented on a change in pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#discussion_r492015888



##########
File path: src/test/java/org/apache/sysds/test/functions/privacy/FederatedL2SVMTest.java
##########
@@ -314,7 +314,7 @@ private void federatedL2SVM(Types.ExecMode execMode, Map<String, PrivacyConstrai
 		if(rtplatform == Types.ExecMode.SPARK) {
 			DMLScript.USE_LOCAL_SPARK_CONFIG = true;
 		}
-		Process t1 = null, t2 = null;
+		Process[] ts = new Process[0];

Review comment:
       This has been changed to use threads since the use of processes failed in GitHub Actions.

##########
File path: src/test/java/org/apache/sysds/test/AutomatedTestBase.java
##########
@@ -1283,44 +1288,89 @@ public static int getRandomAvailablePort() {
 			return availableSocket.getLocalPort();
 		}
 		catch(IOException e) {
-			// If no port was found just use 9999
+			// If no port was found just use 9990
 			return 9990;
 		}
 	}
 
+	private static void waitForFederatedWorker(int port) {
+		long start = System.currentTimeMillis();
+		while(System.currentTimeMillis() - start <= FED_WORKER_CONNECTION_WAIT) {
+			FederatedResponse response = null;
+			try {
+				response = executeFederatedOperation(new InetSocketAddress("localhost", port),
+					new FederatedRequest(FederatedRequest.RequestType.CLEAR)).get();
+			}
+			catch(Exception e) {
+				// failed, try again if fed worker wait has not passed yet
+			}
+			if(response != null) {
+				if(!(response.isSuccessful()))
+					LOG.error("Ping failed: " + response.getErrorMessage());
+				else
+					break;
+			}
+		}
+	}

Review comment:
       If the worker does not respond and the time limit has been reached then the execution would just continue as usual from here. This means that it will fail later when it tries to send a request to the worker. This will make it harder to debug compared to handling it when trying to create the worker. 

##########
File path: src/test/java/org/apache/sysds/test/functions/privacy/FederatedWorkerHandlerTest.java
##########
@@ -199,7 +199,7 @@ public void federatedSum(Types.ExecMode execMode, PrivacyLevel privacyLevel, Cla
 
 		assert(checkedPrivacyConstraintsContains(privacyLevel));
 
-		TestUtils.shutdownThread(t);
+		TestUtils.shutdownProcess(t);

Review comment:
       This has also been changed to threads. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-727259891


   Im closing this PR, because i have just added a faster startup to the federated tests, by adding optional startup time to the threads spawned in a federated tests. This renders this PR obsolite.
   
   If this conclusion is wrong feel free to reopen it @kev-inn .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] sebwrede commented on a change in pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
sebwrede commented on a change in pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#discussion_r492015888



##########
File path: src/test/java/org/apache/sysds/test/functions/privacy/FederatedL2SVMTest.java
##########
@@ -314,7 +314,7 @@ private void federatedL2SVM(Types.ExecMode execMode, Map<String, PrivacyConstrai
 		if(rtplatform == Types.ExecMode.SPARK) {
 			DMLScript.USE_LOCAL_SPARK_CONFIG = true;
 		}
-		Process t1 = null, t2 = null;
+		Process[] ts = new Process[0];

Review comment:
       This has been changed to use threads since the use of processes failed in GitHub Actions.

##########
File path: src/test/java/org/apache/sysds/test/AutomatedTestBase.java
##########
@@ -1283,44 +1288,89 @@ public static int getRandomAvailablePort() {
 			return availableSocket.getLocalPort();
 		}
 		catch(IOException e) {
-			// If no port was found just use 9999
+			// If no port was found just use 9990
 			return 9990;
 		}
 	}
 
+	private static void waitForFederatedWorker(int port) {
+		long start = System.currentTimeMillis();
+		while(System.currentTimeMillis() - start <= FED_WORKER_CONNECTION_WAIT) {
+			FederatedResponse response = null;
+			try {
+				response = executeFederatedOperation(new InetSocketAddress("localhost", port),
+					new FederatedRequest(FederatedRequest.RequestType.CLEAR)).get();
+			}
+			catch(Exception e) {
+				// failed, try again if fed worker wait has not passed yet
+			}
+			if(response != null) {
+				if(!(response.isSuccessful()))
+					LOG.error("Ping failed: " + response.getErrorMessage());
+				else
+					break;
+			}
+		}
+	}

Review comment:
       If the worker does not respond and the time limit has been reached then the execution would just continue as usual from here. This means that it will fail later when it tries to send a request to the worker. This will make it harder to debug compared to handling it when trying to create the worker. 

##########
File path: src/test/java/org/apache/sysds/test/functions/privacy/FederatedWorkerHandlerTest.java
##########
@@ -199,7 +199,7 @@ public void federatedSum(Types.ExecMode execMode, PrivacyLevel privacyLevel, Cla
 
 		assert(checkedPrivacyConstraintsContains(privacyLevel));
 
-		TestUtils.shutdownThread(t);
+		TestUtils.shutdownProcess(t);

Review comment:
       This has also been changed to threads. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard closed pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1047:
URL: https://github.com/apache/systemds/pull/1047


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] corepointer commented on a change in pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
corepointer commented on a change in pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#discussion_r519797269



##########
File path: src/test/java/org/apache/sysds/test/AutomatedTestBase.java
##########
@@ -1380,80 +1386,116 @@ public static int getRandomAvailablePort() {
 			return availableSocket.getLocalPort();
 		}
 		catch(IOException e) {
-			// If no port was found just use 9999
+			// If no port was found just use 9990
 			return 9990;
 		}

Review comment:
       Isn't this hiding trouble? Can you explain when this behavior is ever desired to ignore the exception and just return a port? I know, this is actually already being used and the change is just about the comment, but I didn't review the original change, so I speak up now :blush: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1047: [SYSTEMDS-2651] Add faster multiple federated workers startup

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1047:
URL: https://github.com/apache/systemds/pull/1047#issuecomment-687736854


   But the issue is that it is hard / impossible to debug the processes while it is possible when inside the same JVM using threads, and in that context if we really want to make it smart then we would need to change the system such that it does not produce these static variables and objects that potentially also lead to more bugs in other parts of the program.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org