You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/06/08 14:24:23 UTC

[GitHub] [systemds] Baunsgaard opened a new pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Baunsgaard opened a new pull request #1303:
URL: https://github.com/apache/systemds/pull/1303


   This PR connects the workload tree with the compression instruction, 
   this is done by constructing a global singleton-Hash-Map that can share objects via integer Ids,
   this id is then appended to the compression instruction, and allow the parsing of the Workload tree from Hop level to
   be accessed at Lop level.
   
   I would like some comments if this is okay, also i am concerned about spark implementation,
   since i would have to parse and send the WTree along with the spark instructions to workers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-856969411


   I could use some feedback on this one, @mboehm7 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-857614373


   > I noticed that you are transmitting the whole tree, via an integer ID, to the instruction runtime. I am not well aware of the intention, but can we not follow the conventional path: optimize the tree/DAG in the optimizer, extract information, and transmit via the instructions strings of the corresponding instructions; and/or do the same during recompilation? That way you can stick to the supposed separation of compiler and runtime, and can easily scale to distributed runtimes such as Spark and federated.
   
   Excellent question.
   
   The instruction strings then would have to be able to contain nested structures, since the workload tree provide means of separating different loops into sub tree nodes in the structure. If a string instruction containing this should be constructed it means that the string ends up with different lengths depending on how many sub loops are executed.
   
   Providing an id to the structure and storing the tree somewhere else seams more scalable and flexible to future modifications where new types of instructions are added to the WTree.
   
   Also note that currently we look up our matrix objects the same way via an id provided to the instruction string, (just with another back-end that we have worked with much much longer). The reason why I'm not using that back-end is because it require specific data types contained, like Data, MatrixValue, MatrixBlock, FrameBlock, etc.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] sebwrede commented on a change in pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
sebwrede commented on a change in pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#discussion_r648435211



##########
File path: src/main/java/org/apache/sysds/runtime/compress/CompressedMatrixBlockFactory.java
##########
@@ -88,6 +89,10 @@ private CompressedMatrixBlockFactory(MatrixBlock mb, int k, CompressionSettings
 		return compress(mb, k, new CompressionSettingsBuilder().create());
 	}
 
+	public static Pair<MatrixBlock, CompressionStatistics> compress(MatrixBlock mb, int k, WTreeRoot root){
+		return compress(mb, k, new CompressionSettingsBuilder().create());
+	}

Review comment:
       You are not using root in this method. Is this on purpose? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] phaniarnab commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
phaniarnab commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-857598607


   I noticed that you are transmitting the whole tree, via an integer ID, to the instruction runtime. I am not well aware of the intention, but can we not follow the conventional path: optimize the tree/DAG in the optimizer, extract information, and transmit via the instructions strings of the corresponding instructions; and/or do the same during recompilation? That way you can stick to the supposed separation of compiler and runtime, and can easily scale to distributed runtimes such as Spark and federated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-858550427


   > Well, if you cannot extract the information at compile time and cannot build the tree during runtime, then I agree that the best solution is to put the information somewhere that is not in the instruction string, like you have done in this PR.
   > 
   > Another comment: If you change the `_cops` field, `addCompressedOp`, and `getCompressOps` to something like `_highlightedOps`, `addHighlightedOp`, and `getHighlightedOps` then the class would already be generalized enough to be used for other workload tree representations, such as the one I need for the federated executions.
   
   should be changed now, But it is not enough to make it applicable to federated, for that you need to define 
   
   1. A new IPA
   2. A new Analyzer
   
   also if you decide to do so, we need to move this workload package to some more general location. perhaps a new package called workload in the runtime folder?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] sebwrede commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
sebwrede commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-857824122


   I think the way the WTree is added to the instructions seems reasonable if the CompressionCPInstruction needs the entire tree, but is this really the case? Isn't it possible to add the information about compression contained in the tree and add it to the different instructions, so instead of putting all the information into a single CompressionCPInstruction string (which would have variable length), the information should be placed in the relevant instruction strings so that each of the strings contains only the compression information needed for its instruction execution (resulting in an instruction string of fixed length). 
   Maybe this is because I do not fully understand what you mean by "means of separating different loops into sub tree nodes in the structure". Couldn't the nested structures be represented by adding to the different instructions instead of adding all of it to the same instruction? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard closed pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1303:
URL: https://github.com/apache/systemds/pull/1303


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-864897214


   Thanks for the comments, @phaniarnab  @mboehm7  @sebwrede 
   i have modified accordingly.
   
   Closing PR because of continued work in PR : #1320 , will merge after the Release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on a change in pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on a change in pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#discussion_r648590327



##########
File path: src/main/java/org/apache/sysds/runtime/compress/CompressedMatrixBlockFactory.java
##########
@@ -88,6 +89,10 @@ private CompressedMatrixBlockFactory(MatrixBlock mb, int k, CompressionSettings
 		return compress(mb, k, new CompressionSettingsBuilder().create());
 	}
 
+	public static Pair<MatrixBlock, CompressionStatistics> compress(MatrixBlock mb, int k, WTreeRoot root){
+		return compress(mb, k, new CompressionSettingsBuilder().create());
+	}

Review comment:
       yup, this PR simply wire up the tree to allow the compression instruction to use it, my WIP PR is the actual implementation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] sebwrede commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
sebwrede commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-858560828


   > But it is not enough to make it applicable to federated
   
   It is only the tree-structure I intent to reuse. The only purpose is to use it as a representation of execution plans. 
   
   > perhaps a new package
   
   Yes, that would be a good idea. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] mboehm7 commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-857005894


   thanks - I'll make a detailed pass Friday this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] mboehm7 commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-859754787


   LGTM. Overall, I think this map is fine - generating a unique token, storing the WTree globally under this token, and compiling the token into the compression instruction for later access is (in my opinion) the best option here. We could further make the WTree part of the normal explain output - that way it's not much different from our dictionaries of functions that are shown in the explain after compilation and then called from multiple different places.
   
   Some minor additional comments:
   * The rework of the WTree seem to have lost the begin/end line information (for the explain output). Especially in large scripts like GLM with >1000 lines and many removed statement blocks during compilation, this information is crucial for effective debugging.
   * The global map for token lookups could be hardened for concurrent access by different threads (e.g., JMLC) and be made part of the cleanup logic to prevent resource leaks in programmatic APIs. Furthermore, there is no need to create singleton instances of this map - instead it could be a simple class with a static member variable and static methods (which would avoid unnecessary indirections).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] sebwrede commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
sebwrede commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-858435730


   Well, if you cannot extract the information at compile time and cannot build the tree during runtime, then I agree that the best solution is to put the information somewhere that is not in the instruction string, like you have done in this PR. 
   
   Another comment: If you change the `_cops` field, `addCompressedOp`, and `getCompressOps` to something like `_highlightedOps`, `addHighlightedOp`, and `getHighlightedOps` then the class would already be generalized enough to be used for other workload tree representations, such as the one I need for the federated executions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-858012541


   > I think the way the WTree is added to the instructions seems reasonable if the CompressionCPInstruction needs the entire tree, but is this really the case?
   
   Yes, but that is assuming that i know at compile time what information is needed in the runtime of an arbitrary matrix.
   currently we don't have access to this information at compile time, and we probably should not.
   
   > Isn't it possible to add the information about compression contained in the tree and add it to the different instructions, so instead of putting all the information into a single CompressionCPInstruction string (which would have variable length), the information should be placed in the relevant instruction strings so that each of the strings contains only the compression information needed for its instruction execution (resulting in an instruction string of fixed length).
   
   Since the Compression instruction is one instruction i don't see how i should create multiple instruction strings? Maybe I'm missing the question.
   
   > Maybe this is because I do not fully understand what you mean by "means of separating different loops into sub tree nodes in the structure". Couldn't the nested structures be represented by adding to the different instructions instead of adding all of it to the same instruction?
   
   To clarify the loops:
   
   Lets say i read in a matrix 
   
   ```
   X = read("...")
   ```
   
   This matrix is used in a nested for loop at different locations
   
   ```
   d = mean(X)
   for(a in ...){
     ...
     c = max(X)
     Xm = X + 2
     for(b in ...){
       ...
       B = t(Xm) %*% X
       ...
   }}
   ```
   
   for optimizing compression for workflow i need to know
   
   1. what functions are called on X and how many times(at least an approximation), mean once, max in forloop, compressed multiplication in forloop forloop.
   2. what derived compressed matrices is constructed  such as Xm (Still compressed).
   3. what operations is performed on the derived matrices
   4. how many decompressing operations there is
   
   all this can be encapsulated in the "AWTree" object, and is hard to add to the instruction string since this string then would have to support all this complexity. 
   
   The information cannot just be extracted at compile time because i need access to the actual data that is being compressed to optimize for the given instructions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard commented on pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1303:
URL: https://github.com/apache/systemds/pull/1303#issuecomment-864897214


   Thanks for the comments, @phaniarnab  @mboehm7  @sebwrede 
   i have modified accordingly.
   
   Closing PR because of continued work in PR : #1320 , will merge after the Release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [systemds] Baunsgaard closed pull request #1303: [SYSTEMDS-2994+2991] CLA Workload Analyzer and Workload Representation

Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1303:
URL: https://github.com/apache/systemds/pull/1303


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org