You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2021/12/23 23:28:51 UTC

[GitHub] [systemds] DavidSandruTU commented on pull request #1481: [SYSTEMDS-2832] Refactoring of old performance benchmarks.

DavidSandruTU commented on pull request #1481:
URL: https://github.com/apache/systemds/pull/1481#issuecomment-1000563066


   > Hi, @DavidSandruTU !
   
   Thanks for the feedback, @Baunsgaard !
   
   > This is a good start, i have only minor things, but each take a bit of time. Just reach out if there is questions, also please try to run it once you have the limit argument just with the smallest size of each data, and post some of the numbers in the comments :), especially if you find some of them interesting or weird.
   
   Following tests were performed on an old intel i5-6500 processor
   
   Most algorithms execute in similar time of a couple of seconds, but here are some outliers:
   
   StratifiedStatistics on temp/stratstats/A_10k/data: 39.069242381
   
   PCA is especially heavy hitting it seems, but in line with the algorithm I think(?):
   
   PCA on temp/dimensionreduction/pcaData5k_2k_dense: 285.998737102
   
   ALS seems to generally be somewhat heavy hitting (~20-30 seconds), but a more interesting outlier is the direct solve algorithm on sparse data. It should perform better in theory, with optimizations, than the same algorithm with dense data.
   
   ALS-DS algorithm on temp/als/X10k_1k_sparse: 132.269487990
   
   > 1. Make all scripts executable from the same directory starting point.
   >    i suggest, 'scripts/perftest' and write an error if you are not in a directory named that.
   
   Alright, here I want to make sure I've understood everything correctly. If I remember right, a single perftest folder, with all necessary logic, is copied onto the cluster and executed. I think this is why Matthias changed the data generation scripts, in a previous commit, to execute the DMLs in "./datagen/...", instead of "../datagen/...". 
   
   What should I do here? Should I assume the datagen scripts are in the perftest folder or outside of it? Want me to add a copy of the datagen scripts to the perftest folder?
   Also, I'd then make sure the scripts are in the current directory 'perftest', as I think there would be no 'scripts' parent directory. Am I right in that assumption?
   
   > 2. Suggestion, make all the code available tor scaling experiments in all cases, but add an argument that specify max memory. For instance 'genClusteringData.sh 80' indicating 80MB max, and make this consistent across scripts.
   
   ✅
   Please check my code and let me know if you like my approach. E.g. genBinomialData.sh and runAllBinomial.sh
   
   > 3. Go through your changes and change all the &>> to &> and >> to >, I am a bit confused about the '&' part of it, since i don't know what it does in context to '>'. If you can explain it it would be nice, otherwise remove the '&' from the piping to files. (this comment is unrelated to adding '&' at the end of the lines, since i know this parallelize the instruction)
   
   ✅
   I think the "&" is so that error output is also put into the file, not only the standard out. I kept it that way from the original scripts, but never confirmed it myself. 
   Most outputs are changed to now overwrite, not append, except for some special cases:
   - The federated algorithm often accesses a file multiple times, e.g. once when creating and once when killing workers. Here I append instead of overwrite.
   - Similar with results/times.txt, where all algorithms append their performance times.
   
   > 4. Don't change permissions on log4j files and DML files.
   >    You have done so on:
   > 
   > * scripts/perftest/conf/log4j.properties
   > * scripts/perftest/fed/data/splitAndMakeFederated.dml
   > * scripts/perftest/scripts/GLM-predict.dml
   > * scripts/perftest/scripts/alsCG.dml
   
   ✅
   Today I learned, Git commits permissions.
   
   > 5. Make some of the smallest data generation run in parallel, this is simple to add with a & on the end of the line in sh, and then add a 'wait' in the end of the scripts.
   
   ✅
   Some scripts have commands that split data into test and training data. These depend on the previous data generation command. I've added multiple waits with PIDs for that, let me know if you're happy with that. E.g. genBinomialData.sh
   
   > 6. Run all... should run all, don't comment anything out. but add an argument to stop execution based on the max Memory argument
   
   Do you want me to uncomment the matrix multiply and transform benchmark too, or was that just meant in context of the max memory argument?
   
   > 7. Make all the loops go though all datasets, subject to the memory constraints, example 'scripts/perftest/runAllMultinominal.sh'
   
   ✅
   This should be related to point 2. and 6., and is implemented. 
   
   > 8. If you make the commands multi line with '' then make sure you indent the following lines. example in 'scripts/perftest/runKmeans.sh'
   
   ✅
   
   > 9. Remove commented out code, that obviously is not used and should not be used. example in 'scripts/perftest/runStratStats.sh'
   
   The calls for the old algorithms? I thought they might be helpful to compare the output with the new built-ins, but I'll remove them ✅
   
   > 10. Add empty newline in the end of files example 'scripts/perftest/scripts/stratstats.dml'
   
   ✅
   
   
   > 
   > > ```
   > > * I'm assuming the federated tests are to be ran locally only, not on a cluster.
   > > ```
   > 
   > True, they are running with localhost workers, We want to add the federated things, but it require a bit of setup and remote machines so not obvious for a perf benchmark.
   
   I've moved/copied some bash logic relating to the federated tests into the fed folder, as I've written new performance tests for als-predict and als-DS. These new perftests were originally not part of the federated tests, which is why I added the "original" versions to the fed folder, to be executed by the fed tests alone.
   
   > > ```
   > > * The stratified statistics benchmark needs a heap size of at least 10 GB. This setting is exported only when it's ran locally via "export SYSTEMDS_STANDALONE_OPTS="-Xmx10g -Xms10g -Xmn2000m"
   > > ```
   > Change this, so that it is set via a settings file, or other means.
   
   I've made a file "env-variables", which is sourced once in runAll.sh. The problem here is, if the federated test is ran with that config (from runAll.sh), then each worker then has that heap size (e.g. 10GB), which crashes the test in my case. I've added that "export" commented to the file.
   
   
   
   > > ```
   > >   * It does not stop when ran with logistic regression data.
   > > ```
   > 
   > hmm, this might be a bug, please if you want to, report it as such in JIRA with a way to reproduce it.
   > 
   > https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12335852
   
   Will do.
   
   
   
   Also, Merry Christmas! 🎅
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org