You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/11/06 14:31:59 UTC
[GitHub] [systemds] Baunsgaard opened a new pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Baunsgaard opened a new pull request #1095:
URL: https://github.com/apache/systemds/pull/1095
This commit initialize the performance test suite, with a micro benchmark
comparing the performance of MKL and default matrix multiplication.
The construction allows easy execution on different hardware platforms
to check FP-OPS, to see theoretical trhoughput compared to hardware
specification.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723300314
never, mind I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] Baunsgaard commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723306282
On an AMD EPYC 7302 16-Core Processor
```bash
Total elapsed time: 9.860 sec.
1 ba+* 8.542 3
Total elapsed time: 9.754 sec.
1 ba+* 8.410 3
Total elapsed time: 9.542 sec.
1 ba+* 8.217 3
Total elapsed time: 9.568 sec.
1 ba+* 8.349 3
Total elapsed time: 9.738 sec.
1 ba+* 8.330 3
244244.54 msec task-clock # 23.411 CPUs utilized ( +- 0.52% )
615634864321 cycles # 2.521 GHz ( +- 0.13% ) (33.32%)
1099917247652 instructions # 1.79 insn per cycle
Total elapsed time: 4.087 sec.
1 ba+* 2.758 3
Total elapsed time: 4.067 sec.
1 ba+* 2.701 3
Total elapsed time: 3.880 sec.
1 ba+* 2.656 3
Total elapsed time: 4.617 sec.
1 ba+* 3.300 3
Total elapsed time: 4.175 sec.
1 ba+* 2.841 3
50723.62 msec task-clock # 9.997 CPUs utilized ( +- 1.29% )
94475106229 cycles # 1.863 GHz ( +- 1.02% ) (33.37%)
214603521520 instructions # 2.27 insn per cycle
Total elapsed time: 4.084 sec.
1 ba+* 2.683 3
Total elapsed time: 4.100 sec.
1 ba+* 2.701 3
Total elapsed time: 4.030 sec.
1 ba+* 2.670 3
Total elapsed time: 3.828 sec.
1 ba+* 2.651 3
Total elapsed time: 3.976 sec.
1 ba+* 2.660 3
88602.91 msec task-clock # 17.962 CPUs utilized ( +- 0.33% )
165132494019 cycles # 1.864 GHz ( +- 0.68% ) (33.30%)
251221873378 instructions # 1.52 insn per cycle
```
on two socket 2x 28 core Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
```bash
Total elapsed time: 5.208 sec.
1 ba+* 4.210 3
Total elapsed time: 5.189 sec.
1 ba+* 4.234 3
Total elapsed time: 5.215 sec.
1 ba+* 4.255 3
Total elapsed time: 5.352 sec.
1 ba+* 4.333 3
Total elapsed time: 5.485 sec.
1 ba+* 4.321 3
403205.65 msec task-clock # 68.960 CPUs utilized ( +- 2.07% )
1045022715778 cycles # 2.592 GHz ( +- 2.16% ) (30.72%)
1111149539772 instructions # 1.06 insn per cycle ( +- 0.08% ) (38.41%)
Total elapsed time: 3.651 sec.
1 ba+* 2.165 3
Total elapsed time: 2.687 sec.
1 ba+* 1.535 3
Total elapsed time: 3.269 sec.
1 ba+* 1.924 3
Total elapsed time: 3.148 sec.
1 ba+* 1.818 3
Total elapsed time: 2.899 sec.
1 ba+* 1.547 3
131080.09 msec task-clock # 35.417 CPUs utilized ( +- 7.67% )
327562943703 cycles # 2.499 GHz ( +- 8.49% ) (30.69%)
176246317254 instructions # 0.54 insn per cycle ( +- 3.37% ) (38.37%)
Total elapsed time: 3.132 sec.
1 ba+* 1.490 3
Total elapsed time: 2.763 sec.
1 ba+* 1.505 3
Total elapsed time: 3.474 sec.
1 ba+* 1.524 3
Total elapsed time: 2.930 sec.
1 ba+* 1.516 3
Total elapsed time: 3.184 sec.
1 ba+* 1.662 3
153409.48 msec task-clock # 41.661 CPUs utilized ( +- 7.83% )
397930300980 cycles # 2.594 GHz ( +- 9.09% ) (30.66%)
250592134329 instructions # 0.63 insn per cycle ( +- 1.03% ) (38.33%)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723317125
One difference that might affect performance here is the associativity of the L2 cache - the Intel 6238R setup has 16 way, AMD EPYC 7302 has 8 way, and your i9-9980HK has 4 way. This might matter because our dense mm cache blocking largely optimizes for common L2 sizes and behavior.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] Baunsgaard commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723297664
On my laptop i get:
of witch the first 3 are normal, then 3 mkl and then 3 openBLAS.
If anyone would like to review i would like measurements from your machine as well.
It should be as easy as running the script in scripts/perftest/runAll.sh
Total elapsed time: 21.939 sec.
1 ba+* 21.214 3
Total elapsed time: 23.765 sec.
1 ba+* 22.940 3
Total elapsed time: 24.867 sec.
1 ba+* 24.023 3
Total elapsed time: 26.128 sec.
1 ba+* 25.239 3
Total elapsed time: 25.375 sec.
1 ba+* 24.495 3
342.403,88 msec task-clock # 13,748 CPUs utilized ( +- 3,22% )
911.071.333.279 cycles # 2,661 GHz ( +- 0,60% ) (30,77%)
1.099.760.123.229 instructions # 1,21 insn per cycle ( +- 0,03% ) (38,47%)
Total elapsed time: 4.513 sec.
1 ba+* 3.616 3
Total elapsed time: 4.465 sec.
1 ba+* 3.579 3
Total elapsed time: 4.542 sec.
1 ba+* 3.594 3
Total elapsed time: 4.808 sec.
1 ba+* 3.966 3
Total elapsed time: 4.493 sec.
1 ba+* 3.611 3
34.108,67 msec task-clock # 6,727 CPUs utilized ( +- 1,57% )
85.121.161.235 cycles # 2,496 GHz ( +- 1,27% ) (30,78%)
202.393.564.071 instructions # 2,38 insn per cycle ( +- 0,59% ) (38,46%)
Total elapsed time: 5.259 sec.
1 ba+* 4.333 3
Total elapsed time: 5.359 sec.
1 ba+* 4.471 3
Total elapsed time: 5.086 sec.
1 ba+* 4.175 3
Total elapsed time: 5.277 sec.
1 ba+* 4.316 3
Total elapsed time: 5.495 sec.
1 ba+* 4.493 3
71.032,45 msec task-clock # 12,188 CPUs utilized ( +- 0,96% )
158.323.022.360 cycles # 2,229 GHz ( +- 1,12% ) (30,79%)
232.735.293.658 instructions # 1,47 insn per cycle ( +- 1,33% ) (38,47%)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] Baunsgaard commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
Baunsgaard commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723307897
> I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.
I did try larger matrices, but the JIT compilation times did not increase.
Since a 5k 5k matrix multiplication is this fast on the CPUs take 1 to 8 seconds to do 3 reps is big enough if we do more than 3 repetitions? Or would you rather suggest to use larger matrices and fewer repetitions? We can also do both?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723299484
I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723311181
No you're experiments are fine - I mis-read it as milliseconds. So JIT will already happen during the first matrix multiplication.
For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] mboehm7 edited a comment on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
mboehm7 edited a comment on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723300314
never mind, I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] Baunsgaard edited a comment on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
Baunsgaard edited a comment on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723297664
On my laptop i get:
of witch the first 3 are normal, then 3 mkl and then 3 openBLAS.
If anyone would like to review i would like measurements from your machine as well.
It should be as easy as running the script in scripts/perftest/runAll.sh
```bash
Total elapsed time: 21.939 sec.
1 ba+* 21.214 3
Total elapsed time: 23.765 sec.
1 ba+* 22.940 3
Total elapsed time: 24.867 sec.
1 ba+* 24.023 3
Total elapsed time: 26.128 sec.
1 ba+* 25.239 3
Total elapsed time: 25.375 sec.
1 ba+* 24.495 3
342.403,88 msec task-clock # 13,748 CPUs utilized ( +- 3,22% )
911.071.333.279 cycles # 2,661 GHz ( +- 0,60% ) (30,77%)
1.099.760.123.229 instructions # 1,21 insn per cycle ( +- 0,03% ) (38,47%)
Total elapsed time: 4.513 sec.
1 ba+* 3.616 3
Total elapsed time: 4.465 sec.
1 ba+* 3.579 3
Total elapsed time: 4.542 sec.
1 ba+* 3.594 3
Total elapsed time: 4.808 sec.
1 ba+* 3.966 3
Total elapsed time: 4.493 sec.
1 ba+* 3.611 3
34.108,67 msec task-clock # 6,727 CPUs utilized ( +- 1,57% )
85.121.161.235 cycles # 2,496 GHz ( +- 1,27% ) (30,78%)
202.393.564.071 instructions # 2,38 insn per cycle ( +- 0,59% ) (38,46%)
Total elapsed time: 5.259 sec.
1 ba+* 4.333 3
Total elapsed time: 5.359 sec.
1 ba+* 4.471 3
Total elapsed time: 5.086 sec.
1 ba+* 4.175 3
Total elapsed time: 5.277 sec.
1 ba+* 4.316 3
Total elapsed time: 5.495 sec.
1 ba+* 4.493 3
71.032,45 msec task-clock # 12,188 CPUs utilized ( +- 0,96% )
158.323.022.360 cycles # 2,229 GHz ( +- 1,12% ) (30,79%)
232.735.293.658 instructions # 1,47 insn per cycle ( +- 1,33% ) (38,47%)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] Baunsgaard closed pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
Baunsgaard closed pull request #1095:
URL: https://github.com/apache/systemds/pull/1095
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [systemds] mboehm7 edited a comment on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison
Posted by GitBox <gi...@apache.org>.
mboehm7 edited a comment on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723311181
No you're experiments are fine - I mis-read it as milliseconds without looking at the dimensions. So JIT will already happen during the first matrix multiplication.
For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org