You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2020/11/06 14:31:59 UTC

[GitHub] [systemds] Baunsgaard opened a new pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Baunsgaard opened a new pull request #1095:
URL: https://github.com/apache/systemds/pull/1095


   This commit initialize the performance test suite, with a micro benchmark
   comparing the performance of MKL and default matrix multiplication.
   The construction allows easy execution on different hardware platforms
   to check FP-OPS, to see theoretical trhoughput compared to hardware
   specification.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723300314


   never, mind I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723306282


   On an AMD EPYC 7302 16-Core Processor
   
   ```bash
   Total elapsed time:		9.860 sec.
    1  ba+*           8.542      3
   Total elapsed time:		9.754 sec.
    1  ba+*           8.410      3
   Total elapsed time:		9.542 sec.
    1  ba+*           8.217      3
   Total elapsed time:		9.568 sec.
    1  ba+*           8.349      3
   Total elapsed time:		9.738 sec.
    1  ba+*           8.330      3
            244244.54 msec task-clock                #   23.411 CPUs utilized            ( +-  0.52% )
         615634864321      cycles                    #    2.521 GHz                      ( +-  0.13% )  (33.32%)
        1099917247652      instructions              #    1.79  insn per cycle         
   Total elapsed time:		4.087 sec.
    1  ba+*           2.758      3
   Total elapsed time:		4.067 sec.
    1  ba+*           2.701      3
   Total elapsed time:		3.880 sec.
    1  ba+*           2.656      3
   Total elapsed time:		4.617 sec.
    1  ba+*           3.300      3
   Total elapsed time:		4.175 sec.
    1  ba+*           2.841      3
             50723.62 msec task-clock                #    9.997 CPUs utilized            ( +-  1.29% )
          94475106229      cycles                    #    1.863 GHz                      ( +-  1.02% )  (33.37%)
         214603521520      instructions              #    2.27  insn per cycle         
   Total elapsed time:		4.084 sec.
    1  ba+*           2.683      3
   Total elapsed time:		4.100 sec.
    1  ba+*           2.701      3
   Total elapsed time:		4.030 sec.
    1  ba+*           2.670      3
   Total elapsed time:		3.828 sec.
    1  ba+*           2.651      3
   Total elapsed time:		3.976 sec.
    1  ba+*           2.660      3
             88602.91 msec task-clock                #   17.962 CPUs utilized            ( +-  0.33% )
         165132494019      cycles                    #    1.864 GHz                      ( +-  0.68% )  (33.30%)
         251221873378      instructions              #    1.52  insn per cycle         
   
   ```
   
   on two socket  2x 28 core Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
   ```bash
   Total elapsed time:		5.208 sec.
    1  ba+*           4.210      3
   Total elapsed time:		5.189 sec.
    1  ba+*           4.234      3
   Total elapsed time:		5.215 sec.
    1  ba+*           4.255      3
   Total elapsed time:		5.352 sec.
    1  ba+*           4.333      3
   Total elapsed time:		5.485 sec.
    1  ba+*           4.321      3
            403205.65 msec task-clock                #   68.960 CPUs utilized            ( +-  2.07% )
        1045022715778      cycles                    #    2.592 GHz                      ( +-  2.16% )  (30.72%)
        1111149539772      instructions              #    1.06  insn per cycle           ( +-  0.08% )  (38.41%)
   Total elapsed time:		3.651 sec.
    1  ba+*           2.165      3
   Total elapsed time:		2.687 sec.
    1  ba+*           1.535      3
   Total elapsed time:		3.269 sec.
    1  ba+*           1.924      3
   Total elapsed time:		3.148 sec.
    1  ba+*           1.818      3
   Total elapsed time:		2.899 sec.
    1  ba+*           1.547      3
            131080.09 msec task-clock                #   35.417 CPUs utilized            ( +-  7.67% )
         327562943703      cycles                    #    2.499 GHz                      ( +-  8.49% )  (30.69%)
         176246317254      instructions              #    0.54  insn per cycle           ( +-  3.37% )  (38.37%)
   Total elapsed time:		3.132 sec.
    1  ba+*           1.490      3
   Total elapsed time:		2.763 sec.
    1  ba+*           1.505      3
   Total elapsed time:		3.474 sec.
    1  ba+*           1.524      3
   Total elapsed time:		2.930 sec.
    1  ba+*           1.516      3
   Total elapsed time:		3.184 sec.
    1  ba+*           1.662      3
            153409.48 msec task-clock                #   41.661 CPUs utilized            ( +-  7.83% )
         397930300980      cycles                    #    2.594 GHz                      ( +-  9.09% )  (30.66%)
         250592134329      instructions              #    0.63  insn per cycle           ( +-  1.03% )  (38.33%)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723317125


   One difference that might affect performance here is the associativity of the L2 cache - the Intel 6238R setup has 16 way, AMD EPYC 7302 has 8 way, and your i9-9980HK has 4 way.  This might matter because our dense mm cache blocking largely optimizes for common L2 sizes and behavior.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723297664


   On my laptop i get:
   of witch the first 3 are normal, then 3 mkl and then 3 openBLAS.
   If anyone would like to review i would like measurements from your machine as well.
   It should be as easy as running the script in scripts/perftest/runAll.sh
   
   
   Total elapsed time:		21.939 sec.
    1  ba+*          21.214      3
   Total elapsed time:		23.765 sec.
    1  ba+*          22.940      3
   Total elapsed time:		24.867 sec.
    1  ba+*          24.023      3
   Total elapsed time:		26.128 sec.
    1  ba+*          25.239      3
   Total elapsed time:		25.375 sec.
    1  ba+*          24.495      3
           342.403,88 msec task-clock                #   13,748 CPUs utilized            ( +-  3,22% )
      911.071.333.279      cycles                    #    2,661 GHz                      ( +-  0,60% )  (30,77%)
    1.099.760.123.229      instructions              #    1,21  insn per cycle           ( +-  0,03% )  (38,47%)
   Total elapsed time:		4.513 sec.
    1  ba+*           3.616      3
   Total elapsed time:		4.465 sec.
    1  ba+*           3.579      3
   Total elapsed time:		4.542 sec.
    1  ba+*           3.594      3
   Total elapsed time:		4.808 sec.
    1  ba+*           3.966      3
   Total elapsed time:		4.493 sec.
    1  ba+*           3.611      3
            34.108,67 msec task-clock                #    6,727 CPUs utilized            ( +-  1,57% )
       85.121.161.235      cycles                    #    2,496 GHz                      ( +-  1,27% )  (30,78%)
      202.393.564.071      instructions              #    2,38  insn per cycle           ( +-  0,59% )  (38,46%)
   Total elapsed time:		5.259 sec.
    1  ba+*           4.333      3
   Total elapsed time:		5.359 sec.
    1  ba+*           4.471      3
   Total elapsed time:		5.086 sec.
    1  ba+*           4.175      3
   Total elapsed time:		5.277 sec.
    1  ba+*           4.316      3
   Total elapsed time:		5.495 sec.
    1  ba+*           4.493      3
            71.032,45 msec task-clock                #   12,188 CPUs utilized            ( +-  0,96% )
      158.323.022.360      cycles                    #    2,229 GHz                      ( +-  1,12% )  (30,79%)
      232.735.293.658      instructions              #    1,47  insn per cycle           ( +-  1,33% )  (38,47%)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723307897


   > I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.
   
   I did try larger matrices, but the JIT compilation times did not increase.
   
   Since a 5k 5k matrix multiplication is this fast on the CPUs take 1 to 8 seconds to do 3 reps is big enough if we do more than 3 repetitions? Or would you rather suggest to use larger matrices and fewer repetitions? We can also do both?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723299484


   I would recommend to run it with larger matrices and/or much more operations to give the JIT a chance for tiered (cold, warm, hot) just-in-time compilation into native code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] mboehm7 commented on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

mboehm7 commented on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723311181


   No you're experiments are fine - I mis-read it as milliseconds. So JIT will already happen during the first matrix multiplication. 
   
   For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] mboehm7 edited a comment on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

mboehm7 edited a comment on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723300314


   never mind, I read the results as ms not seconds. Then I suspect what you see is the vector width of AVX512.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard edited a comment on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

Baunsgaard edited a comment on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723297664


   On my laptop i get:
   of witch the first 3 are normal, then 3 mkl and then 3 openBLAS.
   If anyone would like to review i would like measurements from your machine as well.
   It should be as easy as running the script in scripts/perftest/runAll.sh
   
   ```bash
   Total elapsed time:		21.939 sec.
    1  ba+*          21.214      3
   Total elapsed time:		23.765 sec.
    1  ba+*          22.940      3
   Total elapsed time:		24.867 sec.
    1  ba+*          24.023      3
   Total elapsed time:		26.128 sec.
    1  ba+*          25.239      3
   Total elapsed time:		25.375 sec.
    1  ba+*          24.495      3
           342.403,88 msec task-clock                #   13,748 CPUs utilized            ( +-  3,22% )
      911.071.333.279      cycles                    #    2,661 GHz                      ( +-  0,60% )  (30,77%)
    1.099.760.123.229      instructions              #    1,21  insn per cycle           ( +-  0,03% )  (38,47%)
   Total elapsed time:		4.513 sec.
    1  ba+*           3.616      3
   Total elapsed time:		4.465 sec.
    1  ba+*           3.579      3
   Total elapsed time:		4.542 sec.
    1  ba+*           3.594      3
   Total elapsed time:		4.808 sec.
    1  ba+*           3.966      3
   Total elapsed time:		4.493 sec.
    1  ba+*           3.611      3
            34.108,67 msec task-clock                #    6,727 CPUs utilized            ( +-  1,57% )
       85.121.161.235      cycles                    #    2,496 GHz                      ( +-  1,27% )  (30,78%)
      202.393.564.071      instructions              #    2,38  insn per cycle           ( +-  0,59% )  (38,46%)
   Total elapsed time:		5.259 sec.
    1  ba+*           4.333      3
   Total elapsed time:		5.359 sec.
    1  ba+*           4.471      3
   Total elapsed time:		5.086 sec.
    1  ba+*           4.175      3
   Total elapsed time:		5.277 sec.
    1  ba+*           4.316      3
   Total elapsed time:		5.495 sec.
    1  ba+*           4.493      3
            71.032,45 msec task-clock                #   12,188 CPUs utilized            ( +-  0,96% )
      158.323.022.360      cycles                    #    2,229 GHz                      ( +-  1,12% )  (30,79%)
      232.735.293.658      instructions              #    1,47  insn per cycle           ( +-  1,33% )  (38,47%)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard closed pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

Baunsgaard closed pull request #1095:
URL: https://github.com/apache/systemds/pull/1095


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] mboehm7 edited a comment on pull request #1095: [SYSTEMDS-2718] Matrix Mult Accelerator Comparison

Posted by GitBox <gi...@apache.org>.

mboehm7 edited a comment on pull request #1095:
URL: https://github.com/apache/systemds/pull/1095#issuecomment-723311181


   No you're experiments are fine - I mis-read it as milliseconds without looking at the dimensions. So JIT will already happen during the first matrix multiplication. 
   
   For me the take-away is that we should have a look into the influence of the higher turbo-frequency, different cache sizes and other characteristics of your laptop to see if there is something we can do about it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org