You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Rana Alotaibi <ra...@eng.ucsd.edu> on 2020/06/19 04:00:14 UTC

Performance Question

Hi SystemML developers,

I have been experimenting with SystemML lately. I found some interesting
numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*% t(M)*
(The size of matrix M is 20Kx100 and N is 100x20K). The pipeline *t(N)%*%
t(M) *drastically outperforms* t(M%*%N)*.

   - *t(M%*%N)*
   - The multiplication took ~2.53 secs
      - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is
      20Kx20K)
   - *t(N)%*% t(M) *
   - The multiplication took ~0.002 secs (given that the size of t(N) is
      20Kx100 and t(M) is 100x20K )
      - The total transpose time for both matrices is 0.023 secs

So, I'm interested to know why the matrix multiplication in this
pipeline *t(N)%*%
t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is
faster than *t(M%*%N)*?

*Experiments Configurations**: *

   - I'm using  systemml-1.2.0 (Binary tgz) and working on a single node.
   - CPU E5-2640 v4 @ 2.40GHz, 40Cores
   - RAM : 123GB

   *t(N)%*%t(M) DML Script*
    | time = externalFunction(Integer i) return (Double B)
    |       implemented in
(classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
    |  times = matrix(0.0, rows = 5, cols = 1)
    |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
    |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
    | for (ix in 1:5) {
    |   start = time(1)
    |   print("StartTime: "+start)
    |   res = t(N)%*%t(M)
    |   print(as.scalar(res[1,1]))
    |   stop = time(1)
    |   print("EndTime:- "+stop)
    |   pipTime= (stop - start) / 1000
    |   print ("Time:"+ pipTime)
    |   times[ix,1] = pipTime
    | }
    | times = t(times)

    *t(M%*%N) DML Script*
    | time = externalFunction(Integer i) return (Double B)
    |       implemented in
(classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
    |  times = matrix(0.0, rows = 5, cols = 1)
    |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
    |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
    | for (ix in 1:5) {
    |   start = time(1)
    |   print("StartTime: "+start)
    |   res = t(M%*%N)
    |   print(as.scalar(res[1,1]))
    |   end = time(1)
    |   print("EndTime: "+end)
    |   pipTime = (end - start)/1000
    |   print ("Time:"+ pipTime)
    |   times[ix,1] = pipTime
    | }
    | times = t(times)

*t(N)%*%t(M) RUN TIME EXPLAIN *
# Memory Budget local/remote = 63716MB/140MB/140MB
# Degree of Parallelism (vcores) local/remote = 40/1/1
PROGRAM ( size CP/MR = 58/0 )
--FUNCTIONS
----FUNCTION CALL GRAPH
------MAIN PROGRAM
--------.defaultNS::time
----EXTERNAL FUNCTION .defaultNS::time
--MAIN PROGRAM
----GENERIC (lines 4-6) [recompile=false]
------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0
true MATRIX binaryblock 5 1 1000 1000 0 copy
------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
uniform 1.0 40 _mVar0.MATRIX.DOUBLE
------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1
true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2
true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
------CP cpvar _mVar0 times
------CP cpvar _mVar1 M
------CP cpvar _mVar2 N
------CP rmvar _mVar0 _mVar1 _mVar2
----FOR (lines 0-0) [in-place=[times]]
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------GENERIC (lines 8-8) [recompile=false]
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
------GENERIC (lines 9-12) [recompile=false]
--------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
_Var3.SCALAR.STRING
--------CP createvar _mVar4
scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell
20000 100 -1 -1 2000000 copy
--------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
--------CP createvar _mVar5
scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1
100 -1 -1 -1 copy
--------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE
--------CP rmvar _mVar4
--------CP createvar _mVar6
scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100
20000 -1 -1 2000000 copy
--------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40
--------CP createvar _mVar7
scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100
1 -1 -1 -1 copy
--------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true
100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE
--------CP rmvar _mVar6
--------CP createvar _mVar8
scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1
-1 -1 -1 copy
--------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE
_mVar8.MATRIX.DOUBLE 40
--------CP rmvar _mVar5 _mVar7
--------CP createvar _mVar9
scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1
-1 -1 -1 copy
--------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE
--------CP rmvar _mVar8
--------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING
--------CP rmvar _mVar9
--------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING
--------CP rmvar _Var3
--------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING
--------CP rmvar _Var10
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop
--------CP rmvar _Var11 _Var12
------GENERIC (lines 13-16) [recompile=false]
--------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false
_Var13.SCALAR.STRING
--------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
_Var14.SCALAR.DOUBLE
--------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
_Var15.SCALAR.DOUBLE
--------CP rmvar _Var14
--------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false
_Var16.SCALAR.STRING
--------CP createvar _mVar17
scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1
-1 -1 -1 copy
--------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false
ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
_mVar17.MATRIX.DOUBLE
--------CP rmvar _Var15
--------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING
--------CP rmvar _Var13
--------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING
--------CP rmvar _Var16 times _Var18 _Var19
--------CP cpvar _mVar17 times
--------CP rmvar _mVar17 start stop
----GENERIC (lines 18-19) [recompile=false]
------CP createvar _mVar20
scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1
5 -1 -1 -1 copy
------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40
------CP rmvar times
------CP cpvar _mVar20 times
------CP rmvar _mVar20 times

SystemML Statistics:
Total elapsed time: 1.178 sec.
Total compilation time: 0.591 sec.
Total execution time: 0.587 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 47/0/0/0.
Cache writes (WB, FS, HDFS): 12/0/0.
Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Total JIT compile time: 29.332 sec.
Total JVM GC count: 8.
Total JVM GC time: 0.38 sec.
Heavy hitter instructions:
  #  Instruction  Time(s)  Count
  1  rand           0.307      3
  2  r'             0.256     11
  3  ba+*           0.010      5
  4  time           0.004     10
  5  rmvar          0.003     67
  6  rightIndex     0.002     15
  7  print          0.001     20
  8  createvar      0.001     39
  9  leftIndex      0.000      5
 10  +              0.000     15

*t(M%*%N) RUN TIME EXPLAIN *
# Memory Budget local/remote = 63716MB/140MB/140MB
# Degree of Parallelism (vcores) local/remote = 40/1/1
PROGRAM ( size CP/MR = 50/0 )
--FUNCTIONS
----FUNCTION CALL GRAPH
------MAIN PROGRAM
--------.defaultNS::time
----EXTERNAL FUNCTION .defaultNS::time
--MAIN PROGRAM
----GENERIC (lines 4-6) [recompile=false]
------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0
true MATRIX binaryblock 5 1 1000 1000 0 copy
------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
uniform 1.0 40 _mVar0.MATRIX.DOUBLE
------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1
true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2
true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
------CP cpvar _mVar0 times
------CP cpvar _mVar1 M
------CP cpvar _mVar2 N
------CP rmvar _mVar0 _mVar1 _mVar2
----FOR (lines 0-0) [in-place=[times]]
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------GENERIC (lines 8-8) [recompile=false]
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
------GENERIC (lines 9-12) [recompile=false]
--------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
_Var3.SCALAR.STRING
--------CP createvar _mVar4
scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell
20000 20000 -1 -1 -1 copy
--------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
--------CP createvar _mVar5
scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell
20000 20000 -1 -1 -1 copy
--------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40
--------CP rmvar _mVar4
--------CP createvar _mVar6
scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1
-1 -1 -1 copy
--------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE
--------CP rmvar _mVar5
--------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING
--------CP rmvar _mVar6
--------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING
--------CP rmvar _Var3
--------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING
--------CP rmvar _Var7
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end
--------CP rmvar _Var8 _Var9
------GENERIC (lines 13-16) [recompile=false]
--------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false
_Var10.SCALAR.STRING
--------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
_Var11.SCALAR.DOUBLE
--------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
_Var12.SCALAR.DOUBLE
--------CP rmvar _Var11
--------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false
_Var13.SCALAR.STRING
--------CP createvar _mVar14
scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1
-1 -1 -1 copy
--------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false
ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
_mVar14.MATRIX.DOUBLE
--------CP rmvar _Var12
--------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING
--------CP rmvar _Var10
--------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING
--------CP rmvar _Var13 times _Var15 _Var16
--------CP cpvar _mVar14 times
--------CP rmvar _mVar14 start end
----GENERIC (lines 18-19) [recompile=false]
------CP createvar _mVar17 scratch_space//_p32616_192.168.250.59//_t0/temp7
true MATRIX binarycell 1 5 -1 -1 -1 copy
------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40
------CP rmvar times
------CP cpvar _mVar17 times
------CP rmvar _mVar17 times

SystemML Statistics:
Total elapsed time: 22.074 sec.
Total compilation time: 0.567 sec.
Total execution time: 21.507 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 32/0/0/0.
Cache writes (WB, FS, HDFS): 12/0/0.
Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Total JIT compile time: 33.155 sec.
Total JVM GC count: 7.
Total JVM GC time: 0.279 sec.
Heavy hitter instructions:
  #  Instruction  Time(s)  Count
  1  ba+*          12.661      5
  2  r'             8.496      6
  3  rand           0.334      3
  4  time           0.006     10
  5  rmvar          0.003     57
  6  rightIndex     0.002      5
  7  print          0.001     20
  8  createvar      0.001     24
  9  +              0.000     15
 10  leftIndex      0.000      5




Thanks,
Rana

Re: Performance Question

Posted by Rana Alotaibi <ra...@eng.ucsd.edu>.
Thanks Matthias for your prompt response.

Rana

On Thu, Jun 18, 2020 at 9:00 PM Rana Alotaibi <ra...@eng.ucsd.edu> wrote:

> Hi SystemML developers,
>
> I have been experimenting with SystemML lately. I found some interesting
> numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*%
> t(M)* (The size of matrix M is 20Kx100 and N is 100x20K). The pipeline *t(N)%*%
> t(M) *drastically outperforms* t(M%*%N)*.
>
>    - *t(M%*%N)*
>    - The multiplication took ~2.53 secs
>       - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is
>       20Kx20K)
>    - *t(N)%*% t(M) *
>    - The multiplication took ~0.002 secs (given that the size of t(N) is
>       20Kx100 and t(M) is 100x20K )
>       - The total transpose time for both matrices is 0.023 secs
>
> So, I'm interested to know why the matrix multiplication in this pipeline *t(N)%*%
> t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is
> faster than *t(M%*%N)*?
>
> *Experiments Configurations**: *
>
>    - I'm using  systemml-1.2.0 (Binary tgz) and working on a single node.
>    - CPU E5-2640 v4 @ 2.40GHz, 40Cores
>    - RAM : 123GB
>
>    *t(N)%*%t(M) DML Script*
>     | time = externalFunction(Integer i) return (Double B)
>     |       implemented in
> (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
>     |  times = matrix(0.0, rows = 5, cols = 1)
>     |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
>     |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
>     | for (ix in 1:5) {
>     |   start = time(1)
>     |   print("StartTime: "+start)
>     |   res = t(N)%*%t(M)
>     |   print(as.scalar(res[1,1]))
>     |   stop = time(1)
>     |   print("EndTime:- "+stop)
>     |   pipTime= (stop - start) / 1000
>     |   print ("Time:"+ pipTime)
>     |   times[ix,1] = pipTime
>     | }
>     | times = t(times)
>
>     *t(M%*%N) DML Script*
>     | time = externalFunction(Integer i) return (Double B)
>     |       implemented in
> (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
>     |  times = matrix(0.0, rows = 5, cols = 1)
>     |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
>     |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
>     | for (ix in 1:5) {
>     |   start = time(1)
>     |   print("StartTime: "+start)
>     |   res = t(M%*%N)
>     |   print(as.scalar(res[1,1]))
>     |   end = time(1)
>     |   print("EndTime: "+end)
>     |   pipTime = (end - start)/1000
>     |   print ("Time:"+ pipTime)
>     |   times[ix,1] = pipTime
>     | }
>     | times = t(times)
>
> *t(N)%*%t(M) RUN TIME EXPLAIN *
> # Memory Budget local/remote = 63716MB/140MB/140MB
> # Degree of Parallelism (vcores) local/remote = 40/1/1
> PROGRAM ( size CP/MR = 58/0 )
> --FUNCTIONS
> ----FUNCTION CALL GRAPH
> ------MAIN PROGRAM
> --------.defaultNS::time
> ----EXTERNAL FUNCTION .defaultNS::time
> --MAIN PROGRAM
> ----GENERIC (lines 4-6) [recompile=false]
> ------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0
> true MATRIX binaryblock 5 1 1000 1000 0 copy
> ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
> uniform 1.0 40 _mVar0.MATRIX.DOUBLE
> ------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1
> true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
> ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
> ------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2
> true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
> ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
> ------CP cpvar _mVar0 times
> ------CP cpvar _mVar1 M
> ------CP cpvar _mVar2 N
> ------CP rmvar _mVar0 _mVar1 _mVar2
> ----FOR (lines 0-0) [in-place=[times]]
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------GENERIC (lines 8-8) [recompile=false]
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
> ------GENERIC (lines 9-12) [recompile=false]
> --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
> _Var3.SCALAR.STRING
> --------CP createvar _mVar4
> scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell
> 20000 100 -1 -1 2000000 copy
> --------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
> --------CP createvar _mVar5
> scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1
> 100 -1 -1 -1 copy
> --------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE
> --------CP rmvar _mVar4
> --------CP createvar _mVar6
> scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100
> 20000 -1 -1 2000000 copy
> --------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40
> --------CP createvar _mVar7
> scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100
> 1 -1 -1 -1 copy
> --------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true
> 100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE
> --------CP rmvar _mVar6
> --------CP createvar _mVar8
> scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE
> _mVar8.MATRIX.DOUBLE 40
> --------CP rmvar _mVar5 _mVar7
> --------CP createvar _mVar9
> scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE
> --------CP rmvar _mVar8
> --------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING
> --------CP rmvar _mVar9
> --------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING
> --------CP rmvar _Var3
> --------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING
> --------CP rmvar _Var10
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop
> --------CP rmvar _Var11 _Var12
> ------GENERIC (lines 13-16) [recompile=false]
> --------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false
> _Var13.SCALAR.STRING
> --------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
> _Var14.SCALAR.DOUBLE
> --------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
> _Var15.SCALAR.DOUBLE
> --------CP rmvar _Var14
> --------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false
> _Var16.SCALAR.STRING
> --------CP createvar _mVar17
> scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1
> -1 -1 -1 copy
> --------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false
> ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar17.MATRIX.DOUBLE
> --------CP rmvar _Var15
> --------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING
> --------CP rmvar _Var13
> --------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING
> --------CP rmvar _Var16 times _Var18 _Var19
> --------CP cpvar _mVar17 times
> --------CP rmvar _mVar17 start stop
> ----GENERIC (lines 18-19) [recompile=false]
> ------CP createvar _mVar20
> scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1
> 5 -1 -1 -1 copy
> ------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40
> ------CP rmvar times
> ------CP cpvar _mVar20 times
> ------CP rmvar _mVar20 times
>
> SystemML Statistics:
> Total elapsed time: 1.178 sec.
> Total compilation time: 0.591 sec.
> Total execution time: 0.587 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 47/0/0/0.
> Cache writes (WB, FS, HDFS): 12/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Total JIT compile time: 29.332 sec.
> Total JVM GC count: 8.
> Total JVM GC time: 0.38 sec.
> Heavy hitter instructions:
>   #  Instruction  Time(s)  Count
>   1  rand           0.307      3
>   2  r'             0.256     11
>   3  ba+*           0.010      5
>   4  time           0.004     10
>   5  rmvar          0.003     67
>   6  rightIndex     0.002     15
>   7  print          0.001     20
>   8  createvar      0.001     39
>   9  leftIndex      0.000      5
>  10  +              0.000     15
>
> *t(M%*%N) RUN TIME EXPLAIN *
> # Memory Budget local/remote = 63716MB/140MB/140MB
> # Degree of Parallelism (vcores) local/remote = 40/1/1
> PROGRAM ( size CP/MR = 50/0 )
> --FUNCTIONS
> ----FUNCTION CALL GRAPH
> ------MAIN PROGRAM
> --------.defaultNS::time
> ----EXTERNAL FUNCTION .defaultNS::time
> --MAIN PROGRAM
> ----GENERIC (lines 4-6) [recompile=false]
> ------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0
> true MATRIX binaryblock 5 1 1000 1000 0 copy
> ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
> uniform 1.0 40 _mVar0.MATRIX.DOUBLE
> ------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1
> true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
> ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
> ------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2
> true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
> ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
> ------CP cpvar _mVar0 times
> ------CP cpvar _mVar1 M
> ------CP cpvar _mVar2 N
> ------CP rmvar _mVar0 _mVar1 _mVar2
> ----FOR (lines 0-0) [in-place=[times]]
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------GENERIC (lines 8-8) [recompile=false]
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
> ------GENERIC (lines 9-12) [recompile=false]
> --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
> _Var3.SCALAR.STRING
> --------CP createvar _mVar4
> scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell
> 20000 20000 -1 -1 -1 copy
> --------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
> --------CP createvar _mVar5
> scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell
> 20000 20000 -1 -1 -1 copy
> --------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40
> --------CP rmvar _mVar4
> --------CP createvar _mVar6
> scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE
> --------CP rmvar _mVar5
> --------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING
> --------CP rmvar _mVar6
> --------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING
> --------CP rmvar _Var3
> --------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING
> --------CP rmvar _Var7
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end
> --------CP rmvar _Var8 _Var9
> ------GENERIC (lines 13-16) [recompile=false]
> --------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false
> _Var10.SCALAR.STRING
> --------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
> _Var11.SCALAR.DOUBLE
> --------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
> _Var12.SCALAR.DOUBLE
> --------CP rmvar _Var11
> --------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false
> _Var13.SCALAR.STRING
> --------CP createvar _mVar14
> scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1
> -1 -1 -1 copy
> --------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false
> ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar14.MATRIX.DOUBLE
> --------CP rmvar _Var12
> --------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING
> --------CP rmvar _Var10
> --------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING
> --------CP rmvar _Var13 times _Var15 _Var16
> --------CP cpvar _mVar14 times
> --------CP rmvar _mVar14 start end
> ----GENERIC (lines 18-19) [recompile=false]
> ------CP createvar _mVar17
> scratch_space//_p32616_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 5
> -1 -1 -1 copy
> ------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40
> ------CP rmvar times
> ------CP cpvar _mVar17 times
> ------CP rmvar _mVar17 times
>
> SystemML Statistics:
> Total elapsed time: 22.074 sec.
> Total compilation time: 0.567 sec.
> Total execution time: 21.507 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 32/0/0/0.
> Cache writes (WB, FS, HDFS): 12/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Total JIT compile time: 33.155 sec.
> Total JVM GC count: 7.
> Total JVM GC time: 0.279 sec.
> Heavy hitter instructions:
>   #  Instruction  Time(s)  Count
>   1  ba+*          12.661      5
>   2  r'             8.496      6
>   3  rand           0.334      3
>   4  time           0.006     10
>   5  rmvar          0.003     57
>   6  rightIndex     0.002      5
>   7  print          0.001     20
>   8  createvar      0.001     24
>   9  +              0.000     15
>  10  leftIndex      0.000      5
>
>
>
>
> Thanks,
> Rana
>

Re: Performance Question

Posted by Matthias Boehm <mb...@gmail.com>.
Thanks for the question and the detailed inputs - this is an effect of 
simplification rewrites that only apply in one of the cases. 
Specifically, (t(N)%*%t(M))[1,1] is rewritten to t(N)[1,] %*% t(M)[,1], 
which is a form of selection pushdown. You can do the following, for 
benchmarking*: introduce an explicit program block cut to separate the 
indexing operations - as most simplification rewrites (including this 
one) are only applied at DAG level, you will run the full matrix 
multiplications.

res = t(N)%*%t(M)
while(FALSE){}
print(as.scalar(res[1,1]))

Note that there is also further simplification potential we don't 
exploit yet: the indexing could be pushed into the transpose (in both 
cases) as well. For benchmarking MM, it's probably also a good idea to 
compare that with our native BLAS integration (e.g., OpenBLAS or MKL), 
which you can enable in our configuration file and by passing the 
library path to the JVM.

* we don't remove while loops to allow these kinds of explicit cuts and 
because there is not a lot of potential in real scripts.

Regards,
Matthias

On 6/19/2020 6:00 AM, Rana Alotaibi wrote:
> Hi SystemML developers,
> 
> I have been experimenting with SystemML lately. I found some interesting
> numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*% t(M)*
> (The size of matrix M is 20Kx100 and N is 100x20K). The pipeline *t(N)%*%
> t(M) *drastically outperforms* t(M%*%N)*.
> 
>     - *t(M%*%N)*
>     - The multiplication took ~2.53 secs
>        - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is
>        20Kx20K)
>     - *t(N)%*% t(M) *
>     - The multiplication took ~0.002 secs (given that the size of t(N) is
>        20Kx100 and t(M) is 100x20K )
>        - The total transpose time for both matrices is 0.023 secs
> 
> So, I'm interested to know why the matrix multiplication in this
> pipeline *t(N)%*%
> t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is
> faster than *t(M%*%N)*?
> 
> *Experiments Configurations**: *
> 
>     - I'm using  systemml-1.2.0 (Binary tgz) and working on a single node.
>     - CPU E5-2640 v4 @ 2.40GHz, 40Cores
>     - RAM : 123GB
> 
>     *t(N)%*%t(M) DML Script*
>      | time = externalFunction(Integer i) return (Double B)
>      |       implemented in
> (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
>      |  times = matrix(0.0, rows = 5, cols = 1)
>      |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
>      |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
>      | for (ix in 1:5) {
>      |   start = time(1)
>      |   print("StartTime: "+start)
>      |   res = t(N)%*%t(M)
>      |   print(as.scalar(res[1,1]))
>      |   stop = time(1)
>      |   print("EndTime:- "+stop)
>      |   pipTime= (stop - start) / 1000
>      |   print ("Time:"+ pipTime)
>      |   times[ix,1] = pipTime
>      | }
>      | times = t(times)
> 
>      *t(M%*%N) DML Script*
>      | time = externalFunction(Integer i) return (Double B)
>      |       implemented in
> (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
>      |  times = matrix(0.0, rows = 5, cols = 1)
>      |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
>      |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
>      | for (ix in 1:5) {
>      |   start = time(1)
>      |   print("StartTime: "+start)
>      |   res = t(M%*%N)
>      |   print(as.scalar(res[1,1]))
>      |   end = time(1)
>      |   print("EndTime: "+end)
>      |   pipTime = (end - start)/1000
>      |   print ("Time:"+ pipTime)
>      |   times[ix,1] = pipTime
>      | }
>      | times = t(times)
> 
> *t(N)%*%t(M) RUN TIME EXPLAIN *
> # Memory Budget local/remote = 63716MB/140MB/140MB
> # Degree of Parallelism (vcores) local/remote = 40/1/1
> PROGRAM ( size CP/MR = 58/0 )
> --FUNCTIONS
> ----FUNCTION CALL GRAPH
> ------MAIN PROGRAM
> --------.defaultNS::time
> ----EXTERNAL FUNCTION .defaultNS::time
> --MAIN PROGRAM
> ----GENERIC (lines 4-6) [recompile=false]
> ------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0
> true MATRIX binaryblock 5 1 1000 1000 0 copy
> ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
> uniform 1.0 40 _mVar0.MATRIX.DOUBLE
> ------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1
> true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
> ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
> ------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2
> true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
> ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
> ------CP cpvar _mVar0 times
> ------CP cpvar _mVar1 M
> ------CP cpvar _mVar2 N
> ------CP rmvar _mVar0 _mVar1 _mVar2
> ----FOR (lines 0-0) [in-place=[times]]
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------GENERIC (lines 8-8) [recompile=false]
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
> ------GENERIC (lines 9-12) [recompile=false]
> --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
> _Var3.SCALAR.STRING
> --------CP createvar _mVar4
> scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell
> 20000 100 -1 -1 2000000 copy
> --------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
> --------CP createvar _mVar5
> scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1
> 100 -1 -1 -1 copy
> --------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE
> --------CP rmvar _mVar4
> --------CP createvar _mVar6
> scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100
> 20000 -1 -1 2000000 copy
> --------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40
> --------CP createvar _mVar7
> scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100
> 1 -1 -1 -1 copy
> --------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true
> 100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE
> --------CP rmvar _mVar6
> --------CP createvar _mVar8
> scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE
> _mVar8.MATRIX.DOUBLE 40
> --------CP rmvar _mVar5 _mVar7
> --------CP createvar _mVar9
> scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE
> --------CP rmvar _mVar8
> --------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING
> --------CP rmvar _mVar9
> --------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING
> --------CP rmvar _Var3
> --------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING
> --------CP rmvar _Var10
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop
> --------CP rmvar _Var11 _Var12
> ------GENERIC (lines 13-16) [recompile=false]
> --------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false
> _Var13.SCALAR.STRING
> --------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
> _Var14.SCALAR.DOUBLE
> --------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
> _Var15.SCALAR.DOUBLE
> --------CP rmvar _Var14
> --------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false
> _Var16.SCALAR.STRING
> --------CP createvar _mVar17
> scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1
> -1 -1 -1 copy
> --------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false
> ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar17.MATRIX.DOUBLE
> --------CP rmvar _Var15
> --------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING
> --------CP rmvar _Var13
> --------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING
> --------CP rmvar _Var16 times _Var18 _Var19
> --------CP cpvar _mVar17 times
> --------CP rmvar _mVar17 start stop
> ----GENERIC (lines 18-19) [recompile=false]
> ------CP createvar _mVar20
> scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1
> 5 -1 -1 -1 copy
> ------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40
> ------CP rmvar times
> ------CP cpvar _mVar20 times
> ------CP rmvar _mVar20 times
> 
> SystemML Statistics:
> Total elapsed time: 1.178 sec.
> Total compilation time: 0.591 sec.
> Total execution time: 0.587 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 47/0/0/0.
> Cache writes (WB, FS, HDFS): 12/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Total JIT compile time: 29.332 sec.
> Total JVM GC count: 8.
> Total JVM GC time: 0.38 sec.
> Heavy hitter instructions:
>    #  Instruction  Time(s)  Count
>    1  rand           0.307      3
>    2  r'             0.256     11
>    3  ba+*           0.010      5
>    4  time           0.004     10
>    5  rmvar          0.003     67
>    6  rightIndex     0.002     15
>    7  print          0.001     20
>    8  createvar      0.001     39
>    9  leftIndex      0.000      5
>   10  +              0.000     15
> 
> *t(M%*%N) RUN TIME EXPLAIN *
> # Memory Budget local/remote = 63716MB/140MB/140MB
> # Degree of Parallelism (vcores) local/remote = 40/1/1
> PROGRAM ( size CP/MR = 50/0 )
> --FUNCTIONS
> ----FUNCTION CALL GRAPH
> ------MAIN PROGRAM
> --------.defaultNS::time
> ----EXTERNAL FUNCTION .defaultNS::time
> --MAIN PROGRAM
> ----GENERIC (lines 4-6) [recompile=false]
> ------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0
> true MATRIX binaryblock 5 1 1000 1000 0 copy
> ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
> uniform 1.0 40 _mVar0.MATRIX.DOUBLE
> ------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1
> true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
> ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
> ------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2
> true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
> ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
> 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
> ------CP cpvar _mVar0 times
> ------CP cpvar _mVar1 M
> ------CP cpvar _mVar2 N
> ------CP rmvar _mVar0 _mVar1 _mVar2
> ----FOR (lines 0-0) [in-place=[times]]
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
> ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
> ------GENERIC (lines 8-8) [recompile=false]
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
> ------GENERIC (lines 9-12) [recompile=false]
> --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
> _Var3.SCALAR.STRING
> --------CP createvar _mVar4
> scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell
> 20000 20000 -1 -1 -1 copy
> --------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
> --------CP createvar _mVar5
> scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell
> 20000 20000 -1 -1 -1 copy
> --------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40
> --------CP rmvar _mVar4
> --------CP createvar _mVar6
> scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1
> -1 -1 -1 copy
> --------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true
> 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE
> --------CP rmvar _mVar5
> --------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING
> --------CP rmvar _mVar6
> --------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING
> --------CP rmvar _Var3
> --------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING
> --------CP rmvar _Var7
> --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end
> --------CP rmvar _Var8 _Var9
> ------GENERIC (lines 13-16) [recompile=false]
> --------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false
> _Var10.SCALAR.STRING
> --------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
> _Var11.SCALAR.DOUBLE
> --------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
> _Var12.SCALAR.DOUBLE
> --------CP rmvar _Var11
> --------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false
> _Var13.SCALAR.STRING
> --------CP createvar _mVar14
> scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1
> -1 -1 -1 copy
> --------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false
> ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
> _mVar14.MATRIX.DOUBLE
> --------CP rmvar _Var12
> --------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING
> --------CP rmvar _Var10
> --------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING
> --------CP rmvar _Var13 times _Var15 _Var16
> --------CP cpvar _mVar14 times
> --------CP rmvar _mVar14 start end
> ----GENERIC (lines 18-19) [recompile=false]
> ------CP createvar _mVar17 scratch_space//_p32616_192.168.250.59//_t0/temp7
> true MATRIX binarycell 1 5 -1 -1 -1 copy
> ------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40
> ------CP rmvar times
> ------CP cpvar _mVar17 times
> ------CP rmvar _mVar17 times
> 
> SystemML Statistics:
> Total elapsed time: 22.074 sec.
> Total compilation time: 0.567 sec.
> Total execution time: 21.507 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 32/0/0/0.
> Cache writes (WB, FS, HDFS): 12/0/0.
> Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.000 sec.
> Total JIT compile time: 33.155 sec.
> Total JVM GC count: 7.
> Total JVM GC time: 0.279 sec.
> Heavy hitter instructions:
>    #  Instruction  Time(s)  Count
>    1  ba+*          12.661      5
>    2  r'             8.496      6
>    3  rand           0.334      3
>    4  time           0.006     10
>    5  rmvar          0.003     57
>    6  rightIndex     0.002      5
>    7  print          0.001     20
>    8  createvar      0.001     24
>    9  +              0.000     15
>   10  leftIndex      0.000      5
> 
> 
> 
> 
> Thanks,
> Rana
>