You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/10/02 23:35:20 UTC

[jira] [Updated] (SYSTEMML-1004) New spark tsmm2 matrix multiplication operator

     [ https://issues.apache.org/jira/browse/SYSTEMML-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Boehm updated SYSTEMML-1004:
-------------------------------------
    Description: 
The performance experiments for our 0.11 release, revealed performance issues for LinregDS and PCA (specifically for {{t(X)%*%X}}) whenever the number of columns is larger than the blocksize. For example, the following scenario shows LinregDS results for an input size of 10M x 1K with blocksize of 1K. For scenarios with icp>0, we append a column of ones which exceeds the blocksize and hence we compile a {{cpmm}} instead of {{tsmm}} instruction.

{code}
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 293
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 340
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 291
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 302
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 274
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 316
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 279
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 322
{code}

In comparison, LinregCG shows much more robust experimental results:

{code}
-- Running runLinearRegCG on 10M_1k_dense (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 62
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 65
-- Running runLinearRegCG on 10M_1k_dense (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 68
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 58
-- Running runLinearRegCG on 10M_1k_dense (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 50
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 72
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 59
-- Running runLinearRegCG on 10M_1k_dense (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 67
{code}

We should introduce a new {{tsmm2}} operation for the scenario where the excess columns fit into the broadcast memory budget, which would allow us to compute this expression without shuffling t(X) and X.

  was:
The performance experiments for our 0.11 release, revealed performance issues for LinregDS and PCA (specifically for {{t(X)%*%X}}) whenever the number of columns is larger than the blocksize. For example, the following scenario shows LinregDS results for an input size of 10M x 1K with blocksize of 1K. For scenarios with icp>0, we append a column of ones which exceeds the blocksize and hence we compile a {{cpmm}} instead of {{tsmm}} instruction.

{code}
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 122
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 350
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 297
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 279
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 360
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 286
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 299
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 82
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 292
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 292
-- Running runLinearRegDS on 10M_1k_dense (all configs)
LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 82
LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 290
LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 301
{code}

We should introduce a new {{tsmm2}} operation for the scenario where the excess columns fit into the broadcast memory budget, which would allow us to compute this expression without shuffling t(X) and X.


> New spark tsmm2 matrix multiplication operator
> ----------------------------------------------
>
>                 Key: SYSTEMML-1004
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1004
>             Project: SystemML
>          Issue Type: Task
>            Reporter: Matthias Boehm
>
> The performance experiments for our 0.11 release, revealed performance issues for LinregDS and PCA (specifically for {{t(X)%*%X}}) whenever the number of columns is larger than the blocksize. For example, the following scenario shows LinregDS results for an input size of 10M x 1K with blocksize of 1K. For scenarios with icp>0, we append a column of ones which exceeds the blocksize and hence we compile a {{cpmm}} instead of {{tsmm}} instruction.
> {code}
> -- Running runLinearRegDS on 10M_1k_dense (all configs)
> LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
> LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 293
> LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 340
> -- Running runLinearRegDS on 10M_1k_dense (all configs)
> LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
> LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 291
> LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 302
> -- Running runLinearRegDS on 10M_1k_dense (all configs)
> LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 80
> LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 274
> LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 316
> -- Running runLinearRegDS on 10M_1k_dense (all configs)
> LinRegDS train ict=0 on mbperftest/binomial/X10M_1k_dense: 81
> LinRegDS train ict=1 on mbperftest/binomial/X10M_1k_dense: 279
> LinRegDS train ict=2 on mbperftest/binomial/X10M_1k_dense: 322
> {code}
> In comparison, LinregCG shows much more robust experimental results:
> {code}
> -- Running runLinearRegCG on 10M_1k_dense (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 62
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 65
> -- Running runLinearRegCG on 10M_1k_dense (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 68
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 58
> -- Running runLinearRegCG on 10M_1k_dense (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 50
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 72
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 59
> -- Running runLinearRegCG on 10M_1k_dense (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_dense: 57
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_dense: 67
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_dense: 67
> {code}
> We should introduce a new {{tsmm2}} operation for the scenario where the excess columns fit into the broadcast memory budget, which would allow us to compute this expression without shuffling t(X) and X.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)