You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/03/06 23:55:40 UTC
[jira] [Updated] (SYSTEMML-552) Performance features ALS-CG

     [ https://issues.apache.org/jira/browse/SYSTEMML-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Boehm updated SYSTEMML-552:
------------------------------------
    Description: 
Over a spectrum of data sizes, ALS-CG does always perform as good as we would expect due to unnecessary overheads. This task captures related performance features:

1) Cache-conscious sparse wdivmm left/right: For large factors, the approach of iterating through non-zeros in W and computing dot products, leads to repeated (unnecessary) scans of the factors from main-memory. 
2) Preparation sparse W = (X!=0) w/ intrinsics: For scalar operations with !=0, there is already a special case which is however unnecessarily conservative. We should realize this with a plain memcopy of indices and memset 1 for values.
3) Flop-aware operator selection QuaternaryOp: For large ranks, all quaternary operators become really compute-intensive. In these situations, our heuristic of choosing ExecType.CP if the operation fits in driver memory does not work very well. Hence, we should take the number of floating point operations and the local/cluster degree of parallelism into account when deciding for the execution type.  

> Performance features ALS-CG
> ---------------------------
>
>                 Key: SYSTEMML-552
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-552
>             Project: SystemML
>          Issue Type: Task
>            Reporter: Matthias Boehm
>
> Over a spectrum of data sizes, ALS-CG does always perform as good as we would expect due to unnecessary overheads. This task captures related performance features:
> 1) Cache-conscious sparse wdivmm left/right: For large factors, the approach of iterating through non-zeros in W and computing dot products, leads to repeated (unnecessary) scans of the factors from main-memory. 
> 2) Preparation sparse W = (X!=0) w/ intrinsics: For scalar operations with !=0, there is already a special case which is however unnecessarily conservative. We should realize this with a plain memcopy of indices and memset 1 for values.
> 3) Flop-aware operator selection QuaternaryOp: For large ranks, all quaternary operators become really compute-intensive. In these situations, our heuristic of choosing ExecType.CP if the operation fits in driver memory does not work very well. Hence, we should take the number of floating point operations and the local/cluster degree of parallelism into account when deciding for the execution type.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)