You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2017/05/24 03:05:04 UTC

[jira] [Updated] (SYSTEMML-1623) Memory efficiency JMLC matrix and frame conversions

     [ https://issues.apache.org/jira/browse/SYSTEMML-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Boehm updated SYSTEMML-1623:
-------------------------------------
    Description: 
The current JMLC conversion functions cause a very inefficient and memory intensive code path with leads to unnecessary OOMs that can be easily avoided. This task aims to add and improve these primitives to allow convenient data conversions with much better memory efficiency. 

For example consider a scenario of a 500k x 90 input model available as csv file in the classpath. The typical codepath currently use looks as follows:
{code}
ResourceStream(model_file)
-> prep
---> StringBuilder -> String [3GB tmp, 1GB]
-> convertToDoubleMatrix
---> byte[] -> ByteInputStream [2GB]
---> MatrixBlock [360MB]
---> double[][] [400MB]
-> setMatrix
---> MatrixBlock [360MB]
{code} 

which requires at least 4GB of memory due to strong references to all intermediates. The goal of this task is to reduce this to the following:

{code}
ResourceStream(model_file)
-> convertToMatrix
---> MatrixBlock [360MB]
-> setMatrix
---> by references
{code} 


> Memory efficiency JMLC matrix and frame conversions
> ---------------------------------------------------
>
>                 Key: SYSTEMML-1623
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1623
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>
> The current JMLC conversion functions cause a very inefficient and memory intensive code path with leads to unnecessary OOMs that can be easily avoided. This task aims to add and improve these primitives to allow convenient data conversions with much better memory efficiency. 
> For example consider a scenario of a 500k x 90 input model available as csv file in the classpath. The typical codepath currently use looks as follows:
> {code}
> ResourceStream(model_file)
> -> prep
> ---> StringBuilder -> String [3GB tmp, 1GB]
> -> convertToDoubleMatrix
> ---> byte[] -> ByteInputStream [2GB]
> ---> MatrixBlock [360MB]
> ---> double[][] [400MB]
> -> setMatrix
> ---> MatrixBlock [360MB]
> {code} 
> which requires at least 4GB of memory due to strong references to all intermediates. The goal of this task is to reduce this to the following:
> {code}
> ResourceStream(model_file)
> -> convertToMatrix
> ---> MatrixBlock [360MB]
> -> setMatrix
> ---> by references
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)