You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:38:06 UTC
[jira] [Resolved] (SPARK-16820) Sparse - Sparse matrix multiplication

     [ https://issues.apache.org/jira/browse/SPARK-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-16820.
----------------------------------
    Resolution: Incomplete

> Sparse - Sparse matrix multiplication
> -------------------------------------
>
>                 Key: SPARK-16820
>                 URL: https://issues.apache.org/jira/browse/SPARK-16820
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.0.0
>            Reporter: Ohad Raviv
>            Priority: Major
>              Labels: bulk-closed
>
> While working on MCL implementation on Spark we have encountered some difficulties.
> The main part of this process is distributed sparse matrix multiplication that has two main steps:
> 1.	Simulate multiply – preparation before the real multiplication in order to see which blocks should be multiplied.
> 2.	The actual blocks multiplication and summation.
> In our case the sparse matrix has 50M rows and columns, and 2B non-zeros.
> The current multiplication suffers from these issues:
> 1.	A relatively trivial bug already fixed in the first step the caused the process to be very slow [SPARK-16469]
> 2.	Still after the bug fix, if we have too many blocks the Simulate multiply will take very long time and will multiply the data many times. (O(n^3) where n is the number of blocks)
> 3.	Spark supports only multiplication with Dense matrices. Thus, it converts a Sparse matrix into a dense matrix before the multiplication.
> 4.	For summing the intermediate block results Spark uses Breeze’s CSC matrix operations – here the problem is that it is very inefficient to update a CSC matrix in a zero value.
> That means that with many blocks (default block size is 1024) – in our case 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger block size e.g. 100k we get OutOfMemoryException in the “toDense” method of the multiply. We have worked around that by implementing our-selves both the Sparse multiplication and addition in a very naïve way – but at least it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org