You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michał Wesołowski (JIRA)" <ji...@apache.org> on 2016/07/11 11:48:11 UTC

[jira] [Comment Edited] (SPARK-16478) strongly connected components doesn't cache returned RDD

    [ https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595 ] 

Michał Wesołowski edited comment on SPARK-16478 at 7/11/16 11:47 AM:
---------------------------------------------------------------------

If you run code that I provided on databrics you can see that without materializing graph that is returned simple count on vertices takes about 20 minutes, whereas strongly connected components runs 2 minutes. 
I tried to us it on some real data and I wasn't able to save the result because of this. After materializing graph with every iteration I can save results with no problem. Materializing only within outside loop caused less severe problems but wasn't sufficient. 

In original implementation there is lot's of RDD cached and immediately matrialized. Some of them are removed before scc returnes due to LRU fashion spark operates, but returned RDDs are not materialized and depend on the ones already removed from RAM. That is my current understanding of observed behavior. 


was (Author: wesolows):
If you run code that I provided on databrics you can see that without materializing graph that is returned simple count on vertices takes about 20 minutes, whereas strongly connected components runs 2 minutes. 
I tried to us it on some real data and I wasn't able to save the result because of this. After materializing graph with every iteration I can save results with no problem. Materializing only within outside loop caused less severe problems but wasn't sufficient. 

> strongly connected components doesn't cache returned RDD
> --------------------------------------------------------
>
>                 Key: SPARK-16478
>                 URL: https://issues.apache.org/jira/browse/SPARK-16478
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>    Affects Versions: 1.6.2
>            Reporter: Michał Wesołowski
>
> Strongly Connected Components algorithm caches intermediary RDD's but doesn't cache the one that is going to be returned. With large enough graph comparing to available memory when one tries to take action on returned RDD whole RDD has to be computed from scratch which takes much more time than StronglyConnectedComponents alone . 
> I managed to replicate the issue on databrics platform. [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html] is notebook. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org