You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/01/12 22:52:39 UTC
[jira] [Closed] (SPARK-12781) MLib FPGrowth does not scale to large
numbers of frequent items
[ https://issues.apache.org/jira/browse/SPARK-12781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen closed SPARK-12781.
-----------------------------
> MLib FPGrowth does not scale to large numbers of frequent items
> ---------------------------------------------------------------
>
> Key: SPARK-12781
> URL: https://issues.apache.org/jira/browse/SPARK-12781
> Project: Spark
> Issue Type: Improvement
> Reporter: Raj Tiwari
>
> See some background discussion here: [http://stackoverflow.com/questions/34690682/spark-mlib-fpgrowth-job-fails-with-memory-error/]
> The FPGrowth mode's {{run()}} method seems to do the following:
> # Count items
> # Generate frequent items
> # Generate frequent item sets
> The model is trained based on the outcome of the above. When generating frequent items, the code does the following:
> data.flatMap { t =>
> val uniq = t.toSet
> if (t.size != uniq.size) {
> throw new SparkException(s"Items in a transaction must be unique but got ${t.toSeq}.")
> }
> t
> }.map(v => (v, 1L))
> .reduceByKey(partitioner, _ + _)
> .filter(_._2 >= minCount)
> .collect()
> .sortBy(-_._2)
> .map(_._1)
> The {{collect()}} call in the snippet above is causing my executors to blow past any amount of memory I can give them. Is there a way to write {{genFreqItems()}} and {{genFreqItemsets()}} so they won't try to collect all frequent items in memory?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org