You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "wangfei (JIRA)" <ji...@apache.org> on 2014/10/21 11:38:34 UTC

[jira] [Comment Edited] (SPARK-4001) Add Apriori algorithm to Spark MLlib

    [ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178183#comment-14178183 ] 

wangfei edited comment on SPARK-4001 at 10/21/14 9:38 AM:
----------------------------------------------------------

.


was (Author: scwf):


Thanks Sean Owen for explaining! Frequent itemset algorithm works by scanning the input data set, there is no probabilistic model in nature. 

To answer Xiangrui Meng’s earlier questions:
1.	These algorithm is used for finding major patterns / association rules in a data set. For a real use case, some analytic applications in telecom domain use them to find subscriber behavior from the data set combining service record, network traffic record, and demographic data. Please refer to this Chinese article for example:  http://www.ss-lw.com/wxxw-361.html
And, sometimes we use frequent itemset algorithm for preparing features input to other algorithm which selects feature and do other ML task like training a classifier, like this paper: http://dl.acm.org/citation.cfm?id=1401922, 

2.	Since Apriori is a basic algorithm for frequent itemset mining, I am not aware of any parallel implementation for it. But I think the algorithm fits Spark’s data parallel model since it only need to scan the input data set. And for FP-Growth, I do know there is a Parallel FP-Growth from Haoyuan Li: http://dl.acm.org/citation.cfm?id=1454027  . I think I probably will refer  to this paper to implement FP-Growth in Spark 

3.	The Apriori computation complexity is about O(N*k) where N is the number of item in input data and k is the depth of the frequent item tree to search. FP-Grwoth complexity is about O(N),  it is more efficient comparing to Apriori. For space efficiency, FP-growth is also more efficient than Apriori. But in case of smaller data and if frequent itemset is more, Apriori is more efficient. This is because FP-Growth need to construct a FP Tree out of the input data set, and it needs some time. And another advantage of Apriori is that it can output association rules while FP-Growth can not.

Although these two algorithms are basic algo (FP-Growth is more complex), I think it will be handy if mllib can include them since there is no frequent itemset mining algo in Spark yet, and especially in distributed environment. 
Please suggest how to handle this issue. Thanks a lot. 


> Add Apriori algorithm to Spark MLlib
> ------------------------------------
>
>                 Key: SPARK-4001
>                 URL: https://issues.apache.org/jira/browse/SPARK-4001
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Jacky Li
>            Assignee: Jacky Li
>
> Apriori is the classic algorithm for frequent item set mining in a transactional data set.  It will be useful if Apriori algorithm is added to MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org