You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2015/08/01 18:05:05 UTC

[jira] [Commented] (SPARK-8999) Support non-temporal sequence in PrefixSpan

    [ https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650415#comment-14650415 ] 

Xiangrui Meng commented on SPARK-8999:
--------------------------------------

[~srowen] Thanks for your feedback! PrefixSpan paper has ~2k citations and I can find implementations in many libraries, e.g., SPMF, R. I think it is fair to say the algorithm is popular in data mining. The question I had is whether we want to support sequences of itemsets instead of sequences of items. The former complicates both the API and the implementation. I asked the author of SPMF for advice. He said without itemset support it is called string mining, which should be efficiently handled by some other algorithms. So it seems that we should implement PrefixSpan as in the paper, which supports itemsets.

> Support non-temporal sequence in PrefixSpan
> -------------------------------------------
>
>                 Key: SPARK-8999
>                 URL: https://issues.apache.org/jira/browse/SPARK-8999
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Xiangrui Meng
>            Assignee: Zhang JiaJin
>            Priority: Critical
>             Fix For: 1.5.0
>
>
> In SPARK-6487, we assume that all items are ordered. However, we should support non-temporal sequences in PrefixSpan. This should be done before 1.5 because it changes PrefixSpan APIs.
> We can use `Array[Array[Int]]` or follow SPMF to use `Array[Int]` and use -1 to mark itemset boundaries. The latter is more efficient for storage. If we support generic item type, we can use null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org