You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/02/01 18:27:52 UTC
[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

    [ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848747#comment-15848747 ] 

Joseph K. Bradley commented on SPARK-14503:
-------------------------------------------

There are a couple of design issues which have been mentioned in either the design doc or the PR, but which should probably be discussed in more detail in JIRA:
* Item type: It looks like this currently assumes every item is represented as a String.  I'd like us to support any Catalyst type.  If that's hard to do (until we port the implementation over to DataFrames), then just supporting String is OK as long as it's clearly documented.
* FPGrowth vs AssociationRules: The APIs are a bit fuzzy right now.  I’ve listed them out below.  The problem is that AssociationRules is tightly tied to FPGrowth.  While I like the idea of being able to use AssociationRules to analyze the output of multiple FPM algorithms, I don’t think it’s applicable to PrefixSpan since it does not take the ordering of the itemsets into account.  I’d propose we provide a single API under the name "FPGrowth."
** Q: Have you heard of anyone needing the AssociationRules API without going through FPGrowth first?  If so, then we could expose the AssociationRules algorithm as a @DeveloperApi static method.

What does everyone think?

Current APIs
* FPGrowth
** Input to fit() and transform(): Seq(items)
** Output
*** transform(): —> same as AssociationRules
*** getFreqItems: {code}DataFrame["items", "freq"]{code}

* AssociationRules
** Input
*** fit(): (output of FPGrowth)
*** transform(): Seq(items)		—> Not good that fit/transform take different inputs
** Output
*** transform(): predicted items for each Seq(items)
*** associationRules: {code}DataFrame["antecedent", "consequent", "confidence"]{code}

Proposal: Combine under FPGrowth
* FPGrowth
** Input to fit() and transform(): Seq(items)
** Output
*** transform(): predicted items for each Seq(items)
*** getFreqItems: {code}DataFrame["items", "freq"]{code}
*** associationRules: {code}DataFrame["antecedent", "consequent", "confidence"]{code}


> spark.ml Scala API for FPGrowth
> -------------------------------
>
>                 Key: SPARK-14503
>                 URL: https://issues.apache.org/jira/browse/SPARK-14503
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based API, with details for this class.  The doc could also look ahead to the other fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org