You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zhengruifeng <gi...@git.apache.org> on 2016/02/19 08:48:26 UTC

[GitHub] spark pull request: https://issues.apache.org/jira/browse/SPARK-13...

GitHub user zhengruifeng opened a pull request:

    https://github.com/apache/spark/pull/11263

    https://issues.apache.org/jira/browse/SPARK-13385

    ## What changes were proposed in this pull request?
    
    Enable AssociationRules to generate consequents with user-defined lengths
    
    
    ## How was the this patch tested?
    
    unit test and manual test were done. The output‘s accuracy was checked on three open datasets comparing to SPMF toolbox (www.philippe-fournier-viger.com/spmf
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhengruifeng/spark rules

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11263.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11263
    
----
commit 705caac3ac85caa0126f7685515c785a715955ac
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-02-18T03:37:45Z

    AssociationRules

commit 6729bac6fc35cc3c7deb6ae46ca5f704fbb4a10e
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-02-18T08:47:04Z

    update AssociationRules

commit ce59d47ecb72e095322f0825aee19d9299b7617a
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-02-18T11:02:26Z

    update AssociationRules

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164393
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -56,30 +59,49 @@ class AssociationRules private[fpm] (
       }
     
       /**
    +   * Sets the maximum size of consequents used by Apriori Algorithm (default: `1`).
    +   */
    +  @Since("1.5.0")
    +  def setMaxConsequent(maxConsequent: Int): this.type = {
    +    this.maxConsequent = maxConsequent
    +    this
    +  }
    +
    +  /**
        * Computes the association rules with confidence above [[minConfidence]].
    +   *
        * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
        * @return a [[Set[Rule[Item]]] containing the assocation rules.
        *
        */
       @Since("1.5.0")
       def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = {
    -    // For candidate rule X => Y, generate (X, (Y, freq(X union Y)))
    -    val candidates = freqItemsets.flatMap { itemset =>
    -      val items = itemset.items
    -      items.flatMap { item =>
    -        items.partition(_ == item) match {
    -          case (consequent, antecedent) if !antecedent.isEmpty =>
    -            Some((antecedent.toSeq, (consequent.toSeq, itemset.freq)))
    -          case _ => None
    +
    --- End diff --
    
    remove unnecessary empty lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng closed the pull request at:

    https://github.com/apache/spark/pull/11263


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164432
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -88,6 +110,204 @@ class AssociationRules private[fpm] (
         val tag = fakeClassTag[Item]
         run(freqItemsets.rdd)(tag)
       }
    +
    +  /**
    +   * Computes the union seq.
    +   *
    +   * @param freqItemIndices Frequent Items with Integer Indices.
    +   * @param minConfidence   minConfidence.
    +   * @param maxConsequent   maxConsequent.
    +   * @return an ordered union Seq of s1 and s2.
    +   *
    +   */
    +  @Since("2.0.0")
    +  private def genRules(freqItemIndices: RDD[(Seq[Int], Long)],
    +                       minConfidence: Double,
    +                       maxConsequent: Int
    +                      ): RDD[(Seq[Int], Seq[Int], Long, Long)] = {
    +
    +    val sc = freqItemIndices.sparkContext
    +
    +    val initCandidates = freqItemIndices.flatMap {
    +      case (indices, freq) =>
    +        indices.flatMap {
    +          index =>
    +            indices.partition(_ == index) match {
    +              case (consequent, antecendent) if antecendent.nonEmpty =>
    +                Some((antecendent, (consequent, freq)))
    +              case _ => None
    +            }
    +        }
    +    }
    +
    +    val initRules = sc.emptyRDD[(Seq[Int], Seq[Int], Long, Long)]
    +
    +    @tailrec
    +    def loop(candidates: RDD[(Seq[Int], (Seq[Int], Long))],
    +             lenConsequent: Int,
    +             rules: RDD[(Seq[Int], Seq[Int], Long, Long)]
    +            ): RDD[(Seq[Int], Seq[Int], Long, Long)] = {
    +
    +      val numCandidates = candidates.count()
    +
    +      log.info(s"Candidates for ${lenConsequent}-consequent rules : ${numCandidates}")
    +
    +      if (numCandidates == 0 || lenConsequent > maxConsequent) {
    +        rules
    +      } else {
    +        val newRules = candidates.join(freqItemIndices).filter{
    +          case (antecendent, ((consequent, freqUnion), freqAntecedent)) =>
    +            freqUnion >= minConfidence * freqAntecedent
    +        }.map {
    +          case (antecendent, ((consequent, freqUnion), freqAntecedent)) =>
    +            (antecendent, consequent, freqUnion, freqAntecedent)
    +        }
    +
    +        val numNewRules = newRules.count()
    +        log.info(s"Generated ${lenConsequent}-consequent rules : ${numNewRules}")
    +
    +        if (numNewRules == 0) {
    +          rules
    +        } else if (lenConsequent == maxConsequent) {
    +          sc.union(rules, newRules)
    +        } else {
    +          val numPairs = (lenConsequent + 1) * lenConsequent / 2
    +          val newCandidates = newRules.filter{
    +            // rules whose antecendent's length equals to 1 can not be used to generate new rules
    +            case (antecendent, consequent, freqUnion, freqAntecedent) =>
    +              antecendent.size > 1
    +          }.map {
    +            case (antecendent, consequent, freqUnion, freqAntecedent) =>
    +              val union = seqAdd(antecendent, consequent)
    +              ((union, freqUnion), consequent)
    +          }.groupByKey().filter {
    +            // Rule pruning. For a (lenConsequent + 1)-consequent rules, at least
    +            // (lenConsequent + 1) lenConsequent-consequent rules are needed.
    +            case ((union, freqUnion), consequents) =>
    +              consequents.size > lenConsequent
    +          }.flatMap {
    +            case ((union, freqUnion), consequents) =>
    +              val array = consequents.toArray
    +              val newConsequentCount = collection.mutable.Map[Seq[Int], Int]()
    +              for (i <- 0 until array.length; j <- i + 1 until array.length) {
    +                val newConsequent = seqAdd(array(i), array(j))
    +                if (newConsequent.length == lenConsequent + 1) {
    +                  val cnt = newConsequentCount.getOrElse(newConsequent, 0)
    +                  newConsequentCount.update(newConsequent, cnt + 1)
    +                }
    +              }
    --- End diff --
    
    This paragraph is too long. We should break it down and give meaningful names to the intermediate result.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-189584541
  
    There is something wrong with my git operation. I will close this PR, and send another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164389
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -56,30 +59,49 @@ class AssociationRules private[fpm] (
       }
     
       /**
    +   * Sets the maximum size of consequents used by Apriori Algorithm (default: `1`).
    +   */
    +  @Since("1.5.0")
    --- End diff --
    
    The next release is 2.0.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164412
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -56,30 +59,49 @@ class AssociationRules private[fpm] (
       }
     
       /**
    +   * Sets the maximum size of consequents used by Apriori Algorithm (default: `1`).
    +   */
    +  @Since("1.5.0")
    +  def setMaxConsequent(maxConsequent: Int): this.type = {
    +    this.maxConsequent = maxConsequent
    +    this
    +  }
    +
    +  /**
        * Computes the association rules with confidence above [[minConfidence]].
    +   *
        * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
        * @return a [[Set[Rule[Item]]] containing the assocation rules.
        *
        */
       @Since("1.5.0")
       def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = {
    -    // For candidate rule X => Y, generate (X, (Y, freq(X union Y)))
    -    val candidates = freqItemsets.flatMap { itemset =>
    -      val items = itemset.items
    -      items.flatMap { item =>
    -        items.partition(_ == item) match {
    -          case (consequent, antecedent) if !antecedent.isEmpty =>
    -            Some((antecedent.toSeq, (consequent.toSeq, itemset.freq)))
    -          case _ => None
    +
    +    val sc = freqItemsets.sparkContext
    +
    +    val freqItems = freqItemsets.filter(_.items.length == 1).flatMap(_.items).collect()
    +
    +    val freqItemIndices = freqItemsets.mapPartitions {
    +      it =>
    +        val itemToRank = freqItems.zipWithIndex.toMap
    +        it.map {
    +          itemset =>
    +            val indices = itemset.items.flatMap(itemToRank.get).sorted.toSeq
    +            (indices, itemset.freq)
             }
    -      }
         }
     
    -    // Join to get (X, ((Y, freq(X union Y)), freq(X))), generate rules, and filter by confidence
    -    candidates.join(freqItemsets.map(x => (x.items.toSeq, x.freq)))
    -      .map { case (antecendent, ((consequent, freqUnion), freqAntecedent)) =>
    -      new Rule(antecendent.toArray, consequent.toArray, freqUnion, freqAntecedent)
    -    }.filter(_.confidence >= minConfidence)
    +    val rules = genRules(freqItemIndices, minConfidence, maxConsequent)
    +
    +    rules.mapPartitions {
    --- End diff --
    
    this is the same as `rules.map { case ... =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164398
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -56,30 +59,49 @@ class AssociationRules private[fpm] (
       }
     
       /**
    +   * Sets the maximum size of consequents used by Apriori Algorithm (default: `1`).
    +   */
    +  @Since("1.5.0")
    +  def setMaxConsequent(maxConsequent: Int): this.type = {
    +    this.maxConsequent = maxConsequent
    +    this
    +  }
    +
    +  /**
        * Computes the association rules with confidence above [[minConfidence]].
    +   *
        * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
        * @return a [[Set[Rule[Item]]] containing the assocation rules.
        *
        */
       @Since("1.5.0")
       def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = {
    -    // For candidate rule X => Y, generate (X, (Y, freq(X union Y)))
    -    val candidates = freqItemsets.flatMap { itemset =>
    -      val items = itemset.items
    -      items.flatMap { item =>
    -        items.partition(_ == item) match {
    -          case (consequent, antecedent) if !antecedent.isEmpty =>
    -            Some((antecedent.toSeq, (consequent.toSeq, itemset.freq)))
    -          case _ => None
    +
    +    val sc = freqItemsets.sparkContext
    +
    +    val freqItems = freqItemsets.filter(_.items.length == 1).flatMap(_.items).collect()
    +
    +    val freqItemIndices = freqItemsets.mapPartitions {
    +      it =>
    --- End diff --
    
    * merge this with the one above
    * rename `it` to `iter`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188989692
  
    @zhengruifeng I marked a few places. Please update your code or it is easier for others to review. I don't see any unit tests in our PR. Maybe you forget to include it in your commit. You should add them back to test the feature you implemented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164435
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -88,6 +110,204 @@ class AssociationRules private[fpm] (
         val tag = fakeClassTag[Item]
         run(freqItemsets.rdd)(tag)
       }
    +
    +  /**
    +   * Computes the union seq.
    +   *
    +   * @param freqItemIndices Frequent Items with Integer Indices.
    +   * @param minConfidence   minConfidence.
    +   * @param maxConsequent   maxConsequent.
    +   * @return an ordered union Seq of s1 and s2.
    +   *
    +   */
    +  @Since("2.0.0")
    +  private def genRules(freqItemIndices: RDD[(Seq[Int], Long)],
    +                       minConfidence: Double,
    +                       maxConsequent: Int
    +                      ): RDD[(Seq[Int], Seq[Int], Long, Long)] = {
    +
    +    val sc = freqItemIndices.sparkContext
    +
    +    val initCandidates = freqItemIndices.flatMap {
    +      case (indices, freq) =>
    +        indices.flatMap {
    +          index =>
    +            indices.partition(_ == index) match {
    +              case (consequent, antecendent) if antecendent.nonEmpty =>
    +                Some((antecendent, (consequent, freq)))
    +              case _ => None
    +            }
    +        }
    +    }
    +
    +    val initRules = sc.emptyRDD[(Seq[Int], Seq[Int], Long, Long)]
    +
    +    @tailrec
    +    def loop(candidates: RDD[(Seq[Int], (Seq[Int], Long))],
    +             lenConsequent: Int,
    +             rules: RDD[(Seq[Int], Seq[Int], Long, Long)]
    +            ): RDD[(Seq[Int], Seq[Int], Long, Long)] = {
    +
    +      val numCandidates = candidates.count()
    +
    +      log.info(s"Candidates for ${lenConsequent}-consequent rules : ${numCandidates}")
    +
    +      if (numCandidates == 0 || lenConsequent > maxConsequent) {
    +        rules
    +      } else {
    +        val newRules = candidates.join(freqItemIndices).filter{
    +          case (antecendent, ((consequent, freqUnion), freqAntecedent)) =>
    +            freqUnion >= minConfidence * freqAntecedent
    +        }.map {
    +          case (antecendent, ((consequent, freqUnion), freqAntecedent)) =>
    +            (antecendent, consequent, freqUnion, freqAntecedent)
    +        }
    +
    +        val numNewRules = newRules.count()
    +        log.info(s"Generated ${lenConsequent}-consequent rules : ${numNewRules}")
    +
    +        if (numNewRules == 0) {
    +          rules
    +        } else if (lenConsequent == maxConsequent) {
    +          sc.union(rules, newRules)
    +        } else {
    +          val numPairs = (lenConsequent + 1) * lenConsequent / 2
    +          val newCandidates = newRules.filter{
    +            // rules whose antecendent's length equals to 1 can not be used to generate new rules
    +            case (antecendent, consequent, freqUnion, freqAntecedent) =>
    +              antecendent.size > 1
    +          }.map {
    +            case (antecendent, consequent, freqUnion, freqAntecedent) =>
    +              val union = seqAdd(antecendent, consequent)
    +              ((union, freqUnion), consequent)
    +          }.groupByKey().filter {
    +            // Rule pruning. For a (lenConsequent + 1)-consequent rules, at least
    +            // (lenConsequent + 1) lenConsequent-consequent rules are needed.
    +            case ((union, freqUnion), consequents) =>
    +              consequents.size > lenConsequent
    +          }.flatMap {
    +            case ((union, freqUnion), consequents) =>
    +              val array = consequents.toArray
    +              val newConsequentCount = collection.mutable.Map[Seq[Int], Int]()
    +              for (i <- 0 until array.length; j <- i + 1 until array.length) {
    +                val newConsequent = seqAdd(array(i), array(j))
    +                if (newConsequent.length == lenConsequent + 1) {
    +                  val cnt = newConsequentCount.getOrElse(newConsequent, 0)
    +                  newConsequentCount.update(newConsequent, cnt + 1)
    +                }
    +              }
    +
    +              newConsequentCount.filter{
    --- End diff --
    
    space before `{`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188984338
  
    @zhengruifeng There are still inconsistent code. You can run `dev/lint-scala` locally to check. But it doesn't cover all issues. I will list a few inline.
    
    @zhangjiajin @feynmanliang Do you have time to make a pass?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188987613
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188987614
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51993/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-187587555
  
    @zhengruifeng Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and follow the Spark code style defined there. There are some unnecessary changes due to code style change in your PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164424
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -88,6 +110,204 @@ class AssociationRules private[fpm] (
         val tag = fakeClassTag[Item]
         run(freqItemsets.rdd)(tag)
       }
    +
    +  /**
    +   * Computes the union seq.
    +   *
    +   * @param freqItemIndices Frequent Items with Integer Indices.
    +   * @param minConfidence   minConfidence.
    +   * @param maxConsequent   maxConsequent.
    +   * @return an ordered union Seq of s1 and s2.
    +   *
    +   */
    +  @Since("2.0.0")
    +  private def genRules(freqItemIndices: RDD[(Seq[Int], Long)],
    --- End diff --
    
    chop down arguments and use 4-space indentation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188984392
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164418
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -88,6 +110,204 @@ class AssociationRules private[fpm] (
         val tag = fakeClassTag[Item]
         run(freqItemsets.rdd)(tag)
       }
    +
    +  /**
    +   * Computes the union seq.
    +   *
    +   * @param freqItemIndices Frequent Items with Integer Indices.
    +   * @param minConfidence   minConfidence.
    --- End diff --
    
    no vertical alignment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-187644623
  
    @mengxr Ok, I have fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11263#discussion_r54164401
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
    @@ -56,30 +59,49 @@ class AssociationRules private[fpm] (
       }
     
       /**
    +   * Sets the maximum size of consequents used by Apriori Algorithm (default: `1`).
    +   */
    +  @Since("1.5.0")
    +  def setMaxConsequent(maxConsequent: Int): this.type = {
    +    this.maxConsequent = maxConsequent
    +    this
    +  }
    +
    +  /**
        * Computes the association rules with confidence above [[minConfidence]].
    +   *
        * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
        * @return a [[Set[Rule[Item]]] containing the assocation rules.
        *
        */
       @Since("1.5.0")
       def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = {
    -    // For candidate rule X => Y, generate (X, (Y, freq(X union Y)))
    -    val candidates = freqItemsets.flatMap { itemset =>
    -      val items = itemset.items
    -      items.flatMap { item =>
    -        items.partition(_ == item) match {
    -          case (consequent, antecedent) if !antecedent.isEmpty =>
    -            Some((antecedent.toSeq, (consequent.toSeq, itemset.freq)))
    -          case _ => None
    +
    +    val sc = freqItemsets.sparkContext
    +
    +    val freqItems = freqItemsets.filter(_.items.length == 1).flatMap(_.items).collect()
    +
    +    val freqItemIndices = freqItemsets.mapPartitions {
    +      it =>
    +        val itemToRank = freqItems.zipWithIndex.toMap
    --- End diff --
    
    move this line outside the closure and use `.map {` instead of `mapPartitions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-186107355
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188987597
  
    **[Test build #51993 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51993/consoleFull)** for PR 11263 at commit [`848b958`](https://github.com/apache/spark/commit/848b9581eebf164ad136c3fd00d38509347d3622).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-13385][MLlib] Enable AssociationRules t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11263#issuecomment-188986185
  
    **[Test build #51993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51993/consoleFull)** for PR 11263 at commit [`848b958`](https://github.com/apache/spark/commit/848b9581eebf164ad136c3fd00d38509347d3622).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org