You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yu-iskw <gi...@git.apache.org> on 2014/10/23 11:29:22 UTC

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

GitHub user yu-iskw opened a pull request:

    https://github.com/apache/spark/pull/2906

    [SPARK-2429] [MLlib] Hierarchical Implementation of KMeans

    I want to add a divisive hierarchical clustering algorithm implementation to MLlib. I don't support distance metrics other Euclidean distance metric yet. It would be nice to support it at other issue.
    Could you review it?
    
    Thanks!


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yu-iskw/spark hierarchical

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2906.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2906
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60460891
  
      [Test build #22179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22179/consoleFull) for   PR 2906 at commit [`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632804
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    --- End diff --
    
    Incomplete sentence?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60546587
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22267/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22633847
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,126 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * this class is used for the model of the hierarchical clustering
    + *
    + * @param clusterTree a cluster as a tree node
    + * @param isTrained if the model has been trained, the flag is true
    + */
    +class HierarchicalClusteringModel private (
    +  val clusterTree: ClusterTree,
    +  private[mllib] var isTrained: Boolean) extends Serializable with Logging with Cloneable {
    +
    +  def this(clusterTree: ClusterTree) = this(clusterTree, false)
    +
    +  override def clone(): HierarchicalClusteringModel = {
    +    new HierarchicalClusteringModel(this.clusterTree.clone(), true)
    +  }
    +
    +  /**
    +   * Cuts a cluster tree by given threshold of dendrogram height
    +   *
    +   * @param height a threshold to cut a cluster tree
    +   * @return a hierarchical clustering model
    +   */
    +  def cut(height: Double): HierarchicalClusteringModel = {
    +    val cloned = this.clone()
    +    cloned.clusterTree.cut(height)
    +    cloned
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(vector: Vector): Int = {
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    this.clusterTree.assignClusterIndex(metric)(vector)
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val sc = data.sparkContext
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    val treeRoot = this.clusterTree
    +    sc.broadcast(metric)
    +    sc.broadcast(treeRoot)
    +    val predicted = data.map(point => (treeRoot.assignClusterIndex(metric)(point), point))
    +
    +    val predictTime = System.currentTimeMillis() - startTime
    +    logInfo(s"Predicting Time: ${predictTime.toDouble / 1000} [sec]")
    +
    +    predicted
    +  }
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
    +    predict(points.rdd).map(_._1).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]
    +
    +  /**
    +   * Computes the sum of total variance of all cluster
    +   */
    +  def getSumOfVariance(): Double = this.getClusters().map(_.getVariance().get).sum
    +
    +  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
    +
    +  def getCenters(): Array[Vector] = getClusters().map(_.center)
    +
    +  /**
    +   * Converts a clustering merging list
    +   * Returned data format is fit for scipy's dendrogram function
    +   * SEE ALSO: scipy.cluster.hierarchy.dendrogram
    +   *
    +   * @return List[(node1, node2, distance, tree size)]
    +   */
    +  def toMergeList(): List[(Int, Int, Double, Int)] = {
    +    val seq = this.clusterTree.toSeq().sortWith{ case (a, b) => a.getHeight() < b.getHeight()}
    +    val leaves = seq.filter(_.isLeaf())
    +    val nodes = seq.filter(!_.isLeaf()).filter(_.children.size > 1)
    +    val clusters = leaves ++ nodes
    +    val treeMap = clusters.zipWithIndex.map { case (tree, idx) => (tree -> idx)}.toMap
    +
    +    // If a node only has one-child, the child is regarded as the cluster of the child.
    --- End diff --
    
    I find this description a little hard to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60460562
  
      [Test build #22177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22177/consoleFull) for   PR 2906 at commit [`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22640249
  
    --- Diff: data/mllib/sample_hierarchical_data.csv ---
    @@ -0,0 +1,150 @@
    +5.1,3.5,1.4,0.2
    --- End diff --
    
    Minor point - this wouldn't really be CSV though. I imagine the example shows parsing a common encoding like this on purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632678
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    --- End diff --
    
    No output, see my other note about `sc.broadcast`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19288634
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    --- End diff --
    
    Trivial but can this be just `$conf`? and similarly for other format strings


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60544649
  
      [Test build #22270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22270/consoleFull) for   PR 2906 at commit [`8dbbacd`](https://github.com/apache/spark/commit/8dbbacd2e7f27e111b7237006fde73d1cf3eb5e7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62325997
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23124/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19288713
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    +          // add the sub nodes in to the tree
    +          // if the sum of variance of sub nodes is greater than that of pre-splitted node
    +          if (node.get.getVariance().get > subNodes.map(_.getVariance().get).sum) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            isMerged = true
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      totalVariance = newTotalVariance
    +      newTotalVariance = model.clusterTree.toSeq().filter(_.isLeaf()).map(_.getVariance().get).sum
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    model.trainTime = (System.currentTimeMillis() - startTime).toInt
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    conf match {
    +      case conf if conf.getNumClusters() > data.count() =>
    --- End diff --
    
    Can this use `require`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-68787775
  
    @yu-iskw @rnowling, I asked @freeman-lab to make one pass on this PR. Let's ping him :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632101
  
    --- Diff: data/mllib/sample_hierarchical_data.csv ---
    @@ -0,0 +1,150 @@
    +5.1,3.5,1.4,0.2
    --- End diff --
    
    It might be nice if this could be parsed directly by `Vectors.parse`, it would just require adding a `[` and `]` at the start and end of each line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62147990
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23052/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62302445
  
    There is a few conflicts with master brach. I will rebase my PR branch, and then force push it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62328415
  
      [Test build #23125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23125/consoleFull) for   PR 2906 at commit [`b0b061e`](https://github.com/apache/spark/commit/b0b061edc4c2ad42deda00bb664534e1334b50e5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19289245
  
    --- Diff: mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java ---
    @@ -0,0 +1,78 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering;
    +
    +import com.google.common.collect.Lists;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
    +import org.apache.spark.mllib.linalg.Vectors;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.io.Serializable;
    +import java.util.List;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JavaHierarchicalClusteringSuite implements Serializable {
    +    private transient JavaSparkContext sc;
    --- End diff --
    
    Looks like this is using 4-space indent but should be 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62147985
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23052/consoleFull) for   PR 2906 at commit [`8355f95`](https://github.com/apache/spark/commit/8355f959f02ca67454c9cb070912480db0a44671).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class HierarchicalClusteringModel(object):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60921575
  
    @mengxr I added the performance test for vector's sparsity at "Experiment 5: The Effects of Vector Sparsity". You can download a new result. Please check it.
    
    https://issues.apache.org/jira/secure/attachment/12677880/benchmark-result.2014-10-29.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by rnowling <gi...@git.apache.org>.

Github user rnowling commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19267891
  
    --- Diff: python/pyspark/mllib/clustering.py ---
    @@ -91,6 +99,58 @@ def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"
             return KMeansModel([c.toArray() for c in centers])
     
     
    +class HierarchicalClusteringModel(ClusteringModel):
    --- End diff --
    
    The predict method seems to be O(kN) but you can do assignment in O(Nlog k) time with the tree, right?  (N is the number of data points, k is the number of cluster centers).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60566674
  
    @mengxr thank you for your feedback.
    
    > Is there a paper that you used as reference? If so, please cite it in the doc.
    Yes. I added the comment into the doc. 
    https://github.com/yu-iskw/spark/commit/6b22f0752d5d692912c1e8a5e3390326e5d8ebc6
    
    > Could you send some performance testing results on dense and sparse datasets?
    I had only tested the performance on dense datasets. You can download the benchmark result below the URL. However, because I changed the algorithm, I will test it again. I will send the result to you.
    https://issues.apache.org/jira/secure/attachment/12675783/benchmark2.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632146
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/mllib/JavaHierarchicalClustering.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib;
    +
    +import org.apache.spark.SparkConf;
    --- End diff --
    
    Imports should be ordered alphabetically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19396916
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    +          // add the sub nodes in to the tree
    +          // if the sum of variance of sub nodes is greater than that of pre-splitted node
    +          if (node.get.getVariance().get > subNodes.map(_.getVariance().get).sum) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            isMerged = true
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      totalVariance = newTotalVariance
    +      newTotalVariance = model.clusterTree.toSeq().filter(_.isLeaf()).map(_.getVariance().get).sum
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    model.trainTime = (System.currentTimeMillis() - startTime).toInt
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    conf match {
    +      case conf if conf.getNumClusters() > data.count() =>
    +        throw new IllegalArgumentException("# clusters must be less than # input data records")
    +      case _ =>
    +    }
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leafs of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.conf.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.conf.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    var finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is greater then 1. if 1 means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > conf.getEpsilon()
    +        && numIter < conf.getSubIterations()
    +        && centers.size > 1) {
    +
    +      val startTimeOfIter = System.currentTimeMillis()
    +      // finds the closest center of each point
    +      data.sparkContext.broadcast(finder)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = finder(point)
    +          val (sumBV, n) = map.get(idx).getOrElse((BV.zeros[Double](point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = Math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +      finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(point => (finder(point), point))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    --- End diff --
    
    I added the `unpersist` sentence.
    https://github.com/yu-iskw/spark/commit/028d317438bdf9e3dd11246bc331f771d7d1dffe


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-69134341

Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it in both Python and Scala. Overall great work, terrific to see my little gist turned into something so refined and performant! =) I left lots of comments, most minor, though documenting the caching behavior seems quite important.

The one significant addition I'd suggest is exposing another model output: a list of the centers at all nodes in the learned tree. This would be in addition to just the centers of the leaves, which is currently returned by `getCenters` (or `clusterCenters` in Python). Maybe call it `getTreeCenters`. It's basically given by `model.clusterTree.toSeq().map(_.center)`. But we should make sure it's sorted so that it can be indexed using the values from the merge list. In other words, if `Z` is the merge list, and row i indicates that `Z[i,0]` and `Z[i,1]` were merged, we want to be able to get the centers associated with those nodes by calling, for example, `model.treeCenters[Z[i,0]]` and `model.treeCenters[Z[i,1]]`. What do you think?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60542976
  
      [Test build #22267 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22267/consoleFull) for   PR 2906 at commit [`1a08510`](https://github.com/apache/spark/commit/1a0851079bf145939e665aa78f0e77b3995e6e66).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632265
  
    --- Diff: python/pyspark/mllib/clustering.py ---
    @@ -88,6 +92,162 @@ def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"
             return KMeansModel([c.toArray() for c in centers])
     
     
    +class HierarchicalClusteringModel(object):
    +
    +    """A clustering model derived from the hierarchical clustering method.
    +
    +    >>> from numpy import array
    +    >>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
    +    >>> train_rdd = sc.parallelize(data)
    +    >>> model = HierarchicalClustering.train(train_rdd, 2)
    +    >>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0]))
    +    True
    +    >>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0]))
    +    True
    +    >>> x = model.predict(data[0])
    +    >>> type(x)
    +    <type 'int'>
    +    >>> predicted_rdd = model.predict(train_rdd)
    +    >>> type(predicted_rdd)
    +    <class 'pyspark.rdd.RDD'>
    +    >>> predicted_rdd.collect() == [0, 0, 1, 1]
    +    True
    +    >>> sparse_data = [
    +    ...     SparseVector(3, {1: 1.0}),
    +    ...     SparseVector(3, {1: 1.1}),
    +    ...     SparseVector(3, {2: 1.0}),
    +    ...     SparseVector(3, {2: 1.1})
    +    ... ]
    +    >>> train_rdd = sc.parallelize(sparse_data)
    +    >>> model = HierarchicalClustering.train(train_rdd, 2, numRetries=100)
    +    >>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.]))
    +    True
    +    >>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1]))
    +    True
    +    >>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
    +    True
    +    >>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
    +    True
    +    >>> x = model.predict(array([0., 1., 0.]))
    +    >>> type(x)
    +    <type 'int'>
    +    >>> predicted_rdd = model.predict(train_rdd)
    +    >>> type(predicted_rdd)
    +    <class 'pyspark.rdd.RDD'>
    +    >>> (predicted_rdd.collect() == [0, 0, 1, 1]
    +    ...     or predicted_rdd.collect() == [1, 1, 0, 0] )
    +    True
    +    >>> type(model.clusterCenters)
    +    <type 'list'>
    +    """
    +
    +    def __init__(self, sc, java_model, centers):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        :param centers: the cluster centers
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +        self.centers = centers
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    @property
    +    def clusterCenters(self):
    +        """Get the cluster centers, represented as a list of NumPy arrays."""
    +        return self.centers
    +
    +    def predict(self, x):
    +        """Predict the closest cluster index
    +
    +        :param x: a ndarray of list, a SparseVector or RDD[SparseVector]
    +        :return: the closest index or a RDD of int which means the closest index
    +        """
    +        if isinstance(x, ndarray) or isinstance(x, Vector):
    +            return self.__predict_by_array(x)
    +        elif isinstance(x, RDD):
    +            return self.__predict_by_rdd(x)
    +        else:
    +            print 'Invalid input data type x:' + type(x)
    +
    +    def __predict_by_array(self, x):
    +        """Predict the closest cluster index with an ndarray or an SparseVector
    +
    +        :param x: a vector
    +        :return: the closest cluster index
    +        """
    +        ser = PickleSerializer()
    +        bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +        vec = self._sc._jvm.SerDe.loads(bytes)
    +        result = self._java_model.predict(vec)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def __predict_by_rdd(self, x):
    +        """Predict the closest cluster index with a RDD
    +        :param x: a RDD of vector
    +        :return: a RDD of int
    +        """
    +        ser = PickleSerializer()
    +        cached = x.map(_convert_to_vector)._reserialize(AutoBatchedSerializer(ser)).cache()
    +        rdd = _to_java_object_rdd(cached)
    +        jrdd = self._java_model.predict(rdd)
    +        jpyrdd = self._sc._jvm.SerDe.javaToPython(jrdd)
    +        return RDD(jpyrdd, self._sc, AutoBatchedSerializer(PickleSerializer()))
    +
    +    def cut(self, height):
    --- End diff --
    
    This currently breaks if an integer is passed as `height` (which is likely to be common). For example, after creating the model from the example, I got an error when calling `model.cut(4)` but not `model.cut(4.0)`. Probably just recast the input here as a float.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19396796
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    --- End diff --
    
    I changed `HierarchicalClusteringConf` class to a trait for `HierarchicalClustering`. And the class parameters were moved to `HierarchicalClustering` such as `numClusters`.
    I think if the accessor methods for the algorithm is included `HierarchicalClustering`, it gets larger. So I delegated the methods to the trait class.
    
    https://github.com/yu-iskw/spark/commit/2879b00c39880a4ffc29cefaaffde26df655e63f


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632654
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    --- End diff --
    
    Indent by 2 spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22634865
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    +   * @param numRetries the number of retries when the clustering can't be succeeded
    +   * @param epsilon the relative error that bisecting is satisfied
    +   * @param randomSeed the randomseed to generate the initial vectors for each bisecting
    +   * @param randomRange the range of error to genrate the initial vectors for each bisecting
    +   * @return a hierarchical clustering model
    +   */
    +  def train(
    +    data: RDD[Vector],
    +    numClusters: Int,
    +    subIterations: Int,
    +    numRetries: Int,
    +    epsilon: Double,
    +    randomSeed: Int,
    +    randomRange: Double): HierarchicalClusteringModel = {
    +    val algo = new HierarchicalClustering()
    +        .setNumClusters(numClusters)
    +        .setSubIterations(subIterations)
    +        .setNumRetries(numRetries)
    +        .setEpsilon(epsilon)
    +        .setRandomSeed(randomSeed)
    +        .setRandomRange(randomRange)
    +    algo.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param center the center of the cluster
    + * @param data the data in the cluster
    + * @param height distance between sub nodes
    + * @param variance the statistics for splitting of the cluster
    --- End diff --
    
    "of the cluster" => "the cluster"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22634887
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    +   * @param numRetries the number of retries when the clustering can't be succeeded
    +   * @param epsilon the relative error that bisecting is satisfied
    +   * @param randomSeed the randomseed to generate the initial vectors for each bisecting
    +   * @param randomRange the range of error to genrate the initial vectors for each bisecting
    +   * @return a hierarchical clustering model
    +   */
    +  def train(
    +    data: RDD[Vector],
    +    numClusters: Int,
    +    subIterations: Int,
    +    numRetries: Int,
    +    epsilon: Double,
    +    randomSeed: Int,
    +    randomRange: Double): HierarchicalClusteringModel = {
    +    val algo = new HierarchicalClustering()
    +        .setNumClusters(numClusters)
    +        .setSubIterations(subIterations)
    +        .setNumRetries(numRetries)
    +        .setEpsilon(epsilon)
    +        .setRandomSeed(randomSeed)
    +        .setRandomRange(randomRange)
    +    algo.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param center the center of the cluster
    + * @param data the data in the cluster
    + * @param height distance between sub nodes
    + * @param variance the statistics for splitting of the cluster
    + * @param dataSize the data size of its data
    + * @param children the sub node(s) of the cluster
    + * @param parent the parent node of the cluster
    + * @param isVisited a flag to be searched
    + */
    +private[mllib]
    +class ClusterTree private (
    +  val center: Vector,
    +  private[mllib] val data: RDD[BV[Double]],
    +  private[mllib] var height: Option[Double],
    +  private[mllib] var variance: Option[Double],
    +  private[mllib] var dataSize: Option[Long],
    +  private[mllib] var children: List[ClusterTree],
    +  private[mllib] var parent: Option[ClusterTree],
    +  private[mllib] var isVisited: Boolean) extends Serializable with Cloneable with Logging {
    +
    +  def this(center: Vector, data: RDD[BV[Double]]) =
    +    this(center, data, None, None, None, List.empty[ClusterTree], None, false)
    +
    +  override def clone(): ClusterTree = {
    +    val cloned = new ClusterTree(
    +      this.center,
    +      this.data,
    +      this.height,
    +      this.variance,
    +      this.dataSize,
    +      List.empty[ClusterTree],
    +      None,
    +      this.isVisited
    +    )
    +    val clonedChildren = this.children.map(child => child.clone()).toList
    +    cloned.insert(clonedChildren)
    +    cloned
    +  }
    +
    +  override def toString(): String = {
    +    val elements = Array(
    +      s"hashCode:${this.hashCode()}",
    +      s"depth:${this.getDepth()}",
    +      s"dataSize:${this.dataSize.get}",
    +      s"variance:${this.variance.get}",
    +      s"parent:${this.parent.hashCode()}",
    +      s"children:${this.children.map(_.hashCode())}",
    +      s"isLeaf:${this.isLeaf()}",
    +      s"isVisited:${this.isVisited}"
    +    )
    +    elements.mkString(", ")
    +  }
    +
    +  /**
    +   * Cuts a cluster tree
    +   *
    +   * @param height the threshold of height to cut a cluster tree
    +   * @return a cut hierarchical clustering model
    +   */
    +  private[mllib] def cut(height: Double): ClusterTree = {
    +    this.children.foreach { child =>
    +      if (child.getHeight() < height && child.children.size > 0) {
    +        child.children.foreach(grandchild => child.delete(grandchild))
    +      }
    +    }
    +    this.children.foreach(child => child.cut(height))
    +    this
    +  }
    +
    +  /**
    +   * Inserts sub nodes as its children
    +   *
    +   * @param children inserted sub nodes
    +   */
    +  def insert(children: List[ClusterTree]): Unit = {
    +    this.children = this.children ++ children
    +    children.foreach(child => child.parent = Some(this))
    +  }
    +
    +  /**
    +   * Inserts a sub node as its child
    +   *
    +   * @param child inserted sub node
    +   */
    +  def insert(child: ClusterTree): Unit = insert(List(child))
    +
    +  /** Deletes all child */
    +  def delete() = this.children = List.empty[ClusterTree]
    +
    +  /** Deletes a child */
    +  def delete(target: ClusterTree) {
    +    this.children.contains(target) match {
    +      case true => this.children = this.children.filter(child => child != target)
    +      case false => logWarning("You attempted to delete a node which is not contained")
    +    }
    +  }
    +
    +  /**
    +   * Converts the tree into Seq class
    +   * the sub nodes are recursively expanded
    +   *
    +   * @return Seq class which the cluster tree is expanded
    +   */
    +  def toSeq(): Seq[ClusterTree] = {
    +    val seq = this.children.size match {
    +      case 0 => Seq(this)
    +      case _ => Seq(this) ++ this.children.map(child => child.toSeq()).flatten
    +    }
    +    seq.sortWith { case (a, b) =>
    +      a.getDepth() < b.getDepth() &&
    +          breezeNorm(a.center.toBreeze, 2) < breezeNorm(b.center.toBreeze, 2)
    +    }
    +  }
    +
    +  /**
    +   * Gets the all clusters which are leaves in the cluster tree
    +   * @return the Seq of the clusters
    --- End diff --
    
    Insert line break


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62332987
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23125/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22633758
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    --- End diff --
    
    Is it correct that caching every split (but then unpersisting) like this can potentially end up caching twice the size of the original data set? If so, that should be made very clear in the documentation, as it may affect the user's choice of what size data sets to run this one (and their hardware or instance provisioning).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22641364
  
    --- Diff: mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java ---
    @@ -0,0 +1,77 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering;
    +
    +import com.google.common.collect.Lists;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
    +import org.apache.spark.mllib.linalg.Vectors;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.io.Serializable;
    +import java.util.List;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JavaHierarchicalClusteringSuite implements Serializable {
    +  private transient JavaSparkContext sc;
    --- End diff --
    
    This is not a comment on this PR per se, but this whole `implements Serializable` and `transient JavaSparkContext` thing is an anti-pattern I wish wasn't used in even the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60463501
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22177/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632512
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    --- End diff --
    
    This use of `sc.broadcast` has no effect because the output isn't assigned or used. Instead, you want something like `val bcCenters = sc.broadcast(Centers)` and then access within the `map` as `bcCenters.value`. See [KMeans](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L181-185) for an example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60566364
  
      [Test build #22290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22290/consoleFull) for   PR 2906 at commit [`2676166`](https://github.com/apache/spark/commit/2676166ba6f307b4605ea1e7ecf6ece5b9e200b3).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class ClusteringModel(object):`
      * `class KMeansModel(ClusteringModel):`
      * `class HierarchicalClusteringModel(ClusteringModel):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60922006
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22450/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22641250
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/mllib/JavaHierarchicalClustering.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib;
    +
    +import org.apache.spark.SparkConf;
    --- End diff --
    
    These look correctly ordered in the sense that package `a.b.c` sorts entirely before `a.b.c.d`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60464011
  
      [Test build #22179 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22179/consoleFull) for   PR 2906 at commit [`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `class HierarchicalClusteringConf(`
      * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
      * `class ClusterTree(`
      * `class ClusteringModel(object):`
      * `class KMeansModel(ClusteringModel):`
      * `class HierarchicalClusteringModel(ClusteringModel):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19288686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    --- End diff --
    
    Consider splitting this onto two lines with braces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60543195
  
      [Test build #22268 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22268/consoleFull) for   PR 2906 at commit [`b014f50`](https://github.com/apache/spark/commit/b014f500112df597edfbe1a5cef8c02e06b1bbb0).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by rnowling <gi...@git.apache.org>.

Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-68870971
  
    Thanks @mengxr @freeman-lab! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,126 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * this class is used for the model of the hierarchical clustering
    + *
    + * @param clusterTree a cluster as a tree node
    + * @param isTrained if the model has been trained, the flag is true
    + */
    +class HierarchicalClusteringModel private (
    +  val clusterTree: ClusterTree,
    +  private[mllib] var isTrained: Boolean) extends Serializable with Logging with Cloneable {
    +
    +  def this(clusterTree: ClusterTree) = this(clusterTree, false)
    +
    +  override def clone(): HierarchicalClusteringModel = {
    +    new HierarchicalClusteringModel(this.clusterTree.clone(), true)
    +  }
    +
    +  /**
    +   * Cuts a cluster tree by given threshold of dendrogram height
    +   *
    +   * @param height a threshold to cut a cluster tree
    +   * @return a hierarchical clustering model
    +   */
    +  def cut(height: Double): HierarchicalClusteringModel = {
    +    val cloned = this.clone()
    +    cloned.clusterTree.cut(height)
    +    cloned
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(vector: Vector): Int = {
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    this.clusterTree.assignClusterIndex(metric)(vector)
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val sc = data.sparkContext
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    val treeRoot = this.clusterTree
    +    sc.broadcast(metric)
    --- End diff --
    
    Not output, see my other note about `sc.broadcast`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632194
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringSuite.scala ---
    @@ -0,0 +1,330 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    --- End diff --
    
    Import formatting, see other comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60922265
  
    @srowen I finished modifying the source code which you had pointed out. Can you review it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19288871
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    +          // add the sub nodes in to the tree
    +          // if the sum of variance of sub nodes is greater than that of pre-splitted node
    +          if (node.get.getVariance().get > subNodes.map(_.getVariance().get).sum) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            isMerged = true
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      totalVariance = newTotalVariance
    +      newTotalVariance = model.clusterTree.toSeq().filter(_.isLeaf()).map(_.getVariance().get).sum
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    model.trainTime = (System.currentTimeMillis() - startTime).toInt
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    conf match {
    +      case conf if conf.getNumClusters() > data.count() =>
    +        throw new IllegalArgumentException("# clusters must be less than # input data records")
    +      case _ =>
    +    }
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leafs of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.conf.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.conf.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    var finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is greater then 1. if 1 means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > conf.getEpsilon()
    +        && numIter < conf.getSubIterations()
    +        && centers.size > 1) {
    +
    +      val startTimeOfIter = System.currentTimeMillis()
    +      // finds the closest center of each point
    +      data.sparkContext.broadcast(finder)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = finder(point)
    +          val (sumBV, n) = map.get(idx).getOrElse((BV.zeros[Double](point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = Math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +      finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(point => (finder(point), point))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data and the number of clusters
    +   *
    +   * NOTE: If there is no splittable cluster, however the number of clusters is
    +   * less than the given that, the clustering is stopped
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   *
    +   *         TODO: The other parameters for the hierarchical clustering will be applied
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val conf = new HierarchicalClusteringConf()
    +        .setNumClusters(numClusters)
    +    val app = new HierarchicalClustering(conf)
    +    app.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param data the data in the cluster
    + * @param center the center of the cluster
    + * @param variance the statistics for splitting of the cluster
    + * @param dataSize the data size of its data
    + * @param children the sub node(s) of the cluster
    + * @param parent the parent node of the cluster
    + */
    +class ClusterTree(
    --- End diff --
    
    CAn this class be private to the package? I haven't looked that carefully


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19288604
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    --- End diff --
    
    This may be my Scala ignorance, but if the constructor params aren't private, don't you get setters for free? I see you're going for a fluent style, and that makes sense, but I don't know of the other conf-like or algo-like classes do this. Pretty minor and I could be wrong but consider whether it's worth the code and consistency issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60931967
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22451/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22633425
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    +   * @param numRetries the number of retries when the clustering can't be succeeded
    +   * @param epsilon the relative error that bisecting is satisfied
    +   * @param randomSeed the randomseed to generate the initial vectors for each bisecting
    +   * @param randomRange the range of error to genrate the initial vectors for each bisecting
    +   * @return a hierarchical clustering model
    +   */
    +  def train(
    +    data: RDD[Vector],
    +    numClusters: Int,
    +    subIterations: Int,
    +    numRetries: Int,
    +    epsilon: Double,
    +    randomSeed: Int,
    +    randomRange: Double): HierarchicalClusteringModel = {
    +    val algo = new HierarchicalClustering()
    +        .setNumClusters(numClusters)
    +        .setSubIterations(subIterations)
    +        .setNumRetries(numRetries)
    +        .setEpsilon(epsilon)
    +        .setRandomSeed(randomSeed)
    +        .setRandomRange(randomRange)
    +    algo.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param center the center of the cluster
    + * @param data the data in the cluster
    + * @param height distance between sub nodes
    + * @param variance the statistics for splitting of the cluster
    + * @param dataSize the data size of its data
    + * @param children the sub node(s) of the cluster
    + * @param parent the parent node of the cluster
    + * @param isVisited a flag to be searched
    + */
    +private[mllib]
    +class ClusterTree private (
    +  val center: Vector,
    +  private[mllib] val data: RDD[BV[Double]],
    +  private[mllib] var height: Option[Double],
    +  private[mllib] var variance: Option[Double],
    +  private[mllib] var dataSize: Option[Long],
    +  private[mllib] var children: List[ClusterTree],
    +  private[mllib] var parent: Option[ClusterTree],
    +  private[mllib] var isVisited: Boolean) extends Serializable with Cloneable with Logging {
    +
    +  def this(center: Vector, data: RDD[BV[Double]]) =
    +    this(center, data, None, None, None, List.empty[ClusterTree], None, false)
    +
    +  override def clone(): ClusterTree = {
    +    val cloned = new ClusterTree(
    +      this.center,
    +      this.data,
    +      this.height,
    +      this.variance,
    +      this.dataSize,
    +      List.empty[ClusterTree],
    +      None,
    +      this.isVisited
    +    )
    +    val clonedChildren = this.children.map(child => child.clone()).toList
    +    cloned.insert(clonedChildren)
    +    cloned
    +  }
    +
    +  override def toString(): String = {
    +    val elements = Array(
    +      s"hashCode:${this.hashCode()}",
    +      s"depth:${this.getDepth()}",
    +      s"dataSize:${this.dataSize.get}",
    +      s"variance:${this.variance.get}",
    +      s"parent:${this.parent.hashCode()}",
    +      s"children:${this.children.map(_.hashCode())}",
    +      s"isLeaf:${this.isLeaf()}",
    +      s"isVisited:${this.isVisited}"
    +    )
    +    elements.mkString(", ")
    +  }
    +
    +  /**
    +   * Cuts a cluster tree
    +   *
    +   * @param height the threshold of height to cut a cluster tree
    +   * @return a cut hierarchical clustering model
    +   */
    +  private[mllib] def cut(height: Double): ClusterTree = {
    +    this.children.foreach { child =>
    +      if (child.getHeight() < height && child.children.size > 0) {
    +        child.children.foreach(grandchild => child.delete(grandchild))
    +      }
    +    }
    +    this.children.foreach(child => child.cut(height))
    +    this
    +  }
    +
    +  /**
    +   * Inserts sub nodes as its children
    +   *
    +   * @param children inserted sub nodes
    +   */
    +  def insert(children: List[ClusterTree]): Unit = {
    +    this.children = this.children ++ children
    +    children.foreach(child => child.parent = Some(this))
    +  }
    +
    +  /**
    +   * Inserts a sub node as its child
    +   *
    +   * @param child inserted sub node
    +   */
    +  def insert(child: ClusterTree): Unit = insert(List(child))
    +
    +  /** Deletes all child */
    +  def delete() = this.children = List.empty[ClusterTree]
    +
    +  /** Deletes a child */
    +  def delete(target: ClusterTree) {
    +    this.children.contains(target) match {
    +      case true => this.children = this.children.filter(child => child != target)
    +      case false => logWarning("You attempted to delete a node which is not contained")
    +    }
    +  }
    +
    +  /**
    +   * Converts the tree into Seq class
    +   * the sub nodes are recursively expanded
    +   *
    +   * @return Seq class which the cluster tree is expanded
    +   */
    +  def toSeq(): Seq[ClusterTree] = {
    +    val seq = this.children.size match {
    +      case 0 => Seq(this)
    +      case _ => Seq(this) ++ this.children.map(child => child.toSeq()).flatten
    +    }
    +    seq.sortWith { case (a, b) =>
    +      a.getDepth() < b.getDepth() &&
    +          breezeNorm(a.center.toBreeze, 2) < breezeNorm(b.center.toBreeze, 2)
    +    }
    +  }
    +
    +  /**
    +   * Gets the all clusters which are leaves in the cluster tree
    +   * @return the Seq of the clusters
    +   */
    +  def getClusters(): Seq[ClusterTree] = toSeq().filter(_.isLeaf())
    +
    +  /**
    +   * Gets the depth of the cluster in the tree
    +   *
    +   * @return the depth
    +   */
    +  def getDepth(): Int = {
    +    this.parent match {
    +      case None => 0
    +      case _ => 1 + this.parent.get.getDepth()
    +    }
    +  }
    +
    +  /**
    +   * Gets the dendrogram height of the cluster at the cluster tree
    +   *
    +   * @return the dendrogram height
    +   */
    +  def getHeight(): Double = {
    +    this.children.size match {
    +      case 0 => 0.0
    +      case _ => this.height.get + this.children.map(_.getHeight()).max
    +    }
    +  }
    +
    +  /**
    +   * Assigns the closest cluster with a vector
    +   * @param metric distance metric
    +   * @param v the vector you want to assign to
    +   * @return the closest cluster
    +   */
    +  private[mllib]
    +  def assignCluster(metric: Function2[BV[Double], BV[Double], Double])(v: Vector): ClusterTree = {
    +    this.children.size match {
    +      case 0 => this
    +      case size if size > 0 => {
    +        val distances = this.children.map(tree => metric(tree.center.toBreeze, v.toBreeze))
    +        val minIndex = distances.indexOf(distances.min)
    +        this.children(minIndex).assignCluster(metric)(v)
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Assigns the closest cluster index of the clusters with a vector
    +   * @param metric distance metric
    +   * @param vector the vector you want to assign to
    +   * @return the closest cluster index of the all clusters
    +   */
    +  private[mllib]
    +  def assignClusterIndex(metric: Function2[BV[Double], BV[Double], Double])(vector: Vector): Int = {
    +    val assignedTree = this.assignCluster(metric)(vector)
    +    this.getClusters().indexOf(assignedTree)
    +  }
    +
    +  /**
    +   * Gets the number of the clusters in the tree. The clusters are only leaves
    +   *
    +   * @return the number of the clusters in the tree
    +   */
    +  def getTreeSize(): Int = this.toSeq().filter(_.isLeaf()).size
    +
    +  def getVariance(): Option[Double] = this.variance
    +
    +  def getDataSize(): Option[Long] = this.dataSize
    +
    +  def getParent(): Option[ClusterTree] = this.parent
    +
    +  def getChildren(): List[ClusterTree] = this.children
    +
    +  def isLeaf(): Boolean = (this.children.size == 0)
    +
    +  /**
    +   * The flag that the cluster is splittable
    +   *
    +   * @return true is splittable
    +   */
    +  def isSplittable(): Boolean = {
    +    this.isLeaf && this.getDataSize != None && this.getDataSize.get >= 2
    +  }
    +}
    +
    +/**
    + * Companion object for ClusterTree class
    + */
    +object ClusterTree {
    +
    +  /**
    +   * Converts `RDD[Vector]` into a ClusterTree instance
    +   *
    +   * @param data the data in a cluster
    +   * @return a ClusterTree instance
    +   */
    +  def fromRDD(data: RDD[Vector]): ClusterTree = {
    +    val breezeData = data.map(_.toBreeze).cache
    --- End diff --
    
    This automatically caches the user's input data. Elsewhere, like in `KMeans`, the behavior is to issue a warning if the input data is not cached, but not to automatically cache it for them. Because you are also automatically caching (and then unpersisting) all split data sets throughout the algorithm, this might be ok, but there should be a note in the doc string explaining that data will be cached by this algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22634203
  
    --- Diff: python/pyspark/mllib/clustering.py ---
    @@ -88,6 +92,162 @@ def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"
             return KMeansModel([c.toArray() for c in centers])
     
     
    +class HierarchicalClusteringModel(object):
    +
    +    """A clustering model derived from the hierarchical clustering method.
    +
    +    >>> from numpy import array
    +    >>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
    +    >>> train_rdd = sc.parallelize(data)
    +    >>> model = HierarchicalClustering.train(train_rdd, 2)
    +    >>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0]))
    +    True
    +    >>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0]))
    +    True
    +    >>> x = model.predict(data[0])
    +    >>> type(x)
    +    <type 'int'>
    +    >>> predicted_rdd = model.predict(train_rdd)
    +    >>> type(predicted_rdd)
    +    <class 'pyspark.rdd.RDD'>
    +    >>> predicted_rdd.collect() == [0, 0, 1, 1]
    +    True
    +    >>> sparse_data = [
    +    ...     SparseVector(3, {1: 1.0}),
    +    ...     SparseVector(3, {1: 1.1}),
    +    ...     SparseVector(3, {2: 1.0}),
    +    ...     SparseVector(3, {2: 1.1})
    +    ... ]
    +    >>> train_rdd = sc.parallelize(sparse_data)
    +    >>> model = HierarchicalClustering.train(train_rdd, 2, numRetries=100)
    +    >>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.]))
    +    True
    +    >>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1]))
    +    True
    +    >>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
    +    True
    +    >>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
    +    True
    +    >>> x = model.predict(array([0., 1., 0.]))
    +    >>> type(x)
    +    <type 'int'>
    +    >>> predicted_rdd = model.predict(train_rdd)
    +    >>> type(predicted_rdd)
    +    <class 'pyspark.rdd.RDD'>
    +    >>> (predicted_rdd.collect() == [0, 0, 1, 1]
    +    ...     or predicted_rdd.collect() == [1, 1, 0, 0] )
    +    True
    +    >>> type(model.clusterCenters)
    +    <type 'list'>
    +    """
    +
    +    def __init__(self, sc, java_model, centers):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        :param centers: the cluster centers
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +        self.centers = centers
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    @property
    +    def clusterCenters(self):
    +        """Get the cluster centers, represented as a list of NumPy arrays."""
    +        return self.centers
    +
    +    def predict(self, x):
    +        """Predict the closest cluster index
    +
    +        :param x: a ndarray of list, a SparseVector or RDD[SparseVector]
    +        :return: the closest index or a RDD of int which means the closest index
    +        """
    +        if isinstance(x, ndarray) or isinstance(x, Vector):
    +            return self.__predict_by_array(x)
    +        elif isinstance(x, RDD):
    +            return self.__predict_by_rdd(x)
    +        else:
    +            print 'Invalid input data type x:' + type(x)
    +
    +    def __predict_by_array(self, x):
    +        """Predict the closest cluster index with an ndarray or an SparseVector
    +
    +        :param x: a vector
    +        :return: the closest cluster index
    +        """
    +        ser = PickleSerializer()
    +        bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +        vec = self._sc._jvm.SerDe.loads(bytes)
    +        result = self._java_model.predict(vec)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def __predict_by_rdd(self, x):
    +        """Predict the closest cluster index with a RDD
    +        :param x: a RDD of vector
    +        :return: a RDD of int
    +        """
    +        ser = PickleSerializer()
    +        cached = x.map(_convert_to_vector)._reserialize(AutoBatchedSerializer(ser)).cache()
    +        rdd = _to_java_object_rdd(cached)
    +        jrdd = self._java_model.predict(rdd)
    +        jpyrdd = self._sc._jvm.SerDe.javaToPython(jrdd)
    +        return RDD(jpyrdd, self._sc, AutoBatchedSerializer(PickleSerializer()))
    +
    +    def cut(self, height):
    +        """Cut nodes and leaves in a cluster tree by a dendrogram height.
    +        :param height: a threshold to cut a cluster tree
    +        """
    +        ser = PickleSerializer()
    +        model = self._java_model.cut(height)
    +        bytes = self._sc._jvm.SerDe.dumps(model.getCenters())
    +        centers = ser.loads(str(bytes))
    +        return HierarchicalClusteringModel(self._sc, model, [c.toArray() for c in centers])
    +
    +    def sum_of_variance(self):
    +        """Gets the sum of variance of all clusters.
    +        :return: sum of variance of all clusters
    +        """
    +        ser = PickleSerializer()
    +        model = self._java_model
    +        bytes = self._sc._jvm.SerDe.dumps(model.getSumOfVariance())
    +        variance = ser.loads(str(bytes))
    +        return variance
    +
    +    def to_merge_list(self):
    +        """Extract an array for dendrogram
    +
    +        the array is fit for SciPy's dendrogram
    +        :return: an array which is fit for scipy's dendrogram
    +        """
    +        ser = PickleSerializer()
    +        model = self._java_model
    +        bytes = self._sc._jvm.SerDe.dumps(model.toMergeList())
    +        centers = ser.loads(str(bytes))
    +        return array([c.toArray() for c in centers])
    +
    +
    +class HierarchicalClustering(object):
    +
    +    @classmethod
    +    def train(cls, rdd, k,
    +              subIterations=20, numRetries=10, epsilon=1.0e-4, randomSeed=1, randomRange=0.1):
    +        """Train a hierarchical clustering model."""
    +        sc = rdd.context
    +        ser = PickleSerializer()
    +        # cache serialized data to avoid objects over head in JVM
    +        cached = rdd.map(_convert_to_vector)._reserialize(AutoBatchedSerializer(ser)).cache()
    +        model = sc._jvm.PythonMLLibAPI().trainHierarchicalClusteringModel(
    +            _to_java_object_rdd(cached), k,
    +            subIterations, numRetries, epsilon, randomSeed, randomRange)
    +        bytes = sc._jvm.SerDe.dumps(model.getCenters())
    +        centers = ser.loads(str(bytes))
    +        # TODO: because centers are SparseVector, wel will change them numpy.array.
    --- End diff --
    
    Did you mean convert from `DenseVector`? That's currently the output. I agree it would be good to change to `numpy.array`, especially because after calling `cut` the output is an array of `numpy.array`s. The two behaviors should be consistent, as calling `cut` should not change the type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by rnowling <gi...@git.apache.org>.

Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-68746596
  
    @mengxr This PR has been lingering for a while.  What can we do to get it a little more attention?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60463499
  
      [Test build #22177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22177/consoleFull) for   PR 2906 at commit [`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `class HierarchicalClusteringConf(`
      * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
      * `class ClusterTree(`
      * `class ClusteringModel(object):`
      * `class KMeansModel(ClusteringModel):`
      * `class HierarchicalClusteringModel(ClusteringModel):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22640591
  
    --- Diff: data/mllib/sample_hierarchical_data.csv ---
    @@ -0,0 +1,150 @@
    +5.1,3.5,1.4,0.2
    --- End diff --
    
    Good point =) Leave as is then. Maybe at some point we should give all the vector-valued example data sets the same format / file type just for consistency, but that can be a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw closed the pull request at:

    https://github.com/apache/spark/pull/2906


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634674

--- Diff: docs/mllib-clustering.md ---
@@ -154,6 +156,175 @@ section of the Spark
Quick Start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.

+
+### Hierarchical Clustering
+
+MLlib supports
+[hierarchical clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the most commonly used clustering algorithm which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types.
+One is the agglomerative clustering which is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+The other is the divisive clustering which is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
+The MLlib implementation only includes a divisive hierarchical clustering algorithm.
+
+The implementation in MLlib has the following parameters:
+
+* *k* is the number of maximum desired clusters.
+* *subIterations* is the maximum number of iterations to split a cluster to its 2 sub clusters.
+* *numRetries* is the maximum number of retries if a splitting doesn't work as expected.
+* *epsilon* determines the saturate threshold to consider the splitting to have converged.
+
+
+
+### Hierarchical Clustering Example
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+The following code snippets can be executed in `spark-shell`.
+
+In the following example after loading and parsing data,
+we use the hierarchical clustering object to cluster the sample data into three clusters.
--- End diff --

Clarify that this means three clusters at the bottom-most levels of a hierarchical tree.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22634802
  
    --- Diff: examples/src/main/python/mllib/hierarchical_clustering.py ---
    @@ -0,0 +1,84 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +A hierarchical clustering program using MLlib.
    +
    +This example requires NumPy, SciPy and matplotlib.
    +"""
    +
    +import os
    +import sys
    +
    +from numpy import array
    +import matplotlib.pyplot as plt
    --- End diff --
    
    I love that you've made it so easy to visualize the output, but this now adds a "dependency" on matplotlib which isn't used anywhere else in PySpark AFAIK. Strictly, because PySpark doesn't currently use formal package management (e.g. through PyPi), this isn't really adding a dependency, and it's just an example. But might be safer to just use a line note showing how the output can be visualized with matplotlib. Curious what others think. cc @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19396869
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    +          // add the sub nodes in to the tree
    +          // if the sum of variance of sub nodes is greater than that of pre-splitted node
    +          if (node.get.getVariance().get > subNodes.map(_.getVariance().get).sum) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            isMerged = true
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      totalVariance = newTotalVariance
    +      newTotalVariance = model.clusterTree.toSeq().filter(_.isLeaf()).map(_.getVariance().get).sum
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    model.trainTime = (System.currentTimeMillis() - startTime).toInt
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    conf match {
    +      case conf if conf.getNumClusters() > data.count() =>
    +        throw new IllegalArgumentException("# clusters must be less than # input data records")
    +      case _ =>
    +    }
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leafs of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.conf.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.conf.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    var finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is greater then 1. if 1 means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > conf.getEpsilon()
    +        && numIter < conf.getSubIterations()
    +        && centers.size > 1) {
    +
    +      val startTimeOfIter = System.currentTimeMillis()
    +      // finds the closest center of each point
    +      data.sparkContext.broadcast(finder)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = finder(point)
    +          val (sumBV, n) = map.get(idx).getOrElse((BV.zeros[Double](point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = Math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +      finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(point => (finder(point), point))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data and the number of clusters
    +   *
    +   * NOTE: If there is no splittable cluster, however the number of clusters is
    +   * less than the given that, the clustering is stopped
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   *
    +   *         TODO: The other parameters for the hierarchical clustering will be applied
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val conf = new HierarchicalClusteringConf()
    +        .setNumClusters(numClusters)
    +    val app = new HierarchicalClustering(conf)
    +    app.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param data the data in the cluster
    + * @param center the center of the cluster
    + * @param variance the statistics for splitting of the cluster
    + * @param dataSize the data size of its data
    + * @param children the sub node(s) of the cluster
    + * @param parent the parent node of the cluster
    + */
    +class ClusterTree(
    --- End diff --
    
    https://github.com/yu-iskw/spark/commit/fc8676ec06b4ed26369e4fa40a92620d193a3bee


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by rnowling <gi...@git.apache.org>.

Github user rnowling commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19267797
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * this class is used for the model of the hierarchical clustering
    + *
    + * @param clusterTree a cluster as a tree node
    + * @param trainTime the milliseconds for executing a training
    + * @param predictTime the milliseconds for executing a prediction
    + * @param isTrained if the model has been trained, the flag is true
    + */
    +class HierarchicalClusteringModel private (
    +  val clusterTree: ClusterTree,
    +  var trainTime: Int,
    +  var predictTime: Int,
    +  var isTrained: Boolean) extends Serializable {
    +
    +  def this(clusterTree: ClusterTree) = this(clusterTree, 0, 0, false)
    +
    +  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
    +
    +  def getCenters(): Array[Vector] = getClusters().map(_.center)
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(vector: Vector): Int = {
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    this.clusterTree.assignClusterIndex(metric)(vector)
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    val centers = getClusters().map(_.center.toBreeze)
    +    val treeRoot = this.clusterTree
    +    val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _
    +    data.sparkContext.broadcast(closestClusterIndexFinder)
    +    val predicted = data.map(point => (closestClusterIndexFinder(point), point))
    --- End diff --
    
    I don't think you're using the broadcast variable correctly:
    
    http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22633778
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    --- End diff --
    
    Give a slightly longer overview of how the algorithm works?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-83757019
  
    I've spoken with @freeman-lab. I am going to send a new PR after replacing the algorithm to the new one and adding wrapper classes for ml package. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19535947
  
    --- Diff: python/pyspark/mllib/clustering.py ---
    @@ -91,6 +99,58 @@ def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode="k-means||"
             return KMeansModel([c.toArray() for c in centers])
     
     
    +class HierarchicalClusteringModel(ClusteringModel):
    --- End diff --
    
    I changed the way to call `predict` at the python code, using Java API.
    
    https://github.com/yu-iskw/spark/commit/8aa6a00f0dd53a5913be668a64332fc050314040


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62310159
  
      [Test build #23121 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23121/consoleFull) for   PR 2906 at commit [`691c49a`](https://github.com/apache/spark/commit/691c49adf9751193f3b8928211e77d307ef44c37).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class HierarchicalClusteringModel(object):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632182
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,126 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    --- End diff --
    
    Import formatting, see other comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62135443
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23052/consoleFull) for   PR 2906 at commit [`8355f95`](https://github.com/apache/spark/commit/8355f959f02ca67454c9cb070912480db0a44671).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by rnowling <gi...@git.apache.org>.

Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-69192847
  
    @freeman-lab @srowen @mengxr many thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60931955
  
      [Test build #22451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22451/consoleFull) for   PR 2906 at commit [`825fbfb`](https://github.com/apache/spark/commit/825fbfbe62de7787d7b343f84036a4933b53e0ff).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class HierarchicalClusteringModel(object):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60546582
  
      [Test build #22267 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22267/consoleFull) for   PR 2906 at commit [`1a08510`](https://github.com/apache/spark/commit/1a0851079bf145939e665aa78f0e77b3995e6e66).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60923278
  
      [Test build #22451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22451/consoleFull) for   PR 2906 at commit [`825fbfb`](https://github.com/apache/spark/commit/825fbfbe62de7787d7b343f84036a4933b53e0ff).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-78214203
  
    @freeman-lab, @srowen, I apologize for the delay in replying. I will modify the code ASAP.
    And I have a question about the implementation. I think this implementation is very slow and it difficult to take the large number of clusters in an argument. So, I tried to implement the new one which is more scalable and faster than current one. The new one is 1000 times faster than the current one.
    
    https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark
    
    Should we continue the PR, or replace the current one with the new one. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22634890
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    +   * @param numRetries the number of retries when the clustering can't be succeeded
    +   * @param epsilon the relative error that bisecting is satisfied
    +   * @param randomSeed the randomseed to generate the initial vectors for each bisecting
    +   * @param randomRange the range of error to genrate the initial vectors for each bisecting
    +   * @return a hierarchical clustering model
    +   */
    +  def train(
    +    data: RDD[Vector],
    +    numClusters: Int,
    +    subIterations: Int,
    +    numRetries: Int,
    +    epsilon: Double,
    +    randomSeed: Int,
    +    randomRange: Double): HierarchicalClusteringModel = {
    +    val algo = new HierarchicalClustering()
    +        .setNumClusters(numClusters)
    +        .setSubIterations(subIterations)
    +        .setNumRetries(numRetries)
    +        .setEpsilon(epsilon)
    +        .setRandomSeed(randomSeed)
    +        .setRandomRange(randomRange)
    +    algo.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param center the center of the cluster
    + * @param data the data in the cluster
    + * @param height distance between sub nodes
    + * @param variance the statistics for splitting of the cluster
    + * @param dataSize the data size of its data
    + * @param children the sub node(s) of the cluster
    + * @param parent the parent node of the cluster
    + * @param isVisited a flag to be searched
    + */
    +private[mllib]
    +class ClusterTree private (
    +  val center: Vector,
    +  private[mllib] val data: RDD[BV[Double]],
    +  private[mllib] var height: Option[Double],
    +  private[mllib] var variance: Option[Double],
    +  private[mllib] var dataSize: Option[Long],
    +  private[mllib] var children: List[ClusterTree],
    +  private[mllib] var parent: Option[ClusterTree],
    +  private[mllib] var isVisited: Boolean) extends Serializable with Cloneable with Logging {
    +
    +  def this(center: Vector, data: RDD[BV[Double]]) =
    +    this(center, data, None, None, None, List.empty[ClusterTree], None, false)
    +
    +  override def clone(): ClusterTree = {
    +    val cloned = new ClusterTree(
    +      this.center,
    +      this.data,
    +      this.height,
    +      this.variance,
    +      this.dataSize,
    +      List.empty[ClusterTree],
    +      None,
    +      this.isVisited
    +    )
    +    val clonedChildren = this.children.map(child => child.clone()).toList
    +    cloned.insert(clonedChildren)
    +    cloned
    +  }
    +
    +  override def toString(): String = {
    +    val elements = Array(
    +      s"hashCode:${this.hashCode()}",
    +      s"depth:${this.getDepth()}",
    +      s"dataSize:${this.dataSize.get}",
    +      s"variance:${this.variance.get}",
    +      s"parent:${this.parent.hashCode()}",
    +      s"children:${this.children.map(_.hashCode())}",
    +      s"isLeaf:${this.isLeaf()}",
    +      s"isVisited:${this.isVisited}"
    +    )
    +    elements.mkString(", ")
    +  }
    +
    +  /**
    +   * Cuts a cluster tree
    +   *
    +   * @param height the threshold of height to cut a cluster tree
    +   * @return a cut hierarchical clustering model
    +   */
    +  private[mllib] def cut(height: Double): ClusterTree = {
    +    this.children.foreach { child =>
    +      if (child.getHeight() < height && child.children.size > 0) {
    +        child.children.foreach(grandchild => child.delete(grandchild))
    +      }
    +    }
    +    this.children.foreach(child => child.cut(height))
    +    this
    +  }
    +
    +  /**
    +   * Inserts sub nodes as its children
    +   *
    +   * @param children inserted sub nodes
    +   */
    +  def insert(children: List[ClusterTree]): Unit = {
    +    this.children = this.children ++ children
    +    children.foreach(child => child.parent = Some(this))
    +  }
    +
    +  /**
    +   * Inserts a sub node as its child
    +   *
    +   * @param child inserted sub node
    +   */
    +  def insert(child: ClusterTree): Unit = insert(List(child))
    +
    +  /** Deletes all child */
    +  def delete() = this.children = List.empty[ClusterTree]
    +
    +  /** Deletes a child */
    +  def delete(target: ClusterTree) {
    +    this.children.contains(target) match {
    +      case true => this.children = this.children.filter(child => child != target)
    +      case false => logWarning("You attempted to delete a node which is not contained")
    +    }
    +  }
    +
    +  /**
    +   * Converts the tree into Seq class
    +   * the sub nodes are recursively expanded
    +   *
    +   * @return Seq class which the cluster tree is expanded
    +   */
    +  def toSeq(): Seq[ClusterTree] = {
    +    val seq = this.children.size match {
    +      case 0 => Seq(this)
    +      case _ => Seq(this) ++ this.children.map(child => child.toSeq()).flatten
    +    }
    +    seq.sortWith { case (a, b) =>
    +      a.getDepth() < b.getDepth() &&
    +          breezeNorm(a.center.toBreeze, 2) < breezeNorm(b.center.toBreeze, 2)
    +    }
    +  }
    +
    +  /**
    +   * Gets the all clusters which are leaves in the cluster tree
    +   * @return the Seq of the clusters
    +   */
    +  def getClusters(): Seq[ClusterTree] = toSeq().filter(_.isLeaf())
    +
    +  /**
    +   * Gets the depth of the cluster in the tree
    +   *
    +   * @return the depth
    +   */
    +  def getDepth(): Int = {
    +    this.parent match {
    +      case None => 0
    +      case _ => 1 + this.parent.get.getDepth()
    +    }
    +  }
    +
    +  /**
    +   * Gets the dendrogram height of the cluster at the cluster tree
    +   *
    +   * @return the dendrogram height
    +   */
    +  def getHeight(): Double = {
    +    this.children.size match {
    +      case 0 => 0.0
    +      case _ => this.height.get + this.children.map(_.getHeight()).max
    +    }
    +  }
    +
    +  /**
    +   * Assigns the closest cluster with a vector
    +   * @param metric distance metric
    --- End diff --
    
    Insert line break


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60460354
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60464016
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22179/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62310162
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23121/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60214129
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62325994
  
      [Test build #23124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23124/consoleFull) for   PR 2906 at commit [`cfdf842`](https://github.com/apache/spark/commit/cfdf8429bf4afb3e7a6a329dd285fe48429aec46).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class HierarchicalClusteringModel(object):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60547118
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22270/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632172
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    --- End diff --
    
    There should be a line separating third-party imports (e.g. breeze) from spark imports. And within each group, imports should be ordered alphabetically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22635337
  
    --- Diff: examples/src/main/python/mllib/hierarchical_clustering.py ---
    @@ -0,0 +1,84 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +A hierarchical clustering program using MLlib.
    +
    +This example requires NumPy, SciPy and matplotlib.
    +"""
    +
    +import os
    +import sys
    +
    +from numpy import array
    +import matplotlib.pyplot as plt
    --- End diff --
    
    We should be careful to add any dependency, even in example. For here, I'd like to make it optional, tell user to install it to get a better experience if matplotlib and scipy is not installed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22641223
  
    --- Diff: docs/mllib-clustering.md ---
    @@ -154,6 +156,175 @@ section of the Spark
     Quick Start guide. Be sure to also include *spark-mllib* to your build file as
     a dependency.
     
    +
    +### Hierarchical Clustering
    +
    +MLlib supports
    +[hierarchical clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the most commonly used clustering algorithm which seeks to build a hierarchy of clusters.
    +Strategies for hierarchical clustering generally fall into two types.
    +One is the agglomerative clustering which is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
    +The other is the divisive clustering which is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
    +The MLlib implementation only includes a divisive hierarchical clustering algorithm.
    +
    +The implementation in MLlib has the following parameters:
    +
    +* *k* is the number of maximum desired clusters. 
    +* *subIterations* is the maximum number of iterations to split a cluster to its 2 sub clusters.
    +* *numRetries* is the maximum number of retries if a splitting doesn't work as expected.
    +* *epsilon* determines the saturate threshold to consider the splitting to have converged.
    +
    +
    +
    +### Hierarchical Clustering Example
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +The following code snippets can be executed in `spark-shell`.
    +
    +In the following example after loading and parsing data, 
    +we use the hierarchical clustering object to cluster the sample data into three clusters. 
    +The number of desired clusters is passed to the algorithm. 
    +Hoerver, even though the number of clusters is less than *k* in the middle of the clustering,
    +the clustering is stopped if they can not be split any more.
    +
    +{% highlight scala %}
    +import org.apache.spark.mllib.clustering.HierarchicalClustering
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +// Load and parse the data
    +val data = sc.textFile("data/mllib/sample_hierarchical_data.csv")
    +val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
    +
    +// Cluster the data into three classes using HierarchicalClustering object
    +val numClusters = 10
    +val model = HierarchicalClustering.train(parsedData, numClusters)
    +println(s"# Clusters: ${model.getClusters().size}")
    +
    +// Show the cluster centers
    +model.getCenters.foreach(println)
    +
    +// Evaluate clustering by computing the sum of variance of the clusters
    +val variance = model.getClusters.map(_.getVariance.get).sum
    +println(s"Sum of Variance of the Clusters = ${variance}")
    +
    +// Cut the cluster tree by height
    +val cut_model = model.cut(4.0)
    +println(s"# Clusters: ${cut_model.getClusters().size}")
    +val variance = cut_model.getClusters.map(_.getVariance.get).sum
    +println(s"Sum of Variance of the Clusters = ${variance}")
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +All of MLlib's methods use Java-friendly types, so you can import and call them there the same
    +way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
    +Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
    +calling `.rdd()` on your `JavaRDD` object. A self-contained application example
    +that is equivalent to the provided example in Scala is given below:
    +
    +{% highlight java %}
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.mllib.clustering.HierarchicalClustering;
    +import org.apache.spark.mllib.clustering.HierarchicalClusteringModel;
    +import org.apache.spark.mllib.linalg.Vector;
    +import org.apache.spark.mllib.linalg.Vectors;
    +
    +public class JavaHierarchicalClustering {
    --- End diff --
    
    The other example code I see foregoes a lot of the boilerplate here of declaring a class, main method, System.out, etc. The indentation here is also significantly deeper than the 2-space indent in the code. Addressing these might make it easier to scan as an example on the web page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60921831
  
      [Test build #22450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22450/consoleFull) for   PR 2906 at commit [`e772fdf`](https://github.com/apache/spark/commit/e772fdf0318b87ae4c2c4cc728d82752036a67db).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60567740
  
      [Test build #22291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22291/consoleFull) for   PR 2906 at commit [`8be11da`](https://github.com/apache/spark/commit/8be11da1f045e9ffc8c56886eea7c133aefe3eaf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60463744
  
    @yu-iskw I added you to the whitelist. Future commits from you should trigger Jenkins automatically. Just took a very brief scan over the code and really appreciate the fact that more than half of the code is doc/test/example. I will check the implementation after the feature freeze. Some high-level questions for now:
    
    1. Is there a paper that you used as reference? If so, please cite it in the doc.
    2. Could you send some performance testing results on dense and sparse datasets?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62307500
  
      [Test build #23121 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23121/consoleFull) for   PR 2906 at commit [`691c49a`](https://github.com/apache/spark/commit/691c49adf9751193f3b8928211e77d307ef44c37).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-68794407
  
    Hey all, thanks for the nudge =) I've been going through it, will get you feedback ASAP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60546524
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22268/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60922004
  
      [Test build #22450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22450/consoleFull) for   PR 2906 at commit [`e772fdf`](https://github.com/apache/spark/commit/e772fdf0318b87ae4c2c4cc728d82752036a67db).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class HierarchicalClusteringModel(object):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19288793
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    +          // add the sub nodes in to the tree
    +          // if the sum of variance of sub nodes is greater than that of pre-splitted node
    +          if (node.get.getVariance().get > subNodes.map(_.getVariance().get).sum) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            isMerged = true
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      totalVariance = newTotalVariance
    +      newTotalVariance = model.clusterTree.toSeq().filter(_.isLeaf()).map(_.getVariance().get).sum
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    model.trainTime = (System.currentTimeMillis() - startTime).toInt
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    conf match {
    +      case conf if conf.getNumClusters() > data.count() =>
    +        throw new IllegalArgumentException("# clusters must be less than # input data records")
    +      case _ =>
    +    }
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leafs of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.conf.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.conf.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    var finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is greater then 1. if 1 means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > conf.getEpsilon()
    +        && numIter < conf.getSubIterations()
    +        && centers.size > 1) {
    +
    +      val startTimeOfIter = System.currentTimeMillis()
    +      // finds the closest center of each point
    +      data.sparkContext.broadcast(finder)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = finder(point)
    +          val (sumBV, n) = map.get(idx).getOrElse((BV.zeros[Double](point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = Math.abs((normSum - newNormSum) / normSum)
    --- End diff --
    
    More trivia but here I think you're using `java.lang.Math` instead of `scala.math`. Both are just fine but I wonder if the latter is more standard in Scala code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22634895
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    +   * @param numRetries the number of retries when the clustering can't be succeeded
    +   * @param epsilon the relative error that bisecting is satisfied
    +   * @param randomSeed the randomseed to generate the initial vectors for each bisecting
    +   * @param randomRange the range of error to genrate the initial vectors for each bisecting
    +   * @return a hierarchical clustering model
    +   */
    +  def train(
    +    data: RDD[Vector],
    +    numClusters: Int,
    +    subIterations: Int,
    +    numRetries: Int,
    +    epsilon: Double,
    +    randomSeed: Int,
    +    randomRange: Double): HierarchicalClusteringModel = {
    +    val algo = new HierarchicalClustering()
    +        .setNumClusters(numClusters)
    +        .setSubIterations(subIterations)
    +        .setNumRetries(numRetries)
    +        .setEpsilon(epsilon)
    +        .setRandomSeed(randomSeed)
    +        .setRandomRange(randomRange)
    +    algo.run(data)
    +  }
    +}
    +
    +
    +/**
    + * A cluster as a tree node which can have its sub nodes
    + *
    + * @param center the center of the cluster
    + * @param data the data in the cluster
    + * @param height distance between sub nodes
    + * @param variance the statistics for splitting of the cluster
    + * @param dataSize the data size of its data
    + * @param children the sub node(s) of the cluster
    + * @param parent the parent node of the cluster
    + * @param isVisited a flag to be searched
    + */
    +private[mllib]
    +class ClusterTree private (
    +  val center: Vector,
    +  private[mllib] val data: RDD[BV[Double]],
    +  private[mllib] var height: Option[Double],
    +  private[mllib] var variance: Option[Double],
    +  private[mllib] var dataSize: Option[Long],
    +  private[mllib] var children: List[ClusterTree],
    +  private[mllib] var parent: Option[ClusterTree],
    +  private[mllib] var isVisited: Boolean) extends Serializable with Cloneable with Logging {
    +
    +  def this(center: Vector, data: RDD[BV[Double]]) =
    +    this(center, data, None, None, None, List.empty[ClusterTree], None, false)
    +
    +  override def clone(): ClusterTree = {
    +    val cloned = new ClusterTree(
    +      this.center,
    +      this.data,
    +      this.height,
    +      this.variance,
    +      this.dataSize,
    +      List.empty[ClusterTree],
    +      None,
    +      this.isVisited
    +    )
    +    val clonedChildren = this.children.map(child => child.clone()).toList
    +    cloned.insert(clonedChildren)
    +    cloned
    +  }
    +
    +  override def toString(): String = {
    +    val elements = Array(
    +      s"hashCode:${this.hashCode()}",
    +      s"depth:${this.getDepth()}",
    +      s"dataSize:${this.dataSize.get}",
    +      s"variance:${this.variance.get}",
    +      s"parent:${this.parent.hashCode()}",
    +      s"children:${this.children.map(_.hashCode())}",
    +      s"isLeaf:${this.isLeaf()}",
    +      s"isVisited:${this.isVisited}"
    +    )
    +    elements.mkString(", ")
    +  }
    +
    +  /**
    +   * Cuts a cluster tree
    +   *
    +   * @param height the threshold of height to cut a cluster tree
    +   * @return a cut hierarchical clustering model
    +   */
    +  private[mllib] def cut(height: Double): ClusterTree = {
    +    this.children.foreach { child =>
    +      if (child.getHeight() < height && child.children.size > 0) {
    +        child.children.foreach(grandchild => child.delete(grandchild))
    +      }
    +    }
    +    this.children.foreach(child => child.cut(height))
    +    this
    +  }
    +
    +  /**
    +   * Inserts sub nodes as its children
    +   *
    +   * @param children inserted sub nodes
    +   */
    +  def insert(children: List[ClusterTree]): Unit = {
    +    this.children = this.children ++ children
    +    children.foreach(child => child.parent = Some(this))
    +  }
    +
    +  /**
    +   * Inserts a sub node as its child
    +   *
    +   * @param child inserted sub node
    +   */
    +  def insert(child: ClusterTree): Unit = insert(List(child))
    +
    +  /** Deletes all child */
    +  def delete() = this.children = List.empty[ClusterTree]
    +
    +  /** Deletes a child */
    +  def delete(target: ClusterTree) {
    +    this.children.contains(target) match {
    +      case true => this.children = this.children.filter(child => child != target)
    +      case false => logWarning("You attempted to delete a node which is not contained")
    +    }
    +  }
    +
    +  /**
    +   * Converts the tree into Seq class
    +   * the sub nodes are recursively expanded
    +   *
    +   * @return Seq class which the cluster tree is expanded
    +   */
    +  def toSeq(): Seq[ClusterTree] = {
    +    val seq = this.children.size match {
    +      case 0 => Seq(this)
    +      case _ => Seq(this) ++ this.children.map(child => child.toSeq()).flatten
    +    }
    +    seq.sortWith { case (a, b) =>
    +      a.getDepth() < b.getDepth() &&
    +          breezeNorm(a.center.toBreeze, 2) < breezeNorm(b.center.toBreeze, 2)
    +    }
    +  }
    +
    +  /**
    +   * Gets the all clusters which are leaves in the cluster tree
    +   * @return the Seq of the clusters
    +   */
    +  def getClusters(): Seq[ClusterTree] = toSeq().filter(_.isLeaf())
    +
    +  /**
    +   * Gets the depth of the cluster in the tree
    +   *
    +   * @return the depth
    +   */
    +  def getDepth(): Int = {
    +    this.parent match {
    +      case None => 0
    +      case _ => 1 + this.parent.get.getDepth()
    +    }
    +  }
    +
    +  /**
    +   * Gets the dendrogram height of the cluster at the cluster tree
    +   *
    +   * @return the dendrogram height
    +   */
    +  def getHeight(): Double = {
    +    this.children.size match {
    +      case 0 => 0.0
    +      case _ => this.height.get + this.children.map(_.getHeight()).max
    +    }
    +  }
    +
    +  /**
    +   * Assigns the closest cluster with a vector
    +   * @param metric distance metric
    +   * @param v the vector you want to assign to
    +   * @return the closest cluster
    +   */
    +  private[mllib]
    +  def assignCluster(metric: Function2[BV[Double], BV[Double], Double])(v: Vector): ClusterTree = {
    +    this.children.size match {
    +      case 0 => this
    +      case size if size > 0 => {
    +        val distances = this.children.map(tree => metric(tree.center.toBreeze, v.toBreeze))
    +        val minIndex = distances.indexOf(distances.min)
    +        this.children(minIndex).assignCluster(metric)(v)
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Assigns the closest cluster index of the clusters with a vector
    +   * @param metric distance metric
    --- End diff --
    
    Insert line break


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60460281
  
    Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60268305
  
    I just gave this a quick read-through, and the structure makes sense. I left several small comments. I see the chunks of logic I would expect, but did not evaluate it in detail. The existence of some tests suggests this probably basically works :) I am wondering about performance too as this relies on Scala idioms in many places; it might be worth a quick look with jprofiler if you can to see if there are any easy-win optimizations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19396451
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * this class is used for the model of the hierarchical clustering
    + *
    + * @param clusterTree a cluster as a tree node
    + * @param trainTime the milliseconds for executing a training
    + * @param predictTime the milliseconds for executing a prediction
    + * @param isTrained if the model has been trained, the flag is true
    + */
    +class HierarchicalClusteringModel private (
    +  val clusterTree: ClusterTree,
    +  var trainTime: Int,
    +  var predictTime: Int,
    +  var isTrained: Boolean) extends Serializable {
    +
    +  def this(clusterTree: ClusterTree) = this(clusterTree, 0, 0, false)
    +
    +  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
    +
    +  def getCenters(): Array[Vector] = getClusters().map(_.center)
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(vector: Vector): Int = {
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    this.clusterTree.assignClusterIndex(metric)(vector)
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    val centers = getClusters().map(_.center.toBreeze)
    +    val treeRoot = this.clusterTree
    +    val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _
    +    data.sparkContext.broadcast(closestClusterIndexFinder)
    +    val predicted = data.map(point => (closestClusterIndexFinder(point), point))
    --- End diff --
    
    Modify the way to use `broadcast`
    https://github.com/yu-iskw/spark/commit/290d492c1c2d193ddf399b623fbdd97186bc1e75


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60566367
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22290/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22633951
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,126 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * this class is used for the model of the hierarchical clustering
    + *
    + * @param clusterTree a cluster as a tree node
    + * @param isTrained if the model has been trained, the flag is true
    + */
    +class HierarchicalClusteringModel private (
    +  val clusterTree: ClusterTree,
    +  private[mllib] var isTrained: Boolean) extends Serializable with Logging with Cloneable {
    +
    +  def this(clusterTree: ClusterTree) = this(clusterTree, false)
    +
    +  override def clone(): HierarchicalClusteringModel = {
    +    new HierarchicalClusteringModel(this.clusterTree.clone(), true)
    +  }
    +
    +  /**
    +   * Cuts a cluster tree by given threshold of dendrogram height
    +   *
    +   * @param height a threshold to cut a cluster tree
    +   * @return a hierarchical clustering model
    +   */
    +  def cut(height: Double): HierarchicalClusteringModel = {
    +    val cloned = this.clone()
    +    cloned.clusterTree.cut(height)
    +    cloned
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(vector: Vector): Int = {
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    this.clusterTree.assignClusterIndex(metric)(vector)
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val sc = data.sparkContext
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    val treeRoot = this.clusterTree
    +    sc.broadcast(metric)
    +    sc.broadcast(treeRoot)
    +    val predicted = data.map(point => (treeRoot.assignClusterIndex(metric)(point), point))
    +
    +    val predictTime = System.currentTimeMillis() - startTime
    +    logInfo(s"Predicting Time: ${predictTime.toDouble / 1000} [sec]")
    +
    +    predicted
    +  }
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
    +    predict(points.rdd).map(_._1).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]
    +
    +  /**
    +   * Computes the sum of total variance of all cluster
    +   */
    +  def getSumOfVariance(): Double = this.getClusters().map(_.getVariance().get).sum
    +
    +  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
    +
    +  def getCenters(): Array[Vector] = getClusters().map(_.center)
    +
    +  /**
    +   * Converts a clustering merging list
    +   * Returned data format is fit for scipy's dendrogram function
    --- End diff --
    
    I think it's a little weird to justify this based on a connection to scipy, and to reference that code so explicitly. This is primarily scala code, after all =) More importantly, the basic logic of this data structure is quite general, and is used in at least scipy and matlab (and possibly also R?). I'd instead give a longer description of how the list is organized here in the doc, and maybe mention that it is used by other libraries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19288355

--- Diff: docs/mllib-clustering.md ---
@@ -153,3 +157,152 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap
section of the Spark
Quick Start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.
+
+
+### Hierarchical Clustering
+
+MLlib supports
+[hierarchical clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the most commonly used clustering algorithm which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types.
+One is the agglomerative clustering which is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+The other is the divisive clustering which is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
+The MLlib implementation only includes a divisive hierarchical clustering algorithm.
+
+The implementation in MLlib has the following parameters:
+
+* *k* is the number of maximum desired clusters.
+* *subIterations* is the maximum number of iterations to split a cluster to its 2 sub clusters.
+* *numRetries* is the maximum number of retries if a splitting doesn't work as expected.
+* *epsilon* determines the saturate threshold to consider the splitting to have converged.
+
+
+
+### Hierarchical Clustering Example
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+The following code snippets can be executed in `spark-shell`.
+
+In the following example after loading and parsing data,
+we use the hierarchical clustering object to cluster the sample data into three clusters.
+The number of desired clusters is passed to the algorithm.
+Hoerver, even though the number of clusters is less than *k* in the middle of the clustering,
--- End diff --

Horever -> However, and 'not be splitted' -> 'not be split'

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632220
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/mllib/JavaHierarchicalClustering.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.mllib.clustering.HierarchicalClustering;
    +import org.apache.spark.mllib.clustering.HierarchicalClusteringModel;
    +import org.apache.spark.mllib.linalg.Vector;
    +import org.apache.spark.mllib.linalg.Vectors;
    +
    +public class JavaHierarchicalClustering {
    --- End diff --
    
    Would it be possible to also add a similar example in scala? At least for MLlib, there are examples for almost all algorithms in scala, and then a subset of examples Java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60546522
  
      [Test build #22268 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22268/consoleFull) for   PR 2906 at commit [`b014f50`](https://github.com/apache/spark/commit/b014f500112df597edfbe1a5cef8c02e06b1bbb0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `class HierarchicalClusteringConf(`
      * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
      * `class ClusterTree(`
      * `class ClusteringModel(object):`
      * `class KMeansModel(ClusteringModel):`
      * `class HierarchicalClusteringModel(ClusteringModel):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22641414
  
    --- Diff: mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java ---
    @@ -0,0 +1,77 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering;
    +
    +import com.google.common.collect.Lists;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.mllib.linalg.Vector;
    +import org.apache.spark.mllib.linalg.Vectors;
    +import org.junit.After;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.io.Serializable;
    +import java.util.List;
    +
    +import static org.junit.Assert.assertEquals;
    +
    +public class JavaHierarchicalClusteringSuite implements Serializable {
    +  private transient JavaSparkContext sc;
    +
    +  @Before
    +  public void setUp() {
    +    sc = new JavaSparkContext("local", "JavaHierarchicalClustering");
    +  }
    +
    +  @After
    +  public void tearDown() {
    +    sc.stop();
    +    sc = null;
    +  }
    +
    +  @Test
    +  public void runHierarchicalClusteringConstructor() {
    +    List<Vector> points = Lists.newArrayList(
    +        Vectors.dense(1.0, 2.0, 6.0),
    +        Vectors.dense(1.0, 3.0, 0.0),
    +        Vectors.dense(1.0, 4.0, 6.0)
    +    );
    +    Vector expectedCenter = Vectors.dense(1.0, 3.0, 4.0);
    +
    +    JavaRDD<Vector> data = sc.parallelize(points, 2);
    +    HierarchicalClusteringModel model = HierarchicalClustering.train(data.rdd(), 1);
    +    assertEquals(1, model.getCenters().length);
    +    assertEquals(expectedCenter, model.getCenters()[0]);
    +  }
    +
    +  @Test
    +  public void predictJavaRDD() {
    +    List<Vector> points = Lists.newArrayList(
    +        Vectors.dense(1.0, 2.0, 6.0),
    +        Vectors.dense(1.0, 3.0, 0.0),
    +        Vectors.dense(1.0, 4.0, 6.0)
    +    );
    +    JavaRDD<Vector> data = sc.parallelize(points, 2);
    +    HierarchicalClustering algo = new HierarchicalClustering().setNumClusters(1);
    +    HierarchicalClusteringModel model = algo.run(data.rdd());
    +    JavaRDD<Integer> predictions = model.predict(data);
    +    // Should be able to get the first prediction.
    +    predictions.first();
    --- End diff --
    
    assert what the first one is? or is it not stable enough to reliably test for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60547114
  
      [Test build #22270 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22270/consoleFull) for   PR 2906 at commit [`8dbbacd`](https://github.com/apache/spark/commit/8dbbacd2e7f27e111b7237006fde73d1cf3eb5e7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `class HierarchicalClusteringConf(`
      * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
      * `class ClusterTree(`
      * `class ClusteringModel(object):`
      * `class KMeansModel(ClusteringModel):`
      * `class HierarchicalClusteringModel(ClusteringModel):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19396833
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    --- End diff --
    
    I added `toString` method to `HierarchicalClustering`.
    https://github.com/yu-iskw/spark/commit/2898c3fb0b99697f5600f584f7051b12830a75e0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60566259
  
      [Test build #22290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22290/consoleFull) for   PR 2906 at commit [`2676166`](https://github.com/apache/spark/commit/2676166ba6f307b4605ea1e7ecf6ece5b9e200b3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22633997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala ---
    @@ -0,0 +1,126 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * this class is used for the model of the hierarchical clustering
    + *
    + * @param clusterTree a cluster as a tree node
    + * @param isTrained if the model has been trained, the flag is true
    + */
    +class HierarchicalClusteringModel private (
    +  val clusterTree: ClusterTree,
    +  private[mllib] var isTrained: Boolean) extends Serializable with Logging with Cloneable {
    +
    +  def this(clusterTree: ClusterTree) = this(clusterTree, false)
    +
    +  override def clone(): HierarchicalClusteringModel = {
    +    new HierarchicalClusteringModel(this.clusterTree.clone(), true)
    +  }
    +
    +  /**
    +   * Cuts a cluster tree by given threshold of dendrogram height
    +   *
    +   * @param height a threshold to cut a cluster tree
    +   * @return a hierarchical clustering model
    +   */
    +  def cut(height: Double): HierarchicalClusteringModel = {
    +    val cloned = this.clone()
    +    cloned.clusterTree.cut(height)
    +    cloned
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(vector: Vector): Int = {
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    this.clusterTree.assignClusterIndex(metric)(vector)
    +  }
    +
    +  /**
    +   * Predicts the closest cluster of each point
    +   */
    +  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val sc = data.sparkContext
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    val treeRoot = this.clusterTree
    +    sc.broadcast(metric)
    +    sc.broadcast(treeRoot)
    +    val predicted = data.map(point => (treeRoot.assignClusterIndex(metric)(point), point))
    +
    +    val predictTime = System.currentTimeMillis() - startTime
    +    logInfo(s"Predicting Time: ${predictTime.toDouble / 1000} [sec]")
    +
    +    predicted
    +  }
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
    +    predict(points.rdd).map(_._1).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]
    +
    +  /**
    +   * Computes the sum of total variance of all cluster
    +   */
    +  def getSumOfVariance(): Double = this.getClusters().map(_.getVariance().get).sum
    +
    +  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
    +
    +  def getCenters(): Array[Vector] = getClusters().map(_.center)
    +
    +  /**
    +   * Converts a clustering merging list
    +   * Returned data format is fit for scipy's dendrogram function
    +   * SEE ALSO: scipy.cluster.hierarchy.dendrogram
    +   *
    +   * @return List[(node1, node2, distance, tree size)]
    +   */
    +  def toMergeList(): List[(Int, Int, Double, Int)] = {
    --- End diff --
    
    Consider renaming -> `toLinkageMatrix`? I think that's a more general term for this data structure. Would require renaming here and elsewhere (e.g. in the Python code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632647
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    --- End diff --
    
    Indent these variable definition lines by 4 spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by freeman-lab <gi...@git.apache.org>.

Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r22632919
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,627 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * This trait is used for the configuration of the hierarchical clustering
    + */
    +sealed
    +trait HierarchicalClusteringConf extends Serializable {
    +  this: HierarchicalClustering =>
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def setSubIterations(subIterations: Int): this.type = {
    +    this.subIterations = subIterations
    +    this
    +  }
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * The main idea of this algorithm is derived from:
    + * "A comparison of document clustering techniques",
    + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000.
    + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClustering(
    +  private[mllib] var numClusters: Int,
    +  private[mllib] var subIterations: Int,
    +  private[mllib] var numRetries: Int,
    +  private[mllib] var epsilon: Double,
    +  private[mllib] var randomSeed: Int,
    +  private[mllib] var randomRange: Double)
    +    extends Serializable with Logging with HierarchicalClusteringConf {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
    +
    +  /** Shows the parameters */
    +  override def toString(): String = {
    +    Array(
    +      s"numClusters:${numClusters}",
    +      s"subIterations:${subIterations}",
    +      s"numRetries:${numRetries}",
    +      s"epsilon:${epsilon}",
    +      s"randomSeed:${randomSeed}",
    +      s"randomRange:${randomRange}"
    +    ).mkString(", ")
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${this}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.numClusters) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      for (i <- 1 to this.numRetries) {
    +        if (node.get.getVariance().get > this.epsilon && isMerged == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          if (subNodes.size == 2) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            // unpersist unnecessary cache because its children nodes are cached
    +            node.get.data.unpersist()
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +            isMerged = true
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    val trainTime = (System.currentTimeMillis() - startTime).toInt
    +    logInfo(s"Elapsed Time for Training: ${trainTime.toDouble / 1000} [sec]")
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    require(this.numClusters <= data.count(), "# clusters must be less than # data rows")
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leaves of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    val sc = data.sparkContext
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    sc.broadcast(metric)
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is equal to one. if one means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > this.epsilon
    +        && numIter < this.subIterations
    +        && centers.size > 1) {
    +      val startTimeOfIter = System.currentTimeMillis()
    +
    +      sc.broadcast(centers)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = ClusterTree.findClosestCenter(metric)(centers)(point)
    +          val (sumBV, n) = map.get(idx)
    +              .getOrElse((new BSV[Double](Array(), Array(), point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(p => (ClusterTree.findClosestCenter(metric)(centers)(p), p))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    +          new ClusterTree(vectors(i), subData)
    +        }
    +      }
    +      case _ => throw new RuntimeException(s"something wrong with # centers:${centers.size}")
    +    }
    +    logInfo(s"${this.getClass.getSimpleName}.split end" +
    +        s" with total iterations" +
    +        s" for ${System.currentTimeMillis() - startTime}")
    +    nodes
    +  }
    +}
    +
    +/**
    + * top-level methods for calling the hierarchical clustering algorithm
    + */
    +object HierarchicalClustering {
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @return a hierarchical clustering model
    +   */
    +  def train(data: RDD[Vector], numClusters: Int): HierarchicalClusteringModel = {
    +    val app = new HierarchicalClustering().setNumClusters(numClusters)
    +    app.run(data)
    +  }
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given data
    +   *
    +   * @param data trained data
    +   * @param numClusters the maximum number of clusters you want
    +   * @param subIterations the iteration of
    +   * @param numRetries the number of retries when the clustering can't be succeeded
    --- End diff --
    
    Rephrase -> "the number of retries to perform if the clustering does not succeed"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62323346
  
      [Test build #23124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23124/consoleFull) for   PR 2906 at commit [`cfdf842`](https://github.com/apache/spark/commit/cfdf8429bf4afb3e7a6a329dd285fe48429aec46).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62332985
  
      [Test build #23125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23125/consoleFull) for   PR 2906 at commit [`b0b061e`](https://github.com/apache/spark/commit/b0b061edc4c2ad42deda00bb664534e1334b50e5).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class HierarchicalClusteringModel(object):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by yu-iskw <gi...@git.apache.org>.

Github user yu-iskw commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-62333314
  
    @srowen and @rnowling , 
    Sorry for my complicated commits.  I modified my source code. Could you review my PR?
    
    - I modified what you pointed out.
    - I added a function to cut a cluster tree of a trained hierarchical clustering model by a height of dendrogram.
    - I rebased my PR with the latest master branch and then force-push my branch. Because there are a few conflicts with it.
    
    Thanks,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60575130
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22291/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2906#issuecomment-60575123
  
      [Test build #22291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22291/consoleFull) for   PR 2906 at commit [`8be11da`](https://github.com/apache/spark/commit/8be11da1f045e9ffc8c56886eea7c133aefe3eaf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaHierarchicalClustering `
      * `trait HierarchicalClusteringConf extends Serializable `
      * `class HierarchicalClustering(`
      * `class ClusteringModel(object):`
      * `class KMeansModel(ClusteringModel):`
      * `class HierarchicalClusteringModel(ClusteringModel):`
      * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2906#discussion_r19289138
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala ---
    @@ -0,0 +1,549 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.random.XORShiftRandom
    +
    +/**
    + * the configuration for a hierarchical clustering algorithm
    + *
    + * @param numClusters the number of clusters you want
    + * @param subIterations the number of iterations at digging
    + * @param epsilon the threshold to stop the sub-iterations
    + * @param randomSeed uses in sampling data for initializing centers in each sub iterations
    + * @param randomRange the range coefficient to generate random points in each clustering step
    + */
    +class HierarchicalClusteringConf(
    +  private var numClusters: Int,
    +  private var subIterations: Int,
    +  private var numRetries: Int,
    +  private var epsilon: Double,
    +  private var randomSeed: Int,
    +  private[mllib] var randomRange: Double) extends Serializable {
    +
    +  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
    +
    +  def setNumClusters(numClusters: Int): this.type = {
    +    this.numClusters = numClusters
    +    this
    +  }
    +
    +  def getNumClusters(): Int = this.numClusters
    +
    +  def setSubIterations(iterations: Int): this.type = {
    +    this.subIterations = iterations
    +    this
    +  }
    +
    +  def setNumRetries(numRetries: Int): this.type = {
    +    this.numRetries = numRetries
    +    this
    +  }
    +
    +  def getNumRetries(): Int = this.numRetries
    +
    +  def getSubIterations(): Int = this.subIterations
    +
    +  def setEpsilon(epsilon: Double): this.type = {
    +    this.epsilon = epsilon
    +    this
    +  }
    +
    +  def getEpsilon(): Double = this.epsilon
    +
    +  def setRandomSeed(seed: Int): this.type = {
    +    this.randomSeed = seed
    +    this
    +  }
    +
    +  def getRandomSeed(): Int = this.randomSeed
    +
    +  def setRandomRange(range: Double): this.type = {
    +    this.randomRange = range
    +    this
    +  }
    +}
    +
    +
    +/**
    + * This is a divisive hierarchical clustering algorithm based on bi-sect k-means algorithm.
    + *
    + * @param conf the configuration class for the hierarchical clustering
    + */
    +class HierarchicalClustering(val conf: HierarchicalClusteringConf)
    +    extends Serializable with Logging {
    +
    +  /**
    +   * Constructs with the default configuration
    +   */
    +  def this() = this(new HierarchicalClusteringConf())
    +
    +  /**
    +   * Trains a hierarchical clustering model with the given configuration
    +   *
    +   * @param data training points
    +   * @return a model for hierarchical clustering
    +   */
    +  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
    +    validateData(data)
    +    logInfo(s"Run with ${conf.toString}")
    +
    +    val startTime = System.currentTimeMillis() // to measure the execution time
    +    val clusterTree = ClusterTree.fromRDD(data) // make the root node
    +    val model = new HierarchicalClusteringModel(clusterTree)
    +    val statsUpdater = new ClusterTreeStatsUpdater()
    +
    +    var node: Option[ClusterTree] = Some(model.clusterTree)
    +    statsUpdater(node.get)
    +
    +    // If the followed conditions are satisfied, and then stop the training.
    +    //   1. There is no splittable cluster
    +    //   2. The number of the splitted clusters is greater than that of given clusters
    +    //   3. The total variance of all clusters increases, when a cluster is splitted
    +    var totalVariance = Double.MaxValue
    +    var newTotalVariance = model.clusterTree.getVariance().get
    +    var step = 1
    +    while (node != None
    +        && model.clusterTree.getTreeSize() < this.conf.getNumClusters
    +        && totalVariance >= newTotalVariance) {
    +
    +      // split some times in order not to be wrong clustering result
    +      var isMerged = false
    +      var isSingleCluster = false
    +      for (retry <- 1 to this.conf.getNumRetries()) {
    +        if (isMerged == false && isSingleCluster == false) {
    +          var subNodes = split(node.get).map(subNode => statsUpdater(subNode))
    +          // it seems that there is no splittable node
    +          if (subNodes.size == 1) isSingleCluster = false
    +          // add the sub nodes in to the tree
    +          // if the sum of variance of sub nodes is greater than that of pre-splitted node
    +          if (node.get.getVariance().get > subNodes.map(_.getVariance().get).sum) {
    +            // insert the nodes to the tree
    +            node.get.insert(subNodes.toList)
    +            // calculate the local dendrogram height
    +            val dist = breezeNorm(subNodes(0).center.toBreeze - subNodes(1).center.toBreeze, 2)
    +            node.get.height = Some(dist)
    +            isMerged = true
    +            logInfo(s"the number of cluster is ${model.clusterTree.getTreeSize()} at step ${step}")
    +          }
    +        }
    +      }
    +      node.get.isVisited = true
    +
    +      // update the total variance and select the next splittable node
    +      totalVariance = newTotalVariance
    +      newTotalVariance = model.clusterTree.toSeq().filter(_.isLeaf()).map(_.getVariance().get).sum
    +      node = nextNode(model.clusterTree)
    +      step += 1
    +    }
    +
    +    model.isTrained = true
    +    model.trainTime = (System.currentTimeMillis() - startTime).toInt
    +    model
    +  }
    +
    +  /**
    +   * validate the given data to train
    +   */
    +  private def validateData(data: RDD[Vector]) {
    +    conf match {
    +      case conf if conf.getNumClusters() > data.count() =>
    +        throw new IllegalArgumentException("# clusters must be less than # input data records")
    +      case _ =>
    +    }
    +  }
    +
    +  /**
    +   * Selects the next node to split
    +   */
    +  private[clustering] def nextNode(clusterTree: ClusterTree): Option[ClusterTree] = {
    +    // select the max variance of clusters which are leafs of a tree
    +    clusterTree.toSeq().filter(tree => tree.isSplittable() && !tree.isVisited) match {
    +      case list if list.isEmpty => None
    +      case list => Some(list.maxBy(_.getVariance()))
    +    }
    +  }
    +
    +  /**
    +   * Takes the initial centers for bi-sect k-means
    +   */
    +  private[clustering] def takeInitCenters(centers: Vector): Array[BV[Double]] = {
    +    val random = new XORShiftRandom()
    +    Array(
    +      centers.toBreeze.map(elm => elm - random.nextDouble() * elm * this.conf.randomRange),
    +      centers.toBreeze.map(elm => elm + random.nextDouble() * elm * this.conf.randomRange)
    +    )
    +  }
    +
    +  /**
    +   * Splits the given cluster (tree) with bi-sect k-means
    +   *
    +   * @param clusterTree the splitted cluster
    +   * @return an array of ClusterTree. its size is generally 2, but its size can be 1
    +   */
    +  private def split(clusterTree: ClusterTree): Array[ClusterTree] = {
    +    val startTime = System.currentTimeMillis()
    +    val data = clusterTree.data
    +    var centers = takeInitCenters(clusterTree.center)
    +
    +    // TODO Supports distance metrics other Euclidean distance metric
    +    val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - bv2, 2.0)
    +    var finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +    // If the following conditions are satisfied, the iteration is stopped
    +    //   1. the relative error is less than that of configuration
    +    //   2. the number of executed iteration is greater than that of configuration
    +    //   3. the number of centers is greater then 1. if 1 means that the cluster is not splittable
    +    var numIter = 0
    +    var error = Double.MaxValue
    +    while (error > conf.getEpsilon()
    +        && numIter < conf.getSubIterations()
    +        && centers.size > 1) {
    +
    +      val startTimeOfIter = System.currentTimeMillis()
    +      // finds the closest center of each point
    +      data.sparkContext.broadcast(finder)
    +      val newCenters = data.mapPartitions { iter =>
    +        // calculate the accumulation of the all point in a partition and count the rows
    +        val map = scala.collection.mutable.Map.empty[Int, (BV[Double], Int)]
    +        iter.foreach { point =>
    +          val idx = finder(point)
    +          val (sumBV, n) = map.get(idx).getOrElse((BV.zeros[Double](point.size), 0))
    +          map(idx) = (sumBV + point, n + 1)
    +        }
    +        map.toIterator
    +      }.reduceByKeyLocally {
    +        // sum the accumulation and the count in the all partition
    +        case ((p1, n1), (p2, n2)) => (p1 + p2, n1 + n2)
    +      }.map { case ((idx: Int, (center: BV[Double], counts: Int))) =>
    +        center :/ counts.toDouble
    +      }
    +
    +      val normSum = centers.map(v => breezeNorm(v, 2.0)).sum
    +      val newNormSum = newCenters.map(v => breezeNorm(v, 2.0)).sum
    +      error = Math.abs((normSum - newNormSum) / normSum)
    +      centers = newCenters.toArray
    +      numIter += 1
    +      finder = ClusterTree.findClosestCenter(metric)(centers) _
    +
    +      logInfo(s"${numIter} iterations is finished" +
    +          s" for ${System.currentTimeMillis() - startTimeOfIter}" +
    +          s" at ${getClass}.split")
    +    }
    +
    +    val vectors = centers.map(center => Vectors.fromBreeze(center))
    +    val nodes = centers.size match {
    +      case 1 => Array(new ClusterTree(vectors(0), data))
    +      case 2 => {
    +        val closest = data.map(point => (finder(point), point))
    +        centers.zipWithIndex.map { case (center, i) =>
    +          val subData = closest.filter(_._1 == i).map(_._2)
    +          subData.cache
    --- End diff --
    
    I see a number of RDDs cached in the code, but nothing unpersisted. Is it possible to unpersist them at some point here when they are definitely no longer needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org