You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pierce Lamb <ri...@gmail.com> on 2015/03/08 00:20:32 UTC

MLlib/kmeans newbie question(s)

Hi all,

I'm very new to machine learning algorithms and Spark. I'm follow the
Twitter Streaming Language Classifier found here:

http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html

Specifically this code:

http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

Except I'm trying to run it in batch mode on some tweets it pulls out
of Cassandra, in this case 200 total tweets.

As the example shows, I am using this object for "vectorizing" a set of tweets:

object Utils{
  val numFeatures = 1000
  val tf = new HashingTF(numFeatures)

  /**
   * Create feature vectors by turning each tweet into bigrams of
   * characters (an n-gram model) and then hashing those to a
   * length-1000 feature vector that we can pass to MLlib.
   * This is a common way to decrease the number of features in a
   * model while still getting excellent accuracy (otherwise every
   * pair of Unicode characters would potentially be a feature).
   */
  def featurize(s: String): Vector = {
    tf.transform(s.sliding(2).toSeq)
  }
}

Here is my code which is modified from ExaminAndTrain.scala:

 val noSets = rawTweets.map(set => set.mkString("\n"))

    val vectors = noSets.map(Utils.featurize).cache()
    vectors.count()

    val numClusters = 5
    val numIterations = 30

    val model = KMeans.train(vectors, numClusters, numIterations)

      for (i <- 0 until numClusters) {
        println(s"\nCLUSTER $i")
        noSets.foreach {
            t => if (model.predict(Utils.featurize(t)) == 1) {
              println(t)
            }
          }
        }

This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc
with nothing printing beneath. If i flip

models.predict(Utils.featurize(t)) == 1 to
models.predict(Utils.featurize(t)) == 0

the same thing happens except every tweet is printed beneath every cluster.

Here is what I intuitively think is happening (please correct my
thinking if its wrong): This code turns each tweet into a vector,
randomly picks some clusters, then runs kmeans to group the tweets (at
a really high level, the clusters, i assume, would be common
"topics"). As such, when it checks each tweet to see if models.predict
== 1, different sets of tweets should appear under each cluster (and
because its checking the training set against itself, every tweet
should be in a cluster). Why isn't it doing this? Either my
understanding of what kmeans does is wrong, my training set is too
small or I'm missing a step.

Any help is greatly appreciated

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: MLlib/kmeans newbie question(s)

Posted by Xiangrui Meng <me...@gmail.com>.

You need to change `== 1` to `== i`. `println(t)` happens on the
workers, which may not be what you want. Try the following:

noSets.filter(t => model.predict(Utils.featurize(t)) ==
i).collect().foreach(println)

-Xiangrui

On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb
<ri...@gmail.com> wrote:
> Hi all,
>
> I'm very new to machine learning algorithms and Spark. I'm follow the
> Twitter Streaming Language Classifier found here:
>
> http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html
>
> Specifically this code:
>
> http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala
>
> Except I'm trying to run it in batch mode on some tweets it pulls out
> of Cassandra, in this case 200 total tweets.
>
> As the example shows, I am using this object for "vectorizing" a set of tweets:
>
> object Utils{
>   val numFeatures = 1000
>   val tf = new HashingTF(numFeatures)
>
>   /**
>    * Create feature vectors by turning each tweet into bigrams of
>    * characters (an n-gram model) and then hashing those to a
>    * length-1000 feature vector that we can pass to MLlib.
>    * This is a common way to decrease the number of features in a
>    * model while still getting excellent accuracy (otherwise every
>    * pair of Unicode characters would potentially be a feature).
>    */
>   def featurize(s: String): Vector = {
>     tf.transform(s.sliding(2).toSeq)
>   }
> }
>
> Here is my code which is modified from ExaminAndTrain.scala:
>
>  val noSets = rawTweets.map(set => set.mkString("\n"))
>
>     val vectors = noSets.map(Utils.featurize).cache()
>     vectors.count()
>
>     val numClusters = 5
>     val numIterations = 30
>
>     val model = KMeans.train(vectors, numClusters, numIterations)
>
>       for (i <- 0 until numClusters) {
>         println(s"\nCLUSTER $i")
>         noSets.foreach {
>             t => if (model.predict(Utils.featurize(t)) == 1) {
>               println(t)
>             }
>           }
>         }
>
> This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc
> with nothing printing beneath. If i flip
>
> models.predict(Utils.featurize(t)) == 1 to
> models.predict(Utils.featurize(t)) == 0
>
> the same thing happens except every tweet is printed beneath every cluster.
>
> Here is what I intuitively think is happening (please correct my
> thinking if its wrong): This code turns each tweet into a vector,
> randomly picks some clusters, then runs kmeans to group the tweets (at
> a really high level, the clusters, i assume, would be common
> "topics"). As such, when it checks each tweet to see if models.predict
> == 1, different sets of tweets should appear under each cluster (and
> because its checking the training set against itself, every tweet
> should be in a cluster). Why isn't it doing this? Either my
> understanding of what kmeans does is wrong, my training set is too
> small or I'm missing a step.
>
> Any help is greatly appreciated
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org