You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by amin mohebbi <am...@yahoo.com.INVALID> on 2014/11/25 11:39:58 UTC

K-means clustering

 I  have generated a sparse matrix by python, which has the size of  4000*174000 (.pkl), the following is a small part of this matrix :
 (0, 45) 1  (0, 413) 1  (0, 445) 1  (0, 107) 4  (0, 80) 2  (0, 352) 1  (0, 157) 1  (0, 191) 1  (0, 315) 1  (0, 395) 4  (0, 282) 3  (0, 184) 1  (0, 403) 1  (0, 169) 1  (0, 267) 1  (0, 148) 1  (0, 449) 1  (0, 241) 1  (0, 303) 1  (0, 364) 1  (0, 257) 1  (0, 372) 1  (0, 73) 1  (0, 64) 1  (0, 427) 1  : :  (2, 399) 1  (2, 277) 1  (2, 229) 1  (2, 255) 1  (2, 409) 1  (2, 355) 1  (2, 391) 1  (2, 28) 1  (2, 384) 1  (2, 86) 1  (2, 285) 2  (2, 166) 1  (2, 165) 1  (2, 419) 1  (2, 367) 2  (2, 133) 1  (2, 61) 1  (2, 434) 1  (2, 51) 1  (2, 423) 1  (2, 398) 1  (2, 438) 1  (2, 389) 1  (2, 26) 1  (2, 455) 1
I am new in Spark and would like to cluster this matrix by k-means algorithm. Can anyone explain to me what kind of problems  I might be faced. Please note that I do not want to use Mllib and would like to write my own k-means. 
Best Regards

.......................................................

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : TP025921@ex.apiit.edu.my

              amin_524@me.com

Re: K-means clustering

Posted by Xiangrui Meng <me...@gmail.com>.

There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui

On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi
<am...@yahoo.com.invalid> wrote:
>  I  have generated a sparse matrix by python, which has the size of
> 4000*174000 (.pkl), the following is a small part of this matrix :
>
>  (0, 45) 1
>   (0, 413) 1
>   (0, 445) 1
>   (0, 107) 4
>   (0, 80) 2
>   (0, 352) 1
>   (0, 157) 1
>   (0, 191) 1
>   (0, 315) 1
>   (0, 395) 4
>   (0, 282) 3
>   (0, 184) 1
>   (0, 403) 1
>   (0, 169) 1
>   (0, 267) 1
>   (0, 148) 1
>   (0, 449) 1
>   (0, 241) 1
>   (0, 303) 1
>   (0, 364) 1
>   (0, 257) 1
>   (0, 372) 1
>   (0, 73) 1
>   (0, 64) 1
>   (0, 427) 1
>   : :
>   (2, 399) 1
>   (2, 277) 1
>   (2, 229) 1
>   (2, 255) 1
>   (2, 409) 1
>   (2, 355) 1
>   (2, 391) 1
>   (2, 28) 1
>   (2, 384) 1
>   (2, 86) 1
>   (2, 285) 2
>   (2, 166) 1
>   (2, 165) 1
>   (2, 419) 1
>   (2, 367) 2
>   (2, 133) 1
>   (2, 61) 1
>   (2, 434) 1
>   (2, 51) 1
>   (2, 423) 1
>   (2, 398) 1
>   (2, 438) 1
>   (2, 389) 1
>   (2, 26) 1
>   (2, 455) 1
>
> I am new in Spark and would like to cluster this matrix by k-means
> algorithm. Can anyone explain to me what kind of problems  I might be faced.
> Please note that I do not want to use Mllib and would like to write my own
> k-means.
> Best Regards
>
> .......................................................
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : TP025921@ex.apiit.edu.my
>
>               amin_524@me.com

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org