You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ravishankar Rajagopalan <vi...@gmail.com> on 2014/07/17 10:48:58 UTC

Speeding up K-Means Clustering

I am trying to use MLlib for K-Means clustering on a data set with 1
million rows and 50 columns (all columns have double values) which is on
HDFS (raw txt file is 28 MB)

I initially tried the following:

val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)

This took me nearly 850 seconds.

I tried using persist with MEMORY_ONLY option hoping that this would
significantly speed up the algorithm:

val data3 = sc.textFile("hdfs://...inputData.txt")
val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
parsedData3.persist(MEMORY_ONLY)
val numIterations = 10
val numClusters = 200
val clusters = KMeans.train(parsedData3, numClusters, numIterations)

This resulted in only a marginal improvement and took around 720 seconds.

Is there any other way to speed up the algorithm further?

Thank you.

Regards,
Ravi

Re: Speeding up K-Means Clustering

Posted by Xiangrui Meng <me...@gmail.com>.
Please try

val parsedData3 =
data3.repartition(12).map(_.split("\t")).map(_.toDouble).cache()

and check the storage and driver/executor memory in the WebUI. Make
sure the data is fully cached.

-Xiangrui


On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan
<vi...@gmail.com> wrote:
> Hi Xiangrui,
>
> Yes I am using Spark v0.9 and am not running it in local mode.
>
> I did the memory setting using "export SPARK_MEM=4G" before starting the
> Spark instance.
>
> Also previously, I was starting it with -c 1 but changed it to -c 12 since
> it is a 12 core machine. It did bring down the time taken to less than 200
> seconds from over 700 seconds.
>
> I am not sure how to repartition the data to match the CPU cores. How do I
> do it?
>
> Thank you.
>
> Ravi
>
>
> On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
>> and repartition your data to match number of CPU cores such that the
>> data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
>> the data. Make sure there are enough memory for caching. -Xiangrui
>>
>> On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
>> <vi...@gmail.com> wrote:
>> > I am trying to use MLlib for K-Means clustering on a data set with 1
>> > million
>> > rows and 50 columns (all columns have double values) which is on HDFS
>> > (raw
>> > txt file is 28 MB)
>> >
>> > I initially tried the following:
>> >
>> > val data3 = sc.textFile("hdfs://...inputData.txt")
>> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
>> > val numIterations = 10
>> > val numClusters = 200
>> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>> >
>> > This took me nearly 850 seconds.
>> >
>> > I tried using persist with MEMORY_ONLY option hoping that this would
>> > significantly speed up the algorithm:
>> >
>> > val data3 = sc.textFile("hdfs://...inputData.txt")
>> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
>> > parsedData3.persist(MEMORY_ONLY)
>> > val numIterations = 10
>> > val numClusters = 200
>> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>> >
>> > This resulted in only a marginal improvement and took around 720
>> > seconds.
>> >
>> > Is there any other way to speed up the algorithm further?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Ravi
>
>

Re: Speeding up K-Means Clustering

Posted by Ravishankar Rajagopalan <vi...@gmail.com>.
Hi Xiangrui,

Yes I am using Spark v0.9 and am not running it in local mode.

I did the memory setting using "export SPARK_MEM=4G" before starting the  Spark
instance.

Also previously, I was starting it with -c 1 but changed it to -c 12 since
it is a 12 core machine. It did bring down the time taken to less than 200
seconds from over 700 seconds.

I am not sure how to repartition the data to match the CPU cores. How do I
do it?

Thank you.

Ravi


On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
> and repartition your data to match number of CPU cores such that the
> data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
> the data. Make sure there are enough memory for caching. -Xiangrui
>
> On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
> <vi...@gmail.com> wrote:
> > I am trying to use MLlib for K-Means clustering on a data set with 1
> million
> > rows and 50 columns (all columns have double values) which is on HDFS
> (raw
> > txt file is 28 MB)
> >
> > I initially tried the following:
> >
> > val data3 = sc.textFile("hdfs://...inputData.txt")
> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> > val numIterations = 10
> > val numClusters = 200
> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
> >
> > This took me nearly 850 seconds.
> >
> > I tried using persist with MEMORY_ONLY option hoping that this would
> > significantly speed up the algorithm:
> >
> > val data3 = sc.textFile("hdfs://...inputData.txt")
> > val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> > parsedData3.persist(MEMORY_ONLY)
> > val numIterations = 10
> > val numClusters = 200
> > val clusters = KMeans.train(parsedData3, numClusters, numIterations)
> >
> > This resulted in only a marginal improvement and took around 720 seconds.
> >
> > Is there any other way to speed up the algorithm further?
> >
> > Thank you.
> >
> > Regards,
> > Ravi
>

Re: Speeding up K-Means Clustering

Posted by Xiangrui Meng <me...@gmail.com>.
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui

On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
<vi...@gmail.com> wrote:
> I am trying to use MLlib for K-Means clustering on a data set with 1 million
> rows and 50 columns (all columns have double values) which is on HDFS (raw
> txt file is 28 MB)
>
> I initially tried the following:
>
> val data3 = sc.textFile("hdfs://...inputData.txt")
> val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> val numIterations = 10
> val numClusters = 200
> val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>
> This took me nearly 850 seconds.
>
> I tried using persist with MEMORY_ONLY option hoping that this would
> significantly speed up the algorithm:
>
> val data3 = sc.textFile("hdfs://...inputData.txt")
> val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> parsedData3.persist(MEMORY_ONLY)
> val numIterations = 10
> val numClusters = 200
> val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>
> This resulted in only a marginal improvement and took around 720 seconds.
>
> Is there any other way to speed up the algorithm further?
>
> Thank you.
>
> Regards,
> Ravi