You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nirmal Fernando <ni...@wso2.com> on 2015/07/13 11:53:08 UTC

[MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Hi,

For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time
(16+ mints).

It takes lot of time at this task;

org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

Can this be improved?

-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Nirmal Fernando <ni...@wso2.com>.

Can it be the limited memory causing this slowness?

On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando <ni...@wso2.com> wrote:

> Thanks Burak.
>
> Now it takes minutes to repartition;
>
> Active Stages (1) Stage IdDescriptionSubmittedDurationTasks:
> Succeeded/TotalInputOutputShuffle Read Shuffle Write  42 (kill)
> <http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition
> at UnsupervisedSparkModelBuilder.java:120
> <http://localhost:4040/stages/stage?id=42&attempt=0> +details
>
> org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
> org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
>
>  2015/07/14 08:59:30 3.6 min
>  0/3
>  14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
> Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
> KMeansModel.scala:70 <http://localhost:4040/stages/stage?id=43&attempt=0> +details
>
>
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
> org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
>
>  Unknown Unknown
>  0/8
>
> On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz <br...@gmail.com> wrote:
>
>> Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
>> .cache()?
>>
>> something like, (I'm assuming you are using Java):
>> ```
>> JavaRDD<Vector> input = data.repartition(8).cache();
>> org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
>> ```
>>
>> On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando <ni...@wso2.com>
>> wrote:
>>
>>> I'm using;
>>>
>>> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
>>>
>>> Cpu cores: 8 (using default Spark conf thought)
>>>
>>> On partitions, I'm not sure how to find that.
>>>
>>> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <br...@gmail.com> wrote:
>>>
>>>> What are the other parameters? Are you just setting k=3? What about #
>>>> of runs? How many partitions do you have? How many cores does your machine
>>>> have?
>>>>
>>>> Thanks,
>>>> Burak
>>>>
>>>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <ni...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi Burak,
>>>>>
>>>>> k = 3
>>>>> dimension = 785 features
>>>>> Spark 1.4
>>>>>
>>>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <br...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How are you running K-Means? What is your k? What is the dimension of
>>>>>> your dataset (columns)? Which Spark version are you using?
>>>>>>
>>>>>> Thanks,
>>>>>> Burak
>>>>>>
>>>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
>>>>>>> of time (16+ mints).
>>>>>>>
>>>>>>> It takes lot of time at this task;
>>>>>>>
>>>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>>>>
>>>>>>> Can this be improved?
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks & regards,
>>>>>>> Nirmal
>>>>>>>
>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>> Mobile: +94715779733
>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Nirmal Fernando <ni...@wso2.com>.

Thanks Burak.

Now it takes minutes to repartition;

Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total
InputOutputShuffle Read Shuffle Write  42 (kill)
<http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition
at UnsupervisedSparkModelBuilder.java:120
<http://localhost:4040/stages/stage?id=42&attempt=0> +details

org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 2015/07/14 08:59:30 3.6 min
 0/3
 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
KMeansModel.scala:70
<http://localhost:4040/stages/stage?id=43&attempt=0> +details


org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 Unknown Unknown
 0/8

On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz <br...@gmail.com> wrote:

> Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
> .cache()?
>
> something like, (I'm assuming you are using Java):
> ```
> JavaRDD<Vector> input = data.repartition(8).cache();
> org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
> ```
>
> On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando <ni...@wso2.com> wrote:
>
>> I'm using;
>>
>> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
>>
>> Cpu cores: 8 (using default Spark conf thought)
>>
>> On partitions, I'm not sure how to find that.
>>
>> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <br...@gmail.com> wrote:
>>
>>> What are the other parameters? Are you just setting k=3? What about # of
>>> runs? How many partitions do you have? How many cores does your machine
>>> have?
>>>
>>> Thanks,
>>> Burak
>>>
>>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <ni...@wso2.com>
>>> wrote:
>>>
>>>> Hi Burak,
>>>>
>>>> k = 3
>>>> dimension = 785 features
>>>> Spark 1.4
>>>>
>>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <br...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How are you running K-Means? What is your k? What is the dimension of
>>>>> your dataset (columns)? Which Spark version are you using?
>>>>>
>>>>> Thanks,
>>>>> Burak
>>>>>
>>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
>>>>>> of time (16+ mints).
>>>>>>
>>>>>> It takes lot of time at this task;
>>>>>>
>>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>>>
>>>>>> Can this be improved?
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Nirmal
>>>>>>
>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>> Mobile: +94715779733
>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Burak Yavuz <br...@gmail.com>.

Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
.cache()?

something like, (I'm assuming you are using Java):
```
JavaRDD<Vector> input = data.repartition(8).cache();
org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
```

On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando <ni...@wso2.com> wrote:

> I'm using;
>
> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
>
> Cpu cores: 8 (using default Spark conf thought)
>
> On partitions, I'm not sure how to find that.
>
> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <br...@gmail.com> wrote:
>
>> What are the other parameters? Are you just setting k=3? What about # of
>> runs? How many partitions do you have? How many cores does your machine
>> have?
>>
>> Thanks,
>> Burak
>>
>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <ni...@wso2.com>
>> wrote:
>>
>>> Hi Burak,
>>>
>>> k = 3
>>> dimension = 785 features
>>> Spark 1.4
>>>
>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <br...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How are you running K-Means? What is your k? What is the dimension of
>>>> your dataset (columns)? Which Spark version are you using?
>>>>
>>>> Thanks,
>>>> Burak
>>>>
>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>>>>> time (16+ mints).
>>>>>
>>>>> It takes lot of time at this task;
>>>>>
>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>>
>>>>> Can this be improved?
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Nirmal Fernando <ni...@wso2.com>.

I'm using;

org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

Cpu cores: 8 (using default Spark conf thought)

On partitions, I'm not sure how to find that.

On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz <br...@gmail.com> wrote:

> What are the other parameters? Are you just setting k=3? What about # of
> runs? How many partitions do you have? How many cores does your machine
> have?
>
> Thanks,
> Burak
>
> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <ni...@wso2.com> wrote:
>
>> Hi Burak,
>>
>> k = 3
>> dimension = 785 features
>> Spark 1.4
>>
>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <br...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How are you running K-Means? What is your k? What is the dimension of
>>> your dataset (columns)? Which Spark version are you using?
>>>
>>> Thanks,
>>> Burak
>>>
>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>>>> time (16+ mints).
>>>>
>>>> It takes lot of time at this task;
>>>>
>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>
>>>> Can this be improved?
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Burak Yavuz <br...@gmail.com>.

What are the other parameters? Are you just setting k=3? What about # of
runs? How many partitions do you have? How many cores does your machine
have?

Thanks,
Burak

On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando <ni...@wso2.com> wrote:

> Hi Burak,
>
> k = 3
> dimension = 785 features
> Spark 1.4
>
> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <br...@gmail.com> wrote:
>
>> Hi,
>>
>> How are you running K-Means? What is your k? What is the dimension of
>> your dataset (columns)? Which Spark version are you using?
>>
>> Thanks,
>> Burak
>>
>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com> wrote:
>>
>>> Hi,
>>>
>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>>> time (16+ mints).
>>>
>>> It takes lot of time at this task;
>>>
>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>
>>> Can this be improved?
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Nirmal Fernando <ni...@wso2.com>.

Hi Burak,

k = 3
dimension = 785 features
Spark 1.4

On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz <br...@gmail.com> wrote:

> Hi,
>
> How are you running K-Means? What is your k? What is the dimension of your
> dataset (columns)? Which Spark version are you using?
>
> Thanks,
> Burak
>
> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com> wrote:
>
>> Hi,
>>
>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>> time (16+ mints).
>>
>> It takes lot of time at this task;
>>
>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>
>> Can this be improved?
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

Posted by Burak Yavuz <br...@gmail.com>.

Hi,

How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?

Thanks,
Burak

On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando <ni...@wso2.com> wrote:

> Hi,
>
> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
> time (16+ mints).
>
> It takes lot of time at this task;
>
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>
> Can this be improved?
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>