You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tsai Li Ming <ma...@ltsai.com> on 2014/03/23 11:15:23 UTC

Kmeans example reduceByKey slow

Hi,

At the reduceBuyKey stage, it takes a few minutes before the tasks start working.

I have -Dspark.default.parallelism=127 cores (n-1).

CPU/Network/IO is idling across all nodes when this is happening. 

And there is nothing particular on the master log file. From the spark-shell:

14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms

But it stops there for some significant time before any movement. 

In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.

I’m working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).

Is this a normal behaviour?

Thanks!

Re: Kmeans example reduceByKey slow

Posted by Xiangrui Meng <me...@gmail.com>.
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui

On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming <ma...@ltsai.com> wrote:
> Thanks again.
>
>> If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master,
>
> The "master" here is the app/driver/spark-shell?
>
> Thanks!
>
> On 25 Mar, 2014, at 1:03 am, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Number of rows doesn't matter much as long as you have enough workers
>> to distribute the work. K-means has complexity O(n * d * k), where n
>> is number of points, d is the dimension, and k is the number of
>> clusters. If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master, so a large k would slow down
>> the initialization stage. If your data is sparse, the latest change to
>> KMeans will help with the speed, depending on how sparse your data is.
>> -Xiangrui
>>
>> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>> Thanks, Let me try with a smaller K.
>>>
>>> Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>>>
>>>
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>>> K = 500000 is certainly a large number for k-means. If there is no
>>>> particular reason to have 500000 clusters, could you try to reduce it
>>>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>>>> problems. You should use the KMeans algorithm in mllib clustering for
>>>> your problem.
>>>>
>>>> -Xiangrui
>>>>
>>>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>>>> Hi,
>>>>>
>>>>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>>>>>
>>>>> (0.9.0) is deployed in a stand alone mode.
>>>>>
>>>>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>>>>>
>>>>> This is on the first iteration. K=500000. Here's the code I use:
>>>>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <me...@gmail.com> wrote:
>>>>>
>>>>>> Hi Tsai,
>>>>>>
>>>>>> Could you share more information about the machine you used and the
>>>>>> training parameters (runs, k, and iterations)? It can help solve your
>>>>>> issues. Thanks!
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>>>>>>>
>>>>>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>>>>>
>>>>>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>>>>>
>>>>>>> And there is nothing particular on the master log file. From the spark-shell:
>>>>>>>
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>>>>>>>
>>>>>>> But it stops there for some significant time before any movement.
>>>>>>>
>>>>>>> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>>>>>>>
>>>>>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>>>>>>>
>>>>>>> Is this a normal behaviour?
>>>>>>>
>>>>>>> Thanks!
>>>>>
>>>
>

Re: Kmeans example reduceByKey slow

Posted by Tsai Li Ming <ma...@ltsai.com>.
Thanks again.

> If you use the KMeans implementation from MLlib, the
> initialization stage is done on master, 

The “master” here is the app/driver/spark-shell?

Thanks!

On 25 Mar, 2014, at 1:03 am, Xiangrui Meng <me...@gmail.com> wrote:

> Number of rows doesn't matter much as long as you have enough workers
> to distribute the work. K-means has complexity O(n * d * k), where n
> is number of points, d is the dimension, and k is the number of
> clusters. If you use the KMeans implementation from MLlib, the
> initialization stage is done on master, so a large k would slow down
> the initialization stage. If your data is sparse, the latest change to
> KMeans will help with the speed, depending on how sparse your data is.
> -Xiangrui
> 
> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>> Thanks, Let me try with a smaller K.
>> 
>> Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>> 
>> 
>> 
>> 
>> 
>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <me...@gmail.com> wrote:
>> 
>>> K = 500000 is certainly a large number for k-means. If there is no
>>> particular reason to have 500000 clusters, could you try to reduce it
>>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>>> problems. You should use the KMeans algorithm in mllib clustering for
>>> your problem.
>>> 
>>> -Xiangrui
>>> 
>>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>>> Hi,
>>>> 
>>>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>>>> 
>>>> (0.9.0) is deployed in a stand alone mode.
>>>> 
>>>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>>>> 
>>>> This is on the first iteration. K=500000. Here's the code I use:
>>>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> 
>>>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <me...@gmail.com> wrote:
>>>> 
>>>>> Hi Tsai,
>>>>> 
>>>>> Could you share more information about the machine you used and the
>>>>> training parameters (runs, k, and iterations)? It can help solve your
>>>>> issues. Thanks!
>>>>> 
>>>>> Best,
>>>>> Xiangrui
>>>>> 
>>>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>>>>>> 
>>>>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>>>> 
>>>>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>>>> 
>>>>>> And there is nothing particular on the master log file. From the spark-shell:
>>>>>> 
>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>>>>>> 
>>>>>> But it stops there for some significant time before any movement.
>>>>>> 
>>>>>> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>>>>>> 
>>>>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>>>>>> 
>>>>>> Is this a normal behaviour?
>>>>>> 
>>>>>> Thanks!
>>>> 
>> 


Re: Kmeans example reduceByKey slow

Posted by Xiangrui Meng <me...@gmail.com>.
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on master, so a large k would slow down
the initialization stage. If your data is sparse, the latest change to
KMeans will help with the speed, depending on how sparse your data is.
-Xiangrui

On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
> Thanks, Let me try with a smaller K.
>
> Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>
>
>
>
>
> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <me...@gmail.com> wrote:
>
>> K = 500000 is certainly a large number for k-means. If there is no
>> particular reason to have 500000 clusters, could you try to reduce it
>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>> problems. You should use the KMeans algorithm in mllib clustering for
>> your problem.
>>
>> -Xiangrui
>>
>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>> Hi,
>>>
>>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>>>
>>> (0.9.0) is deployed in a stand alone mode.
>>>
>>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>>>
>>> This is on the first iteration. K=500000. Here's the code I use:
>>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>>> Hi Tsai,
>>>>
>>>> Could you share more information about the machine you used and the
>>>> training parameters (runs, k, and iterations)? It can help solve your
>>>> issues. Thanks!
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>>>> Hi,
>>>>>
>>>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>>>>>
>>>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>>>
>>>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>>>
>>>>> And there is nothing particular on the master log file. From the spark-shell:
>>>>>
>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>>>>>
>>>>> But it stops there for some significant time before any movement.
>>>>>
>>>>> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>>>>>
>>>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>>>>>
>>>>> Is this a normal behaviour?
>>>>>
>>>>> Thanks!
>>>
>

Re: Kmeans example reduceByKey slow

Posted by Tsai Li Ming <ma...@ltsai.com>.
Thanks, Let me try with a smaller K.

Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark?





On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <me...@gmail.com> wrote:

> K = 500000 is certainly a large number for k-means. If there is no
> particular reason to have 500000 clusters, could you try to reduce it
> to, e.g, 100 or 1000? Also, the example code is not for large-scale
> problems. You should use the KMeans algorithm in mllib clustering for
> your problem.
> 
> -Xiangrui
> 
> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <ma...@ltsai.com> wrote:
>> Hi,
>> 
>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>> 
>> (0.9.0) is deployed in a stand alone mode.
>> 
>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>> 
>> This is on the first iteration. K=500000. Here's the code I use:
>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>> 
>> Thanks!
>> 
>> 
>> 
>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <me...@gmail.com> wrote:
>> 
>>> Hi Tsai,
>>> 
>>> Could you share more information about the machine you used and the
>>> training parameters (runs, k, and iterations)? It can help solve your
>>> issues. Thanks!
>>> 
>>> Best,
>>> Xiangrui
>>> 
>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>>> Hi,
>>>> 
>>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>>>> 
>>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>> 
>>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>> 
>>>> And there is nothing particular on the master log file. From the spark-shell:
>>>> 
>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>>>> 
>>>> But it stops there for some significant time before any movement.
>>>> 
>>>> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>>>> 
>>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>>>> 
>>>> Is this a normal behaviour?
>>>> 
>>>> Thanks!
>> 


Re: Kmeans example reduceByKey slow

Posted by Xiangrui Meng <me...@gmail.com>.
K = 500000 is certainly a large number for k-means. If there is no
particular reason to have 500000 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.

-Xiangrui

On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <ma...@ltsai.com> wrote:
> Hi,
>
> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>
> (0.9.0) is deployed in a stand alone mode.
>
> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>
> This is on the first iteration. K=500000. Here's the code I use:
> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>
> Thanks!
>
>
>
> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Tsai,
>>
>> Could you share more information about the machine you used and the
>> training parameters (runs, k, and iterations)? It can help solve your
>> issues. Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>>> Hi,
>>>
>>> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>>>
>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>
>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>
>>> And there is nothing particular on the master log file. From the spark-shell:
>>>
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>>>
>>> But it stops there for some significant time before any movement.
>>>
>>> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>>>
>>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>>>
>>> Is this a normal behaviour?
>>>
>>> Thanks!
>

Re: Kmeans example reduceByKey slow

Posted by Tsai Li Ming <ma...@ltsai.com>.
Hi,

This is on a 4 nodes cluster each with 32 cores/256GB Ram. 

(0.9.0) is deployed in a stand alone mode.

Each worker is configured with 192GB. Spark executor memory is also 192GB. 

This is on the first iteration. K=500000. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.

Thanks!



On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Tsai,
> 
> Could you share more information about the machine you used and the
> training parameters (runs, k, and iterations)? It can help solve your
> issues. Thanks!
> 
> Best,
> Xiangrui
> 
> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
>> Hi,
>> 
>> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>> 
>> I have -Dspark.default.parallelism=127 cores (n-1).
>> 
>> CPU/Network/IO is idling across all nodes when this is happening.
>> 
>> And there is nothing particular on the master log file. From the spark-shell:
>> 
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>> 
>> But it stops there for some significant time before any movement.
>> 
>> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>> 
>> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>> 
>> Is this a normal behaviour?
>> 
>> Thanks!


Re: Kmeans example reduceByKey slow

Posted by Xiangrui Meng <me...@gmail.com>.
Hi Tsai,

Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!

Best,
Xiangrui

On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <ma...@ltsai.com> wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes before the tasks start working.
>
> I have -Dspark.default.parallelism=127 cores (n-1).
>
> CPU/Network/IO is idling across all nodes when this is happening.
>
> And there is nothing particular on the master log file. From the spark-shell:
>
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 on executor 2: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 38765155 bytes in 193 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 on executor 1: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 38765155 bytes in 96 ms
> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 on executor 0: XXX (PROCESS_LOCAL)
> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 38765155 bytes in 100 ms
>
> But it stops there for some significant time before any movement.
>
> In the stage detail of the UI, I can see that there are 127 tasks running but the duration each is at least a few minutes.
>
> I'm working off local storage (not hdfs) and the kmeans data is about 6.5GB (50M rows).
>
> Is this a normal behaviour?
>
> Thanks!