You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Muler <mu...@gmail.com> on 2015/08/28 20:47:02 UTC

Help Explain Tasks in WebUI:4040

I have a 7 node cluster running in standalone mode (1 executor per node,
100g/executor, 18 cores/executor)

Attached is the Task status for two of my nodes. I'm not clear why some of
my tasks are taking too long:

   1. [node sk5, green] task 197 took 35 mins while task 218 took less than
   2 mins. But if you look into the size of output size/records they have
   almost same size. Even more strange, the size of shuffle spill for memory
   and disk is 0 for task 197 and yet it is taking a long time

Same issue for my other node (sk3, red)

Can you please explain what is going on?

Thanks,

Re: Help Explain Tasks in WebUI:4040

Posted by Igor Berman <ig...@gmail.com>.

are there other processes on sk3? or more generally are you sharing
resources with somebody else, virtualization etc

does your transformation consumes other services?(e.g. reading from s3, so
it can happen that s3 latency plays the role...)
can it be that task per some key will take longer than same task on other
key(I mean your business logic...) I see that some tasks take ~1min and
other ~1h which is strange

On 28 August 2015 at 21:47, Muler <mu...@gmail.com> wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>    1. [node sk5, green] task 197 took 35 mins while task 218 took less
>    than 2 mins. But if you look into the size of output size/records they have
>    almost same size. Even more strange, the size of shuffle spill for memory
>    and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Help Explain Tasks in WebUI:4040

Posted by Alexey Grishchenko <pr...@gmail.com>.

It really depends on the code. I would say that the easiest way is to
restart the problematic action, find the straggler task and analyze whats
happening with it with jstack / make a heap dump and analyze locally. For
example, there might be the case that your tasks are connecting to some
external resource and this resource is timing out under the pressure. Also
call toDebugString on the problematic RDD before calling an action that
triggers the calculations, this would give you an understanding what your
execution tasks are really doing

On Fri, Aug 28, 2015 at 7:47 PM, Muler <mu...@gmail.com> wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>    1. [node sk5, green] task 197 took 35 mins while task 218 took less
>    than 2 mins. But if you look into the size of output size/records they have
>    almost same size. Even more strange, the size of shuffle spill for memory
>    and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

-- 
Alexey Grishchenko, http://0x0fff.com

Re: Help Explain Tasks in WebUI:4040

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Are you doing a join/groupBy such operation? In that case i would suspect
that the keys are not evenly distributed and that's why few of the tasks
are spending way too much time doing the actual processing. You might want
to look into custom partitioners
<http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where>
to
avoid these scenarios.

Thanks
Best Regards

On Sat, Aug 29, 2015 at 12:17 AM, Muler <mu...@gmail.com> wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>    1. [node sk5, green] task 197 took 35 mins while task 218 took less
>    than 2 mins. But if you look into the size of output size/records they have
>    almost same size. Even more strange, the size of shuffle spill for memory
>    and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>