You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2016/01/01 22:56:28 UTC

How to find cause(waiting threads etc) of hanging job for 7 hours?

Hi I have a Spark job which hangs for around 7 hours or more than that until
jobs killed out by Autosys because of time out. Data is not huge I am sure
it stucks because of GC but I cant find source code which causes GC I am
reusing almost all variable trying to minimize creating local objects though
I cant avoid creating many String objects in order to update DataFrame
values. When I see live thread debug in the executor where job is running I
see attached running/waiting threads. Please guide me to find which waiting
thread is culprit and preventing my job to finish. My code uses
dataframe.group by one around 8 fields and also uses coalese(1) twice so it
shuffles huge amounts of data in terms of GBs in each executor when I see in
the UI.

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n25850/Screen_Shot_2016-01-02_at_2.jpg> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n25850/Screen_Shot_2016-01-02_at_2.jpg> 

Here is the heap space error which is I dont understand how to resolve in my
code 

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n25850/Screen_Shot_2016-01-02_at_2.jpg> 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by Umesh Kacha <um...@gmail.com>.
Hi Prabhu thanks for the response. I did the same the problem is when I get
process id using jps or ps - ef I don't get user in the very first column I
see number in place of user name so can't run jstack on it because of
permission issue it gives something like following

7000028852   3553   9833   0   04:30   ?  00:00:00   /bin/bash blabal
On Jan 12, 2016 12:02, "Prabhu Joseph" <pr...@gmail.com> wrote:

> Umesh,
>
>   Running task is a thread within the executor process. We need to take
> stack trace for the executor process. The executor will be running in any
> NodeManager machine as a container.
>
>   YARN RM UI running jobs will have the host details where executor is
> running. Login to that NodeManager machine and jps -l will list all java
> processes, jstack -l <pid> will give the stack trace.
>
>
> Thanks,
> Prabhu Joseph
>
> On Mon, Jan 11, 2016 at 7:56 PM, Umesh Kacha <um...@gmail.com>
> wrote:
>
>> Hi Prabhu thanks for the response. How do I find pid of a slow running
>> task. Task is running in yarn cluster node. When I try to see pid of a
>> running task using my user I see some 7-8 digit number instead of user
>> running process any idea why spark creates this number instead of
>> displaying user
>> On Jan 3, 2016 6:06 AM, "Prabhu Joseph" <pr...@gmail.com>
>> wrote:
>>
>>> The attached image just has thread states, and WAITING threads need not
>>> be the issue. We need to take thread stack traces and identify at which
>>> area of code, threads are spending lot of time.
>>>
>>> Use jstack -l <pid> or kill -3 <pid>, where pid is the process id of the
>>> executor process. Take jstack stack trace for every 2 seconds and total 1
>>> minute. This will help to identify the code where threads are spending lot
>>> of time and then try to tune.
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>>
>>>
>>> On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha <um...@gmail.com>
>>> wrote:
>>>
>>>> Hi thanks I did that and I have attached thread dump images. That was
>>>> the intention of my question asking for help to identify which waiting
>>>> thread is culprit.
>>>>
>>>> Regards,
>>>> Umesh
>>>>
>>>> On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph <
>>>> prabhujose.gates@gmail.com> wrote:
>>>>
>>>>> Take thread dump of Executor process several times in a short time
>>>>> period and check what each threads are doing at different times which will
>>>>> help to identify the expensive sections in user code.
>>>>>
>>>>> Thanks,
>>>>> Prabhu Joseph
>>>>>
>>>>> On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <um...@gmail.com> wrote:
>>>>>
>>>>>> Sorry please see attached waiting thread log
>>>>>>
>>>>>> <
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>>>>> >
>>>>>> <
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by Prabhu Joseph <pr...@gmail.com>.
Umesh,

  Running task is a thread within the executor process. We need to take
stack trace for the executor process. The executor will be running in any
NodeManager machine as a container.

  YARN RM UI running jobs will have the host details where executor is
running. Login to that NodeManager machine and jps -l will list all java
processes, jstack -l <pid> will give the stack trace.


Thanks,
Prabhu Joseph

On Mon, Jan 11, 2016 at 7:56 PM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Prabhu thanks for the response. How do I find pid of a slow running
> task. Task is running in yarn cluster node. When I try to see pid of a
> running task using my user I see some 7-8 digit number instead of user
> running process any idea why spark creates this number instead of
> displaying user
> On Jan 3, 2016 6:06 AM, "Prabhu Joseph" <pr...@gmail.com>
> wrote:
>
>> The attached image just has thread states, and WAITING threads need not
>> be the issue. We need to take thread stack traces and identify at which
>> area of code, threads are spending lot of time.
>>
>> Use jstack -l <pid> or kill -3 <pid>, where pid is the process id of the
>> executor process. Take jstack stack trace for every 2 seconds and total 1
>> minute. This will help to identify the code where threads are spending lot
>> of time and then try to tune.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>> On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha <um...@gmail.com>
>> wrote:
>>
>>> Hi thanks I did that and I have attached thread dump images. That was
>>> the intention of my question asking for help to identify which waiting
>>> thread is culprit.
>>>
>>> Regards,
>>> Umesh
>>>
>>> On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph <
>>> prabhujose.gates@gmail.com> wrote:
>>>
>>>> Take thread dump of Executor process several times in a short time
>>>> period and check what each threads are doing at different times which will
>>>> help to identify the expensive sections in user code.
>>>>
>>>> Thanks,
>>>> Prabhu Joseph
>>>>
>>>> On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <um...@gmail.com> wrote:
>>>>
>>>>> Sorry please see attached waiting thread log
>>>>>
>>>>> <
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>>>> >
>>>>> <
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by Umesh Kacha <um...@gmail.com>.
Hi Prabhu thanks for the response. How do I find pid of a slow running
task. Task is running in yarn cluster node. When I try to see pid of a
running task using my user I see some 7-8 digit number instead of user
running process any idea why spark creates this number instead of
displaying user
On Jan 3, 2016 6:06 AM, "Prabhu Joseph" <pr...@gmail.com> wrote:

> The attached image just has thread states, and WAITING threads need not be
> the issue. We need to take thread stack traces and identify at which area
> of code, threads are spending lot of time.
>
> Use jstack -l <pid> or kill -3 <pid>, where pid is the process id of the
> executor process. Take jstack stack trace for every 2 seconds and total 1
> minute. This will help to identify the code where threads are spending lot
> of time and then try to tune.
>
> Thanks,
> Prabhu Joseph
>
>
>
> On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha <um...@gmail.com> wrote:
>
>> Hi thanks I did that and I have attached thread dump images. That was the
>> intention of my question asking for help to identify which waiting thread
>> is culprit.
>>
>> Regards,
>> Umesh
>>
>> On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph <prabhujose.gates@gmail.com
>> > wrote:
>>
>>> Take thread dump of Executor process several times in a short time
>>> period and check what each threads are doing at different times which will
>>> help to identify the expensive sections in user code.
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>> On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <um...@gmail.com> wrote:
>>>
>>>> Sorry please see attached waiting thread log
>>>>
>>>> <
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>>> >
>>>> <
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by Prabhu Joseph <pr...@gmail.com>.
The attached image just has thread states, and WAITING threads need not be
the issue. We need to take thread stack traces and identify at which area
of code, threads are spending lot of time.

Use jstack -l <pid> or kill -3 <pid>, where pid is the process id of the
executor process. Take jstack stack trace for every 2 seconds and total 1
minute. This will help to identify the code where threads are spending lot
of time and then try to tune.

Thanks,
Prabhu Joseph



On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha <um...@gmail.com> wrote:

> Hi thanks I did that and I have attached thread dump images. That was the
> intention of my question asking for help to identify which waiting thread
> is culprit.
>
> Regards,
> Umesh
>
> On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph <pr...@gmail.com>
> wrote:
>
>> Take thread dump of Executor process several times in a short time period
>> and check what each threads are doing at different times which will help to
>> identify the expensive sections in user code.
>>
>> Thanks,
>> Prabhu Joseph
>>
>> On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <um...@gmail.com> wrote:
>>
>>> Sorry please see attached waiting thread log
>>>
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>> >
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>>> >
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by Umesh Kacha <um...@gmail.com>.
Hi thanks I did that and I have attached thread dump images. That was the
intention of my question asking for help to identify which waiting thread
is culprit.

Regards,
Umesh

On Sat, Jan 2, 2016 at 8:38 AM, Prabhu Joseph <pr...@gmail.com>
wrote:

> Take thread dump of Executor process several times in a short time period
> and check what each threads are doing at different times which will help to
> identify the expensive sections in user code.
>
> Thanks,
> Prabhu Joseph
>
> On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <um...@gmail.com> wrote:
>
>> Sorry please see attached waiting thread log
>>
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>> >
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by Prabhu Joseph <pr...@gmail.com>.
Take thread dump of Executor process several times in a short time period
and check what each threads are doing at different times which will help to
identify the expensive sections in user code.

Thanks,
Prabhu Joseph

On Sat, Jan 2, 2016 at 3:28 AM, unk1102 <um...@gmail.com> wrote:

> Sorry please see attached waiting thread log
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
> >
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

Posted by unk1102 <um...@gmail.com>.
Sorry please see attached waiting thread log

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n25851/Screen_Shot_2016-01-02_at_2.jpg> 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-cause-waiting-threads-etc-of-hanging-job-for-7-hours-tp25850p25851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org