You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rohit Karlupia <ro...@qubole.com> on 2018/01/01 15:41:16 UTC

Re: Spark on EMR suddenly stalling

Here is the list that I will probably try to fill:

   1. Check GC on the offending executor when the task is running. May be
   you need even more memory.
   2. Go back to some previous successful run of the job and check the
   spark ui for the offending stage and check max task time/max input/max
   shuffle in/out for the largest task. Will help you understand the degree of
   skew in this stage.
   3. Take a thread dump of the executor from the Spark UI and verify if
   the task is really doing any work or it stuck in some deadlock. Some of the
   hive serde are not really usable from multi-threaded/multi-use spark
   executors.
   4. Take a thread dump of the executor from the Spark UI and verify if
   the task is spilling to disk. Playing with storage and memory fraction or
   generally increasing the memory will help.
   5. Check the disk utilisation on the machine running the executor.
   6. Look for event loss messages in the logs due to event queue full.
   Loss of events can send some of the spark components into really bad
   states.


thanks,
rohitk



On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <gourav.sengupta@gmail.com
> wrote:

> Hi,
>
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> Sadly, I cannot be of much help unless we go for a screen share session
> over google chat or skype.
>
> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Besides that, there is a metrics in the EMR console which shows the number
> of containers getting generated by your job on graphs.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bl...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Just a quick update as I did not made much progress yet.
>>
>> On 28 Dec 2017, at 21:09, Gourav Sengupta <go...@gmail.com>
>> wrote:
>> > can you try to then use the EMR version 5.10 instead or EMR version
>> 5.11 instead?
>>
>> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>>
>> > can you please try selecting a subnet which is in a different
>> availability zone?
>>
>> I did not try this yet. But why should that make a difference?
>>
>> > if possible just try to increase the number of task instances and see
>> the difference?
>>
>> I tried with 512 partitions -- no difference.
>>
>> > also in case you are using caching,
>>
>> No caching used.
>>
>> > Also can you please report the number of containers that your job is
>> creating by looking at the metrics in the EMR console?
>>
>> 8 containers if I trust the directories in j-xxx/containers/application_x
>> xx/.
>>
>> > Also if you see the spark UI then you can easily see which particular
>> step is taking the longest period of time - you just have to drill in a bit
>> in order to see that. Generally in case shuffling is an issue then it
>> definitely appears in the SPARK UI as I drill into the steps and see which
>> particular one is taking the longest.
>>
>> I always have issues with the Spark UI on EC2 -- it never seems to be up
>> to date.
>>
>> JM
>>
>>
>

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.
Hello Mans,

On 1 Jan 2018, at 17:12, M Singh <ma...@yahoo.com> wrote:
> I am not sure if I missed it - but can you let us know what is your input source and output sink ?

Reading from S3 and writing to S3.

However the never-ending task 0.0 happens in a stage way before outputting anything to S3.

Regards,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark on EMR suddenly stalling

Posted by M Singh <ma...@yahoo.com.INVALID>.
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input source and output sink ?  
In some cases, I found that saving to S3 was a problem. In this case I started saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which solved our issue.

Mans 

    On Monday, January 1, 2018 7:41 AM, Rohit Karlupia <ro...@qubole.com> wrote:
 

 Here is the list that I will probably try to fill:   
   - Check GC on the offending executor when the task is running. May be you need even more memory.  
   - Go back to some previous successful run of the job and check the spark ui for the offending stage and check max task time/max input/max shuffle in/out for the largest task. Will help you understand the degree of skew in this stage. 
   - Take a thread dump of the executor from the Spark UI and verify if the task is really doing any work or it stuck in some deadlock. Some of the hive serde are not really usable from multi-threaded/multi-use spark executors. 
   - Take a thread dump of the executor from the Spark UI and verify if the task is spilling to disk. Playing with storage and memory fraction or generally increasing the memory will help. 
   - Check the disk utilisation on the machine running the executor. 
   - Look for event loss messages in the logs due to event queue full. Loss of events can send some of the spark components into really bad states.  

thanks,rohitk


On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <go...@gmail.com> wrote:

Hi,
Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.
Sadly, I cannot be of much help unless we go for a screen share session over google chat or skype. 
Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true. 
Besides that, there is a metrics in the EMR console which shows the number of containers getting generated by your job on graphs.


Regards,Gourav Sengupta
On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bl...@gmail.com> wrote:

Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta <go...@gmail.com> wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_x xx/.

> Also if you see the spark UI then you can easily see which particular step is taking the longest period of time - you just have to drill in a bit in order to see that. Generally in case shuffling is an issue then it definitely appears in the SPARK UI as I drill into the steps and see which particular one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to date.

JM