You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jeroen Miller <bl...@gmail.com> on 2017/12/28 16:06:14 UTC

Spark on EMR suddenly stalling

Dear Sparkers,

Once again in times of desperation, I leave what remains of my mental sanity to this wise and knowledgeable community.

I have a Spark job (on EMR 5.8.0) which had been running daily for months, if not the whole year, with absolutely no supervision. This changed all of sudden for reasons I do not understand.

The volume of data processed daily has been slowly increasing over the past year but has been stable in the last couple months. Since I'm only processing the past 8 days's worth of data I do not think that increased data volume is to blame here. Yes, I did check the volume of data for the past few days.

Here is a short description of the issue.

- The Spark job starts normally and proceeds successfully with the first few stages.
- Once we reach the dreaded stage, all tasks are performed successfully (they typically take not more than 1 minute each), except for the /very/ first one (task 0.0) which never finishes.

Here is what the log looks like (simplified for readability):

----------------------------------------
INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412 ms on ... (executor 12) (254/256)
INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394 ms on ... (executor 7) (255/256)
INFO ExecutorAllocationManager: Request to remove executorIds: 14
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 14
INFO YarnAllocator: Driver requested a total number of 0 executor(s).
----------------------------------------

Why is that? There is still a task waiting to be completed right? Isn't an executor needed for that?

Afterwards, all executors are getting killed (dynamic allocation is turned on):

----------------------------------------
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
INFO ExecutorAllocationManager: Removing executor 14 because it has been idle for 60 seconds (new desired total will be 5)
.
.
.
INFO ExecutorAllocationManager: Request to remove executorIds: 7
INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 7
INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 1)
INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
INFO DAGScheduler: Executor lost: 7 (epoch 4)
INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, ..., 44289, None)
INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
INFO ExecutorAllocationManager: Existing executor 7 has been removed (new total is 1)
----------------------------------------

Then, there's nothing more in the driver's log. Nothing. The cluster then run for hours, with no progress being made, and no executors allocated.

Here is what I tried:

- More memory per executor: from 13 GB to 24 GB by increments.
- Explicit repartition() on the RDD: from 128 to 256 partitions.

The offending stage used to be a rather innocent looking keyBy(). After adding some repartition() the offending stage was then a mapToPair(). During my last experiments, it turned out the repartition(256) itself is now the culprit.

I like Spark, but its mysteries will manage to send me in a mental hospital one of those days.

Can anyone shed light on what is going on here, or maybe offer some suggestions or pointers to relevant source of information?

I am completely clueless.

Seasons greetings,

Jeroen

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.

Hello Mans,

On 1 Jan 2018, at 17:12, M Singh <ma...@yahoo.com> wrote:
> I am not sure if I missed it - but can you let us know what is your input source and output sink ?

Reading from S3 and writing to S3.

However the never-ending task 0.0 happens in a stage way before outputting anything to S3.

Regards,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by M Singh <ma...@yahoo.com.INVALID>.

Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input source and output sink ?  
In some cases, I found that saving to S3 was a problem. In this case I started saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which solved our issue.

Mans 

    On Monday, January 1, 2018 7:41 AM, Rohit Karlupia <ro...@qubole.com> wrote:

 Here is the list that I will probably try to fill:   
   - Check GC on the offending executor when the task is running. May be you need even more memory.  
   - Go back to some previous successful run of the job and check the spark ui for the offending stage and check max task time/max input/max shuffle in/out for the largest task. Will help you understand the degree of skew in this stage. 
   - Take a thread dump of the executor from the Spark UI and verify if the task is really doing any work or it stuck in some deadlock. Some of the hive serde are not really usable from multi-threaded/multi-use spark executors. 
   - Take a thread dump of the executor from the Spark UI and verify if the task is spilling to disk. Playing with storage and memory fraction or generally increasing the memory will help. 
   - Check the disk utilisation on the machine running the executor. 
   - Look for event loss messages in the logs due to event queue full. Loss of events can send some of the spark components into really bad states.  

thanks,rohitk

On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <go...@gmail.com> wrote:

Hi,
Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.
Sadly, I cannot be of much help unless we go for a screen share session over google chat or skype. 
Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true. 
Besides that, there is a metrics in the EMR console which shows the number of containers getting generated by your job on graphs.

Regards,Gourav Sengupta
On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bl...@gmail.com> wrote:

Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta <go...@gmail.com> wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_x xx/.

> Also if you see the spark UI then you can easily see which particular step is taking the longest period of time - you just have to drill in a bit in order to see that. Generally in case shuffling is an issue then it definitely appears in the SPARK UI as I drill into the steps and see which particular one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to date.

JM

Re: Spark on EMR suddenly stalling

Posted by Rohit Karlupia <ro...@qubole.com>.

Here is the list that I will probably try to fill:

   1. Check GC on the offending executor when the task is running. May be
   you need even more memory.
   2. Go back to some previous successful run of the job and check the
   spark ui for the offending stage and check max task time/max input/max
   shuffle in/out for the largest task. Will help you understand the degree of
   skew in this stage.
   3. Take a thread dump of the executor from the Spark UI and verify if
   the task is really doing any work or it stuck in some deadlock. Some of the
   hive serde are not really usable from multi-threaded/multi-use spark
   executors.
   4. Take a thread dump of the executor from the Spark UI and verify if
   the task is spilling to disk. Playing with storage and memory fraction or
   generally increasing the memory will help.
   5. Check the disk utilisation on the machine running the executor.
   6. Look for event loss messages in the logs due to event queue full.
   Loss of events can send some of the spark components into really bad
   states.

thanks,
rohitk

On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <gourav.sengupta@gmail.com
> wrote:

> Hi,
>
> Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> Sadly, I cannot be of much help unless we go for a screen share session
> over google chat or skype.
>
> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Besides that, there is a metrics in the EMR console which shows the number
> of containers getting generated by your job on graphs.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bl...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Just a quick update as I did not made much progress yet.
>>
>> On 28 Dec 2017, at 21:09, Gourav Sengupta <go...@gmail.com>
>> wrote:
>> > can you try to then use the EMR version 5.10 instead or EMR version
>> 5.11 instead?
>>
>> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>>
>> > can you please try selecting a subnet which is in a different
>> availability zone?
>>
>> I did not try this yet. But why should that make a difference?
>>
>> > if possible just try to increase the number of task instances and see
>> the difference?
>>
>> I tried with 512 partitions -- no difference.
>>
>> > also in case you are using caching,
>>
>> No caching used.
>>
>> > Also can you please report the number of containers that your job is
>> creating by looking at the metrics in the EMR console?
>>
>> 8 containers if I trust the directories in j-xxx/containers/application_x
>> xx/.
>>
>> > Also if you see the spark UI then you can easily see which particular
>> step is taking the longest period of time - you just have to drill in a bit
>> in order to see that. Generally in case shuffling is an issue then it
>> definitely appears in the SPARK UI as I drill into the steps and see which
>> particular one is taking the longest.
>>
>> I always have issues with the Spark UI on EC2 -- it never seems to be up
>> to date.
>>
>> JM
>>
>>
>

Re: Spark on EMR suddenly stalling

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Jeroen,

in case you are using HIVE partitions how many partitions do you have?

Also is there any chance that you might post the code?

Regards,
Gourav Sengupta

On Tue, Jan 2, 2018 at 7:50 AM, Jeroen Miller <bl...@gmail.com>
wrote:

> Hello Gourav,
>
> On 30 Dec 2017, at 20:20, Gourav Sengupta <go...@gmail.com>
> wrote:
> > Please try to use the SPARK UI from the way that AWS EMR recommends, it
> should be available from the resource manager. I never ever had any problem
> working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
> DEBUGGING.
>
> For some reason sometimes there is absolutely nothing showing up in the
> Spark UI or the UI is not refreshed, e.g. for the current stage is #x while
> the logs shows stage #y (with y > x) is currently under way.
>
> It may very well be that the source of this problem lies between the
> keyboard and the chair, but if this is the case, I do not know how to solve
> this.
>
> > Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to
> be set to true.
>
> Thanks for the tip -- will try this setting in my next batch of
> experiments!
>
> JM
>
>

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.

Hello Gourav,

On 30 Dec 2017, at 20:20, Gourav Sengupta <go...@gmail.com> wrote:
> Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.

For some reason sometimes there is absolutely nothing showing up in the Spark UI or the UI is not refreshed, e.g. for the current stage is #x while the logs shows stage #y (with y > x) is currently under way.

It may very well be that the source of this problem lies between the keyboard and the chair, but if this is the case, I do not know how to solve this.

> Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set to true. 

Thanks for the tip -- will try this setting in my next batch of experiments!

JM

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

Please try to use the SPARK UI from the way that AWS EMR recommends, it
should be available from the resource manager. I never ever had any problem
working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF
DEBUGGING.

Sadly, I cannot be of much help unless we go for a screen share session
over google chat or skype.

Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be
set to true.

Besides that, there is a metrics in the EMR console which shows the number
of containers getting generated by your job on graphs.

Regards,
Gourav Sengupta

On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bl...@gmail.com>
wrote:

> Hello,
>
> Just a quick update as I did not made much progress yet.
>
> On 28 Dec 2017, at 21:09, Gourav Sengupta <go...@gmail.com>
> wrote:
> > can you try to then use the EMR version 5.10 instead or EMR version 5.11
> instead?
>
> Same issue with EMR 5.11.0. Task 0 in one stage never finishes.
>
> > can you please try selecting a subnet which is in a different
> availability zone?
>
> I did not try this yet. But why should that make a difference?
>
> > if possible just try to increase the number of task instances and see
> the difference?
>
> I tried with 512 partitions -- no difference.
>
> > also in case you are using caching,
>
> No caching used.
>
> > Also can you please report the number of containers that your job is
> creating by looking at the metrics in the EMR console?
>
> 8 containers if I trust the directories in j-xxx/containers/application_
> xxx/.
>
> > Also if you see the spark UI then you can easily see which particular
> step is taking the longest period of time - you just have to drill in a bit
> in order to see that. Generally in case shuffling is an issue then it
> definitely appears in the SPARK UI as I drill into the steps and see which
> particular one is taking the longest.
>
> I always have issues with the Spark UI on EC2 -- it never seems to be up
> to date.
>
> JM
>
>

Re: Spark on EMR suddenly stalling

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Jeroen,

can you try to then use the EMR version 5.10 instead or EMR version 5.11
instead?
can you please try selecting a subnet which is in a different availability
zone?
if possible just try to increase the number of task instances and see the
difference?
also in case you are using caching, try to see the total amount of space
being used, you may also want to persist intermediate data into S3 as
default parquet format in worst case scenario and then work through the
steps that you think are failing using Jupyter or Spark notebook.
Also can you please report the number of containers that your job is
creating by looking at the metrics in the EMR console?

Also if you see the spark UI then you can easily see which particular step
is taking the longest period of time - you just have to drill in a bit in
order to see that. Generally in case shuffling is an issue then it
definitely appears in the SPARK UI as I drill into the steps and see which
particular one is taking the longest.

Since you do not have a long running cluster (which I mistook from your
statement of a long running job) therefore things should be fine.

Regards,
Gourav Sengupta

On Thu, Dec 28, 2017 at 7:43 PM, Jeroen Miller <bl...@gmail.com>
wrote:

> On 28 Dec 2017, at 19:42, Gourav Sengupta <go...@gmail.com>
> wrote:
> > In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
>
> Nothing that I can think of, just a Spark step (unless EMR is doing fancy
> stuff behind my back).
>
> > Are you using SPARK Session?
>
> Yes.
>
> > If yes is your application using cluster mode or client mode?
>
> Cluster mode.
>
> > Have you read the EC2 service level agreement?
>
> I did not -- I doubt it has the answer to my problem though! :-)
>
> > Is your cluster on auto scaling group?
>
> Nope.
>
> > Are you scheduling your job by adding another new step into the EMR
> cluster? Or is it the same job running always triggered by some background
> process?
> > Since EMR are supposed to be ephemeral, have you tried creating a new
> cluster and trying your job in that?
>
> I'm creating a new cluster on demand, specifically for that job. No other
> application runs on it.
>
> JM
>
>

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.

On 28 Dec 2017, at 19:42, Gourav Sengupta <go...@gmail.com> wrote:
> In the EMR cluster what are the other applications that you have enabled (like HIVE, FLUME, Livy, etc).

Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff behind my back).

> Are you using SPARK Session?

Yes.

> If yes is your application using cluster mode or client mode?

Cluster mode.

> Have you read the EC2 service level agreement?

I did not -- I doubt it has the answer to my problem though! :-)

> Is your cluster on auto scaling group?

Nope.

> Are you scheduling your job by adding another new step into the EMR cluster? Or is it the same job running always triggered by some background process?
> Since EMR are supposed to be ephemeral, have you tried creating a new cluster and trying your job in that?

I'm creating a new cluster on demand, specifically for that job. No other application runs on it.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by Gourav Sengupta <go...@gmail.com>.

HI Jeroen,

Can I get a few pieces of additional information please?

In the EMR cluster what are the other applications that you have enabled
(like HIVE, FLUME, Livy, etc).
Are you using SPARK Session? If yes is your application using cluster mode
or client mode?
Have you read the EC2 service level agreement?
Is your cluster on auto scaling group?
Are you scheduling your job by adding another new step into the EMR
cluster? Or is it the same job running always triggered by some background
process?
Since EMR are supposed to be ephemeral, have you tried creating a new
cluster and trying your job in that?


Regards,
Gourav Sengupta

On Thu, Dec 28, 2017 at 4:06 PM, Jeroen Miller <bl...@gmail.com>
wrote:

> Dear Sparkers,
>
> Once again in times of desperation, I leave what remains of my mental
> sanity to this wise and knowledgeable community.
>
> I have a Spark job (on EMR 5.8.0) which had been running daily for months,
> if not the whole year, with absolutely no supervision. This changed all of
> sudden for reasons I do not understand.
>
> The volume of data processed daily has been slowly increasing over the
> past year but has been stable in the last couple months. Since I'm only
> processing the past 8 days's worth of data I do not think that increased
> data volume is to blame here. Yes, I did check the volume of data for the
> past few days.
>
> Here is a short description of the issue.
>
> - The Spark job starts normally and proceeds successfully with the first
> few stages.
> - Once we reach the dreaded stage, all tasks are performed successfully
> (they typically take not more than 1 minute each), except for the /very/
> first one (task 0.0) which never finishes.
>
> Here is what the log looks like (simplified for readability):
>
> ----------------------------------------
> INFO TaskSetManager: Finished task 243.0 in stage 4.0 (TID 929) in 49412
> ms on ... (executor 12) (254/256)
> INFO TaskSetManager: Finished task 255.0 in stage 4.0 (TID 941) in 48394
> ms on ... (executor 7) (255/256)
> INFO ExecutorAllocationManager: Request to remove executorIds: 14
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 14
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 14
> INFO YarnAllocator: Driver requested a total number of 0 executor(s).
> ----------------------------------------
>
> Why is that? There is still a task waiting to be completed right? Isn't an
> executor needed for that?
>
> Afterwards, all executors are getting killed (dynamic allocation is turned
> on):
>
> ----------------------------------------
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
> INFO ExecutorAllocationManager: Removing executor 14 because it has been
> idle for 60 seconds (new desired total will be 5)
>     .
>     .
>     .
> INFO ExecutorAllocationManager: Request to remove executorIds: 7
> INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 7
> INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed
> is 7
> INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 7.
> INFO ExecutorAllocationManager: Removing executor 7 because it has been
> idle for 60 seconds (new desired total will be 1)
> INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> INFO DAGScheduler: Executor lost: 7 (epoch 4)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from
> BlockManagerMaster.
> INFO YarnClusterScheduler: Executor 7 on ... killed by driver.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7,
> ..., 44289, None)
> INFO BlockManagerMaster: Removed 7 successfully in removeExecutor
> INFO ExecutorAllocationManager: Existing executor 7 has been removed (new
> total is 1)
> ----------------------------------------
>
> Then, there's nothing more in the driver's log. Nothing. The cluster then
> run for hours, with no progress being made, and no executors allocated.
>
> Here is what I tried:
>
>     - More memory per executor: from 13 GB to 24 GB by increments.
>     - Explicit repartition() on the RDD: from 128 to 256 partitions.
>
> The offending stage used to be a rather innocent looking keyBy(). After
> adding some repartition() the offending stage was then a mapToPair().
> During my last experiments, it turned out the repartition(256) itself is
> now the culprit.
>
> I like Spark, but its mysteries will manage to send me in a mental
> hospital one of those days.
>
> Can anyone shed light on what is going on here, or maybe offer some
> suggestions or pointers to relevant source of information?
>
> I am completely clueless.
>
> Seasons greetings,
>
> Jeroen
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.

On 28 Dec 2017, at 19:40, Maximiliano Felice <ma...@gmail.com> wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of a mix of speculative execution and OOM issues in the container.

Interesting! However I don't have any OOM exception in the logs. Does that rule out your hypothesis?

> We've managed to check that when we have speculative execution enabled and some YARN containers which were running speculative tasks died, they did take a chance from the max-attempts number. This wouldn't represent any issue in normal behavior, but it seems that if all the retries were consumed in a task that has started speculative execution, the application itself doesn't fail, but it hangs the task expecting to reschedule it sometime. As the attempts are zero, it never reschedules it and the application itself fails to finish.

Hmm, this sounds like a huge design fail to me, but I'm sure there are very complicated issues that go way over my head.

> 1. Check the number of tasks scheduled. If you see one (or more) tasks missing when you do the final sum, then you might be encountering this issue.
> 2. Check the container logs to see if anything broke. OOM is what failed to me.

I can't find anything in the logs from EMR. Should I expect to find explicit OOM exception messages? 

JM


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by Maximiliano Felice <ma...@gmail.com>.

Hi Jeroen,

I experienced a similar issue a few weeks ago. The situation was a result
of a mix of speculative execution and OOM issues in the container.

First of all, when an executor takes too much time in Spark, it is handled
by the YARN speculative execution, which will launch a new executor and
allocate it in a new container. In our case, some tasks were throwing OOM
exceptions while executing, but not on the executor itself, *but on the
YARN container.*

It turns out that YARN will try several times to run an application when
something fails in it. Specifically, it will try
*yarn.resourcemanager.am.max-attempts* times to run the application before
failing, which has a default value of 2 and is not modified in EMR
configurations (check here
<https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml>
).

We've managed to check that when we have speculative execution enabled and
some YARN containers which were running speculative tasks died, they did
take a chance from the *max-attempts *number. This wouldn't represent any
issue in normal behavior, but it seems that if all the retries were
consumed in a task that has started speculative execution, the application
itself doesn't fail, but it hangs the task expecting to reschedule it
sometime. As the attempts are zero, it never reschedules it and the
application itself fails to finish.

I checked this theory repeatedly, always getting the expected results.
Several times I changed the named YARN configuration and it always starts
speculative retries on this task and hangs when reaching max-attempts
number of broken YARN containers.

I personally think that this issue should be possible to reproduce without
the speculative execution configured.

So, what would I do if I were you?

1. Check the number of tasks scheduled. If you see one (or more) tasks
missing when you do the final sum, then you might be encountering this
issue.
2. Check the *container* logs to see if anything broke. OOM is what failed
to me.
3. Contact AWS EMR support, although in my experience they were of no help
at all.

Hope this helps you a bit!

2017-12-28 14:57 GMT-03:00 Jeroen Miller <bl...@gmail.com>:

> On 28 Dec 2017, at 17:41, Richard Qiao <ri...@gmail.com> wrote:
> > Are you able to specify which path of data filled up?
>
> I can narrow it down to a bunch of files but it's not so straightforward.
>
> > Any logs not rolled over?
>
> I have to manually terminate the cluster but there is nothing more in the
> driver's log when I check it from the AWS console when the cluster is still
> running.
>
> JM
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark on EMR suddenly stalling

Posted by Shushant Arora <sh...@gmail.com>.

you may have to recreate your cluster with below configuration at emr
creation
    "Configurations": [
            {
                "Properties": {
                    "maximizeResourceAllocation": "false"
                },
                "Classification": "spark"
            }
        ]

On Fri, Dec 29, 2017 at 11:57 PM, Jeroen Miller <bl...@gmail.com>
wrote:

> On 28 Dec 2017, at 19:25, Patrick Alwell <pa...@hortonworks.com> wrote:
> > Dynamic allocation is great; but sometimes I’ve found explicitly setting
> the num executors, cores per executor, and memory per executor to be a
> better alternative.
>
> No difference with spark.dynamicAllocation.enabled set to false.
>
> JM
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.

On 28 Dec 2017, at 19:25, Patrick Alwell <pa...@hortonworks.com> wrote:
> Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative.

No difference with spark.dynamicAllocation.enabled set to false.

JM


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by Patrick Alwell <pa...@hortonworks.com>.

Joren,

Anytime there is a shuffle in the network, Spark moves to a new stage. It seems like you are having issues either pre or post shuffle. Have you looked at a resource management tool like ganglia to determine if this is a memory or thread related issue? The spark UI?

You are using groupByKey() have you thought of an alternative like aggregateByKey() or combineByKey() to reduce shuffling?
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/avoid-groupbykey-when-performing-a-group-of-multiple-items-by-key.html

Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative.

Take a look at the yarn logs as well for the particular executor in question. Executors can have multiple tasks; and will often fail if they have more tasks than available threads.

As for partitioning the data; you could also look into your level of parallelism which is correlated to the splitablity (blocks) of data. This will be based on your largest RDD.
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism

Spark is like C/C++ you need to manage the memory buffer or the compiler will through you out ;)
https://spark.apache.org/docs/latest/hardware-provisioning.html

Hang in there, this is the more complicated stage of placing a spark application into production. The Yarn logs should point you in the right direction.

It’s tough to debug over email, so hopefully this information is helpful.

-Pat

On 12/28/17, 9:57 AM, "Jeroen Miller" <bl...@gmail.com> wrote:

On 28 Dec 2017, at 17:41, Richard Qiao <ri...@gmail.com> wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the driver's log when I check it from the AWS console when the cluster is still running.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark on EMR suddenly stalling

Posted by Jeroen Miller <bl...@gmail.com>.

On 28 Dec 2017, at 17:41, Richard Qiao <ri...@gmail.com> wrote:
> Are you able to specify which path of data filled up?

I can narrow it down to a bunch of files but it's not so straightforward.

> Any logs not rolled over?

I have to manually terminate the cluster but there is nothing more in the driver's log when I check it from the AWS console when the cluster is still running. 

JM

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org