You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yash Sharma <ya...@gmail.com> on 2016/09/24 00:27:55 UTC

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Have been playing around with configs to crack this. Adding them here where
it would be helpful to others :)
Number of executors and timeout seemed like the core issue.

{code}
--driver-memory 4G \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=500 \
--conf spark.core.connection.ack.wait.timeout=6000 \
--conf spark.akka.heartbeat.interval=6000 \
--conf spark.akka.frameSize=100 \
--conf spark.akka.timeout=6000 \
{code}

Cheers !

On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in> wrote:

> For testing purpose can you run with fix number of executors and try. May
> be 12 executors for testing and let know the status.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <ya...@gmail.com>
> wrote:
>
> Thanks Aditya, appreciate the help.
>>
>> I had the exact thought about the huge number of executors requested.
>> I am going with the dynamic executors and not specifying the number of
>> executors. Are you suggesting that I should limit the number of executors
>> when the dynamic allocator requests for more number of executors.
>>
>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>
>>
>>
>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>> co.in> wrote:
>>
>>> Hi Yash,
>>>
>>> What is your total cluster memory and number of cores?
>>> Problem might be with the number of executors you are allocating. The
>>> logs shows it as 168510 which is on very high side. Try reducing your
>>> executors.
>>>
>>>
>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>
>>>> Hi All,
>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>> allocation enabled.
>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>> starts*.
>>>>
>>>> Is there anything I can check to debug this problem. There is not a lot
>>>> of information in logs for the exact cause but here is some snapshot below.
>>>>
>>>> Thanks All.
>>>>
>>>> * - by starts I mean when it shows something on the spark web ui,
>>>> before that its just blank page.
>>>>
>>>> Logs here -
>>>>
>>>> {code}
>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>> of 168510 executor(s).
>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 22
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 19
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 18
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 12
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 11
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 20
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 15
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 7
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 8
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 16
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 21
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 6
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 13
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 14
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 9
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 3
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 17
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 1
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 10
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 4
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 2
>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>> non-existent executor 5
>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>> time(s) in a row.
>>>> java.lang.StackOverflowError
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>>> $1.apply(TraversableLike.scala:772)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>>> $1.apply(TraversableLike.scala:772)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>>> $1.apply(TraversableLike.scala:772)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>>         at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app
>>>> ly(MapLike.scala:245)
>>>> {code}
>>>>
>>>> ... <trimmed logs>
>>>>
>>>> {code}
>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>> but got no response. Marking as slave lost.
>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>> non-existent executor 7
>>>>         at org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossRea
>>>> sonRequest(YarnAllocator.scala:554)
>>>>         at org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$a
>>>> nonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
>>>>         at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mc
>>>> V$sp(Inbox.scala:104)
>>>>         at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>         at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispat
>>>> cher.scala:215)
>>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1145)
>>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:615)
>>>>         at java.lang.Thread.run(Thread.java:745)
>>>> {code}
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by Yash Sharma <ya...@gmail.com>.

Hi Dhruve, thanks.
I've solved the issue with adding max executors.
I wanted to find some place where I can add this behavior in Spark so that
user should not have to worry about the max executors.

Cheers

- Thanks, via mobile,  excuse brevity.

On Sep 24, 2016 1:15 PM, "dhruve ashar" <dh...@gmail.com> wrote:

> From your log, its trying to launch every executor with approximately
> 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x
> 6.6GB is unrealistic for a 12 node cluster.
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>
> I don't know the size of the data that you are processing here.
>
> Here are some general choices that I would start with.
>
> Start with a smaller no. of minimum executors and assign them reasonable
> memory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
>
>
> On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma <ya...@gmail.com> wrote:
>
>> Is there anywhere I can help fix this ?
>>
>> I can see the requests being made in the yarn allocator. What should be
>> the upperlimit of the requests made ?
>>
>> https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222
>>
>> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <ya...@gmail.com> wrote:
>>
>>> Have been playing around with configs to crack this. Adding them here
>>> where it would be helpful to others :)
>>> Number of executors and timeout seemed like the core issue.
>>>
>>> {code}
>>> --driver-memory 4G \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.maxExecutors=500 \
>>> --conf spark.core.connection.ack.wait.timeout=6000 \
>>> --conf spark.akka.heartbeat.interval=6000 \
>>> --conf spark.akka.frameSize=100 \
>>> --conf spark.akka.timeout=6000 \
>>> {code}
>>>
>>> Cheers !
>>>
>>> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
>>> wrote:
>>>
>>>> For testing purpose can you run with fix number of executors and try.
>>>> May be 12 executors for testing and let know the status.
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash360@gmail.com
>>>> > wrote:
>>>>
>>>> Thanks Aditya, appreciate the help.
>>>>>
>>>>> I had the exact thought about the huge number of executors requested.
>>>>> I am going with the dynamic executors and not specifying the number of
>>>>> executors. Are you suggesting that I should limit the number of executors
>>>>> when the dynamic allocator requests for more number of executors.
>>>>>
>>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>>>> co.in> wrote:
>>>>>
>>>>>> Hi Yash,
>>>>>>
>>>>>> What is your total cluster memory and number of cores?
>>>>>> Problem might be with the number of executors you are allocating. The
>>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>>> executors.
>>>>>>
>>>>>>
>>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>>> allocation enabled.
>>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>>> starts*.
>>>>>>>
>>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>>> below.
>>>>>>>
>>>>>>> Thanks All.
>>>>>>>
>>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>>> before that its just blank page.
>>>>>>>
>>>>>>> Logs here -
>>>>>>>
>>>>>>> {code}
>>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total
>>>>>>> number of 168510 executor(s).
>>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 22
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 19
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 18
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 12
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 11
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 20
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 15
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 7
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 8
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 16
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 21
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 6
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 13
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 14
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 9
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 3
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 17
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 1
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 10
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 4
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 2
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 5
>>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>>>> time(s) in a row.
>>>>>>> java.lang.StackOverflowError
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>> {code}
>>>>>>>
>>>>>>> ... <trimmed logs>
>>>>>>>
>>>>>>> {code}
>>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>>>> but got no response. Marking as slave lost.
>>>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>>>> non-existent executor 7
>>>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>>>         at org.apache.spark.deploy.yarn.A
>>>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>>>         at java.util.concurrent.ThreadPoo
>>>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>         at java.util.concurrent.ThreadPoo
>>>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>>> {code}
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> -Dhruve Ashar
>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by Yash Sharma <ya...@gmail.com>.

Hi Dhruve, thanks.
I've solved the issue with adding max executors.
I wanted to find some place where I can add this behavior in Spark so that
user should not have to worry about the max executors.

Cheers

- Thanks, via mobile,  excuse brevity.

On Sep 24, 2016 1:15 PM, "dhruve ashar" <dh...@gmail.com> wrote:

> From your log, its trying to launch every executor with approximately
> 6.6GB of memory. 168510 is an extremely huge no. executors and 168510 x
> 6.6GB is unrealistic for a 12 node cluster.
> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>
> I don't know the size of the data that you are processing here.
>
> Here are some general choices that I would start with.
>
> Start with a smaller no. of minimum executors and assign them reasonable
> memory. This can be around 48 assuming 12 nodes x 4 cores each. You could
> start with processing a subset of your data and see if you are able to get
> a decent performance. Then gradually increase the maximum # of execs for
> dynamic allocation and process the remaining data.
>
>
>
>
> On Fri, Sep 23, 2016 at 7:54 PM, Yash Sharma <ya...@gmail.com> wrote:
>
>> Is there anywhere I can help fix this ?
>>
>> I can see the requests being made in the yarn allocator. What should be
>> the upperlimit of the requests made ?
>>
>> https://github.com/apache/spark/blob/master/yarn/src/main/
>> scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222
>>
>> On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <ya...@gmail.com> wrote:
>>
>>> Have been playing around with configs to crack this. Adding them here
>>> where it would be helpful to others :)
>>> Number of executors and timeout seemed like the core issue.
>>>
>>> {code}
>>> --driver-memory 4G \
>>> --conf spark.dynamicAllocation.enabled=true \
>>> --conf spark.dynamicAllocation.maxExecutors=500 \
>>> --conf spark.core.connection.ack.wait.timeout=6000 \
>>> --conf spark.akka.heartbeat.interval=6000 \
>>> --conf spark.akka.frameSize=100 \
>>> --conf spark.akka.timeout=6000 \
>>> {code}
>>>
>>> Cheers !
>>>
>>> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
>>> wrote:
>>>
>>>> For testing purpose can you run with fix number of executors and try.
>>>> May be 12 executors for testing and let know the status.
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash360@gmail.com
>>>> > wrote:
>>>>
>>>> Thanks Aditya, appreciate the help.
>>>>>
>>>>> I had the exact thought about the huge number of executors requested.
>>>>> I am going with the dynamic executors and not specifying the number of
>>>>> executors. Are you suggesting that I should limit the number of executors
>>>>> when the dynamic allocator requests for more number of executors.
>>>>>
>>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>>>> co.in> wrote:
>>>>>
>>>>>> Hi Yash,
>>>>>>
>>>>>> What is your total cluster memory and number of cores?
>>>>>> Problem might be with the number of executors you are allocating. The
>>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>>> executors.
>>>>>>
>>>>>>
>>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>>> allocation enabled.
>>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>>> starts*.
>>>>>>>
>>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>>> below.
>>>>>>>
>>>>>>> Thanks All.
>>>>>>>
>>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>>> before that its just blank page.
>>>>>>>
>>>>>>> Logs here -
>>>>>>>
>>>>>>> {code}
>>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total
>>>>>>> number of 168510 executor(s).
>>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 22
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 19
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 18
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 12
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 11
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 20
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 15
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 7
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 8
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 16
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 21
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 6
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 13
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 14
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 9
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 3
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 17
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 1
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 10
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 4
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 2
>>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>>> for non-existent executor 5
>>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>>>> time(s) in a row.
>>>>>>> java.lang.StackOverflowError
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.TraversableLi
>>>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>>         at scala.collection.MapLike$Mappe
>>>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>> {code}
>>>>>>>
>>>>>>> ... <trimmed logs>
>>>>>>>
>>>>>>> {code}
>>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>>>> but got no response. Marking as slave lost.
>>>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>>>> non-existent executor 7
>>>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>>>         at org.apache.spark.deploy.yarn.A
>>>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>>>         at java.util.concurrent.ThreadPoo
>>>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>         at java.util.concurrent.ThreadPoo
>>>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>>> {code}
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> -Dhruve Ashar
>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by Yash Sharma <ya...@gmail.com>.

Is there anywhere I can help fix this ?

I can see the requests being made in the yarn allocator. What should be the
upperlimit of the requests made ?

https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222

On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <ya...@gmail.com> wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <ya...@gmail.com>
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>> co.in> wrote:
>>>
>>>> Hi Yash,
>>>>
>>>> What is your total cluster memory and number of cores?
>>>> Problem might be with the number of executors you are allocating. The
>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>> executors.
>>>>
>>>>
>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>
>>>>> Hi All,
>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>> allocation enabled.
>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>> starts*.
>>>>>
>>>>> Is there anything I can check to debug this problem. There is not a
>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>> below.
>>>>>
>>>>> Thanks All.
>>>>>
>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>> before that its just blank page.
>>>>>
>>>>> Logs here -
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>> of 168510 executor(s).
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 22
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 19
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 18
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 12
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 11
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 20
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 15
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 7
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 8
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 16
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 21
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 6
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 13
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 14
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 9
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 3
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 17
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 1
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 10
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 4
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 2
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 5
>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>> time(s) in a row.
>>>>> java.lang.StackOverflowError
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>> {code}
>>>>>
>>>>> ... <trimmed logs>
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>> but got no response. Marking as slave lost.
>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>> non-existent executor 7
>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>         at org.apache.spark.deploy.yarn.A
>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> {code}
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by Yash Sharma <ya...@gmail.com>.

Is there anywhere I can help fix this ?

I can see the requests being made in the yarn allocator. What should be the
upperlimit of the requests made ?

https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L222

On Sat, Sep 24, 2016 at 10:27 AM, Yash Sharma <ya...@gmail.com> wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <ya...@gmail.com>
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>> co.in> wrote:
>>>
>>>> Hi Yash,
>>>>
>>>> What is your total cluster memory and number of cores?
>>>> Problem might be with the number of executors you are allocating. The
>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>> executors.
>>>>
>>>>
>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>
>>>>> Hi All,
>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>> allocation enabled.
>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>> starts*.
>>>>>
>>>>> Is there anything I can check to debug this problem. There is not a
>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>> below.
>>>>>
>>>>> Thanks All.
>>>>>
>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>> before that its just blank page.
>>>>>
>>>>> Logs here -
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>> of 168510 executor(s).
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 22
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 19
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 18
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 12
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 11
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 20
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 15
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 7
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 8
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 16
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 21
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 6
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 13
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 14
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 9
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 3
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 17
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 1
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 10
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 4
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 2
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 5
>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>> time(s) in a row.
>>>>> java.lang.StackOverflowError
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>> {code}
>>>>>
>>>>> ... <trimmed logs>
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>> but got no response. Marking as slave lost.
>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>> non-existent executor 7
>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>         at org.apache.spark.deploy.yarn.A
>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> {code}
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by Yash Sharma <ya...@gmail.com>.

We have too many (large)  files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.

I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's Max value,
 was wondering if we can compute an appropriate max executor count based on
the cluster resources.

Would be happy to contribute back if I can get some info on the executors
requests.

Cheers


On Sat, Sep 24, 2016, 6:39 PM ayan guha <gu...@gmail.com> wrote:

> Do you have too many small files you are trying to read? Number of
> executors are very high
> On 24 Sep 2016 10:28, "Yash Sharma" <ya...@gmail.com> wrote:
>
>> Have been playing around with configs to crack this. Adding them here
>> where it would be helpful to others :)
>> Number of executors and timeout seemed like the core issue.
>>
>> {code}
>> --driver-memory 4G \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.maxExecutors=500 \
>> --conf spark.core.connection.ack.wait.timeout=6000 \
>> --conf spark.akka.heartbeat.interval=6000 \
>> --conf spark.akka.frameSize=100 \
>> --conf spark.akka.timeout=6000 \
>> {code}
>>
>> Cheers !
>>
>> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
>> wrote:
>>
>>> For testing purpose can you run with fix number of executors and try.
>>> May be 12 executors for testing and let know the status.
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <ya...@gmail.com>
>>> wrote:
>>>
>>> Thanks Aditya, appreciate the help.
>>>>
>>>> I had the exact thought about the huge number of executors requested.
>>>> I am going with the dynamic executors and not specifying the number of
>>>> executors. Are you suggesting that I should limit the number of executors
>>>> when the dynamic allocator requests for more number of executors.
>>>>
>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <
>>>> aditya.calangutkar@augmentiq.co.in> wrote:
>>>>
>>>>> Hi Yash,
>>>>>
>>>>> What is your total cluster memory and number of cores?
>>>>> Problem might be with the number of executors you are allocating. The
>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>> executors.
>>>>>
>>>>>
>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>
>>>>>> Hi All,
>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>> allocation enabled.
>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>> starts*.
>>>>>>
>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>> below.
>>>>>>
>>>>>> Thanks All.
>>>>>>
>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>> before that its just blank page.
>>>>>>
>>>>>> Logs here -
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>>> of 168510 executor(s).
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 22
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 19
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 18
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 12
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 11
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 20
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 15
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 7
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 8
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 16
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 21
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 6
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 13
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 14
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 9
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 3
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 17
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 1
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 10
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 4
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 2
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 5
>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>>> time(s) in a row.
>>>>>> java.lang.StackOverflowError
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>> {code}
>>>>>>
>>>>>> ... <trimmed logs>
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>>> but got no response. Marking as slave lost.
>>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>>> non-existent executor 7
>>>>>>         at
>>>>>> org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>>         at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> {code}
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by Yash Sharma <ya...@gmail.com>.

We have too many (large)  files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.

I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's Max value,
 was wondering if we can compute an appropriate max executor count based on
the cluster resources.

Would be happy to contribute back if I can get some info on the executors
requests.

Cheers


On Sat, Sep 24, 2016, 6:39 PM ayan guha <gu...@gmail.com> wrote:

> Do you have too many small files you are trying to read? Number of
> executors are very high
> On 24 Sep 2016 10:28, "Yash Sharma" <ya...@gmail.com> wrote:
>
>> Have been playing around with configs to crack this. Adding them here
>> where it would be helpful to others :)
>> Number of executors and timeout seemed like the core issue.
>>
>> {code}
>> --driver-memory 4G \
>> --conf spark.dynamicAllocation.enabled=true \
>> --conf spark.dynamicAllocation.maxExecutors=500 \
>> --conf spark.core.connection.ack.wait.timeout=6000 \
>> --conf spark.akka.heartbeat.interval=6000 \
>> --conf spark.akka.frameSize=100 \
>> --conf spark.akka.timeout=6000 \
>> {code}
>>
>> Cheers !
>>
>> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
>> wrote:
>>
>>> For testing purpose can you run with fix number of executors and try.
>>> May be 12 executors for testing and let know the status.
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <ya...@gmail.com>
>>> wrote:
>>>
>>> Thanks Aditya, appreciate the help.
>>>>
>>>> I had the exact thought about the huge number of executors requested.
>>>> I am going with the dynamic executors and not specifying the number of
>>>> executors. Are you suggesting that I should limit the number of executors
>>>> when the dynamic allocator requests for more number of executors.
>>>>
>>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <
>>>> aditya.calangutkar@augmentiq.co.in> wrote:
>>>>
>>>>> Hi Yash,
>>>>>
>>>>> What is your total cluster memory and number of cores?
>>>>> Problem might be with the number of executors you are allocating. The
>>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>>> executors.
>>>>>
>>>>>
>>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>>
>>>>>> Hi All,
>>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>>> allocation enabled.
>>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>>> starts*.
>>>>>>
>>>>>> Is there anything I can check to debug this problem. There is not a
>>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>>> below.
>>>>>>
>>>>>> Thanks All.
>>>>>>
>>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>>> before that its just blank page.
>>>>>>
>>>>>> Logs here -
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>>> of 168510 executor(s).
>>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 22
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 19
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 18
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 12
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 11
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 20
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 15
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 7
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 8
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 16
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 21
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 6
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 13
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 14
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 9
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 3
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 17
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 1
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 10
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 4
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 2
>>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason
>>>>>> for non-existent executor 5
>>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>>> time(s) in a row.
>>>>>> java.lang.StackOverflowError
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>>         at
>>>>>> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>> {code}
>>>>>>
>>>>>> ... <trimmed logs>
>>>>>>
>>>>>> {code}
>>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>>> but got no response. Marking as slave lost.
>>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>>> non-existent executor 7
>>>>>>         at
>>>>>> org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>>         at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>>         at
>>>>>> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>         at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>>> {code}
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Spark job fails as soon as it starts. Driver requested a total number of 168510 executor

Posted by ayan guha <gu...@gmail.com>.

Do you have too many small files you are trying to read? Number of
executors are very high
On 24 Sep 2016 10:28, "Yash Sharma" <ya...@gmail.com> wrote:

> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the core issue.
>
> {code}
> --driver-memory 4G \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.maxExecutors=500 \
> --conf spark.core.connection.ack.wait.timeout=6000 \
> --conf spark.akka.heartbeat.interval=6000 \
> --conf spark.akka.frameSize=100 \
> --conf spark.akka.timeout=6000 \
> {code}
>
> Cheers !
>
> On Fri, Sep 23, 2016 at 7:50 PM, <ad...@augmentiq.co.in>
> wrote:
>
>> For testing purpose can you run with fix number of executors and try. May
>> be 12 executors for testing and let know the status.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <ya...@gmail.com>
>> wrote:
>>
>> Thanks Aditya, appreciate the help.
>>>
>>> I had the exact thought about the huge number of executors requested.
>>> I am going with the dynamic executors and not specifying the number of
>>> executors. Are you suggesting that I should limit the number of executors
>>> when the dynamic allocator requests for more number of executors.
>>>
>>> Its a 12 node EMR cluster and has more than a Tb of memory.
>>>
>>>
>>>
>>> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq.
>>> co.in> wrote:
>>>
>>>> Hi Yash,
>>>>
>>>> What is your total cluster memory and number of cores?
>>>> Problem might be with the number of executors you are allocating. The
>>>> logs shows it as 168510 which is on very high side. Try reducing your
>>>> executors.
>>>>
>>>>
>>>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote:
>>>>
>>>>> Hi All,
>>>>> I have a spark job which runs over a huge bulk of data with Dynamic
>>>>> allocation enabled.
>>>>> The job takes some 15 minutes to start up and fails as soon as it
>>>>> starts*.
>>>>>
>>>>> Is there anything I can check to debug this problem. There is not a
>>>>> lot of information in logs for the exact cause but here is some snapshot
>>>>> below.
>>>>>
>>>>> Thanks All.
>>>>>
>>>>> * - by starts I mean when it shows something on the spark web ui,
>>>>> before that its just blank page.
>>>>>
>>>>> Logs here -
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter
>>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number
>>>>> of 168510 executor(s).
>>>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor
>>>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 22
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 19
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 18
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 12
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 11
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 20
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 15
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 7
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 8
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 16
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 21
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 6
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 13
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 14
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 9
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 3
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 17
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 1
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 10
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 4
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 2
>>>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for
>>>>> non-existent executor 5
>>>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1
>>>>> time(s) in a row.
>>>>> java.lang.StackOverflowError
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.TraversableLi
>>>>> ke$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>>         at scala.collection.MapLike$Mappe
>>>>> dValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>>>>> {code}
>>>>>
>>>>> ... <trimmed logs>
>>>>>
>>>>> {code}
>>>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>>>>> Attempted to get executor loss reason for executor id 7 at RPC address ,
>>>>> but got no response. Marking as slave lost.
>>>>> org.apache.spark.SparkException: Fail to find loss reason for
>>>>> non-existent executor 7
>>>>>         at org.apache.spark.deploy.yarn.Y
>>>>> arnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554)
>>>>>         at org.apache.spark.deploy.yarn.A
>>>>> pplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.apply
>>>>> OrElse(ApplicationMaster.scala:632)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>>>>>         at org.apache.spark.rpc.netty.Inb
>>>>> ox.safelyCall(Inbox.scala:204)
>>>>>         at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>>>>>         at org.apache.spark.rpc.netty.Dis
>>>>> patcher$MessageLoop.run(Dispatcher.scala:215)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>         at java.util.concurrent.ThreadPoo
>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> {code}
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>