You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arun Luthra <ar...@gmail.com> on 2016/01/21 23:02:56 UTC

TaskCommitDenied (Driver denied task commit)

Example warning:

16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2168, attempt: 4436


Is there a solution for this? Increase driver memory? I'm using just 1G
driver memory but ideally I won't have to increase it.

The RDD being processed has 2262 partitions.

Arun

Re: TaskCommitDenied (Driver denied task commit)

Posted by Arun Luthra <ar...@gmail.com>.

Correction. I have to use spark.yarn.am.memoryOverhead because I'm in Yarn
client mode. I set it to 13% of the executor memory.

Also quite helpful was increasing the total overall executor memory.

It will be great when tungsten enhancements make there way into RDDs.

Thanks!

Arun

On Thu, Jan 21, 2016 at 6:19 PM, Arun Luthra <ar...@gmail.com> wrote:

> Two changes I made that appear to be keeping various errors at bay:
>
> 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACBYxKLD8qasYMj2GHK__vtTzV4GeJCZCqfAw++s1d5te1dbJA@mail.gmail.com%3E
> . Even though I couldn't find the same error in my yarn log.
>
> 2) very important: I ran coalesce(1000) on the RDD at the start of the
> DAG. I know keeping the # of partitions lower is helpful, based on past
> experience with groupByKey. I haven't run this pipeline in a bit so that
> rule of thumb was not forefront in my mind.
>
> On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra <ar...@gmail.com>
> wrote:
>
>> Looking into the yarn logs for a similar job where an executor was
>> associated with the same error, I find:
>>
>> ...
>> 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
>> connection to (SERVER), creating a new one.
>> 16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while
>> beginning fetch of 46 outstanding blocks*
>> *java.io.IOException: Failed to connect to (SERVER)*
>>     at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>>     at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>>     at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>>     at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>     at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>>     at
>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
>>     at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>>     at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:265)
>>     at
>> org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:112)
>>     at
>> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:43)
>>     at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>     at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>     at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>     at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>     at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>     at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>     at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>     at java.lang.Thread.run(Thread.java:745)
>> *Caused by: java.net.ConnectException: Connection refused:* (SERVER)
>>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>     at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>>     at
>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>>     at
>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>>     at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>>     at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>     at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>     at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>     ... 1 more
>>
>> ...
>>
>>
>> Not sure if this reveals anything at all.
>>
>>
>> On Thu, Jan 21, 2016 at 2:58 PM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> My hunch is that the TaskCommitDenied is perhaps a red hearing and the
>>> problem is groupByKey - but I've also just seen a lot of people be bitten
>>> by it so that might not be issue. If you just do a count at the point of
>>> the groupByKey does the pipeline succeed?
>>>
>>> On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <ar...@gmail.com>
>>> wrote:
>>>
>>>> Usually the pipeline works, it just failed on this particular input
>>>> data. The other data it has run on is of similar size.
>>>>
>>>> Speculation is enabled.
>>>>
>>>> I'm using Spark 1.5.0.
>>>>
>>>> Here is the config. Many of these may not be needed anymore, they are
>>>> from trying to get things working in Spark 1.2 and 1.3.
>>>>
>>>>         .set("spark.storage.memoryFraction","0.2") // default 0.6
>>>>         .set("spark.shuffle.memoryFraction","0.2") // default 0.2
>>>>         .set("spark.shuffle.manager","SORT") // preferred setting for
>>>> optimized joins
>>>>         .set("spark.shuffle.consolidateFiles","true") // helpful for
>>>> "too many files open"
>>>>         .set("spark.mesos.coarse", "true") // helpful for
>>>> MapOutputTracker errors?
>>>>         .set("spark.akka.frameSize","300") // helpful when using
>>>> consildateFiles=true
>>>>         .set("spark.shuffle.compress","false") //
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>>>         .set("spark.file.transferTo","false") //
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>>>         .set("spark.core.connection.ack.wait.timeout","600") //
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>>>         .set("spark.speculation","true")
>>>>         .set("spark.worker.timeout","600") //
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>>>>         .set("spark.akka.timeout","300") //
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>>>>         .set("spark.storage.blockManagerSlaveTimeoutMs","120000")
>>>>         .set("spark.driver.maxResultSize","2048") // in response to
>>>> error: Total size of serialized results of 39901 tasks (1024.0 MB) is
>>>> bigger than spark.driver.maxResultSize (1024.0 MB)
>>>>         .set("spark.serializer",
>>>> "org.apache.spark.serializer.KryoSerializer")
>>>>         .set("spark.kryo.registrator","------.MyRegistrator")
>>>>         .set("spark.kryo.registrationRequired", "true")
>>>>         .set("spark.yarn.executor.memoryOverhead","600")
>>>>
>>>> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <jo...@databricks.com>
>>>> wrote:
>>>>
>>>>> Is speculation enabled? This TaskCommitDenied by driver error is
>>>>> thrown by writers who lost the race to commit an output partition. I don't
>>>>> think this had anything to do with key skew etc. Replacing the groupbykey
>>>>> with a count will mask this exception because the coordination does not get
>>>>> triggered in non save/write operations.
>>>>>
>>>>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Before we dig too far into this, the thing which most quickly jumps
>>>>>> out to me is groupByKey which could be causing some problems - whats the
>>>>>> distribution of keys like? Try replacing the groupByKey with a count() and
>>>>>> see if the pipeline works up until that stage. Also 1G of driver memory is
>>>>>> also a bit small for something with 90 executors...
>>>>>>
>>>>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load
>>>>>>> native-hadoop library for your platform... using builtin-java classes where
>>>>>>> applicable
>>>>>>>
>>>>>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name
>>>>>>> DAGScheduler for source because spark.app.id is not set.
>>>>>>>
>>>>>>> spark.yarn.driver.memoryOverhead is set but does not apply in client
>>>>>>> mode.
>>>>>>>
>>>>>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
>>>>>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>>>>>
>>>>>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
>>>>>>> broadcast_4 in memory! (computed 60.2 MB so far)
>>>>>>>
>>>>>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to
>>>>>>> disk instead.
>>>>>>>
>>>>>>> [Stage 1:====================================================>(2260
>>>>>>> + 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in
>>>>>>> stage 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for
>>>>>>> job: 1, partition: 1440, attempt: 4530
>>>>>>>
>>>>>>> [Stage 1:====================================================>(2260
>>>>>>> + 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in
>>>>>>> stage 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for
>>>>>>> job: 1, partition: 1488, attempt: 4531
>>>>>>>
>>>>>>> [Stage 1:====================================================>(2261
>>>>>>> + 4) / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in
>>>>>>> stage 1.0 (TID 4532, --): TaskCommitDenied (Driver denied task commit) for
>>>>>>> job: 1, partition: 1982, attempt: 4532
>>>>>>>
>>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
>>>>>>> (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>>>>> partition: 2214, attempt: 4482
>>>>>>>
>>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>>>>> (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>>>>> partition: 2168, attempt: 4436
>>>>>>>
>>>>>>>
>>>>>>> I am running with:
>>>>>>>
>>>>>>>     spark-submit --class "myclass" \
>>>>>>>
>>>>>>>       --num-executors 90 \
>>>>>>>
>>>>>>>       --driver-memory 1g \
>>>>>>>
>>>>>>>       --executor-memory 60g \
>>>>>>>
>>>>>>>       --executor-cores 8 \
>>>>>>>
>>>>>>>       --master yarn-client \
>>>>>>>
>>>>>>>       --conf "spark.executor.extraJavaOptions=-verbose:gc
>>>>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>>>>>>
>>>>>>>       my.jar
>>>>>>>
>>>>>>>
>>>>>>> There are 2262 input files totaling just 98.6G. The DAG is basically
>>>>>>> textFile().map().filter().groupByKey().saveAsTextFile().
>>>>>>>
>>>>>>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you post more of your log? How big are the partitions? What is
>>>>>>>> the action you are performing?
>>>>>>>>
>>>>>>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <arun.luthra@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Example warning:
>>>>>>>>>
>>>>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage
>>>>>>>>> 1.0 (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for
>>>>>>>>> job: 1, partition: 2168, attempt: 4436
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is there a solution for this? Increase driver memory? I'm using
>>>>>>>>> just 1G driver memory but ideally I won't have to increase it.
>>>>>>>>>
>>>>>>>>> The RDD being processed has 2262 partitions.
>>>>>>>>>
>>>>>>>>> Arun
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Cell : 425-233-8271
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Cell : 425-233-8271
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>

Re: TaskCommitDenied (Driver denied task commit)

Posted by Arun Luthra <ar...@gmail.com>.

Two changes I made that appear to be keeping various errors at bay:

1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of
https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCACBYxKLD8qasYMj2GHK__vtTzV4GeJCZCqfAw++s1d5te1dbJA@mail.gmail.com%3E
. Even though I couldn't find the same error in my yarn log.

2) very important: I ran coalesce(1000) on the RDD at the start of the DAG.
I know keeping the # of partitions lower is helpful, based on past
experience with groupByKey. I haven't run this pipeline in a bit so that
rule of thumb was not forefront in my mind.

On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra <ar...@gmail.com> wrote:

> Looking into the yarn logs for a similar job where an executor was
> associated with the same error, I find:
>
> ...
> 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
> connection to (SERVER), creating a new one.
> 16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 46 outstanding blocks*
> *java.io.IOException: Failed to connect to (SERVER)*
>     at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>     at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>     at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>     at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>     at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>     at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
>     at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
>     at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:265)
>     at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:112)
>     at
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:43)
>     at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>     at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>     at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>     at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>     at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>     at org.apache.spark.scheduler.Task.run(Task.scala:88)
>     at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> *Caused by: java.net.ConnectException: Connection refused:* (SERVER)
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>     at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>     at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>     at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>     at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>     at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>     at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>     ... 1 more
>
> ...
>
>
> Not sure if this reveals anything at all.
>
>
> On Thu, Jan 21, 2016 at 2:58 PM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> My hunch is that the TaskCommitDenied is perhaps a red hearing and the
>> problem is groupByKey - but I've also just seen a lot of people be bitten
>> by it so that might not be issue. If you just do a count at the point of
>> the groupByKey does the pipeline succeed?
>>
>> On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <ar...@gmail.com>
>> wrote:
>>
>>> Usually the pipeline works, it just failed on this particular input
>>> data. The other data it has run on is of similar size.
>>>
>>> Speculation is enabled.
>>>
>>> I'm using Spark 1.5.0.
>>>
>>> Here is the config. Many of these may not be needed anymore, they are
>>> from trying to get things working in Spark 1.2 and 1.3.
>>>
>>>         .set("spark.storage.memoryFraction","0.2") // default 0.6
>>>         .set("spark.shuffle.memoryFraction","0.2") // default 0.2
>>>         .set("spark.shuffle.manager","SORT") // preferred setting for
>>> optimized joins
>>>         .set("spark.shuffle.consolidateFiles","true") // helpful for
>>> "too many files open"
>>>         .set("spark.mesos.coarse", "true") // helpful for
>>> MapOutputTracker errors?
>>>         .set("spark.akka.frameSize","300") // helpful when using
>>> consildateFiles=true
>>>         .set("spark.shuffle.compress","false") //
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>>         .set("spark.file.transferTo","false") //
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>>         .set("spark.core.connection.ack.wait.timeout","600") //
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>>         .set("spark.speculation","true")
>>>         .set("spark.worker.timeout","600") //
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>>>         .set("spark.akka.timeout","300") //
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>>>         .set("spark.storage.blockManagerSlaveTimeoutMs","120000")
>>>         .set("spark.driver.maxResultSize","2048") // in response to
>>> error: Total size of serialized results of 39901 tasks (1024.0 MB) is
>>> bigger than spark.driver.maxResultSize (1024.0 MB)
>>>         .set("spark.serializer",
>>> "org.apache.spark.serializer.KryoSerializer")
>>>         .set("spark.kryo.registrator","------.MyRegistrator")
>>>         .set("spark.kryo.registrationRequired", "true")
>>>         .set("spark.yarn.executor.memoryOverhead","600")
>>>
>>> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <jo...@databricks.com>
>>> wrote:
>>>
>>>> Is speculation enabled? This TaskCommitDenied by driver error is thrown
>>>> by writers who lost the race to commit an output partition. I don't think
>>>> this had anything to do with key skew etc. Replacing the groupbykey with a
>>>> count will mask this exception because the coordination does not get
>>>> triggered in non save/write operations.
>>>>
>>>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Before we dig too far into this, the thing which most quickly jumps
>>>>> out to me is groupByKey which could be causing some problems - whats the
>>>>> distribution of keys like? Try replacing the groupByKey with a count() and
>>>>> see if the pipeline works up until that stage. Also 1G of driver memory is
>>>>> also a bit small for something with 90 executors...
>>>>>
>>>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>>>>>> library for your platform... using builtin-java classes where applicable
>>>>>>
>>>>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler
>>>>>> for source because spark.app.id is not set.
>>>>>>
>>>>>> spark.yarn.driver.memoryOverhead is set but does not apply in client
>>>>>> mode.
>>>>>>
>>>>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
>>>>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>>>>
>>>>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
>>>>>> broadcast_4 in memory! (computed 60.2 MB so far)
>>>>>>
>>>>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to
>>>>>> disk instead.
>>>>>>
>>>>>> [Stage 1:====================================================>(2260 +
>>>>>> 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage
>>>>>> 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job:
>>>>>> 1, partition: 1440, attempt: 4530
>>>>>>
>>>>>> [Stage 1:====================================================>(2260 +
>>>>>> 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage
>>>>>> 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job:
>>>>>> 1, partition: 1488, attempt: 4531
>>>>>>
>>>>>> [Stage 1:====================================================>(2261 +
>>>>>> 4) / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage
>>>>>> 1.0 (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job:
>>>>>> 1, partition: 1982, attempt: 4532
>>>>>>
>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
>>>>>> (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>>>> partition: 2214, attempt: 4482
>>>>>>
>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>>>> (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>>>> partition: 2168, attempt: 4436
>>>>>>
>>>>>>
>>>>>> I am running with:
>>>>>>
>>>>>>     spark-submit --class "myclass" \
>>>>>>
>>>>>>       --num-executors 90 \
>>>>>>
>>>>>>       --driver-memory 1g \
>>>>>>
>>>>>>       --executor-memory 60g \
>>>>>>
>>>>>>       --executor-cores 8 \
>>>>>>
>>>>>>       --master yarn-client \
>>>>>>
>>>>>>       --conf "spark.executor.extraJavaOptions=-verbose:gc
>>>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>>>>>
>>>>>>       my.jar
>>>>>>
>>>>>>
>>>>>> There are 2262 input files totaling just 98.6G. The DAG is basically
>>>>>> textFile().map().filter().groupByKey().saveAsTextFile().
>>>>>>
>>>>>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you post more of your log? How big are the partitions? What is
>>>>>>> the action you are performing?
>>>>>>>
>>>>>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Example warning:
>>>>>>>>
>>>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage
>>>>>>>> 1.0 (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for
>>>>>>>> job: 1, partition: 2168, attempt: 4436
>>>>>>>>
>>>>>>>>
>>>>>>>> Is there a solution for this? Increase driver memory? I'm using
>>>>>>>> just 1G driver memory but ideally I won't have to increase it.
>>>>>>>>
>>>>>>>> The RDD being processed has 2262 partitions.
>>>>>>>>
>>>>>>>> Arun
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Cell : 425-233-8271
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

Re: TaskCommitDenied (Driver denied task commit)

Posted by Arun Luthra <ar...@gmail.com>.

Looking into the yarn logs for a similar job where an executor was
associated with the same error, I find:

...
16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
connection to (SERVER), creating a new one.
16/01/22 01:17:18 *ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 46 outstanding blocks*
*java.io.IOException: Failed to connect to (SERVER)*
    at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
    at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
    at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
    at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
    at
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
    at
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97)
    at
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:152)
    at
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:265)
    at
org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:112)
    at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:43)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
*Caused by: java.net.ConnectException: Connection refused:* (SERVER)
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
    at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
    at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    ... 1 more

...


Not sure if this reveals anything at all.


On Thu, Jan 21, 2016 at 2:58 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> My hunch is that the TaskCommitDenied is perhaps a red hearing and the
> problem is groupByKey - but I've also just seen a lot of people be bitten
> by it so that might not be issue. If you just do a count at the point of
> the groupByKey does the pipeline succeed?
>
> On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <ar...@gmail.com>
> wrote:
>
>> Usually the pipeline works, it just failed on this particular input data.
>> The other data it has run on is of similar size.
>>
>> Speculation is enabled.
>>
>> I'm using Spark 1.5.0.
>>
>> Here is the config. Many of these may not be needed anymore, they are
>> from trying to get things working in Spark 1.2 and 1.3.
>>
>>         .set("spark.storage.memoryFraction","0.2") // default 0.6
>>         .set("spark.shuffle.memoryFraction","0.2") // default 0.2
>>         .set("spark.shuffle.manager","SORT") // preferred setting for
>> optimized joins
>>         .set("spark.shuffle.consolidateFiles","true") // helpful for "too
>> many files open"
>>         .set("spark.mesos.coarse", "true") // helpful for
>> MapOutputTracker errors?
>>         .set("spark.akka.frameSize","300") // helpful when using
>> consildateFiles=true
>>         .set("spark.shuffle.compress","false") //
>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>         .set("spark.file.transferTo","false") //
>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>         .set("spark.core.connection.ack.wait.timeout","600") //
>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>>         .set("spark.speculation","true")
>>         .set("spark.worker.timeout","600") //
>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>>         .set("spark.akka.timeout","300") //
>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>>         .set("spark.storage.blockManagerSlaveTimeoutMs","120000")
>>         .set("spark.driver.maxResultSize","2048") // in response to
>> error: Total size of serialized results of 39901 tasks (1024.0 MB) is
>> bigger than spark.driver.maxResultSize (1024.0 MB)
>>         .set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer")
>>         .set("spark.kryo.registrator","------.MyRegistrator")
>>         .set("spark.kryo.registrationRequired", "true")
>>         .set("spark.yarn.executor.memoryOverhead","600")
>>
>> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <jo...@databricks.com>
>> wrote:
>>
>>> Is speculation enabled? This TaskCommitDenied by driver error is thrown
>>> by writers who lost the race to commit an output partition. I don't think
>>> this had anything to do with key skew etc. Replacing the groupbykey with a
>>> count will mask this exception because the coordination does not get
>>> triggered in non save/write operations.
>>>
>>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Before we dig too far into this, the thing which most quickly jumps out
>>>> to me is groupByKey which could be causing some problems - whats the
>>>> distribution of keys like? Try replacing the groupByKey with a count() and
>>>> see if the pipeline works up until that stage. Also 1G of driver memory is
>>>> also a bit small for something with 90 executors...
>>>>
>>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>>>>> library for your platform... using builtin-java classes where applicable
>>>>>
>>>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler
>>>>> for source because spark.app.id is not set.
>>>>>
>>>>> spark.yarn.driver.memoryOverhead is set but does not apply in client
>>>>> mode.
>>>>>
>>>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
>>>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>>>
>>>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
>>>>> broadcast_4 in memory! (computed 60.2 MB so far)
>>>>>
>>>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to
>>>>> disk instead.
>>>>>
>>>>> [Stage 1:====================================================>(2260 +
>>>>> 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage
>>>>> 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job:
>>>>> 1, partition: 1440, attempt: 4530
>>>>>
>>>>> [Stage 1:====================================================>(2260 +
>>>>> 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage
>>>>> 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job:
>>>>> 1, partition: 1488, attempt: 4531
>>>>>
>>>>> [Stage 1:====================================================>(2261 +
>>>>> 4) / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage
>>>>> 1.0 (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job:
>>>>> 1, partition: 1982, attempt: 4532
>>>>>
>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
>>>>> (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>>> partition: 2214, attempt: 4482
>>>>>
>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>>> (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>>> partition: 2168, attempt: 4436
>>>>>
>>>>>
>>>>> I am running with:
>>>>>
>>>>>     spark-submit --class "myclass" \
>>>>>
>>>>>       --num-executors 90 \
>>>>>
>>>>>       --driver-memory 1g \
>>>>>
>>>>>       --executor-memory 60g \
>>>>>
>>>>>       --executor-cores 8 \
>>>>>
>>>>>       --master yarn-client \
>>>>>
>>>>>       --conf "spark.executor.extraJavaOptions=-verbose:gc
>>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>>>>
>>>>>       my.jar
>>>>>
>>>>>
>>>>> There are 2262 input files totaling just 98.6G. The DAG is basically
>>>>> textFile().map().filter().groupByKey().saveAsTextFile().
>>>>>
>>>>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Can you post more of your log? How big are the partitions? What is
>>>>>> the action you are performing?
>>>>>>
>>>>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Example warning:
>>>>>>>
>>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>>>>> (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job:
>>>>>>> 1, partition: 2168, attempt: 4436
>>>>>>>
>>>>>>>
>>>>>>> Is there a solution for this? Increase driver memory? I'm using just
>>>>>>> 1G driver memory but ideally I won't have to increase it.
>>>>>>>
>>>>>>> The RDD being processed has 2262 partitions.
>>>>>>>
>>>>>>> Arun
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Cell : 425-233-8271
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: TaskCommitDenied (Driver denied task commit)

Posted by Holden Karau <ho...@pigscanfly.ca>.

My hunch is that the TaskCommitDenied is perhaps a red hearing and the
problem is groupByKey - but I've also just seen a lot of people be bitten
by it so that might not be issue. If you just do a count at the point of
the groupByKey does the pipeline succeed?

On Thu, Jan 21, 2016 at 2:56 PM, Arun Luthra <ar...@gmail.com> wrote:

> Usually the pipeline works, it just failed on this particular input data.
> The other data it has run on is of similar size.
>
> Speculation is enabled.
>
> I'm using Spark 1.5.0.
>
> Here is the config. Many of these may not be needed anymore, they are from
> trying to get things working in Spark 1.2 and 1.3.
>
>         .set("spark.storage.memoryFraction","0.2") // default 0.6
>         .set("spark.shuffle.memoryFraction","0.2") // default 0.2
>         .set("spark.shuffle.manager","SORT") // preferred setting for
> optimized joins
>         .set("spark.shuffle.consolidateFiles","true") // helpful for "too
> many files open"
>         .set("spark.mesos.coarse", "true") // helpful for MapOutputTracker
> errors?
>         .set("spark.akka.frameSize","300") // helpful when using
> consildateFiles=true
>         .set("spark.shuffle.compress","false") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>         .set("spark.file.transferTo","false") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>         .set("spark.core.connection.ack.wait.timeout","600") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>         .set("spark.speculation","true")
>         .set("spark.worker.timeout","600") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>         .set("spark.akka.timeout","300") //
> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
>         .set("spark.storage.blockManagerSlaveTimeoutMs","120000")
>         .set("spark.driver.maxResultSize","2048") // in response to error:
> Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
>         .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>         .set("spark.kryo.registrator","------.MyRegistrator")
>         .set("spark.kryo.registrationRequired", "true")
>         .set("spark.yarn.executor.memoryOverhead","600")
>
> On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <jo...@databricks.com>
> wrote:
>
>> Is speculation enabled? This TaskCommitDenied by driver error is thrown
>> by writers who lost the race to commit an output partition. I don't think
>> this had anything to do with key skew etc. Replacing the groupbykey with a
>> count will mask this exception because the coordination does not get
>> triggered in non save/write operations.
>>
>> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Before we dig too far into this, the thing which most quickly jumps out
>>> to me is groupByKey which could be causing some problems - whats the
>>> distribution of keys like? Try replacing the groupByKey with a count() and
>>> see if the pipeline works up until that stage. Also 1G of driver memory is
>>> also a bit small for something with 90 executors...
>>>
>>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>>>> library for your platform... using builtin-java classes where applicable
>>>>
>>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler
>>>> for source because spark.app.id is not set.
>>>>
>>>> spark.yarn.driver.memoryOverhead is set but does not apply in client
>>>> mode.
>>>>
>>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
>>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>>
>>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
>>>> broadcast_4 in memory! (computed 60.2 MB so far)
>>>>
>>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to
>>>> disk instead.
>>>>
>>>> [Stage 1:====================================================>(2260 +
>>>> 7) / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage
>>>> 1.0 (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job:
>>>> 1, partition: 1440, attempt: 4530
>>>>
>>>> [Stage 1:====================================================>(2260 +
>>>> 6) / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage
>>>> 1.0 (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job:
>>>> 1, partition: 1488, attempt: 4531
>>>>
>>>> [Stage 1:====================================================>(2261 +
>>>> 4) / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage
>>>> 1.0 (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job:
>>>> 1, partition: 1982, attempt: 4532
>>>>
>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
>>>> (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>> partition: 2214, attempt: 4482
>>>>
>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>> (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>>> partition: 2168, attempt: 4436
>>>>
>>>>
>>>> I am running with:
>>>>
>>>>     spark-submit --class "myclass" \
>>>>
>>>>       --num-executors 90 \
>>>>
>>>>       --driver-memory 1g \
>>>>
>>>>       --executor-memory 60g \
>>>>
>>>>       --executor-cores 8 \
>>>>
>>>>       --master yarn-client \
>>>>
>>>>       --conf "spark.executor.extraJavaOptions=-verbose:gc
>>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>>>
>>>>       my.jar
>>>>
>>>>
>>>> There are 2262 input files totaling just 98.6G. The DAG is basically
>>>> textFile().map().filter().groupByKey().saveAsTextFile().
>>>>
>>>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Can you post more of your log? How big are the partitions? What is the
>>>>> action you are performing?
>>>>>
>>>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Example warning:
>>>>>>
>>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>>>> (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job:
>>>>>> 1, partition: 2168, attempt: 4436
>>>>>>
>>>>>>
>>>>>> Is there a solution for this? Increase driver memory? I'm using just
>>>>>> 1G driver memory but ideally I won't have to increase it.
>>>>>>
>>>>>> The RDD being processed has 2262 partitions.
>>>>>>
>>>>>> Arun
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: TaskCommitDenied (Driver denied task commit)

Posted by Arun Luthra <ar...@gmail.com>.

Usually the pipeline works, it just failed on this particular input data.
The other data it has run on is of similar size.

Speculation is enabled.

I'm using Spark 1.5.0.

Here is the config. Many of these may not be needed anymore, they are from
trying to get things working in Spark 1.2 and 1.3.

        .set("spark.storage.memoryFraction","0.2") // default 0.6
        .set("spark.shuffle.memoryFraction","0.2") // default 0.2
        .set("spark.shuffle.manager","SORT") // preferred setting for
optimized joins
        .set("spark.shuffle.consolidateFiles","true") // helpful for "too
many files open"
        .set("spark.mesos.coarse", "true") // helpful for MapOutputTracker
errors?
        .set("spark.akka.frameSize","300") // helpful when using
consildateFiles=true
        .set("spark.shuffle.compress","false") //
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
        .set("spark.file.transferTo","false") //
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
        .set("spark.core.connection.ack.wait.timeout","600") //
http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
        .set("spark.speculation","true")
        .set("spark.worker.timeout","600") //
http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
        .set("spark.akka.timeout","300") //
http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-td3798.html
        .set("spark.storage.blockManagerSlaveTimeoutMs","120000")
        .set("spark.driver.maxResultSize","2048") // in response to error:
Total size of serialized results of 39901 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)
        .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
        .set("spark.kryo.registrator","------.MyRegistrator")
        .set("spark.kryo.registrationRequired", "true")
        .set("spark.yarn.executor.memoryOverhead","600")

On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen <jo...@databricks.com>
wrote:

> Is speculation enabled? This TaskCommitDenied by driver error is thrown by
> writers who lost the race to commit an output partition. I don't think this
> had anything to do with key skew etc. Replacing the groupbykey with a count
> will mask this exception because the coordination does not get triggered in
> non save/write operations.
>
> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <ho...@pigscanfly.ca> wrote:
>
>> Before we dig too far into this, the thing which most quickly jumps out
>> to me is groupByKey which could be causing some problems - whats the
>> distribution of keys like? Try replacing the groupByKey with a count() and
>> see if the pipeline works up until that stage. Also 1G of driver memory is
>> also a bit small for something with 90 executors...
>>
>> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler
>>> for source because spark.app.id is not set.
>>>
>>> spark.yarn.driver.memoryOverhead is set but does not apply in client
>>> mode.
>>>
>>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local
>>> reads feature cannot be used because libhadoop cannot be loaded.
>>>
>>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache
>>> broadcast_4 in memory! (computed 60.2 MB so far)
>>>
>>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
>>> instead.
>>>
>>> [Stage 1:====================================================>(2260 + 7)
>>> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
>>> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 1440, attempt: 4530
>>>
>>> [Stage 1:====================================================>(2260 + 6)
>>> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
>>> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 1488, attempt: 4531
>>>
>>> [Stage 1:====================================================>(2261 + 4)
>>> / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
>>> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 1982, attempt: 4532
>>>
>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0
>>> (TID 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 2214, attempt: 4482
>>>
>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>> (TID 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>>> partition: 2168, attempt: 4436
>>>
>>>
>>> I am running with:
>>>
>>>     spark-submit --class "myclass" \
>>>
>>>       --num-executors 90 \
>>>
>>>       --driver-memory 1g \
>>>
>>>       --executor-memory 60g \
>>>
>>>       --executor-cores 8 \
>>>
>>>       --master yarn-client \
>>>
>>>       --conf "spark.executor.extraJavaOptions=-verbose:gc
>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>>
>>>       my.jar
>>>
>>>
>>> There are 2262 input files totaling just 98.6G. The DAG is basically
>>> textFile().map().filter().groupByKey().saveAsTextFile().
>>>
>>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> Can you post more of your log? How big are the partitions? What is the
>>>> action you are performing?
>>>>
>>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>> Example warning:
>>>>>
>>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>>> (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job:
>>>>> 1, partition: 2168, attempt: 4436
>>>>>
>>>>>
>>>>> Is there a solution for this? Increase driver memory? I'm using just
>>>>> 1G driver memory but ideally I won't have to increase it.
>>>>>
>>>>> The RDD being processed has 2262 partitions.
>>>>>
>>>>> Arun
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: TaskCommitDenied (Driver denied task commit)

Posted by Josh Rosen <jo...@databricks.com>.

Is speculation enabled? This TaskCommitDenied by driver error is thrown by
writers who lost the race to commit an output partition. I don't think this
had anything to do with key skew etc. Replacing the groupbykey with a count
will mask this exception because the coordination does not get triggered in
non save/write operations.
On Thu, Jan 21, 2016 at 2:46 PM Holden Karau <ho...@pigscanfly.ca> wrote:

> Before we dig too far into this, the thing which most quickly jumps out to
> me is groupByKey which could be causing some problems - whats the
> distribution of keys like? Try replacing the groupByKey with a count() and
> see if the pipeline works up until that stage. Also 1G of driver memory is
> also a bit small for something with 90 executors...
>
> On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com>
> wrote:
>
>>
>>
>> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
>> source because spark.app.id is not set.
>>
>> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
>>
>> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
>> feature cannot be used because libhadoop cannot be loaded.
>>
>> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
>> in memory! (computed 60.2 MB so far)
>>
>> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
>> instead.
>>
>> [Stage 1:====================================================>(2260 + 7)
>> / 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
>> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1440, attempt: 4530
>>
>> [Stage 1:====================================================>(2260 + 6)
>> / 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
>> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1488, attempt: 4531
>>
>> [Stage 1:====================================================>(2261 + 4)
>> / 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
>> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 1982, attempt: 4532
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
>> 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2214, attempt: 4482
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2168, attempt: 4436
>>
>>
>> I am running with:
>>
>>     spark-submit --class "myclass" \
>>
>>       --num-executors 90 \
>>
>>       --driver-memory 1g \
>>
>>       --executor-memory 60g \
>>
>>       --executor-cores 8 \
>>
>>       --master yarn-client \
>>
>>       --conf "spark.executor.extraJavaOptions=-verbose:gc
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>>
>>       my.jar
>>
>>
>> There are 2262 input files totaling just 98.6G. The DAG is basically
>> textFile().map().filter().groupByKey().saveAsTextFile().
>>
>> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> Can you post more of your log? How big are the partitions? What is the
>>> action you are performing?
>>>
>>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
>>> wrote:
>>>
>>>> Example warning:
>>>>
>>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>>> (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job:
>>>> 1, partition: 2168, attempt: 4436
>>>>
>>>>
>>>> Is there a solution for this? Increase driver memory? I'm using just 1G
>>>> driver memory but ideally I won't have to increase it.
>>>>
>>>> The RDD being processed has 2262 partitions.
>>>>
>>>> Arun
>>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: TaskCommitDenied (Driver denied task commit)

Posted by Holden Karau <ho...@pigscanfly.ca>.

Before we dig too far into this, the thing which most quickly jumps out to
me is groupByKey which could be causing some problems - whats the
distribution of keys like? Try replacing the groupByKey with a count() and
see if the pipeline works up until that stage. Also 1G of driver memory is
also a bit small for something with 90 executors...

On Thu, Jan 21, 2016 at 2:40 PM, Arun Luthra <ar...@gmail.com> wrote:

>
>
> 16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
> source because spark.app.id is not set.
>
> spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
>
> 16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
> 16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
> in memory! (computed 60.2 MB so far)
>
> 16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
> instead.
>
> [Stage 1:====================================================>(2260 + 7) /
> 2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
> (TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 1440, attempt: 4530
>
> [Stage 1:====================================================>(2260 + 6) /
> 2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
> (TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 1488, attempt: 4531
>
> [Stage 1:====================================================>(2261 + 4) /
> 2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
> (TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 1982, attempt: 4532
>
> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
> 4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 2214, attempt: 4482
>
> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
> 4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 2168, attempt: 4436
>
>
> I am running with:
>
>     spark-submit --class "myclass" \
>
>       --num-executors 90 \
>
>       --driver-memory 1g \
>
>       --executor-memory 60g \
>
>       --executor-cores 8 \
>
>       --master yarn-client \
>
>       --conf "spark.executor.extraJavaOptions=-verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
>
>       my.jar
>
>
> There are 2262 input files totaling just 98.6G. The DAG is basically
> textFile().map().filter().groupByKey().saveAsTextFile().
>
> On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> Can you post more of your log? How big are the partitions? What is the
>> action you are performing?
>>
>> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
>> wrote:
>>
>>> Example warning:
>>>
>>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0
>>> (TID 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job:
>>> 1, partition: 2168, attempt: 4436
>>>
>>>
>>> Is there a solution for this? Increase driver memory? I'm using just 1G
>>> driver memory but ideally I won't have to increase it.
>>>
>>> The RDD being processed has 2262 partitions.
>>>
>>> Arun
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: TaskCommitDenied (Driver denied task commit)

Posted by Arun Luthra <ar...@gmail.com>.

16/01/21 21:52:11 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

16/01/21 21:52:14 WARN MetricsSystem: Using default name DAGScheduler for
source because spark.app.id is not set.

spark.yarn.driver.memoryOverhead is set but does not apply in client mode.

16/01/21 21:52:16 WARN DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.

16/01/21 21:52:52 WARN MemoryStore: Not enough space to cache broadcast_4
in memory! (computed 60.2 MB so far)

16/01/21 21:52:52 WARN MemoryStore: Persisting block broadcast_4 to disk
instead.

[Stage 1:====================================================>(2260 + 7) /
2262]16/01/21 21:57:24 WARN TaskSetManager: Lost task 1440.1 in stage 1.0
(TID 4530, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 1440, attempt: 4530

[Stage 1:====================================================>(2260 + 6) /
2262]16/01/21 21:57:27 WARN TaskSetManager: Lost task 1488.1 in stage 1.0
(TID 4531, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 1488, attempt: 4531

[Stage 1:====================================================>(2261 + 4) /
2262]16/01/21 21:57:39 WARN TaskSetManager: Lost task 1982.1 in stage 1.0
(TID 4532, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 1982, attempt: 4532

16/01/21 21:57:57 WARN TaskSetManager: Lost task 2214.0 in stage 1.0 (TID
4482, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2214, attempt: 4482

16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
4436, --): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2168, attempt: 4436


I am running with:

    spark-submit --class "myclass" \

      --num-executors 90 \

      --driver-memory 1g \

      --executor-memory 60g \

      --executor-cores 8 \

      --master yarn-client \

      --conf "spark.executor.extraJavaOptions=-verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \

      my.jar


There are 2262 input files totaling just 98.6G. The DAG is basically
textFile().map().filter().groupByKey().saveAsTextFile().

On Thu, Jan 21, 2016 at 2:14 PM, Holden Karau <ho...@pigscanfly.ca> wrote:

> Can you post more of your log? How big are the partitions? What is the
> action you are performing?
>
> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com>
> wrote:
>
>> Example warning:
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job: 1,
>> partition: 2168, attempt: 4436
>>
>>
>> Is there a solution for this? Increase driver memory? I'm using just 1G
>> driver memory but ideally I won't have to increase it.
>>
>> The RDD being processed has 2262 partitions.
>>
>> Arun
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: TaskCommitDenied (Driver denied task commit)

Posted by Holden Karau <ho...@pigscanfly.ca>.

Can you post more of your log? How big are the partitions? What is the
action you are performing?

On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra <ar...@gmail.com> wrote:

> Example warning:
>
> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
> 4436, XXXXXXX): TaskCommitDenied (Driver denied task commit) for job: 1,
> partition: 2168, attempt: 4436
>
>
> Is there a solution for this? Increase driver memory? I'm using just 1G
> driver memory but ideally I won't have to increase it.
>
> The RDD being processed has 2262 partitions.
>
> Arun
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau