You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Yash Sharma <ya...@gmail.com> on 2016/04/11 04:46:11 UTC

Spark Sql on large number of files (~500Megs each) fails after couple of hours

Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K).
Each file is roughly 400-500 Megs.

I am issuing a fairly simple hive query on the dataset with just filters
(No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs
and processes about 80-100 Gigs on a 12 node cluster.

I have experimented with different values of spark.sql.shuffle.partitions
from 20 to 4000 but havn't seen lot of difference.

>From the logs I have the yarn error attached at end [1]. I have got the
below spark configs [2] for the job.

Is there any other tuning I need to look into. Any tips would be
appreciated,

Thanks


2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2


1. Yarn Error:

>
> 16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed:
> container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics:
> Exception from container-launch.
> Container id: container_1459747472046_1618_02_000003
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>         at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 1

RE: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Posted by "Yu, Yucai" <yu...@intel.com>.

It is possible not the first failure, could you increase below and rerun?
spark.yarn.executor.memoryOverhead           4096

In my experience, sometimes, netty will use lots of off-heap memory, which may lead to exceed container memory limitation and be killed by yarn’s node manager.

Thanks,
Yucai

From: Yash Sharma [mailto:yash360@gmail.com]
Sent: Monday, April 11, 2016 11:51 AM
To: Yu, Yucai <yu...@intel.com>
Cc: dev@spark.apache.org
Subject: Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Hi Yucai,
Thanks for the info. I have explored the container logs but did not get lot of information from it.

I have seen this error log for few containers but not sure of the cause for it.
1. java.lang.NullPointerException (DiskBlockManager.scala:167)
2. java.lang.ClassCastException: RegisterExecutorFailed

Attaching the log for reference.


16/04/07 13:05:43 INFO storage.MemoryStore: MemoryStore started with capacity 2.6 GB
16/04/07 13:05:43 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler<http://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler>
16/04/07 13:05:43 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler<http://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler>
java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$
        at java.lang.Class.cast(Class.java:3186)
        at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
        at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
        at scala.util.Try$.apply(Try.scala:161)
        at scala.util.Success.map(Try.scala:206)
        at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
        at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
        at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
        at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
        at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
        at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
        at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
        at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$KeptPromise.onComplete(Promise.scala:333)
        at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:254)
        at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
        at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
        at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
        at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
        at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
        at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
        at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
        at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:935)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
16/04/07 13:05:44 INFO storage.DiskBlockManager: Shutdown hook called
16/04/07 13:05:44 ERROR util.Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
        at org.apache.spark.storage.DiskBlockManager.org<http://org.apache.spark.storage.DiskBlockManager.org>$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
        at org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:149)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
        at scala.util.Try$.apply(Try.scala:161)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
16/04/07 13:05:44 INFO util.ShutdownHookManager: Shutdown hook called

On Mon, Apr 11, 2016 at 1:10 PM, Yu, Yucai <yu...@intel.com>> wrote:
Hi Yash,

How about checking the executor(yarn container) log? Most of time, it shows more details, we are using CDH, the log is at:

[yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
/mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
[yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
total 408
-rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
-rw-r--r-- 1 yucai DP  22302 Mar 13 18:04 stdout

Please pay attention, you had better check the first failure container .

Thanks,
Yucai

From: Yash Sharma [mailto:yash360@gmail.com<ma...@gmail.com>]
Sent: Monday, April 11, 2016 10:46 AM
To: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.

I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.

I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.

From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.

Is there any other tuning I need to look into. Any tips would be appreciated,

Thanks


2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2


1. Yarn Error:

16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1

Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Posted by Yash Sharma <ya...@gmail.com>.

Hi Yucai,
Thanks for the info. I have explored the container logs but did not get lot
of information from it.

I have seen this error log for few containers but not sure of the cause for
it.
1. java.lang.NullPointerException (DiskBlockManager.scala:167)
2. java.lang.ClassCastException: RegisterExecutorFailed

Attaching the log for reference.


16/04/07 13:05:43 INFO storage.MemoryStore: MemoryStore started with
> capacity 2.6 GB
> 16/04/07 13:05:43 INFO executor.CoarseGrainedExecutorBackend: Connecting
> to driver: akka.tcp://
> sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler
> 16/04/07 13:05:43 ERROR executor.CoarseGrainedExecutorBackend: Cannot
> register with driver: akka.tcp://
> sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler
> java.lang.ClassCastException: Cannot cast
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed
> to
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$
>         at java.lang.Class.cast(Class.java:3186)
>         at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
>         at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
>         at scala.util.Try$.apply(Try.scala:161)
>         at scala.util.Success.map(Try.scala:206)
>         at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>         at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>         at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>         at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>         at
> scala.concurrent.impl.Promise$KeptPromise.onComplete(Promise.scala:333)
>         at
> scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:254)
>         at
> scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>         at
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>         at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>         at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
>         at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
>         at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:935)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>         at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 16/04/07 13:05:44 INFO storage.DiskBlockManager: Shutdown hook called
> 16/04/07 13:05:44 ERROR util.Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
>         at org.apache.spark.storage.DiskBlockManager.org
> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
>         at
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:149)
>         at
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>         at scala.util.Try$.apply(Try.scala:161)
>         at
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>         at
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 16/04/07 13:05:44 INFO util.ShutdownHookManager: Shutdown hook called


On Mon, Apr 11, 2016 at 1:10 PM, Yu, Yucai <yu...@intel.com> wrote:

> Hi Yash,
>
>
>
> How about checking the executor(yarn container) log? Most of time, it
> shows more details, we are using CDH, the log is at:
>
>
>
> [yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
>
>
> /mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
>
> [yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
>
> total 408
>
> -rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
>
> -rw-r--r-- 1 yucai DP  22302 Mar 13 18:04 stdout
>
>
>
> Please pay attention, you had better check the first failure container .
>
>
>
> Thanks,
>
> Yucai
>
>
>
> *From:* Yash Sharma [mailto:yash360@gmail.com]
> *Sent:* Monday, April 11, 2016 10:46 AM
> *To:* dev@spark.apache.org
> *Subject:* Spark Sql on large number of files (~500Megs each) fails after
> couple of hours
>
>
>
> Hi All,
>
> I am trying Spark Sql on a dataset ~16Tb with large number of files
> (~50K). Each file is roughly 400-500 Megs.
>
>
>
> I am issuing a fairly simple hive query on the dataset with just filters
> (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs
> and processes about 80-100 Gigs on a 12 node cluster.
>
>
>
> I have experimented with different values of spark.sql.shuffle.partitions
> from 20 to 4000 but havn't seen lot of difference.
>
>
>
> From the logs I have the yarn error attached at end [1]. I have got the
> below spark configs [2] for the job.
>
>
>
> Is there any other tuning I need to look into. Any tips would be
> appreciated,
>
>
>
> Thanks
>
>
>
>
>
> 2. Spark config -
>
> spark-submit
>
> --master yarn-client
>
> --driver-memory 1G
>
> --executor-memory 10G
>
> --executor-cores 5
>
> --conf spark.dynamicAllocation.enabled=true
>
> --conf spark.shuffle.service.enabled=true
>
> --conf spark.dynamicAllocation.initialExecutors=2
>
> --conf spark.dynamicAllocation.minExecutors=2
>
>
>
>
>
> 1. Yarn Error:
>
>
> 16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed:
> container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics:
> Exception from container-launch.
> Container id: container_1459747472046_1618_02_000003
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>         at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 1
>
>

RE: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Posted by "Yu, Yucai" <yu...@intel.com>.

Hi Yash,

How about checking the executor(yarn container) log? Most of time, it shows more details, we are using CDH, the log is at:

[yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
/mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
[yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
total 408
-rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
-rw-r--r-- 1 yucai DP  22302 Mar 13 18:04 stdout

Please pay attention, you had better check the first failure container .

Thanks,
Yucai

From: Yash Sharma [mailto:yash360@gmail.com]
Sent: Monday, April 11, 2016 10:46 AM
To: dev@spark.apache.org
Subject: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.

I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.

I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.

From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.

Is there any other tuning I need to look into. Any tips would be appreciated,

Thanks


2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2


1. Yarn Error:

16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1