You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yash Sharma <ya...@gmail.com> on 2016/04/11 04:46:11 UTC
Spark Sql on large number of files (~500Megs each) fails after couple
of hours
Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K).
Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters
(No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs
and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions
from 20 to 4000 but havn't seen lot of difference.
>From the logs I have the yarn error attached at end [1]. I have got the
below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be
appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
>
> 16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed:
> container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics:
> Exception from container-launch.
> Container id: container_1459747472046_1618_02_000003
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 1
RE: Spark Sql on large number of files (~500Megs each) fails after
couple of hours
Posted by "Yu, Yucai" <yu...@intel.com>.
It is possible not the first failure, could you increase below and rerun?
spark.yarn.executor.memoryOverhead 4096
In my experience, sometimes, netty will use lots of off-heap memory, which may lead to exceed container memory limitation and be killed by yarn’s node manager.
Thanks,
Yucai
From: Yash Sharma [mailto:yash360@gmail.com]
Sent: Monday, April 11, 2016 11:51 AM
To: Yu, Yucai <yu...@intel.com>
Cc: dev@spark.apache.org
Subject: Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours
Hi Yucai,
Thanks for the info. I have explored the container logs but did not get lot of information from it.
I have seen this error log for few containers but not sure of the cause for it.
1. java.lang.NullPointerException (DiskBlockManager.scala:167)
2. java.lang.ClassCastException: RegisterExecutorFailed
Attaching the log for reference.
16/04/07 13:05:43 INFO storage.MemoryStore: MemoryStore started with capacity 2.6 GB
16/04/07 13:05:43 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler<http://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler>
16/04/07 13:05:43 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler<http://sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler>
java.lang.ClassCastException: Cannot cast org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed to org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$
at java.lang.Class.cast(Class.java:3186)
at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$KeptPromise.onComplete(Promise.scala:333)
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:254)
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:935)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
16/04/07 13:05:44 INFO storage.DiskBlockManager: Shutdown hook called
16/04/07 13:05:44 ERROR util.Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
at org.apache.spark.storage.DiskBlockManager.org<http://org.apache.spark.storage.DiskBlockManager.org>$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
at org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:149)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
16/04/07 13:05:44 INFO util.ShutdownHookManager: Shutdown hook called
On Mon, Apr 11, 2016 at 1:10 PM, Yu, Yucai <yu...@intel.com>> wrote:
Hi Yash,
How about checking the executor(yarn container) log? Most of time, it shows more details, we are using CDH, the log is at:
[yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
/mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
[yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
total 408
-rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
-rw-r--r-- 1 yucai DP 22302 Mar 13 18:04 stdout
Please pay attention, you had better check the first failure container .
Thanks,
Yucai
From: Yash Sharma [mailto:yash360@gmail.com<ma...@gmail.com>]
Sent: Monday, April 11, 2016 10:46 AM
To: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Spark Sql on large number of files (~500Megs each) fails after couple of hours
Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.
From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Re: Spark Sql on large number of files (~500Megs each) fails after
couple of hours
Posted by Yash Sharma <ya...@gmail.com>.
Hi Yucai,
Thanks for the info. I have explored the container logs but did not get lot
of information from it.
I have seen this error log for few containers but not sure of the cause for
it.
1. java.lang.NullPointerException (DiskBlockManager.scala:167)
2. java.lang.ClassCastException: RegisterExecutorFailed
Attaching the log for reference.
16/04/07 13:05:43 INFO storage.MemoryStore: MemoryStore started with
> capacity 2.6 GB
> 16/04/07 13:05:43 INFO executor.CoarseGrainedExecutorBackend: Connecting
> to driver: akka.tcp://
> sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler
> 16/04/07 13:05:43 ERROR executor.CoarseGrainedExecutorBackend: Cannot
> register with driver: akka.tcp://
> sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler
> java.lang.ClassCastException: Cannot cast
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed
> to
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$
> at java.lang.Class.cast(Class.java:3186)
> at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Success.map(Try.scala:206)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
> at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at
> scala.concurrent.impl.Promise$KeptPromise.onComplete(Promise.scala:333)
> at
> scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:254)
> at
> scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
> at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:935)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 16/04/07 13:05:44 INFO storage.DiskBlockManager: Shutdown hook called
> 16/04/07 13:05:44 ERROR util.Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
> at org.apache.spark.storage.DiskBlockManager.org
> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
> at
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:149)
> at
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 16/04/07 13:05:44 INFO util.ShutdownHookManager: Shutdown hook called
On Mon, Apr 11, 2016 at 1:10 PM, Yu, Yucai <yu...@intel.com> wrote:
> Hi Yash,
>
>
>
> How about checking the executor(yarn container) log? Most of time, it
> shows more details, we are using CDH, the log is at:
>
>
>
> [yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
>
>
> /mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
>
> [yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
>
> total 408
>
> -rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
>
> -rw-r--r-- 1 yucai DP 22302 Mar 13 18:04 stdout
>
>
>
> Please pay attention, you had better check the first failure container .
>
>
>
> Thanks,
>
> Yucai
>
>
>
> *From:* Yash Sharma [mailto:yash360@gmail.com]
> *Sent:* Monday, April 11, 2016 10:46 AM
> *To:* dev@spark.apache.org
> *Subject:* Spark Sql on large number of files (~500Megs each) fails after
> couple of hours
>
>
>
> Hi All,
>
> I am trying Spark Sql on a dataset ~16Tb with large number of files
> (~50K). Each file is roughly 400-500 Megs.
>
>
>
> I am issuing a fairly simple hive query on the dataset with just filters
> (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs
> and processes about 80-100 Gigs on a 12 node cluster.
>
>
>
> I have experimented with different values of spark.sql.shuffle.partitions
> from 20 to 4000 but havn't seen lot of difference.
>
>
>
> From the logs I have the yarn error attached at end [1]. I have got the
> below spark configs [2] for the job.
>
>
>
> Is there any other tuning I need to look into. Any tips would be
> appreciated,
>
>
>
> Thanks
>
>
>
>
>
> 2. Spark config -
>
> spark-submit
>
> --master yarn-client
>
> --driver-memory 1G
>
> --executor-memory 10G
>
> --executor-cores 5
>
> --conf spark.dynamicAllocation.enabled=true
>
> --conf spark.shuffle.service.enabled=true
>
> --conf spark.dynamicAllocation.initialExecutors=2
>
> --conf spark.dynamicAllocation.minExecutors=2
>
>
>
>
>
> 1. Yarn Error:
>
>
> 16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed:
> container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics:
> Exception from container-launch.
> Container id: container_1459747472046_1618_02_000003
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 1
>
>
RE: Spark Sql on large number of files (~500Megs each) fails after
couple of hours
Posted by "Yu, Yucai" <yu...@intel.com>.
Hi Yash,
How about checking the executor(yarn container) log? Most of time, it shows more details, we are using CDH, the log is at:
[yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
/mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
[yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
total 408
-rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
-rw-r--r-- 1 yucai DP 22302 Mar 13 18:04 stdout
Please pay attention, you had better check the first failure container .
Thanks,
Yucai
From: Yash Sharma [mailto:yash360@gmail.com]
Sent: Monday, April 11, 2016 10:46 AM
To: dev@spark.apache.org
Subject: Spark Sql on large number of files (~500Megs each) fails after couple of hours
Hi All,
I am trying Spark Sql on a dataset ~16Tb with large number of files (~50K). Each file is roughly 400-500 Megs.
I am issuing a fairly simple hive query on the dataset with just filters (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs and processes about 80-100 Gigs on a 12 node cluster.
I have experimented with different values of spark.sql.shuffle.partitions from 20 to 4000 but havn't seen lot of difference.
From the logs I have the yarn error attached at end [1]. I have got the below spark configs [2] for the job.
Is there any other tuning I need to look into. Any tips would be appreciated,
Thanks
2. Spark config -
spark-submit
--master yarn-client
--driver-memory 1G
--executor-memory 10G
--executor-cores 5
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.minExecutors=2
1. Yarn Error:
16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed: container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1459747472046_1618_02_000003
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1