You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Diego Fanesi <di...@gmail.com> on 2017/03/21 02:55:10 UTC

worker connected to standalone cluster are continuously crashing

Hello everybody,

I configured a simple standalone cluster with few machines and I am trying
to submit a very simple job just to test the cluster.

my laptop is the client and one of the workers. my server contains the
master and the second worker.

If I submit my job just executing the scala code from intellij without
using spark-submit the job runs correctly, but this is a not supported
procedure and many operations are not working properly, such as udf.

If I try to use spark-submit the worker in my laptop is running correctly
but the second worker in my server continuously exits with the stack trace
attached at the end of my email.

I did the following attempts to try to solve the issue:
- I forced the resolution of the hostname to the ip address of the network
card
- I replaced open jdk with oracle jdk
- I defined SPARK_LOCAL_IP to the IP of the network card in spark-env.sh
- I changed /etc/nsswitch.conf to resolve hostnames through the internal
dns server first.
- I created a vm and I separated the master from the worker. the master
works but the vm containing the worker keeps failing. installing three
workers on three VMs I get the same issue.

all systems are arch linux and I can't understand why my laptop is behaving
correctly while every other system I install present the same problem.
Doing a ping to the hostname I can see that after my configuration it is
not being resolved as 127.0.0.1 anymore, so this configuration should be
ok.

I really can't understand what I am doing wrong.

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult
	at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	... 4 more
Caused by: java.io.IOException: Failed to connect to /127.0.0.1:41851
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: /127.0.0.1:41851
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	... 1 more