You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by hmxxyy <hm...@gmail.com> on 2014/11/11 06:04:47 UTC

Strange behavior of spark-shell while accessing hdfs

I am trying spark-shell on a single host and got some strange behavior of
spark-shell.

If I run bin/spark-shell without connecting a master, it can access a hdfs
file on a remote cluster with kerberos authentication.

scala> val textFile =
sc.textFile("hdfs://*.*.*.*:8020/user/lih/drill_test/test.csv")
scala> textFile.count()
res0: Long = 9

However, if I start the master and slave on the same host and using 
bin/spark-shell --master spark://*.*.*.*:7077
run the same commands

scala> textFile.count()
14/11/11 05:00:23 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
stgace-launcher06.diy.corp.ne1.yahoo.com): java.io.IOException: Failed on
local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
"*.*.*.*.com/98.138.236.95"; destination host is: "*.*.*.*":8020;
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
	at org.apache.hadoop.ipc.Client.call(Client.java:1375)
	at org.apache.hadoop.ipc.Client.call(Client.java:1324)
	at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
	at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:225)
	at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
	at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy20.getBlockLocations(Unknown Source)
	at
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1165)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1155)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1145)
	at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:268)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:235)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:228)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1318)
	at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:293)
	at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:289)
	at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:289)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
	at
org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
	at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:233)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.scheduler.Task.run(Task.scala:56)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]
	at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:657)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
	at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:621)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
	at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:368)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1423)
	at org.apache.hadoop.ipc.Client.call(Client.java:1342)
	... 38 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]
	at
org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:171)
	at
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:388)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:702)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:698)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:697)
	... 41 more


I figure that when connecting to a master, the job must have been executed
by a child process and the env variables or classpath might be different
from running without master.

Please suggest how to do more troubleshooting and fix this. I am pulling all
hair out...

Thanks so much.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Strange behavior of spark-shell while accessing hdfs

Posted by ramblingpolak <ad...@wibidata.com>.

You need to set the spark configuration property: spark.yarn.access.namenodes
to your namenode.

e.g. spark.yarn.access.namenodes=hdfs://mynamenode:8020

Similarly, I'm curious if you're also running high availability HDFS with an
HA nameservice.

I currently have HA HDFS and kerberos and I've noticed that I must set the
above property to the currently active namenode's hostname and port. Simply
using the HA nameservice to get delegation tokens does NOT seem to work with
Spark 1.1.0 (even though I can confirm the token is acquired).

I believe this may be a bug. Unfortunately simply adding both the active and
standby name nodes does not work as this actually causes an error. This
means that when my active name node fails over, my spark configuration
becomes invalid.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Strange behavior of spark-shell while accessing hdfs

Posted by hmxxyy <hm...@gmail.com>.

Thanks guys for the info.

I have to use yarn to access a kerberos cluster.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Strange behavior of spark-shell while accessing hdfs

Posted by ramblingpolak <ad...@wibidata.com>.

Only YARN mode is supported with kerberos. You can't use a spark:// master
with kerberos.


Tobias Pfeiffer wrote
> When you give a "spark://*" master, Spark will run on a different machine,
> where you have not yet authenticated to HDFS, I think. I don't know how to
> solve this, though, maybe some Kerberos token must be passed on to the
> Spark cluster?





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Strange behavior of spark-shell while accessing hdfs

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy <hm...@gmail.com> wrote:
>
> If I run bin/spark-shell without connecting a master, it can access a hdfs
> file on a remote cluster with kerberos authentication.

[...]

However, if I start the master and slave on the same host and using
> bin/spark-shell --master spark://*.*.*.*:7077
> run the same commands

[... ]
> org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
> "*.*.*.*.com/98.138.236.95"; destination host is: "*.*.*.*":8020;
>

When you give no master, it is "local[*]", so Spark will (implicitly?)
authenticate to HDFS from your local machine using local environment
variables, key files etc., I guess.

When you give a "spark://*" master, Spark will run on a different machine,
where you have not yet authenticated to HDFS, I think. I don't know how to
solve this, though, maybe some Kerberos token must be passed on to the
Spark cluster?

Tobias