You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matt Mead <ma...@matthewcmead.com> on 2014/12/20 21:44:30 UTC

v1.2.0 (re?)introduces Wrong FS behavior in thriftserver

First, thanks for the efforts and contribution to such a useful software
stack!  Spark is great!

I have been using the git tags for v1.2.0-rc1 and v1.2.0-rc2 built as
follows:

./make-distribution.sh -Dhadoop.version=2.5.0-cdh5.2.0
> -Dyarn.version=2.5.0-cdh5.2.0 -Phadoop-2.4 -Phive -Pyarn -Phive-thriftserver


I have been starting the thriftserver as follows:

HADOOP_CONF_DIR=/etc/hadoop/conf ./sbin/start-thriftserver.sh --master yarn
> --num-executors 16


Under v1.2.0-rc1 and v1.2.0-rc2, this has worked properly, where the
thriftserver starts up and I am able to interact with it and execute
queries as expected using the JDBC driver.

I have updated to git tag v1.2.0, built identically and started the
thriftserver identically, but am now running into the following issue on
startup:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
> hdfs://myhdfs/user/user/.sparkStaging/application_1416150945509_0055/datanucleus-api-jdo-3.2.6.jar,
> expected: file:///
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:519)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
> at
> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
> at
> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
> at
> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
> at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
> at
> org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
> at org.apache.spark.SparkContext.<init>(SparkContext.scala:335)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:38)
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:56)
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Looking at SPARK-4757, it appears others were seeing this behavior in
earlier releases and it is fixed in v1.2.0, whereas I did not see the
behavior in earlier releases and now am seeing it in v1.2.0.

I have tested this with the exact same build/launch commands on two
separate CDH5.2.0 clusters with identical results.  Both machines where the
build and execution take place have a proper HDFS/YARN client configuration
in /etc/hadoop/conf and other hadoop tools like MR2 on YARN function as
expected.

Any ideas on what to do to resolve this issue?

Thanks!




-matt

Re: v1.2.0 (re?)introduces Wrong FS behavior in thriftserver

Posted by Matt Mead <ma...@matthewcmead.com>.

That seemed to correct the issue.  Thanks for pointing out the lack of
diffs between v1.2.0-rc2 and v1.2.0 -- I'm not sure how my git repo ended
up not matching its origin.




-matt


On Sat, Dec 20, 2014 at 4:25 PM, Matt Mead <ma...@matthewcmead.com> wrote:

> Bizarre.  I originally cloned from and have been pulling from
> https://github.com/apache/spark, and my repo shows the following:
>
> user@host:~/development/spark$ git diff v1.2.0-rc2..v1.2.0 | wc -l
>> 1898
>
>
> If I pull a fresh clone, I get this:
>
> user@host:~$ git clone https://github.com/apache/spark
>> Cloning into 'spark'...
>> remote: Counting objects: 152765, done.
>> remote: Compressing objects: 100% (50/50), done.
>> remote: Total 152765 (delta 16), reused 64 (delta 16)
>> Receiving objects: 100% (152765/152765), 85.01 MiB | 3.29 MiB/s, done.
>> Resolving deltas: 100% (68247/68247), done.
>> user@host:~$ cd spark
>> user@host:~/spark$ git diff v1.2.0-rc2..v1.2.0 | wc -l
>> 0
>
>
> I will do a build from the fresh clone and report back on whether the
> behavior persists.
>
>
>
>
> -matt
>
>
> On Sat, Dec 20, 2014 at 4:16 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> This makes no sense.  There is no difference between v1.2.0-rc2 and
>> v1.2.0: https://github.com/apache/spark/compare/v1.2.0-rc2...v1.2.0
>>
>> On Sat, Dec 20, 2014 at 12:44 PM, Matt Mead <ma...@matthewcmead.com>
>> wrote:
>>
>>> First, thanks for the efforts and contribution to such a useful software
>>> stack!  Spark is great!
>>>
>>> I have been using the git tags for v1.2.0-rc1 and v1.2.0-rc2 built as
>>> follows:
>>>
>>> ./make-distribution.sh -Dhadoop.version=2.5.0-cdh5.2.0
>>>> -Dyarn.version=2.5.0-cdh5.2.0 -Phadoop-2.4 -Phive -Pyarn -Phive-thriftserver
>>>
>>>
>>> I have been starting the thriftserver as follows:
>>>
>>> HADOOP_CONF_DIR=/etc/hadoop/conf ./sbin/start-thriftserver.sh --master
>>>> yarn --num-executors 16
>>>
>>>
>>> Under v1.2.0-rc1 and v1.2.0-rc2, this has worked properly, where the
>>> thriftserver starts up and I am able to interact with it and execute
>>> queries as expected using the JDBC driver.
>>>
>>> I have updated to git tag v1.2.0, built identically and started the
>>> thriftserver identically, but am now running into the following issue on
>>> startup:
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>>>> hdfs://myhdfs/user/user/.sparkStaging/application_1416150945509_0055/datanucleus-api-jdo-3.2.6.jar,
>>>> expected: file:///
>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>>>> at
>>>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
>>>> at
>>>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:519)
>>>> at
>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
>>>> at
>>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
>>>> at
>>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>>>> at
>>>> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
>>>> at
>>>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
>>>> at
>>>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
>>>> at scala.Option.foreach(Option.scala:236)
>>>> at
>>>> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
>>>> at
>>>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
>>>> at
>>>> org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
>>>> at
>>>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
>>>> at
>>>> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
>>>> at
>>>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>>>> at
>>>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
>>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:335)
>>>> at
>>>> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:38)
>>>> at
>>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:56)
>>>> at
>>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>>
>>> Looking at SPARK-4757, it appears others were seeing this behavior in
>>> earlier releases and it is fixed in v1.2.0, whereas I did not see the
>>> behavior in earlier releases and now am seeing it in v1.2.0.
>>>
>>> I have tested this with the exact same build/launch commands on two
>>> separate CDH5.2.0 clusters with identical results.  Both machines where the
>>> build and execution take place have a proper HDFS/YARN client configuration
>>> in /etc/hadoop/conf and other hadoop tools like MR2 on YARN function as
>>> expected.
>>>
>>> Any ideas on what to do to resolve this issue?
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>> -matt
>>>
>>>
>>
>

Re: v1.2.0 (re?)introduces Wrong FS behavior in thriftserver

Posted by Matt Mead <ma...@matthewcmead.com>.

Bizarre.  I originally cloned from and have been pulling from
https://github.com/apache/spark, and my repo shows the following:

user@host:~/development/spark$ git diff v1.2.0-rc2..v1.2.0 | wc -l
> 1898


If I pull a fresh clone, I get this:

user@host:~$ git clone https://github.com/apache/spark
> Cloning into 'spark'...
> remote: Counting objects: 152765, done.
> remote: Compressing objects: 100% (50/50), done.
> remote: Total 152765 (delta 16), reused 64 (delta 16)
> Receiving objects: 100% (152765/152765), 85.01 MiB | 3.29 MiB/s, done.
> Resolving deltas: 100% (68247/68247), done.
> user@host:~$ cd spark
> user@host:~/spark$ git diff v1.2.0-rc2..v1.2.0 | wc -l
> 0


I will do a build from the fresh clone and report back on whether the
behavior persists.




-matt


On Sat, Dec 20, 2014 at 4:16 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> This makes no sense.  There is no difference between v1.2.0-rc2 and
> v1.2.0: https://github.com/apache/spark/compare/v1.2.0-rc2...v1.2.0
>
> On Sat, Dec 20, 2014 at 12:44 PM, Matt Mead <ma...@matthewcmead.com> wrote:
>
>> First, thanks for the efforts and contribution to such a useful software
>> stack!  Spark is great!
>>
>> I have been using the git tags for v1.2.0-rc1 and v1.2.0-rc2 built as
>> follows:
>>
>> ./make-distribution.sh -Dhadoop.version=2.5.0-cdh5.2.0
>>> -Dyarn.version=2.5.0-cdh5.2.0 -Phadoop-2.4 -Phive -Pyarn -Phive-thriftserver
>>
>>
>> I have been starting the thriftserver as follows:
>>
>> HADOOP_CONF_DIR=/etc/hadoop/conf ./sbin/start-thriftserver.sh --master
>>> yarn --num-executors 16
>>
>>
>> Under v1.2.0-rc1 and v1.2.0-rc2, this has worked properly, where the
>> thriftserver starts up and I am able to interact with it and execute
>> queries as expected using the JDBC driver.
>>
>> I have updated to git tag v1.2.0, built identically and started the
>> thriftserver identically, but am now running into the following issue on
>> startup:
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>>> hdfs://myhdfs/user/user/.sparkStaging/application_1416150945509_0055/datanucleus-api-jdo-3.2.6.jar,
>>> expected: file:///
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:519)
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
>>> at
>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>>> at
>>> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
>>> at
>>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
>>> at
>>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
>>> at scala.Option.foreach(Option.scala:236)
>>> at
>>> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
>>> at
>>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
>>> at
>>> org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
>>> at
>>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
>>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
>>> at
>>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
>>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:335)
>>> at
>>> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:38)
>>> at
>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:56)
>>> at
>>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>> Looking at SPARK-4757, it appears others were seeing this behavior in
>> earlier releases and it is fixed in v1.2.0, whereas I did not see the
>> behavior in earlier releases and now am seeing it in v1.2.0.
>>
>> I have tested this with the exact same build/launch commands on two
>> separate CDH5.2.0 clusters with identical results.  Both machines where the
>> build and execution take place have a proper HDFS/YARN client configuration
>> in /etc/hadoop/conf and other hadoop tools like MR2 on YARN function as
>> expected.
>>
>> Any ideas on what to do to resolve this issue?
>>
>> Thanks!
>>
>>
>>
>>
>> -matt
>>
>>
>

Re: v1.2.0 (re?)introduces Wrong FS behavior in thriftserver

Posted by Mark Hamstra <ma...@clearstorydata.com>.

This makes no sense.  There is no difference between v1.2.0-rc2 and v1.2.0:
https://github.com/apache/spark/compare/v1.2.0-rc2...v1.2.0

On Sat, Dec 20, 2014 at 12:44 PM, Matt Mead <ma...@matthewcmead.com> wrote:

> First, thanks for the efforts and contribution to such a useful software
> stack!  Spark is great!
>
> I have been using the git tags for v1.2.0-rc1 and v1.2.0-rc2 built as
> follows:
>
> ./make-distribution.sh -Dhadoop.version=2.5.0-cdh5.2.0
>> -Dyarn.version=2.5.0-cdh5.2.0 -Phadoop-2.4 -Phive -Pyarn -Phive-thriftserver
>
>
> I have been starting the thriftserver as follows:
>
> HADOOP_CONF_DIR=/etc/hadoop/conf ./sbin/start-thriftserver.sh --master
>> yarn --num-executors 16
>
>
> Under v1.2.0-rc1 and v1.2.0-rc2, this has worked properly, where the
> thriftserver starts up and I am able to interact with it and execute
> queries as expected using the JDBC driver.
>
> I have updated to git tag v1.2.0, built identically and started the
> thriftserver identically, but am now running into the following issue on
> startup:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS:
>> hdfs://myhdfs/user/user/.sparkStaging/application_1416150945509_0055/datanucleus-api-jdo-3.2.6.jar,
>> expected: file:///
>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:519)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
>> at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>> at
>> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
>> at
>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
>> at
>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
>> at org.apache.spark.SparkContext.<init>(SparkContext.scala:335)
>> at
>> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:38)
>> at
>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:56)
>> at
>> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> Looking at SPARK-4757, it appears others were seeing this behavior in
> earlier releases and it is fixed in v1.2.0, whereas I did not see the
> behavior in earlier releases and now am seeing it in v1.2.0.
>
> I have tested this with the exact same build/launch commands on two
> separate CDH5.2.0 clusters with identical results.  Both machines where the
> build and execution take place have a proper HDFS/YARN client configuration
> in /etc/hadoop/conf and other hadoop tools like MR2 on YARN function as
> expected.
>
> Any ideas on what to do to resolve this issue?
>
> Thanks!
>
>
>
>
> -matt
>
>