You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/01/22 11:43:48 UTC

[jira] [Commented] (SPARK-5350) There are issues when combining Spark and CDK (https://github.com/egonw/cdk).

    [ https://issues.apache.org/jira/browse/SPARK-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287242#comment-14287242 ] 

Sean Owen commented on SPARK-5350:
----------------------------------

Is this really a Spark issue? My hunch is that this third-party project depends on Hadoop 2.x. You need to run Hadoop 2.x, and a build of Spark for Hadoop 2.x, on your cluster. Your app should not include copies of Hadoop libraries, then. Here it looks like you are either mixing the app with Hadoop 1.x / Spark for Hadoop 1.x, or, are packaging Hadoop 1.x libraries in your app.

> There are issues when combining Spark and CDK (https://github.com/egonw/cdk). 
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-5350
>                 URL: https://issues.apache.org/jira/browse/SPARK-5350
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.1, 1.2.0
>         Environment: Running Spark using a local computer, using both Mac OS X and a VM with Linux Ubuntu.
>            Reporter: Staffan Arvidsson
>
> I'm using Maven and Eclipse to build my project. When I import the CDK (https://github.com/egonw/cdk) jar-files that I need, and setup the SparkContext and try for instance reading a file (by simply "val lines = sc.textFile(filePath)") I get the following errors in the log:
> {quote}
> [main] DEBUG org.apache.spark.rdd.HadoopRDD  - SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code.
> java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo
> 	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> 	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> 	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> 	at java.lang.Class.forName0(Native Method)
> 	at java.lang.Class.forName(Class.java:191)
> 	at org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.<init>(HadoopRDD.scala:381)
> 	at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391)
> 	at org.apache.spark.rdd.HadoopRDD$.<init>(HadoopRDD.scala:390)
> 	at org.apache.spark.rdd.HadoopRDD$.<clinit>(HadoopRDD.scala)
> 	at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159)
> 	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> 	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
> 	at org.apache.spark.rdd.RDD.foreach(RDD.scala:765)
> {quote}
> later in the log: 
> {quote}
> [Executor task launch worker-0] DEBUG org.apache.spark.deploy.SparkHadoopUtil  - Couldn't find method for retrieving thread-level FileSystem input data
> java.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()
> 	at java.lang.Class.getDeclaredMethod(Class.java:2009)
> 	at org.apache.spark.util.Utils$.invoke(Utils.scala:1733)
> 	at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
> 	at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
> 	at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138)
> 	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
> 	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
> 	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> 	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:56)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> There has also been issues related to "HADOOP_HOME" not being set etc., but which seems to be intermittent and only occur sometimes. 
> After testing different versions of both CDK and Spark, I've found out that the Spark version 0.9.1 seems to get things to work. This will not solve my problem though, as I will later need to use functionality from the MLlib that are only in the newer versions of Spark.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org