You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stéphane Verlet <ka...@gmail.com> on 2014/11/08 04:12:45 UTC

Spark 1.1.0 Can not read snappy compressed sequence file

I first saw this using SparkSQL but the result is the same with plain
Spark.

14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)

Full stack below ....

I tried many different thing without luck
    * extract the libsnappyjava.so from the Spark assembly and put it on
the library path
           * Added -Djava.library.path=... to  SPARK_MASTER_OPTS
and SPARK_WORKER_OPTS
           * added library path to SPARK_LIBRARY_PATH
           * added hadoop library path to SPARK_LIBRARY_PATH
    * Rebuilt spark with different versions (previous and next)  of Snappy
(as seen when Google-ing)


Env :
   Centos 6.4
   Hadoop 2.3 (CDH5.1)
   Running in standalone/local mode


Any help would be appreciated

Thank you

Stephane


scala> import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.BytesWritable

scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text

scala> import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.io.NullWritable

scala> var seq =
sc.sequenceFile[NullWritable,Text]("/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/000000_0").map(_._2.toString())
14/11/07 19:46:19 INFO MemoryStore: ensureFreeSpace(157973) called with
curMem=0, maxMem=278302556
14/11/07 19:46:19 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 154.3 KB, free 265.3 MB)
seq: org.apache.spark.rdd.RDD[String] = MappedRDD[2] at map at <console>:15

scala> seq.collect().foreach(println)
14/11/07 19:46:35 INFO FileInputFormat: Total input paths to process : 1
14/11/07 19:46:35 INFO SparkContext: Starting job: collect at <console>:18
14/11/07 19:46:35 INFO DAGScheduler: Got job 0 (collect at <console>:18)
with 2 output partitions (allowLocal=false)
14/11/07 19:46:35 INFO DAGScheduler: Final stage: Stage 0(collect at
<console>:18)
14/11/07 19:46:35 INFO DAGScheduler: Parents of final stage: List()
14/11/07 19:46:35 INFO DAGScheduler: Missing parents: List()
14/11/07 19:46:35 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at
map at <console>:15), which has no missing parents
14/11/07 19:46:35 INFO MemoryStore: ensureFreeSpace(2928) called with
curMem=157973, maxMem=278302556
14/11/07 19:46:35 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 2.9 KB, free 265.3 MB)
14/11/07 19:46:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (MappedRDD[2] at map at <console>:15)
14/11/07 19:46:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/07 19:46:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, PROCESS_LOCAL, 1243 bytes)
14/11/07 19:46:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, PROCESS_LOCAL, 1243 bytes)
14/11/07 19:46:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/07 19:46:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/11/07 19:46:36 INFO HadoopRDD: Input split:
file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/000000_0:6504064+6504065
14/11/07 19:46:36 INFO HadoopRDD: Input split:
file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/000000_0:0+6504064
14/11/07 19:46:36 INFO deprecation: mapred.tip.id is deprecated. Instead,
use mapreduce.task.id
14/11/07 19:46:36 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap
14/11/07 19:46:36 INFO deprecation: mapred.task.partition is deprecated.
Instead, use mapreduce.task.partition
14/11/07 19:46:36 INFO deprecation: mapred.job.id is deprecated. Instead,
use mapreduce.job.id
14/11/07 19:46:36 INFO deprecation: mapred.task.id is deprecated. Instead,
use mapreduce.task.attempt.id
14/11/07 19:46:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/07 19:46:36 ERROR ExecutorUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-1,5,main]
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/07 19:46:36 ERROR ExecutorUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/07 19:46:36 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
localhost): java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
        org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)

org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)

org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)

org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)

org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)

org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)

org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)

org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)

org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:49)

org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
        org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
        org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
        org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)
14/11/07 19:46:36 ERROR TaskSetManager: Task 1 in stage 0.0 failed 1 times;
aborting job
14/11/07 19:46:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
14/11/07 19:46:36 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0)
on executor localhost: java.lang.UnsatisfiedLinkError
(org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z) [duplicate
1]
14/11/07 19:46:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool

Re: Spark 1.1.0 Can not read snappy compressed sequence file

Posted by Stéphane Verlet <ka...@gmail.com>.
Yes , It is working with this in spark-env.sh

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/lib/snappy-java-1.0.4.1.jar

I tried so many things that I do not even know where I got this from :-(

Stephane


On Wed, Nov 26, 2014 at 8:08 AM, cjdc <cr...@cern.ch> wrote:

> Hi,
>
> did you get a solution for this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Can-not-read-snappy-compressed-sequence-file-tp18394p19876.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark 1.1.0 Can not read snappy compressed sequence file

Posted by cjdc <cr...@cern.ch>.
Hi,

did you get a solution for this? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Can-not-read-snappy-compressed-sequence-file-tp18394p19876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org