You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2019/09/05 17:46:00 UTC
[jira] [Resolved] (SPARK-28981) Missing library for reading/writing Snappy-compressed files

     [ https://issues.apache.org/jira/browse/SPARK-28981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun resolved SPARK-28981.
-----------------------------------
    Resolution: Cannot Reproduce

Since this is fixed at 2.4.4, this seems to be reported at a wrong affected version. The following is Apache Spark 2.4.4 result. I'll close this as `Cannot Reproduce` according to the current `Affected Versions`. In addition, I linked SPARK-26995 as a duplicate for future reference.

{code}
$ docker build -t spark:2.4.4 -f kubernetes/dockerfiles/spark/Dockerfile .
$ docker run --rm -it spark:2.4.4 /opt/spark/bin/spark-shell
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ SPARK_K8S_CMD=/opt/spark/bin/spark-shell
+ case "$SPARK_K8S_CMD" in
+ echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...'
Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ exec /sbin/tini -s -- /opt/spark/bin/spark-shell
19/09/05 17:39:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://454a817f8cee:4040
Spark context available as 'sc' (master = local[*], app id = local-1567705163260).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(10).write.parquet("/tmp/p")
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 61.43% for 11 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
19/09/05 17:39:38 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers

scala> spark.read.parquet("/tmp/p").count
res1: Long = 10
{code}

> Missing library for reading/writing Snappy-compressed files
> -----------------------------------------------------------
>
>                 Key: SPARK-28981
>                 URL: https://issues.apache.org/jira/browse/SPARK-28981
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.4.4
>            Reporter: Paul Schweigert
>            Priority: Minor
>
> The current Dockerfile for Spark on Kubernetes is missing the "ld-linux-x86-64.so.2" library needed to read / write Snappy-compressed files. 
>  
> Sample error message when trying to read a parquet file compressed with snappy:
>  
> {code:java}
> 19/09/02 05:33:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 172.30.189.77, executor 2): org.apache.spark.SparkException: Task failed while writing rows.    
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)    
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)    
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)    
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)    
>     at org.apache.spark.scheduler.Task.run(Task.scala:121)    
>     at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)    
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)    
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)    
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)    
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)    
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.7-04145e2f-cc82-4217-99b8-641cdd755a87-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.7-04145e2f-cc82-4217-99b8-641cdd755a87-libsnappyjava.so)    
>     at java.lang.ClassLoader$NativeLibrary.load(Native Method)    
>     at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)    
>     at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)    
>     at java.lang.Runtime.load0(Runtime.java:809)    
>     at java.lang.System.load(System.java:1086)    
>     at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:179)    
>     at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154)    
>     at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)    
>     at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)    
>     at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)    
>     at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)    
>     at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.compress(CodecFactory.java:165)    
>     at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:95)    
>     at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)    
>     at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235)    
>     at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)    
>     at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)    
>     at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)    
>     at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)    
>     at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)    
>     at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)    
>     at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)    
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)    
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)    
>     at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)    
>     at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)    
>     ... 10 more
> {code}
> The relevant library is in the Alpine Linux "gcompat" package ([https://pkgs.alpinelinux.org/package/edge/community/x86/gcompat]). Adding this library to the Dockerfile enables the reading/writing of Snappy-compressed files.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org