You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Graham Dennis (JIRA)" <ji...@apache.org> on 2014/09/03 09:54:51 UTC

[jira] [Created] (SPARK-3368) Spark cannot be used with Avro and Parquet

Graham Dennis created SPARK-3368:
------------------------------------

             Summary: Spark cannot be used with Avro and Parquet
                 Key: SPARK-3368
                 URL: https://issues.apache.org/jira/browse/SPARK-3368
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.0.2
            Reporter: Graham Dennis


Spark cannot currently (as of 1.0.2) use any Parquet write support classes that are not part of the spark assembly jar (at least when launched using `spark-submit`).  This prevents using Avro with Parquet.

See https://github.com/GrahamDennis/spark-avro-parquet for a test case to reproduce this issue.

The problem appears in the master logs as:

{noformat}
    14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
    parquet.hadoop.BadConfigurationException: could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
    	at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
    	at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
    	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
    	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
    	at org.apache.spark.scheduler.Task.run(Task.scala:51)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
    	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    	at java.lang.Class.forName0(Native Method)
    	at java.lang.Class.forName(Class.java:190)
    	at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
    	... 11 more
{noformat}

The root cause of the problem is that the class loader that's used to find the Parquet write support class only searches the spark assembly jar and doesn't also search the application jar.  A solution would be to ensure that the application jar is always available on the executor classpath.  This is the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org