You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Graham Dennis (JIRA)" <ji...@apache.org> on 2014/09/03 10:05:52 UTC
[jira] [Commented] (SPARK-3368) Spark cannot be used with Avro and Parquet

    [ https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119522#comment-14119522 ] 

Graham Dennis commented on SPARK-3368:
--------------------------------------

There are a couple of github repos that demonstrate using Avro & Parquet, see https://github.com/AndreSchumacher/avro-parquet-spark-example and https://github.com/massie/spark-parquet-example

In the first case, data is written out locally, and in the second, spark is launched via maven (not spark-submit) which puts both spark jars and application jars on the classpath at launch time.

> Spark cannot be used with Avro and Parquet
> ------------------------------------------
>
>                 Key: SPARK-3368
>                 URL: https://issues.apache.org/jira/browse/SPARK-3368
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.2
>            Reporter: Graham Dennis
>
> Spark cannot currently (as of 1.0.2) use any Parquet write support classes that are not part of the spark assembly jar (at least when launched using `spark-submit`).  This prevents using Avro with Parquet.
> See https://github.com/GrahamDennis/spark-avro-parquet for a test case to reproduce this issue.
> The problem appears in the master logs as:
> {noformat}
>     14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
>     parquet.hadoop.BadConfigurationException: could not instanciate class parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
>     	at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
>     	at parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
>     	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>     	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>     	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
>     	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
>     	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>     	at org.apache.spark.scheduler.Task.run(Task.scala:51)
>     	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>     	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     	at java.lang.Thread.run(Thread.java:745)
>     Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
>     	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>     	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>     	at java.security.AccessController.doPrivileged(Native Method)
>     	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>     	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>     	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>     	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>     	at java.lang.Class.forName0(Native Method)
>     	at java.lang.Class.forName(Class.java:190)
>     	at parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
>     	... 11 more
> {noformat}
> The root cause of the problem is that the class loader that's used to find the Parquet write support class only searches the spark assembly jar and doesn't also search the application jar.  A solution would be to ensure that the application jar is always available on the executor classpath.  This is the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org