You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by ๏̯͡๏ <ÐΞ€ρ@Ҝ>, de...@gmail.com on 2015/03/05 05:57:04 UTC

Fwd: Unable to Read/Write Avro RDD on cluster.

I am trying to read RDD avro, transform and write.
I am able to run it locally fine but when i run onto cluster, i see issues
with Avro.


export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
export HADOOP_CONF_DIR=/apache/hadoop/conf
export YARN_CONF_DIR=/apache/hadoop/conf
export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
export
SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native"
export YARN_CONF_DIR=/apache/hadoop/conf/

cd $SPARK_HOME

./bin/spark-submit --master yarn-cluster --jars
/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
--num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-02-16 endDate=2015-02-16
epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
subcommand=successevents
outputdir=/user/dvasthimal/epdatasets/successdetail

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
over to rm2
15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
ApplicationsManager (ASM), number of NodeManagers: 2221
15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
      queueApplicationCount = 7, queueChildQueueCount = 0
15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
resource in this cluster 16384
15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.


15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
7780745 for dvasthimal on 10.115.206.112:8020
15/03/04 03:20:46 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
15/03/04 03:20:47 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
15/03/04 03:20:52 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
15/03/04 03:20:52 INFO yarn.Client: Uploading
file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
-Djava.io.tmpdir=$PWD/tmp,
-Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\",
 -Dlog4j.configuration=log4j-spark-container.properties,
org.apache.spark.deploy.yarn.ApplicationMaster, --class,
com.company.ep.poc.spark.reporting.SparkApp, --jar ,
file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
 'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
 'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
 'subcommand=successevents'  --args
 'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
2048, --executor-cores, 1, --num-executors , 3, 1>, <LOG_DIR>/stdout, 2>,
<LOG_DIR>/stderr)
15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
application_1425075571333_61948
15/03/04 03:20:56 INFO yarn.Client: Application report from ASM:
 application identifier: application_1425075571333_61948
 appId: 61948
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: hdmi-spark
 appMasterRpcPort: -1
 appStartTime: 1425464454263
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl:
https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
 appUser: dvasthimal
15/03/04 03:21:18 INFO yarn.Client: Application report from ASM:
 application identifier: application_1425075571333_61948
 appId: 61948
 clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
 appDiagnostics:
 appMasterHost: phxaishdc9dn0169.phx.company.com
 appQueue: hdmi-spark
 appMasterRpcPort: 0
 appStartTime: 1425464454263
 yarnAppState: RUNNING
 distributedFinalState: UNDEFINED
 appTrackingUrl:
https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
 appUser: dvasthimal
….
….
15/03/04 03:21:22 INFO yarn.Client: Application report from ASM:
 application identifier: application_1425075571333_61948
 appId: 61948
 clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
 appDiagnostics:
 appMasterHost: phxaishdc9dn0169.phx.company.com
 appQueue: hdmi-spark
 appMasterRpcPort: 0
 appStartTime: 1425464454263
 yarnAppState: FINISHED
 distributedFinalState: FAILED
 appTrackingUrl:
https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/A
 appUser: dvasthimal



AM failed with following exception

/apache/hadoop/bin/yarn logs -applicationId application_1425075571333_61948
15/03/04 03:21:22 INFO NewHadoopRDD: Input split: hdfs://
apollo-phx-nn.company.com:8020/user/dvasthimal/epdatasets_small/exptsession/2015/02/16/part-r-00000.avro:0+13890
15/03/04 03:21:22 ERROR Executor: Exception in task ID 3
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



1) Having figured out the error the fix would be to put the right version
of avro libs into AM JVM classpath. Hence i included --jars
/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
in spark-submit command. However i still see the same exception.
2) I tried to include these libs in SPARK_CLASSPATH. However i see the same
exception.


-- 
Deepak

Re: Fwd: Unable to Read/Write Avro RDD on cluster.

Posted by "M. Dale" <me...@yahoo.com.INVALID>.
There was a avro-mapred version conflict described in 
https://issues.apache.org/jira/browse/SPARK-3039.
Fixed by https://github.com/apache/spark/pull/4315 for Spark 1.3.

Here is a link that describes how to fix Spark 1.2.1 for avro-mapred 
hadoop2: 
https://github.com/medale/spark-mail/blob/master/presentation/CreatingAvroMapred2Spark.md

We had built that 1.2.1 version for a demo for the February MD Apache 
Spark meetup (http://www.meetup.com/Apache-Spark-Maryland/) that was 
hosted at 
https://s3.amazonaws.com/morris-datasets/ENRON/demo/spark-1.2.1.tar.gz.

Hope this helps,
Markus

On 03/04/2015 11:57 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
> I am trying to read RDD avro, transform and write.
> I am able to run it locally fine but when i run onto cluster, i see issues
> with Avro.
>
>
> export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> export YARN_CONF_DIR=/apache/hadoop/conf
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
> export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native"
> export YARN_CONF_DIR=/apache/hadoop/conf/
>
> cd $SPARK_HOME
>
> ./bin/spark-submit --master yarn-cluster --jars
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
> 1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
> /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
> subcommand=successevents
> outputdir=/user/dvasthimal/epdatasets/successdetail
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to rm2
> 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
> ApplicationsManager (ASM), number of NodeManagers: 2221
> 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
> queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
>        queueApplicationCount = 7, queueChildQueueCount = 0
> 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
> resource in this cluster 16384
> 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
> 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
>
> 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
> 7780745 for dvasthimal on 10.115.206.112:8020
> 15/03/04 03:20:46 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
> 15/03/04 03:20:47 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
> 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
> 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
> 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
> ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
> -Djava.io.tmpdir=$PWD/tmp,
> -Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\",
>   -Dlog4j.configuration=log4j-spark-container.properties,
> org.apache.spark.deploy.yarn.ApplicationMaster, --class,
> com.company.ep.poc.spark.reporting.SparkApp, --jar ,
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
>   'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
>   'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
>   'subcommand=successevents'  --args
>   'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
> 2048, --executor-cores, 1, --num-executors , 3, 1>, <LOG_DIR>/stdout, 2>,
> <LOG_DIR>/stderr)
> 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
> 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
> application_1425075571333_61948
> 15/03/04 03:20:56 INFO yarn.Client: Application report from ASM:
>   application identifier: application_1425075571333_61948
>   appId: 61948
>   clientToAMToken: null
>   appDiagnostics:
>   appMasterHost: N/A
>   appQueue: hdmi-spark
>   appMasterRpcPort: -1
>   appStartTime: 1425464454263
>   yarnAppState: ACCEPTED
>   distributedFinalState: UNDEFINED
>   appTrackingUrl:
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>   appUser: dvasthimal
> 15/03/04 03:21:18 INFO yarn.Client: Application report from ASM:
>   application identifier: application_1425075571333_61948
>   appId: 61948
>   clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>   appDiagnostics:
>   appMasterHost: phxaishdc9dn0169.phx.company.com
>   appQueue: hdmi-spark
>   appMasterRpcPort: 0
>   appStartTime: 1425464454263
>   yarnAppState: RUNNING
>   distributedFinalState: UNDEFINED
>   appTrackingUrl:
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>   appUser: dvasthimal
> ….
> ….
> 15/03/04 03:21:22 INFO yarn.Client: Application report from ASM:
>   application identifier: application_1425075571333_61948
>   appId: 61948
>   clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>   appDiagnostics:
>   appMasterHost: phxaishdc9dn0169.phx.company.com
>   appQueue: hdmi-spark
>   appMasterRpcPort: 0
>   appStartTime: 1425464454263
>   yarnAppState: FINISHED
>   distributedFinalState: FAILED
>   appTrackingUrl:
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/A
>   appUser: dvasthimal
>
>
>
> AM failed with following exception
>
> /apache/hadoop/bin/yarn logs -applicationId application_1425075571333_61948
> 15/03/04 03:21:22 INFO NewHadoopRDD: Input split: hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/epdatasets_small/exptsession/2015/02/16/part-r-00000.avro:0+13890
> 15/03/04 03:21:22 ERROR Executor: Exception in task ID 3
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 1) Having figured out the error the fix would be to put the right version
> of avro libs into AM JVM classpath. Hence i included --jars
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> in spark-submit command. However i still see the same exception.
> 2) I tried to include these libs in SPARK_CLASSPATH. However i see the same
> exception.
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Unable to Read/Write Avro RDD on cluster.

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Here's a workaround:

- Download and put this jar
<http://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.7/
avro-mapred-1.7.7-hadoop2.jar> in the SPARK_CLASSPATH in all workers
- Make sure that jar is present in the same path in all workers.

Thanks
Best Regards

On Thu, Mar 5, 2015 at 10:27 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <de...@gmail.com> wrote:

> I am trying to read RDD avro, transform and write.
> I am able to run it locally fine but when i run onto cluster, i see issues
> with Avro.
>
>
> export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> export YARN_CONF_DIR=/apache/hadoop/conf
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export
>
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
> export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native"
> export YARN_CONF_DIR=/apache/hadoop/conf/
>
> cd $SPARK_HOME
>
> ./bin/spark-submit --master yarn-cluster --jars
>
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
> 1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
> /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
> subcommand=successevents
> outputdir=/user/dvasthimal/epdatasets/successdetail
>
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to rm2
> 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
> ApplicationsManager (ASM), number of NodeManagers: 2221
> 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
> queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
>       queueApplicationCount = 7, queueChildQueueCount = 0
> 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
> resource in this cluster 16384
> 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
> 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
>
>
> 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
> 7780745 for dvasthimal on 10.115.206.112:8020
> 15/03/04 03:20:46 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
> 15/03/04 03:20:47 INFO yarn.Client: Uploading
>
> file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
> 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
> 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
> 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
> ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
> -Djava.io.tmpdir=$PWD/tmp,
> -Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\",
>  -Dlog4j.configuration=log4j-spark-container.properties,
> org.apache.spark.deploy.yarn.ApplicationMaster, --class,
> com.company.ep.poc.spark.reporting.SparkApp, --jar ,
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
>  'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
>  'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
>  'subcommand=successevents'  --args
>  'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
> 2048, --executor-cores, 1, --num-executors , 3, 1>, <LOG_DIR>/stdout, 2>,
> <LOG_DIR>/stderr)
> 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
> 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
> application_1425075571333_61948
> 15/03/04 03:20:56 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: null
>  appDiagnostics:
>  appMasterHost: N/A
>  appQueue: hdmi-spark
>  appMasterRpcPort: -1
>  appStartTime: 1425464454263
>  yarnAppState: ACCEPTED
>  distributedFinalState: UNDEFINED
>  appTrackingUrl:
>
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>  appUser: dvasthimal
> 15/03/04 03:21:18 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  appDiagnostics:
>  appMasterHost: phxaishdc9dn0169.phx.company.com
>  appQueue: hdmi-spark
>  appMasterRpcPort: 0
>  appStartTime: 1425464454263
>  yarnAppState: RUNNING
>  distributedFinalState: UNDEFINED
>  appTrackingUrl:
>
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>  appUser: dvasthimal
> ….
> ….
> 15/03/04 03:21:22 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  appDiagnostics:
>  appMasterHost: phxaishdc9dn0169.phx.company.com
>  appQueue: hdmi-spark
>  appMasterRpcPort: 0
>  appStartTime: 1425464454263
>  yarnAppState: FINISHED
>  distributedFinalState: FAILED
>  appTrackingUrl:
>
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/A
>  appUser: dvasthimal
>
>
>
> AM failed with following exception
>
> /apache/hadoop/bin/yarn logs -applicationId application_1425075571333_61948
> 15/03/04 03:21:22 INFO NewHadoopRDD: Input split: hdfs://
>
> apollo-phx-nn.company.com:8020/user/dvasthimal/epdatasets_small/exptsession/2015/02/16/part-r-00000.avro:0+13890
> 15/03/04 03:21:22 ERROR Executor: Exception in task ID 3
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at
>
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 1) Having figured out the error the fix would be to put the right version
> of avro libs into AM JVM classpath. Hence i included --jars
>
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> in spark-submit command. However i still see the same exception.
> 2) I tried to include these libs in SPARK_CLASSPATH. However i see the same
> exception.
>
>
> --
> Deepak
>

RE: Unable to Read/Write Avro RDD on cluster.

Posted by java8964 <ja...@hotmail.com>.
You can give Spark-Avro a try. It works great for our project.
https://github.com/databricks/spark-avro

> From: deepujain@gmail.com
> Date: Thu, 5 Mar 2015 10:27:04 +0530
> Subject: Fwd: Unable to Read/Write Avro RDD on cluster.
> To: dev@spark.apache.org
> 
> I am trying to read RDD avro, transform and write.
> I am able to run it locally fine but when i run onto cluster, i see issues
> with Avro.
> 
> 
> export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export HADOOP_CONF_DIR=/apache/hadoop/conf
> export YARN_CONF_DIR=/apache/hadoop/conf
> export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf"
> export
> SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar
> export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native"
> export YARN_CONF_DIR=/apache/hadoop/conf/
> 
> cd $SPARK_HOME
> 
> ./bin/spark-submit --master yarn-cluster --jars
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores
> 1  --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp
> /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar
> startDate=2015-02-16 endDate=2015-02-16
> epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession
> subcommand=successevents
> outputdir=/user/dvasthimal/epdatasets/successdetail
> 
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing
> over to rm2
> 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from
> ApplicationsManager (ASM), number of NodeManagers: 2221
> 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark,
> queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08,
>       queueApplicationCount = 7, queueChildQueueCount = 0
> 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single
> resource in this cluster 16384
> 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources
> 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads
> feature cannot be used because libhadoop cannot be loaded.
> 
> 
> 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
> 7780745 for dvasthimal on 10.115.206.112:8020
> 15/03/04 03:20:46 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar
> 15/03/04 03:20:47 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar
> to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar
> 15/03/04 03:20:52 INFO yarn.Client: Uploading
> file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar
> 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment
> 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context
> 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark
> ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m,
> -Djava.io.tmpdir=$PWD/tmp,
> -Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\",
>  -Dlog4j.configuration=log4j-spark-container.properties,
> org.apache.spark.deploy.yarn.ApplicationMaster, --class,
> com.company.ep.poc.spark.reporting.SparkApp, --jar ,
> file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar,  --args
>  'startDate=2015-02-16'  --args  'endDate=2015-02-16'  --args
>  'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession'  --args
>  'subcommand=successevents'  --args
>  'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory,
> 2048, --executor-cores, 1, --num-executors , 3, 1>, <LOG_DIR>/stdout, 2>,
> <LOG_DIR>/stderr)
> 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM
> 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application
> application_1425075571333_61948
> 15/03/04 03:20:56 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: null
>  appDiagnostics:
>  appMasterHost: N/A
>  appQueue: hdmi-spark
>  appMasterRpcPort: -1
>  appStartTime: 1425464454263
>  yarnAppState: ACCEPTED
>  distributedFinalState: UNDEFINED
>  appTrackingUrl:
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>  appUser: dvasthimal
> 15/03/04 03:21:18 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  appDiagnostics:
>  appMasterHost: phxaishdc9dn0169.phx.company.com
>  appQueue: hdmi-spark
>  appMasterRpcPort: 0
>  appStartTime: 1425464454263
>  yarnAppState: RUNNING
>  distributedFinalState: UNDEFINED
>  appTrackingUrl:
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/
>  appUser: dvasthimal
> ….
> ….
> 15/03/04 03:21:22 INFO yarn.Client: Application report from ASM:
>  application identifier: application_1425075571333_61948
>  appId: 61948
>  clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }
>  appDiagnostics:
>  appMasterHost: phxaishdc9dn0169.phx.company.com
>  appQueue: hdmi-spark
>  appMasterRpcPort: 0
>  appStartTime: 1425464454263
>  yarnAppState: FINISHED
>  distributedFinalState: FAILED
>  appTrackingUrl:
> https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/A
>  appUser: dvasthimal
> 
> 
> 
> AM failed with following exception
> 
> /apache/hadoop/bin/yarn logs -applicationId application_1425075571333_61948
> 15/03/04 03:21:22 INFO NewHadoopRDD: Input split: hdfs://
> apollo-phx-nn.company.com:8020/user/dvasthimal/epdatasets_small/exptsession/2015/02/16/part-r-00000.avro:0+13890
> 15/03/04 03:21:22 ERROR Executor: Exception in task ID 3
> java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
> at
> org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> 
> 
> 1) Having figured out the error the fix would be to put the right version
> of avro libs into AM JVM classpath. Hence i included --jars
> /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar
> in spark-submit command. However i still see the same exception.
> 2) I tried to include these libs in SPARK_CLASSPATH. However i see the same
> exception.
> 
> 
> -- 
> Deepak