You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jianshi Huang <ji...@gmail.com> on 2014/07/25 06:24:37 UTC

Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

I can successfully run my code in local mode using spark-submit (--master
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.

Any hints what is the problem? Is it a closure serialization problem? How
can I debug it? Your answers would be very helpful.

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to
java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
        at
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
a:40)
        at
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
a:36)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
        at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
        at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

Posted by Jianshi Huang <ji...@gmail.com>.

I see Andrew, thanks for the explanantion.

On Tue, Jul 29, 2014 at 5:29 AM, Andrew Lee <al...@hotmail.com> wrote:

>
> I was thinking maybe we can suggest the community to enhance the Spark
> HistoryServer to capture the last failure exception from the container logs
> in the last failed stage?
>

This would be helpful. I personally like Yarn-Client mode as all the
running status can be checked directly from the console.

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

Posted by Andrew Lee <al...@hotmail.com>.

Hi Jianshi,
My understanding is 'No' based on how Spark's is designed even with your own log4j.properties in the Spark's conf folder.
In YARN mode, the Application Master is running inside the cluster and all logs are part of containers log which is defined by another log4j.properties file from the Hadoop and YARN environment. Spark can't override that unless it can provide its own log4j prior to YARN's in the classpath. So the only way is to login to the resource manager and click on the job itself to read the containers log. (Other people) Please correct me if my understanding is wrong.
You may be thinking why can't I stream the log's to an external service (e.g. Flume, syslogd) with a different appender in log4j, myself don't consider this a good practice since:1. you need 2 infra structure to operate the entire cluster.  2. you will need to open up the firewall ports between the 2 services to transfer/stream logs.3. unpredictable traffic, the YARN cluster may bring down the logging service/infra (DDoS) when someone accidentally change the logging level from WARN to INFO, or worst, DEBUG.
I was thinking maybe we can suggest the community to enhance the Spark HistoryServer to capture the last failure exception from the container logs in the last failed stage? Not sure if this is an good idea since it may complicate the event model. I'm not sure if Akka model can support this or some other components in Spark could help to capture these exceptions and pass it back to AM and eventually stored in somewhere for later troubleshooting. I'm not clear how this path is constructed until reading the source code, so I can't give a better answer.
AL

From: jianshi.huang@gmail.com
Date: Mon, 28 Jul 2014 13:32:05 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode
To: user@spark.apache.org

Hi Andrew,
Thanks for the reply, I figured out the cause of the issue. Some resource files were missing in JARs. A class initialization depends on the resource files so it got that exception.

I appended the resource files explicitly to --jars option and it worked fine.
The "Caused by..." messages were found in yarn logs actually, I think it might be useful if I can seem them from the console which runs spark-submit. Would that be possible?

Jianshi

On Sat, Jul 26, 2014 at 7:08 AM, Andrew Lee <al...@hotmail.com> wrote:

Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?

Were you able to manually write one record to HBase with the serialize function? Hardcode and test it ?

From: jianshi.huang@gmail.com

Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode
To: user@spark.apache.org

I nailed it down to a union operation, here's my code snippet:
    val properties: RDD[((String, String, String), Externalizer[KeyValue])] = vertices.map { ve =>

      val (vertices, dsName) = ve

      val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)      val (_, rvalAsc, rvalType) = rval
      println(s"Table name: $dsName, Rval: $rval")

      println(vertices.toDebugString)
      vertices.map { v =>        val rk = appendHash(boxId(v.id)).getBytes        val cf = PROP_BYTES

        val cq = boxRval(v.rval, rvalAsc, rvalType).getBytes        val value = Serializer.serialize(v.properties)
        ((new String(rk), new String(cf), new String(cq)),

         Externalizer(put(rk, cf, cq, value)))      }    }.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're transformed to the a KeyValue to be insert in HBase, so I need to do a .reduce(_.union(_)) to combine them into one RDD[(key, value)].

I cannot see what's wrong in my code.
Jianshi

On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang <ji...@gmail.com> wrote:

I can successfully run my code in local mode using spark-submit (--master local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.

Any hints what is the problem? Is it a closure serialization problem? How can I debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError        at com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal

a:40)        at com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)

        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)

        at org.apache.spark.scheduler.Task.run(Task.scala:51)        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)        at java.lang.Thread.run(Thread.java:745)

-- 
Jianshi Huang

LinkedIn: jianshi

Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

Posted by Jianshi Huang <ji...@gmail.com>.

Hi Andrew,

Thanks for the reply, I figured out the cause of the issue. Some resource
files were missing in JARs. A class initialization depends on the resource
files so it got that exception.

I appended the resource files explicitly to --jars option and it worked
fine.

The "Caused by..." messages were found in yarn logs actually, I think it
might be useful if I can seem them from the console which runs
spark-submit. Would that be possible?

Jianshi



On Sat, Jul 26, 2014 at 7:08 AM, Andrew Lee <al...@hotmail.com> wrote:

> Hi Jianshi,
>
> Could you provide which HBase version you're using?
>
> By the way, a quick sanity check on whether the Workers can access HBase?
>
> Were you able to manually write one record to HBase with the serialize
> function? Hardcode and test it ?
>
> ------------------------------
> From: jianshi.huang@gmail.com
> Date: Fri, 25 Jul 2014 15:12:18 +0800
> Subject: Re: Need help, got java.lang.ExceptionInInitializerError in
> Yarn-Client/Cluster mode
> To: user@spark.apache.org
>
>
> I nailed it down to a union operation, here's my code snippet:
>
>     val properties: RDD[((String, String, String),
> Externalizer[KeyValue])] = vertices.map { ve =>
>       val (vertices, dsName) = ve
>       val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES,
> dsName)
>       val (_, rvalAsc, rvalType) = rval
>
>       println(s"Table name: $dsName, Rval: $rval")
>       println(vertices.toDebugString)
>
>       vertices.map { v =>
>         val rk = appendHash(boxId(v.id)).getBytes
>         val cf = PROP_BYTES
>         val cq = boxRval(v.rval, rvalAsc, rvalType).getBytes
>         val value = Serializer.serialize(v.properties)
>
>         ((new String(rk), new String(cf), new String(cq)),
>           Externalizer(put(rk, cf, cq, value)))
>       }
>     }.reduce(_.union(_)).sortByKey(numPartitions = 32)
>
> Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and
> they're transformed to the a KeyValue to be insert in HBase, so I need to
> do a .reduce(_.union(_)) to combine them into one RDD[(key, value)].
>
> I cannot see what's wrong in my code.
>
> Jianshi
>
>
>
> On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang <ji...@gmail.com>
> wrote:
>
> I can successfully run my code in local mode using spark-submit (--master
> local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.
>
> Any hints what is the problem? Is it a closure serialization problem? How
> can I debug it? Your answers would be very helpful.
>
> 14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.ExceptionInInitializerError
> java.lang.ExceptionInInitializerError
>         at
> com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
> a:40)
>         at
> com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
> a:36)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)
>         at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
>         at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
>         at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>         at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

Posted by Andrew Lee <al...@hotmail.com>.

Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?
Were you able to manually write one record to HBase with the serialize function? Hardcode and test it ?

From: jianshi.huang@gmail.com
Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode
To: user@spark.apache.org

I nailed it down to a union operation, here's my code snippet:
    val properties: RDD[((String, String, String), Externalizer[KeyValue])] = vertices.map { ve =>      val (vertices, dsName) = ve

      val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)      val (_, rvalAsc, rvalType) = rval
      println(s"Table name: $dsName, Rval: $rval")

      println(vertices.toDebugString)
      vertices.map { v =>        val rk = appendHash(boxId(v.id)).getBytes        val cf = PROP_BYTES

        val cq = boxRval(v.rval, rvalAsc, rvalType).getBytes        val value = Serializer.serialize(v.properties)
        ((new String(rk), new String(cf), new String(cq)),

         Externalizer(put(rk, cf, cq, value)))      }    }.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're transformed to the a KeyValue to be insert in HBase, so I need to do a .reduce(_.union(_)) to combine them into one RDD[(key, value)].


I cannot see what's wrong in my code.
Jianshi


On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang <ji...@gmail.com> wrote:


I can successfully run my code in local mode using spark-submit (--master local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.


Any hints what is the problem? Is it a closure serialization problem? How can I debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError        at com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal


a:40)        at com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)


        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)        at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)


        at org.apache.spark.scheduler.Task.run(Task.scala:51)        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)        at java.lang.Thread.run(Thread.java:745)



-- 
Jianshi Huang

LinkedIn: jianshi

Twitter: @jshuang
Github & Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

Posted by Jianshi Huang <ji...@gmail.com>.

I nailed it down to a union operation, here's my code snippet:

    val properties: RDD[((String, String, String), Externalizer[KeyValue])]
= vertices.map { ve =>
      val (vertices, dsName) = ve
      val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES,
dsName)
      val (_, rvalAsc, rvalType) = rval

      println(s"Table name: $dsName, Rval: $rval")
      println(vertices.toDebugString)

      vertices.map { v =>
        val rk = appendHash(boxId(v.id)).getBytes
        val cf = PROP_BYTES
        val cq = boxRval(v.rval, rvalAsc, rvalType).getBytes
        val value = Serializer.serialize(v.properties)

        ((new String(rk), new String(cf), new String(cq)),
         Externalizer(put(rk, cf, cq, value)))
      }
    }.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and
they're transformed to the a KeyValue to be insert in HBase, so I need to
do a .reduce(_.union(_)) to combine them into one RDD[(key, value)].

I cannot see what's wrong in my code.

Jianshi



On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang <ji...@gmail.com>
wrote:

> I can successfully run my code in local mode using spark-submit (--master
> local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.
>
> Any hints what is the problem? Is it a closure serialization problem? How
> can I debug it? Your answers would be very helpful.
>
> 14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.ExceptionInInitializerError
> java.lang.ExceptionInInitializerError
>         at
> com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
> a:40)
>         at
> com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
> a:36)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)
>         at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
>         at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
>         at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>         at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/