You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pala M Muthaia <mc...@rocketfuelinc.com> on 2014/11/18 22:54:01 UTC

Lost executors

Hi,

I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
shell.

I am running a job that essentially reads a bunch of HBase keys, looks up
HBase data, and performs some filtering and aggregation. The job works fine
in smaller datasets, but when i try to execute on the full dataset, the job
never completes. The few symptoms i notice are:

a. The job shows progress for a while and then starts throwing lots of the
following errors:

2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
906 disconnected, so removing it*
2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
executor 906 on <machine name>: remote Akka client disassociated*

2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
 org.apache.spark.storage.BlockManagerMasterActor - *Removing BlockManager
BlockManagerId(9186, <machine name>, 54600, 0) with no recent heart beats:
82313ms exceeds 45000ms*

Looking at the logs, the job never recovers from these errors, and
continues to show errors about lost executors and launching new executors,
and this just continues for a long time.

Could this be because the executors are running out of memory?

In terms of memory usage, the intermediate data could be large (after the
HBase lookup), but partial and fully aggregated data set size should be
quite small - essentially a bunch of ids and counts (< 1 mil in total).



b. In the Spark UI, i am seeing the following errors (redacted for
brevity), not sure if they are transient or real issue:

java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read
timed out}
...
org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
...
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)




I was trying to get more data to investigate but haven't been able to
figure out how to enable logging on the executors. The Spark UI appears
stuck and i only see driver side logs in the jobhistory directory specified
in the job.


Thanks,
pala

Re: Code works in Spark-Shell but Fails inside IntelliJ

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.
Not using SBT...I have been creating and adapting various Spark Scala examples and put it here and all u have to do is git clone and import as maven project into IntelliJhttps://github.com/sanjaysubramanian/msfx_scala.git

Sidenote , IMHO, IDEs encourage the "new to Spark/Scala developers" to quickly test , experiment and debug code.
      From: Jay Vyas <ja...@gmail.com>
 To: Sanjay Subramanian <sa...@yahoo.com> 
Cc: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Thursday, November 20, 2014 4:53 PM
 Subject: Re: Code works in Spark-Shell but Fails inside IntelliJ
   
This seems pretty standard: your IntelliJ classpath isn't matched to the correct ones that are used in spark shell....
Are you using the SBT plugin? If not how are you putting deps into IntelliJ?



On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian <sa...@yahoo.com.INVALID> wrote:


hey guys
I am at AmpCamp 2014 at UCB right now :-) 
Funny Issue...
This code works in Spark-Shell but throws a funny exception in IntelliJ
CODE
====val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")wikiData.registerTempTable("wikiData")sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
RESULTS========[Waacstats,2003][Cydebot,949][BattyBot,939][Yobot,890][Addbot,853][Monkbot,668][ChrisGualtieri,438][RjwilmsiBot,387][OccultZone,377][ClueBot NG,353]

INTELLIJ CODE=============object ParquetSql {
  def main(args: Array[String]) {

    val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-NamesFoodSql")
    val sc = new SparkContext(sconf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
    val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
    wikiData.registerTempTable("wikiData")
    val results = sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10")
    results.collect().foreach(println)
  }

}

INTELLIJ ERROR==============Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.serializer.Serializer, but class was expected at org.apache.spark.sql.parquet.ParquetFilters$.serializeFilterExpressions(ParquetFilters.scala:244) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:109) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:48) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) at org.medicalsidefx.common.utils.ParquetSql$.main(ParquetSql.scala:18) at org.medicalsidefx.common.utils.ParquetSql.main(ParquetSql.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)


 


  

Re: Code works in Spark-Shell but Fails inside IntelliJ

Posted by Jay Vyas <ja...@gmail.com>.
This seems pretty standard: your IntelliJ classpath isn't matched to the correct ones that are used in spark shell....

Are you using the SBT plugin? If not how are you putting deps into IntelliJ?

> On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian <sa...@yahoo.com.INVALID> wrote:
> 
> hey guys
> 
> I am at AmpCamp 2014 at UCB right now :-) 
> 
> Funny Issue...
> 
> This code works in Spark-Shell but throws a funny exception in IntelliJ
> 
> CODE
> ====
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
> val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
> wikiData.registerTempTable("wikiData")
> sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
> 
> RESULTS
> ========
> [Waacstats,2003]
> [Cydebot,949]
> [BattyBot,939]
> [Yobot,890]
> [Addbot,853]
> [Monkbot,668]
> [ChrisGualtieri,438]
> [RjwilmsiBot,387]
> [OccultZone,377]
> [ClueBot NG,353]
> 
> 
> INTELLIJ CODE
> =============
> object ParquetSql {
>   def main(args: Array[String]) {
> 
>     val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-NamesFoodSql")
>     val sc = new SparkContext(sconf)
>     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>     sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
>     val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
>     wikiData.registerTempTable("wikiData")
>     val results = sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10")
>     results.collect().foreach(println)
>   }
> 
> }
> 
> INTELLIJ ERROR
> ==============
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.serializer.Serializer, but class was expected
> 	at org.apache.spark.sql.parquet.ParquetFilters$.serializeFilterExpressions(ParquetFilters.scala:244)
> 	at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:109)
> 	at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
> 	at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151)
> 	at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
> 	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
> 	at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
> 	at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:48)
> 	at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
> 	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
> 	at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
> 	at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151)
> 	at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
> 	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
> 	at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
> 	at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171)
> 	at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
> 	at org.medicalsidefx.common.utils.ParquetSql$.main(ParquetSql.scala:18)
> 	at org.medicalsidefx.common.utils.ParquetSql.main(ParquetSql.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> 
> 
> 

Re: Code works in Spark-Shell but Fails inside IntelliJ

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.
Awesome that was it...Hit me with with a hockey stick :-) 
unmatched Spark Core (1.0.0) and SparkSql (1.1.1) versionsCorrected that to 1.1.0 on both 
<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-core_2.10</artifactId>
   <version>1.0.0</version>
</dependency>
<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-sql_2.10</artifactId>
   <version>1.1.0</version>
</dependency> 

      From: Michael Armbrust <mi...@databricks.com>
 To: Sanjay Subramanian <sa...@yahoo.com> 
Cc: "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Thursday, November 20, 2014 4:49 PM
 Subject: Re: Code works in Spark-Shell but Fails inside IntelliJ
   
Looks like intelij might be trying to load the wrong version of spark?


On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian <sa...@yahoo.com.invalid> wrote:

hey guys
I am at AmpCamp 2014 at UCB right now :-) 
Funny Issue...
This code works in Spark-Shell but throws a funny exception in IntelliJ
CODE
====val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")wikiData.registerTempTable("wikiData")sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
RESULTS========[Waacstats,2003][Cydebot,949][BattyBot,939][Yobot,890][Addbot,853][Monkbot,668][ChrisGualtieri,438][RjwilmsiBot,387][OccultZone,377][ClueBot NG,353]

INTELLIJ CODE=============object ParquetSql {
  def main(args: Array[String]) {

    val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-NamesFoodSql")
    val sc = new SparkContext(sconf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
    val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
    wikiData.registerTempTable("wikiData")
    val results = sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10")
    results.collect().foreach(println)
  }

}

INTELLIJ ERROR==============Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.serializer.Serializer, but class was expected at org.apache.spark.sql.parquet.ParquetFilters$.serializeFilterExpressions(ParquetFilters.scala:244) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:109) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:48) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) at org.medicalsidefx.common.utils.ParquetSql$.main(ParquetSql.scala:18) at org.medicalsidefx.common.utils.ParquetSql.main(ParquetSql.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)


 



  

Re: Code works in Spark-Shell but Fails inside IntelliJ

Posted by Michael Armbrust <mi...@databricks.com>.
Looks like intelij might be trying to load the wrong version of spark?

On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian <
sanjaysubramanian@yahoo.com.invalid> wrote:

> hey guys
>
> I am at AmpCamp 2014 at UCB right now :-)
>
> Funny Issue...
>
> This code works in Spark-Shell but throws a funny exception in IntelliJ
>
> CODE
> ====
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
> val wikiData =
> sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
> wikiData.registerTempTable("wikiData")
> sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE
> username <> '' GROUP BY username ORDER BY cnt DESC LIMIT
> 10").collect().foreach(println)
>
> RESULTS
> ========
> [Waacstats,2003]
> [Cydebot,949]
> [BattyBot,939]
> [Yobot,890]
> [Addbot,853]
> [Monkbot,668]
> [ChrisGualtieri,438]
> [RjwilmsiBot,387]
> [OccultZone,377]
> [ClueBot NG,353]
>
>
> INTELLIJ CODE
> =============
>
> object ParquetSql {
>   def main(args: Array[String]) {
>
>     val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-NamesFoodSql")
>     val sc = new SparkContext(sconf)
>     val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>     sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
>     val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
>     wikiData.registerTempTable("wikiData")
>     val results = sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10")
>     results.collect().foreach(println)
>   }
>
> }
>
>
> INTELLIJ ERROR
> ==============
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found
> interface org.apache.spark.serializer.Serializer, but class was expected
> at
> org.apache.spark.sql.parquet.ParquetFilters$.serializeFilterExpressions(ParquetFilters.scala:244)
> at
> org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:109)
> at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
> at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:48)
> at
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
> at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
> at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
> at
> org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
> at org.medicalsidefx.common.utils.ParquetSql$.main(ParquetSql.scala:18)
> at org.medicalsidefx.common.utils.ParquetSql.main(ParquetSql.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
>
>
>
>

Code works in Spark-Shell but Fails inside IntelliJ

Posted by Sanjay Subramanian <sa...@yahoo.com.INVALID>.
hey guys
I am at AmpCamp 2014 at UCB right now :-) 
Funny Issue...
This code works in Spark-Shell but throws a funny exception in IntelliJ
CODE
====val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")wikiData.registerTempTable("wikiData")sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
RESULTS========[Waacstats,2003][Cydebot,949][BattyBot,939][Yobot,890][Addbot,853][Monkbot,668][ChrisGualtieri,438][RjwilmsiBot,387][OccultZone,377][ClueBot NG,353]

INTELLIJ CODE=============object ParquetSql {
  def main(args: Array[String]) {

    val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-NamesFoodSql")
    val sc = new SparkContext(sconf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
    val wikiData = sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
    wikiData.registerTempTable("wikiData")
    val results = sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10")
    results.collect().foreach(println)
  }

}

INTELLIJ ERROR==============Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.spark.serializer.Serializer, but class was expected at org.apache.spark.sql.parquet.ParquetFilters$.serializeFilterExpressions(ParquetFilters.scala:244) at org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:109) at org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:48) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) at org.medicalsidefx.common.utils.ParquetSql$.main(ParquetSql.scala:18) at org.medicalsidefx.common.utils.ParquetSql.main(ParquetSql.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)



Re: Lost executors

Posted by Pala M Muthaia <mc...@rocketfuelinc.com>.
Just to close the loop, it seems no issues pop up when i submit the job
using 'spark submit' so that the driver process also runs on a container in
the YARN cluster.

In the above, the driver was running on the gateway machine through which
the job was submitted, which led to quite a few issues.

On Tue, Nov 18, 2014 at 5:01 PM, Pala M Muthaia <mchettiar@rocketfuelinc.com
> wrote:

> Sandy,
>
> Good point - i forgot about NM logs.
>
> When i looked up the NM logs, i only see the following statements that
> align with the driver side log about lost executor. Many executors show the
> same log statement at the same time, so it seems like the decision to kill
> many if not all executors happened centrally, and all executors got
> notified somehow:
>
> 14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013
> 14/11/18 00:18:25 INFO Executor: Executor killed task 2013
>
>
> In general, i also see quite a few instances of the following exception across many executors/nodes. :
>
> 14/11/17 23:58:00 INFO HadoopRDD: Input split: <hdfs dir path>/sorted_keys-1020_3-r-00255.deflate:0+415841
>
> 14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket
> java.net.ConnectException: connect(2) error: Connection refused when trying to connect to '/srv/var/hadoop/runs/hdfs/dn_socket'
> 	at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method)
> 	at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250)
> 	at org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158)
> 	at org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721)
> 	at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441)
> 	at org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780)
> 	at org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714)
> 	at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395)
> 	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303)
> 	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790)
> 	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
> 	at java.io.DataInputStream.read(DataInputStream.java:149)
> 	at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
> 	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
> 	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
> 	at java.io.InputStream.read(InputStream.java:101)
> 	at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
> 	at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> 	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> 	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
> 	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
> 	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201)
> 	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184)
> 	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 	at scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> 	at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
> 	at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
> 	at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:50)
> 	at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
> 	at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:51)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:724)
>
> 14/11/17 23:58:00 WARN ShortCircuitCache: ShortCircuitCache(0x71a8053d): failed to load 1276010498_BP-1416824317-172.22.48.2-1387241776581
>
>
> However, in some of the nodes, it seems execution proceeded after the
> error, so the above could just be a transient error.
>
> Finally, in the driver logs, i was looking for hint on the decision to
> kill many executors, around the 00:18:25 timestamp when many tasks were
> killed across many executors, but i didn't find anything different.
>
>
>
> On Tue, Nov 18, 2014 at 1:59 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>
>> Hi Pala,
>>
>> Do you have access to your YARN NodeManager logs?  Are you able to check
>> whether they report killing any containers for exceeding memory limits?
>>
>> -Sandy
>>
>> On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <
>> mchettiar@rocketfuelinc.com> wrote:
>>
>>> Hi,
>>>
>>> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
>>> shell.
>>>
>>> I am running a job that essentially reads a bunch of HBase keys, looks
>>> up HBase data, and performs some filtering and aggregation. The job works
>>> fine in smaller datasets, but when i try to execute on the full dataset,
>>> the job never completes. The few symptoms i notice are:
>>>
>>> a. The job shows progress for a while and then starts throwing lots of
>>> the following errors:
>>>
>>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>>>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
>>> 906 disconnected, so removing it*
>>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
>>> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
>>> executor 906 on <machine name>: remote Akka client disassociated*
>>>
>>> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>>>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
>>> BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent
>>> heart beats: 82313ms exceeds 45000ms*
>>>
>>> Looking at the logs, the job never recovers from these errors, and
>>> continues to show errors about lost executors and launching new executors,
>>> and this just continues for a long time.
>>>
>>> Could this be because the executors are running out of memory?
>>>
>>> In terms of memory usage, the intermediate data could be large (after
>>> the HBase lookup), but partial and fully aggregated data set size should be
>>> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>>>
>>>
>>>
>>> b. In the Spark UI, i am seeing the following errors (redacted for
>>> brevity), not sure if they are transient or real issue:
>>>
>>> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed out}
>>> ...
>>> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
>>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
>>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> ...
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> java.lang.Thread.run(Thread.java:724)
>>>
>>>
>>>
>>>
>>> I was trying to get more data to investigate but haven't been able to
>>> figure out how to enable logging on the executors. The Spark UI appears
>>> stuck and i only see driver side logs in the jobhistory directory specified
>>> in the job.
>>>
>>>
>>> Thanks,
>>> pala
>>>
>>>
>>>
>>
>

Re: Lost executors

Posted by Pala M Muthaia <mc...@rocketfuelinc.com>.
Sandy,

Good point - i forgot about NM logs.

When i looked up the NM logs, i only see the following statements that
align with the driver side log about lost executor. Many executors show the
same log statement at the same time, so it seems like the decision to kill
many if not all executors happened centrally, and all executors got
notified somehow:

14/11/18 00:18:25 INFO Executor: Executor is trying to kill task 2013
14/11/18 00:18:25 INFO Executor: Executor killed task 2013


In general, i also see quite a few instances of the following
exception across many executors/nodes. :

14/11/17 23:58:00 INFO HadoopRDD: Input split: <hdfs dir
path>/sorted_keys-1020_3-r-00255.deflate:0+415841

14/11/17 23:58:00 WARN BlockReaderLocal: error creating DomainSocket
java.net.ConnectException: connect(2) error: Connection refused when
trying to connect to '/srv/var/hadoop/runs/hdfs/dn_socket'
	at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method)
	at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250)
	at org.apache.hadoop.hdfs.DomainSocketFactory.createSocket(DomainSocketFactory.java:158)
	at org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:721)
	at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:441)
	at org.apache.hadoop.hdfs.client.ShortCircuitCache.create(ShortCircuitCache.java:780)
	at org.apache.hadoop.hdfs.client.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:714)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:395)
	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:303)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:790)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
	at java.io.InputStream.read(InputStream.java:101)
	at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
	at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:201)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:184)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
	at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
	at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
	at $line57.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:50)
	at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
	at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.Task.run(Task.scala:51)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)

14/11/17 23:58:00 WARN ShortCircuitCache:
ShortCircuitCache(0x71a8053d): failed to load
1276010498_BP-1416824317-172.22.48.2-1387241776581


However, in some of the nodes, it seems execution proceeded after the
error, so the above could just be a transient error.

Finally, in the driver logs, i was looking for hint on the decision to kill
many executors, around the 00:18:25 timestamp when many tasks were killed
across many executors, but i didn't find anything different.



On Tue, Nov 18, 2014 at 1:59 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Pala,
>
> Do you have access to your YARN NodeManager logs?  Are you able to check
> whether they report killing any containers for exceeding memory limits?
>
> -Sandy
>
> On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <
> mchettiar@rocketfuelinc.com> wrote:
>
>> Hi,
>>
>> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
>> shell.
>>
>> I am running a job that essentially reads a bunch of HBase keys, looks up
>> HBase data, and performs some filtering and aggregation. The job works fine
>> in smaller datasets, but when i try to execute on the full dataset, the job
>> never completes. The few symptoms i notice are:
>>
>> a. The job shows progress for a while and then starts throwing lots of
>> the following errors:
>>
>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
>> 906 disconnected, so removing it*
>> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
>> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
>> executor 906 on <machine name>: remote Akka client disassociated*
>>
>> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
>> BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent
>> heart beats: 82313ms exceeds 45000ms*
>>
>> Looking at the logs, the job never recovers from these errors, and
>> continues to show errors about lost executors and launching new executors,
>> and this just continues for a long time.
>>
>> Could this be because the executors are running out of memory?
>>
>> In terms of memory usage, the intermediate data could be large (after the
>> HBase lookup), but partial and fully aggregated data set size should be
>> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>>
>>
>>
>> b. In the Spark UI, i am seeing the following errors (redacted for
>> brevity), not sure if they are transient or real issue:
>>
>> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed out}
>> ...
>> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> ...
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:724)
>>
>>
>>
>>
>> I was trying to get more data to investigate but haven't been able to
>> figure out how to enable logging on the executors. The Spark UI appears
>> stuck and i only see driver side logs in the jobhistory directory specified
>> in the job.
>>
>>
>> Thanks,
>> pala
>>
>>
>>
>

Re: Lost executors

Posted by Sandy Ryza <sa...@cloudera.com>.
Hi Pala,

Do you have access to your YARN NodeManager logs?  Are you able to check
whether they report killing any containers for exceeding memory limits?

-Sandy

On Tue, Nov 18, 2014 at 1:54 PM, Pala M Muthaia <mchettiar@rocketfuelinc.com
> wrote:

> Hi,
>
> I am using Spark 1.0.1 on Yarn 2.5, and doing everything through spark
> shell.
>
> I am running a job that essentially reads a bunch of HBase keys, looks up
> HBase data, and performs some filtering and aggregation. The job works fine
> in smaller datasets, but when i try to execute on the full dataset, the job
> never completes. The few symptoms i notice are:
>
> a. The job shows progress for a while and then starts throwing lots of the
> following errors:
>
> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] INFO
>  org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - *Executor
> 906 disconnected, so removing it*
> 2014-11-18 00:18:20,020 [spark-akka.actor.default-dispatcher-67] ERROR
> org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - *Lost
> executor 906 on <machine name>: remote Akka client disassociated*
>
> 2014-11-18 16:52:02,283 [spark-akka.actor.default-dispatcher-22] WARN
>  org.apache.spark.storage.BlockManagerMasterActor - *Removing
> BlockManager BlockManagerId(9186, <machine name>, 54600, 0) with no recent
> heart beats: 82313ms exceeds 45000ms*
>
> Looking at the logs, the job never recovers from these errors, and
> continues to show errors about lost executors and launching new executors,
> and this just continues for a long time.
>
> Could this be because the executors are running out of memory?
>
> In terms of memory usage, the intermediate data could be large (after the
> HBase lookup), but partial and fully aggregated data set size should be
> quite small - essentially a bunch of ids and counts (< 1 mil in total).
>
>
>
> b. In the Spark UI, i am seeing the following errors (redacted for
> brevity), not sure if they are transient or real issue:
>
> java.net.SocketTimeoutException (java.net.SocketTimeoutException: Read timed out}
> ...
> org.apache.spark.util.Utils$.fetchFile(Utils.scala:349)
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:328)
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:724)
>
>
>
>
> I was trying to get more data to investigate but haven't been able to
> figure out how to enable logging on the executors. The Spark UI appears
> stuck and i only see driver side logs in the jobhistory directory specified
> in the job.
>
>
> Thanks,
> pala
>
>
>