You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Simeon Simeonov (JIRA)" <ji...@apache.org> on 2015/08/05 06:52:04 UTC
[jira] [Commented] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times

    [ https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654806#comment-14654806 ] 

Simeon Simeonov commented on SPARK-9625:
----------------------------------------

This is a spark shell-specific issue seemingly related to closures & serialization. All code in the shell is entered into a closure and that can affect serialization. 

> SparkILoop creates sql context continuously, thousands of times
> ---------------------------------------------------------------
>
>                 Key: SPARK-9625
>                 URL: https://issues.apache.org/jira/browse/SPARK-9625
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell
>    Affects Versions: 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: Simeon Simeonov
>              Labels: sql
>
> Occasionally but repeatably, based on the Spark SQL operations being run, {{spark-shell}} gets into a funk where it attempts to create a sql context over and over again as it is doing its work. Example output below:
> {code}
> 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages
> 15/08/05 03:04:12 INFO DAGScheduler: running: Set()
> 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, ResultStage 8)
> 15/08/05 03:04:12 INFO DAGScheduler: failed: Set()
> 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: List()
> 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: List(ShuffleMapStage 7)
> 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 (MapPartitionsRDD[49] at map at <console>:474), which is now runnable
> 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with curMem=685306, maxMem=26671746908
> 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 46.7 KB, free 24.8 GB)
> 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with curMem=733146, maxMem=26671746908
> 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 14.7 KB, free 24.8 GB)
> 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on localhost:39451 (size: 14.7 KB, free: 24.8 GB)
> 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at DAGScheduler.scala:874
> 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 7 (MapPartitionsRDD[49] at map at <console>:474)
> 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
> 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 684, localhost, PROCESS_LOCAL, 1461 bytes)
> 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684)
> 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty blocks out of 214 blocks
> 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
> 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since config is empty
> 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr.
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> {code}
> In the end, the job succeeded and produced correct output but it ran many times slower that it should have.
> Longer output, showing job completion, can be found [here|https://gist.github.com/ssimeonov/8064f5f77897bb337157]. In the example, sql context was created about 2,500 times. I've had runs where it's been over 100,000 times.
> Could it be that an exception silently generated by the job is causing this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org