You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jamie Hutton (JIRA)" <ji...@apache.org> on 2016/04/29 15:00:15 UTC

[jira] [Created] (SPARK-15000) Spark hangs indefinitely if you cache a dataframe, then show it, then do some further processing on it

Jamie Hutton created SPARK-15000:
------------------------------------

             Summary: Spark hangs indefinitely if you cache a dataframe, then show it, then do some further processing on it
                 Key: SPARK-15000
                 URL: https://issues.apache.org/jira/browse/SPARK-15000
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.0, 1.5.2
         Environment: I am running the test code on both a hortonworks sandbox and also on AWS EMR / EC2. Issue occurs in both spark-submit and spark-shell
            Reporter: Jamie Hutton


There seems to be an issue with certain combinations of cache and show when using spark. If you read a parquet file from disk, cache it, then perform a show operation, the system will hang (forever) if you perform further processing on it. 

The following code replicates the issue. I have run it on multiple environments, two spark versions and in both spark-shell and spark-submit. 

/*create a dataframe for our test - i did this so the test was self contained but you can use any parquet format dataframe*/

val r = scala.util.Random
val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long]))
val distData = sc.parallelize(list)
import sqlContext.implicits._
val df=distData.toDF
df.write.format("parquet").mode("overwrite").save("df_hanging_test.parquet") 


/*Now read the dataframe back in -  this is where the test begins*/
val df2 = sqlContext.read.load("df_hanging_test.parquet")
df2.cache
df2.show
val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
groupresult.show
/*the last step hangs forever*/

If you remove either the df2.cache or the df2.show lines the issue goes away. Also the groupBy/Agg doesnt seem to be the issue - I believe i have seen the same issue with other types of processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org