You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/09/28 10:25:20 UTC

[jira] [Commented] (SPARK-17706) DataFrame losing string data in yarn mode

    [ https://issues.apache.org/jira/browse/SPARK-17706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529197#comment-15529197 ] 

Sean Owen commented on SPARK-17706:
-----------------------------------

If the heap size is the issue, it's almost certainly a bug triggered by different compressed OOPS settings on the two JVMs, because heaps > 32G have a different default. See https://issues.apache.org/jira/browse/SPARK-9725 thought that was fixed in 1.5.0, but there have been more since like https://issues.apache.org/jira/browse/SPARK-10914 or https://issues.apache.org/jira/browse/SPARK-17211 .

Ideally, try a newer Spark version, because I suspect it's been fixed.

> DataFrame losing string data in yarn mode
> -----------------------------------------
>
>                 Key: SPARK-17706
>                 URL: https://issues.apache.org/jira/browse/SPARK-17706
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, YARN
>    Affects Versions: 1.5.0
>         Environment: RedHat 6.6, CDH 5.5.2
>            Reporter: Andrey Dmitriev
>
> By some reason when I add new column or append string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so function show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.
> Here is example code:
> {code}
> object MissingValue {
>   def hex(str: String): String = str
>     .getBytes("UTF-8")
>     .map(f => Integer.toHexString(f&0xFF).toUpperCase)
>     .mkString("-")
>   def main(args: Array[String]) {
>     val conf = new SparkConf().setAppName("MissingValue")
>     val sc = new SparkContext(conf)
>     sc.setLogLevel("WARN")
>     val sqlContext = new SQLContext(sc)
>     import sqlContext.implicits._
>     val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
>     val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
>     val schema = StructType(
>       StructField("COL1",IntegerType,true)
>       ::StructField("COL2",StringType,true)
>       ::Nil
>     )
>     val df = sqlContext.createDataFrame(rdd,schema)
>     df.show()
>     val str = df.first().getString(1)
>     println(s"${str} == ${hex(str)}")
>     sc.stop()
>   }
> }
> {code}
> When I run it in local mode then everything works as expected:
> {code}
>     +----+----+
>     |COL1|COL2|
>     +----+----+
>     | 101| ABC|
>     | 102| BCD|
>     | 103| CDE|
>     +----+----+
>     
>     ABC == 41-42-43
> {code}
> But if I run the same code in yarn-client mode it produces:
> {code}
>     +----+----+
>     |COL1|COL2|
>     +----+----+
>     | 101| ^E^@^@|
>     | 102| ^E^@^@|
>     | 103| ^E^@^@|
>     +----+----+
>     ^E^@^@ == 5-0-0
> {code}
> This problem exists only for string values, so first column (Integer) is fine.
> Also if I'm creating rdd from the dataframe then everything is fine i.e.  {{df.rdd.take(1).apply(0).getString(1)}}
> I'm using Spark 1.5.0 from CDH 5.5.2
> It seems that this happens when the difference between driver memory and executor memory is too high {{--driver-memory xxG --executor-memory yyG}} i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org