You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/09/28 10:25:20 UTC
[jira] [Commented] (SPARK-17706) DataFrame losing string data in
yarn mode
[ https://issues.apache.org/jira/browse/SPARK-17706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529197#comment-15529197 ]
Sean Owen commented on SPARK-17706:
-----------------------------------
If the heap size is the issue, it's almost certainly a bug triggered by different compressed OOPS settings on the two JVMs, because heaps > 32G have a different default. See https://issues.apache.org/jira/browse/SPARK-9725 thought that was fixed in 1.5.0, but there have been more since like https://issues.apache.org/jira/browse/SPARK-10914 or https://issues.apache.org/jira/browse/SPARK-17211 .
Ideally, try a newer Spark version, because I suspect it's been fixed.
> DataFrame losing string data in yarn mode
> -----------------------------------------
>
> Key: SPARK-17706
> URL: https://issues.apache.org/jira/browse/SPARK-17706
> Project: Spark
> Issue Type: Bug
> Components: SQL, YARN
> Affects Versions: 1.5.0
> Environment: RedHat 6.6, CDH 5.5.2
> Reporter: Andrey Dmitriev
>
> By some reason when I add new column or append string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so function show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.
> Here is example code:
> {code}
> object MissingValue {
> def hex(str: String): String = str
> .getBytes("UTF-8")
> .map(f => Integer.toHexString(f&0xFF).toUpperCase)
> .mkString("-")
> def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("MissingValue")
> val sc = new SparkContext(conf)
> sc.setLogLevel("WARN")
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
> val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
> val schema = StructType(
> StructField("COL1",IntegerType,true)
> ::StructField("COL2",StringType,true)
> ::Nil
> )
> val df = sqlContext.createDataFrame(rdd,schema)
> df.show()
> val str = df.first().getString(1)
> println(s"${str} == ${hex(str)}")
> sc.stop()
> }
> }
> {code}
> When I run it in local mode then everything works as expected:
> {code}
> +----+----+
> |COL1|COL2|
> +----+----+
> | 101| ABC|
> | 102| BCD|
> | 103| CDE|
> +----+----+
>
> ABC == 41-42-43
> {code}
> But if I run the same code in yarn-client mode it produces:
> {code}
> +----+----+
> |COL1|COL2|
> +----+----+
> | 101| ^E^@^@|
> | 102| ^E^@^@|
> | 103| ^E^@^@|
> +----+----+
> ^E^@^@ == 5-0-0
> {code}
> This problem exists only for string values, so first column (Integer) is fine.
> Also if I'm creating rdd from the dataframe then everything is fine i.e. {{df.rdd.take(1).apply(0).getString(1)}}
> I'm using Spark 1.5.0 from CDH 5.5.2
> It seems that this happens when the difference between driver memory and executor memory is too high {{--driver-memory xxG --executor-memory yyG}} i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org