You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tzach Zohar (JIRA)" <ji...@apache.org> on 2015/08/13 16:45:46 UTC

[jira] [Closed] (SPARK-9936) decimal precision lost when loading DataFrame from RDD

     [ https://issues.apache.org/jira/browse/SPARK-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tzach Zohar closed SPARK-9936.
------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.0

> decimal precision lost when loading DataFrame from RDD
> ------------------------------------------------------
>
>                 Key: SPARK-9936
>                 URL: https://issues.apache.org/jira/browse/SPARK-9936
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Tzach Zohar
>             Fix For: 1.5.0
>
>
> It seems that when converting an RDD that contains BigDecimals into a DataFrame (using SQLContext.createDataFrame without specifying schema), precision info is lost, which means saving as Parquet file will fail (Parquet tries to verify precision < 18, so fails if it's unset).
> This seems to be similar to [SPARK-7196|https://issues.apache.org/jira/browse/SPARK-7196], which fixed the same issue for DataFrames created via JDBC.
> To reproduce:
> {code:none}
> scala> val rdd: RDD[(String, BigDecimal)] = sc.parallelize(Seq(("a", BigDecimal.valueOf(0.234))))
> rdd: org.apache.spark.rdd.RDD[(String, BigDecimal)] = ParallelCollectionRDD[0] at parallelize at <console>:23
> scala> val df: DataFrame = new SQLContext(rdd.context).createDataFrame(rdd)
> df: org.apache.spark.sql.DataFrame = [_1: string, _2: decimal(10,0)]
> scala> df.write.parquet("/data/parquet-file")
> 15/08/13 10:30:07 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.RuntimeException: Unsupported datatype DecimalType()
> {code}
> To verify this is indeed caused by the precision being lost, I've tried manually changing the schema to include precision (by traversing the StructFields and replacing the DecimalTypes with altered DecimalTypes), creating a new DataFrame using this updated schema - and indeed it fixes the problem.
> I'm using Spark 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org