You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Suresh Thalamati (JIRA)" <ji...@apache.org> on 2016/04/15 23:35:25 UTC
[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal
like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243702#comment-15243702 ]
Suresh Thalamati commented on SPARK-14586:
------------------------------------------
Thanks for reporting this issue , Stephane. which version of Hive are you are using ? I took a quick look at the code , here is what I found:
type decimal(4,2) will map to BigDecimal not double. BigDecimal parsing fails if there are spaces.
{code}
scala> BigDecimal(" 2.0")
java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:494)
at java.math.BigDecimal.<init>(BigDecimal.java:383)
{code}
Spark SQL also relies on HiveDecimal to convert the string to BigDecimal value.
Hive made fix in 2.0 release to trim decimal input string.
https://issues.apache.org/jira/browse/HIVE-12343
https://issues.apache.org/jira/browse/HIVE-10799
commit : https://github.com/apache/hive/commit/c178a6e9d12055e5bde634123ca58f243ae39477
{code}
common/src/java/org/apache/hadoop/hive/common/type/HiveDecimal.java
public static HiveDecimal create(String dec) {
BigDecimal bd;
try {
- bd = new BigDecimal(dec);
+ bd = new BigDecimal(dec.trim());
} catch (NumberFormatException ex) {
return null;
}
{code}
When Spark moves to 2.0 version of Hive, decimal parsing should behave same as Hive. I am not sure about the plans to upgrade Hive version inside Spark. Copying Yin Hui.
[~yhuai]
> SparkSQL doesn't parse decimal like Hive
> ----------------------------------------
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.1
> Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
> column_1 varchar(10),
> column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a 2
> NULL 3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +--------+--------+
> |column_1|column_2|
> +--------+--------+
> | a| null|
> | null| 3.00|
> +--------+--------+
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't have a similar parsing behavior for decimals. I wouldn't say it is a bug per se, but it looks like a necessary improvement for the two engines to converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org