You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stephane Maarek (JIRA)" <ji...@apache.org> on 2016/04/15 01:53:25 UTC
[jira] [Updated] (SPARK-14583) SparkSQL doesn't apply
TBLPROPERTIES('serialization.null.format'='') when Hive Table has
partitions
[ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephane Maarek updated SPARK-14583:
------------------------------------
Summary: SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions (was: SparkSQL doesn't read TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions)
> SparkSQL doesn't apply TBLPROPERTIES('serialization.null.format'='') when Hive Table has partitions
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 1.5.1
> Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
> column_1 varchar(10),
> column_2 int)
> PARTITIONED BY (
> `part_a` string,
> `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a 2 a b
> NULL 3 a b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +--------+--------+------+------+
> |column_1|column_2|part_a|part_b|
> +--------+--------+------+------+
> | a| 2| a| b|
> | | 3| a| b|
> +--------+--------+------+------+
> {code}
> As you can see, SPARK can't detect the null.
> I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data isn't read correctly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org