You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stephane Maarek (JIRA)" <ji...@apache.org> on 2016/04/13 03:16:25 UTC

[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

     [ https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephane Maarek updated SPARK-14583:
------------------------------------
    Description: 
it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:csv}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a       2       a       b
NULL    3       a       b
{code}
(you can see the NULL)

now onto Spark:


{code:scala}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+--------+--------+------+------+
|column_1|column_2|part_a|part_b|
+--------+--------+------+------+
|       a|       2|     a|     b|
|        |       3|     a|     b|
+--------+--------+------+------+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data isn't read correctly. 


  was:
it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
a,2
,3

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a       2       a       b
NULL    3       a       b

(you can see the NULL)

now onto Spark:
+--------+--------+------+------+
|column_1|column_2|part_a|part_b|
+--------+--------+------+------+
|       a|       2|     a|     b|
|        |       3|     a|     b|
+--------+--------+------+------+


As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data isn't read correctly. 



> Spark doesn't read hive table properly after MSCK REPAIR
> --------------------------------------------------------
>
>                 Key: SPARK-14583
>                 URL: https://issues.apache.org/jira/browse/SPARK-14583
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.5.1
>            Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:csv}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a       2       a       b
> NULL    3       a       b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:scala}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +--------+--------+------+------+
> |column_1|column_2|part_a|part_b|
> +--------+--------+------+------+
> |       a|       2|     a|     b|
> |        |       3|     a|     b|
> +--------+--------+------+------+
> {code}
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my company's environment. Steps are easy to reproduce though so can be tested in other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org