You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chetan Dalal (JIRA)" <ji...@apache.org> on 2015/07/28 21:18:04 UTC
[jira] [Created] (SPARK-9414) HiveContext:saveAsTable creates wrong
partition for existing hive table(append mode)
Chetan Dalal created SPARK-9414:
-----------------------------------
Summary: HiveContext:saveAsTable creates wrong partition for existing hive table(append mode)
Key: SPARK-9414
URL: https://issues.apache.org/jira/browse/SPARK-9414
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.0
Environment: Hadoop 2.6, Spark 1.4.0, Hive 0.14.0.
Reporter: Chetan Dalal
Priority: Critical
Raising this bug because I found this issue was ready reported on Apache mail archive and I am facing a similar issue.
-----------original------------------------------
I am using spark 1.4 and HiveContext to append data into a partitioned
hive table. I found that the data insert into the table is correct, but the
partition(folder) created is totally wrong.
val schemaString = "zone z year month date hh x y height u v w ph phb
p pb qvapor qgraup qnice qnrain tke_pbl el_pbl"
val schema =
StructType(
schemaString.split(" ").map(fieldName =>
if (fieldName.equals("zone") || fieldName.equals("z") ||
fieldName.equals("year") || fieldName.equals("month") ||
fieldName.equals("date") || fieldName.equals("hh") ||
fieldName.equals("x") || fieldName.equals("y"))
StructField(fieldName, IntegerType, true)
else
StructField(fieldName, FloatType, true)
))
val pairVarRDD =
sc.parallelize(Seq((Row(2,42,2009,3,1,0,218,365,9989.497.floatValue(),29.627113.floatValue(),19.071793.floatValue(),0.11982734.floatValue(),3174.6812.floatValue(),
97735.2.floatValue(),16.389032.floatValue(),-96.62891.floatValue(),25135.365.floatValue(),2.6476808E-5.floatValue(),0.0.floatValue(),13195.351.floatValue(),
0.0.floatValue(),0.1.floatValue(),0.0.floatValue()))
))
val partitionedTestDF2 = sqlContext.createDataFrame(pairVarRDD, schema)
partitionedTestDF2.write.format("org.apache.spark.sql.hive.orc.DefaultSource")
.mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("test4DimBySpark")
---------------------------------------------------------------------------------------------
The table contains 23 columns (longer than Tuple maximum length), so I
use Row Object to store raw data, not Tuple.
Here is some message from spark when it saved data>>
>>>>
15/06/16 10:39:22 INFO metadata.Hive: Renaming
src:hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0/part-00001;dest:
hdfs://service-10-0.local:8020/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001;Status:true
>>>>
15/06/16 10:39:22 INFO metadata.Hive: New loading path =
hdfs://service-10-0.local:8020/tmp/hive-patcharee/hive_2015-06-16_10-39-21_205_8768669104487548472-1/-ext-10000/zone=13195/z=0/year=0/month=0
with partSpec {zone=13195, z=0, year=0, month=0}
>>>>
>From the raw data (pairVarRDD) zone = 2, z = 42, year = 2009, month =
3. But spark created a partition {zone=13195, z=0, year=0, month=0}. (x)
>>>>
When I queried from hive>>
>>>>
hive> select * from test4dimBySpark;
OK
2 42 2009 3 1.0 0.0 218.0 365.0 9989.497
29.627113 19.071793 0.11982734 -3174.6812 97735.2 16.389032
-96.62891 25135.365 2.6476808E-5 0.0 13195 0 0 0
hive> select zone, z, year, month from test4dimBySpark;
OK
13195 0 0 0
hive> dfs -ls /apps/hive/warehouse/test4dimBySpark/*/*/*/*;
Found 2 items
-rw-r--r-- 3 patcharee hdfs 1411 2015-06-16 10:39
/apps/hive/warehouse/test4dimBySpark/zone=13195/z=0/year=0/month=0/part-00001
>>>>
The data stored in the table is correct zone = 2, z = 42, year = 2009,
month = 3, but the partition created was wrong
"zone=13195/z=0/year=0/month=0" (x)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org