You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jerrick Hoang <je...@gmail.com> on 2015/08/11 08:14:54 UTC

Refresh table

Hi all,

I'm a little confused about how refresh table (SPARK-5833) should work. So
I did the following,

val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")

df1.write.parquet("hdfs://<path>/test_table/key=1")


Then I created an external table by doing,

CREATE EXTERNAL TABLE `tmp_table` (
`single`: int,
`double`: int)
PARTITIONED BY (
  `key` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://<path>/test_table/'

Then I added the partition to the table by `alter table tmp_table add
partition (key=1) location 'hdfs://..`

Then I added a new partition with different schema by,

val df2 = sc.makeRDD(1 to 5).map(i => (i, i * 3)).toDF("single", "triple")

df2.write.parquet("hdfs://<path>/test_table/key=2")


And added the new partition to the table by `alter table ..`,

But when I did `refresh table tmp_table` and `describe table` it couldn't
pick up the new column `triple`. Can someone explain to me how partition
discovery and schema merging of refresh table should work?

Thanks

RE: Refresh table

Posted by "Cheng, Hao" <ha...@intel.com>.

Refreshing table only works for the Spark SQL DataSource  in my understanding, apparently here, you’re running a Hive Table.

Can you try to create a table like:

        |CREATE TEMPORARY TABLE parquetTable (a int, b string)
        |USING org.apache.spark.sql.parquet.DefaultSource
        |OPTIONS (
        |  path '/root_path'
        |)

And then df2.write.parquet("hdfs://root_path/test_table/key=2") …

Cheng

From: Jerrick Hoang [mailto:jerrickhoang@gmail.com]
Sent: Tuesday, August 11, 2015 2:15 PM
To: user
Subject: Refresh table

Hi all,

I'm a little confused about how refresh table (SPARK-5833) should work. So I did the following,

val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")

df1.write.parquet("hdfs://<path>/test_table/key=1")

Then I created an external table by doing,

CREATE EXTERNAL TABLE `tmp_table` (
`single`: int,
`double`: int)
PARTITIONED BY (
  `key` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://<path>/test_table/'

Then I added the partition to the table by `alter table tmp_table add partition (key=1) location 'hdfs://..`

Then I added a new partition with different schema by,


val df2 = sc.makeRDD(1 to 5).map(i => (i, i * 3)).toDF("single", "triple")

df2.write.parquet("hdfs://<path>/test_table/key=2")

And added the new partition to the table by `alter table ..`,

But when I did `refresh table tmp_table` and `describe table` it couldn't pick up the new column `triple`. Can someone explain to me how partition discovery and schema merging of refresh table should work?

Thanks