You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jerrick Hoang <je...@gmail.com> on 2015/07/18 20:11:03 UTC

Spark-hive parquet schema evolution

Hi all,

I'm aware of the support for schema evolution via DataFrame API. Just
wondering what would be the best way to go about dealing with schema
evolution with Hive metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?

Thanks,
J

Re: Spark-hive parquet schema evolution

Posted by Cheng Lian <li...@gmail.com>.

Yeah, the benefit of `saveAsTable` is that you don't need to deal with 
schema explicitly, while the benefit of ALTER TABLE is you still have a 
standard vanilla Hive table.

Cheng

On 7/22/15 11:00 PM, Dean Wampler wrote:
> While it's not recommended to overwrite files Hive thinks it 
> understands, you can add the column to Hive's metastore using an ALTER 
> TABLE command using HiveQL in the Hive shell or using HiveContext.sql():
>
> ALTER TABLE mytable ADD COLUMNS col_name data_type
>
> See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column 
> for full details.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition 
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Since Hive doesn’t support schema evolution, you’ll have to update
>     the schema stored in metastore somehow. For example, you can
>     create a new external table with the merged schema. Say you have a
>     Hive table |t1|:
>
>     |CREATE TABLE t1 (c0 INT, c1 DOUBLE); |
>
>     By default, this table is stored in HDFS path
>     |hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append
>     some Parquet data with an extra column |c2| to the same directory:
>
>     |import org.apache.spark.sql.types._ val path =
>     "hdfs://some-host:9000/user/hive/warehouse/t1" val df1 =
>     sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as
>     'c1, 'id cast StringType as 'c2)
>     df1.write.mode("append").parquet(path) |
>
>     Now you can create a new external table |t2| like this:
>
>     |val df2 = sqlContext.read.option(" mergeSchema",
>     "true").parquet(path) df2.write.path(path).saveAsTable("t2") |
>
>     Since we specified a path above, the newly created |t2| is an
>     external table pointing to the original HDFS location. But the
>     schema of |t2| is the merged version.
>
>     The drawback of this approach is that, |t2| is actually a Spark
>     SQL specific data source table rather than a genuine Hive table.
>     This means, it can be accessed by Spark SQL only. We’re just using
>     Hive metastore to help persisting metadata of the data source
>     table. However, since you’re asking how to access the new table
>     via Spark SQL CLI, this should work for you. We are working on
>     making Parquet and ORC data source tables accessible via Hive in
>     Spark 1.5.0.
>
>     Cheng
>
>     On 7/22/15 10:32 AM, Jerrick Hoang wrote:
>
>>     Hi Lian,
>>
>>     Sorry I'm new to Spark so I did not express myself very clearly.
>>     I'm concerned about the situation when let's say I have a Parquet
>>     table some partitions and I add a new column A to parquet schema
>>     and write some data with the new schema to a new partition in the
>>     table. If i'm not mistaken, if I do a
>>     sqlContext.read.parquet(table_path).printSchema() it will print
>>     the correct schema with new column A. But if I do a 'describe
>>     table' from SparkSQLCLI I won't see the new column being added. I
>>     understand that this is because Hive doesn't support schema
>>     evolution. So what is the best way to support CLI queries in this
>>     situation? Do I need to manually alter the table everytime the
>>     underlying schema changes?
>>
>>     Thanks
>>
>>     On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian
>>     <lian.cs.zju@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hey Jerrick,
>>
>>         What do you mean by "schema evolution with Hive metastore
>>         tables"? Hive doesn't take schema evolution into account.
>>         Could you please give a concrete use case? Are you trying to
>>         write Parquet data with extra columns into an existing
>>         metastore Parquet table?
>>
>>         Cheng
>>
>>
>>         On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>>>         I'm new to Spark, any ideas would be much appreciated! Thanks
>>>
>>>         On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
>>>         <jerrickhoang@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>             Hi all,
>>>
>>>             I'm aware of the support for schema evolution via
>>>             DataFrame API. Just wondering what would be the best way
>>>             to go about dealing with schema evolution with Hive
>>>             metastore tables. So, say I create a table via SparkSQL
>>>             CLI, how would I deal with Parquet schema evolution?
>>>
>>>             Thanks,
>>>             J
>>>
>>>
>>
>>
>     
>
>

Re: Spark-hive parquet schema evolution

Posted by Dean Wampler <de...@gmail.com>.

While it's not recommended to overwrite files Hive thinks it understands,
you can add the column to Hive's metastore using an ALTER TABLE command
using HiveQL in the Hive shell or using HiveContext.sql():

ALTER TABLE mytable ADD COLUMNS col_name data_type

See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
for full details.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <li...@gmail.com> wrote:

>  Since Hive doesn’t support schema evolution, you’ll have to update the
> schema stored in metastore somehow. For example, you can create a new
> external table with the merged schema. Say you have a Hive table t1:
>
> CREATE TABLE t1 (c0 INT, c1 DOUBLE);
>
> By default, this table is stored in HDFS path
> hdfs://some-host:9000/user/hive/warehouse/t1. Now you append some Parquet
> data with an extra column c2 to the same directory:
>
> import org.apache.spark.sql.types._
> val path = "hdfs://some-host:9000/user/hive/warehouse/t1"val df1 = sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id cast StringType as 'c2)
> df1.write.mode("append").parquet(path)
>
> Now you can create a new external table t2 like this:
>
> val df2 = sqlContext.read.option(
>  "
> mergeSchema", "true").parquet(path)
> df2.write.path(path).saveAsTable("t2")
>
> Since we specified a path above, the newly created t2 is an external
> table pointing to the original HDFS location. But the schema of t2 is the
> merged version.
>
> The drawback of this approach is that, t2 is actually a Spark SQL
> specific data source table rather than a genuine Hive table. This means, it
> can be accessed by Spark SQL only. We’re just using Hive metastore to help
> persisting metadata of the data source table. However, since you’re asking
> how to access the new table via Spark SQL CLI, this should work for you. We
> are working on making Parquet and ORC data source tables accessible via
> Hive in Spark 1.5.0.
>
> Cheng
>
> On 7/22/15 10:32 AM, Jerrick Hoang wrote:
>
>   Hi Lian,
>
>  Sorry I'm new to Spark so I did not express myself very clearly. I'm
> concerned about the situation when let's say I have a Parquet table some
> partitions and I add a new column A to parquet schema and write some data
> with the new schema to a new partition in the table. If i'm not mistaken,
> if I do a sqlContext.read.parquet(table_path).printSchema() it will print
> the correct schema with new column A. But if I do a 'describe table' from
> SparkSQLCLI I won't see the new column being added. I understand that this
> is because Hive doesn't support schema evolution. So what is the best way
> to support CLI queries in this situation? Do I need to manually alter the
> table everytime the underlying schema changes?
>
>  Thanks
>
> On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <li...@gmail.com> wrote:
>
>>  Hey Jerrick,
>>
>> What do you mean by "schema evolution with Hive metastore tables"? Hive
>> doesn't take schema evolution into account. Could you please give a
>> concrete use case? Are you trying to write Parquet data with extra columns
>> into an existing metastore Parquet table?
>>
>> Cheng
>>
>>
>> On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>>
>> I'm new to Spark, any ideas would be much appreciated! Thanks
>>
>> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang <
>> <je...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>>  I'm aware of the support for schema evolution via DataFrame API. Just
>>> wondering what would be the best way to go about dealing with schema
>>> evolution with Hive metastore tables. So, say I create a table via SparkSQL
>>> CLI, how would I deal with Parquet schema evolution?
>>>
>>>  Thanks,
>>> J
>>>
>>
>>
>>
>    
>

Re: Spark-hive parquet schema evolution

Posted by Cheng Lian <li...@gmail.com>.

Since Hive doesn’t support schema evolution, you’ll have to update the 
schema stored in metastore somehow. For example, you can create a new 
external table with the merged schema. Say you have a Hive table |t1|:

|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |

By default, this table is stored in HDFS path 
|hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append some 
Parquet data with an extra column |c2| to the same directory:

|import org.apache.spark.sql.types._ val path = 
"hdfs://some-host:9000/user/hive/warehouse/t1" val df1 = 
sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id 
cast StringType as 'c2) df1.write.mode("append").parquet(path) |

Now you can create a new external table |t2| like this:

|val df2 = sqlContext.read.option("mergeSchema", "true").parquet(path) 
df2.write.path(path).saveAsTable("t2") |

Since we specified a path above, the newly created |t2| is an external 
table pointing to the original HDFS location. But the schema of |t2| is 
the merged version.

The drawback of this approach is that, |t2| is actually a Spark SQL 
specific data source table rather than a genuine Hive table. This means, 
it can be accessed by Spark SQL only. We’re just using Hive metastore to 
help persisting metadata of the data source table. However, since you’re 
asking how to access the new table via Spark SQL CLI, this should work 
for you. We are working on making Parquet and ORC data source tables 
accessible via Hive in Spark 1.5.0.

Cheng

On 7/22/15 10:32 AM, Jerrick Hoang wrote:

> Hi Lian,
>
> Sorry I'm new to Spark so I did not express myself very clearly. I'm 
> concerned about the situation when let's say I have a Parquet table 
> some partitions and I add a new column A to parquet schema and write 
> some data with the new schema to a new partition in the table. If i'm 
> not mistaken, if I do a 
> sqlContext.read.parquet(table_path).printSchema() it will print the 
> correct schema with new column A. But if I do a 'describe table' from 
> SparkSQLCLI I won't see the new column being added. I understand that 
> this is because Hive doesn't support schema evolution. So what is the 
> best way to support CLI queries in this situation? Do I need to 
> manually alter the table everytime the underlying schema changes?
>
> Thanks
>
> On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hey Jerrick,
>
>     What do you mean by "schema evolution with Hive metastore tables"?
>     Hive doesn't take schema evolution into account. Could you please
>     give a concrete use case? Are you trying to write Parquet data
>     with extra columns into an existing metastore Parquet table?
>
>     Cheng
>
>
>     On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>>     I'm new to Spark, any ideas would be much appreciated! Thanks
>>
>>     On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
>>     <jerrickhoang@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Hi all,
>>
>>         I'm aware of the support for schema evolution via DataFrame
>>         API. Just wondering what would be the best way to go about
>>         dealing with schema evolution with Hive metastore tables. So,
>>         say I create a table via SparkSQL CLI, how would I deal with
>>         Parquet schema evolution?
>>
>>         Thanks,
>>         J
>>
>>
>
>

Re: Spark-hive parquet schema evolution

Posted by Jerrick Hoang <je...@gmail.com>.

Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly. I'm
concerned about the situation when let's say I have a Parquet table some
partitions and I add a new column A to parquet schema and write some data
with the new schema to a new partition in the table. If i'm not mistaken,
if I do a sqlContext.read.parquet(table_path).printSchema() it will print
the correct schema with new column A. But if I do a 'describe table' from
SparkSQLCLI I won't see the new column being added. I understand that this
is because Hive doesn't support schema evolution. So what is the best way
to support CLI queries in this situation? Do I need to manually alter the
table everytime the underlying schema changes?

Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <li...@gmail.com> wrote:

>  Hey Jerrick,
>
> What do you mean by "schema evolution with Hive metastore tables"? Hive
> doesn't take schema evolution into account. Could you please give a
> concrete use case? Are you trying to write Parquet data with extra columns
> into an existing metastore Parquet table?
>
> Cheng
>
>
> On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>
> I'm new to Spark, any ideas would be much appreciated! Thanks
>
> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang <je...@gmail.com>
> wrote:
>
>> Hi all,
>>
>>  I'm aware of the support for schema evolution via DataFrame API. Just
>> wondering what would be the best way to go about dealing with schema
>> evolution with Hive metastore tables. So, say I create a table via SparkSQL
>> CLI, how would I deal with Parquet schema evolution?
>>
>>  Thanks,
>> J
>>
>
>
>

Re: Spark-hive parquet schema evolution

Posted by Cheng Lian <li...@gmail.com>.

Hey Jerrick,

What do you mean by "schema evolution with Hive metastore tables"? Hive 
doesn't take schema evolution into account. Could you please give a 
concrete use case? Are you trying to write Parquet data with extra 
columns into an existing metastore Parquet table?

Cheng

On 7/21/15 1:04 AM, Jerrick Hoang wrote:
> I'm new to Spark, any ideas would be much appreciated! Thanks
>
> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang 
> <jerrickhoang@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi all,
>
>     I'm aware of the support for schema evolution via DataFrame API.
>     Just wondering what would be the best way to go about dealing with
>     schema evolution with Hive metastore tables. So, say I create a
>     table via SparkSQL CLI, how would I deal with Parquet schema
>     evolution?
>
>     Thanks,
>     J
>
>

Re: Spark-hive parquet schema evolution

Posted by Jerrick Hoang <je...@gmail.com>.

I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang <je...@gmail.com>
wrote:

> Hi all,
>
> I'm aware of the support for schema evolution via DataFrame API. Just
> wondering what would be the best way to go about dealing with schema
> evolution with Hive metastore tables. So, say I create a table via SparkSQL
> CLI, how would I deal with Parquet schema evolution?
>
> Thanks,
> J
>