You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Rajat Khandelwal <ra...@inmobi.com> on 2016/12/07 10:32:37 UTC

Schema evolution in hive tables

So far, my understanding has been that in Hive tables, each partition has a
schema and whenever you add a partition to a Hive table, the current table
schema is copied into the partition schema. This should allow a seamless
evolution of the schema. Recently I came across something that contradicts
this. Hence, looking for some clarification.

So we have a table, and we have fixed ORC format for it. The table has a
schema say (a,b,c,d). We added one partition. The data is stored in the
same order. When we query (a, b) from this partition, the data has the two
columns in the correct order. Now we go ahead and change the schema of the
*table* to (b,c,d,a). But the schema of the partition is still (a,b,c,d) as
verified by doing describe extended on the partition. Now we issue the same
query on the old partition projecting (a,b). Surprisingly, it projects (b,
c). Is this the expected behaviour or am I missing something obvious?

Coming back to the question of schema evolution, as business usecases grow,
there is a need to add fields in the table. So am I restricted by hive to
add my fields at the end only?

Thanks

-- 
Rajat Khandelwal
Software Engineer

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: Schema evolution in hive tables

Posted by Furcy Pin <fu...@flaminem.com>.

Hi Rajat,

All I know from the doc is that changing a table's schema will not change
its partitions' schemas, unless you use the keyword CASCADE.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

However, the way Hive handles the data in the different cases seems to be
quite complicated and still eludes me (I will tell all that I know about
that bellow).

For this reason my favorite way to change a table schema is to :

1. Create a new table with the new schema
2. Copy the data from the old table to the new table
3. Replace the old table with the new table

This guarantees minimal downtime for my users.

[image: Inline image 1]


Now, I would love to hear if there is a better way, but I'm still looking
for it.


If you want to dive deeper into the way partition schemas are handled, here
is what I understood after doing some testing (I provide a copy at the end
of this mail).

- Changing a table's columns with an ALTER TABLE will not change the
partition's schemas unless you use the keyword CASCADE.
- You can change also the partition schemas separately, with the syntax
"ALTER TABLE tablename PARTITION(part_key=part_value)".
- INSERT INTO and INSERT OVERWRITE will NOT change the partitions schemas,
unless for newly created partitions that will inherit the table's schema.

That was the part I'm pretty sure about, now the part I'm not so sure about
:

- It seems to me that Hive will perform a first match between column names
and column numbers using only the table's schema (not the partition
schemas).
- It will then, for each partition, use its partition schema to infer the
matching between column number and column types.

Because of that, changing the column ordering will probably break your
table, and changing the column types might break it as well.

This is why the "table swap" is still my favorite option.

Hope this helps,

Furcy

PS: Bellow are the few test I ran to "retro-engineer" what I describe above.


## TEST 1 : changing column ordering
###########################################################################


-- 1)
DROP TABLE IF EXISTS test ;

-- 2)
CREATE TABLE test (a STRING, b STRING, c STRING, d STRING) PARTITIONED BY
(p STRING)
STORED AS ORC
;


-- 3)
INSERT INTO TABLE test PARTITION(p="P1")
SELECT "a", "b", "c", "d" FROM (SELECT 1) T
;

INSERT INTO TABLE test PARTITION(p="P2")
SELECT "a", "b", "c", "d" FROM (SELECT 1) T
;


-- 4)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+--+
-- | test.a  | test.b  | test.c  | test.d  | test.p  |
-- +---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | P1      |
-- | a       | b       | c       | d       | P2      |
-- +---------+---------+---------+---------+---------+--+


-- 5)
ALTER TABLE test CHANGE COLUMN a a STRING AFTER d ;
ALTER TABLE test PARTITION(p="P1") CHANGE COLUMN a a STRING AFTER d ;


-- 6)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+--+
-- | test.b  | test.c  | test.d  | test.a  | test.p  |
-- +---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | P1      |
-- | a       | b       | c       | d       | P2      |
-- +---------+---------+---------+---------+---------+--+


-- 7)
INSERT INTO TABLE test PARTITION(p="P1")
SELECT "b", "c", "d", "a" FROM (SELECT 1) T
;
INSERT INTO TABLE test PARTITION(p="P2")
SELECT "b", "c", "d", "a" FROM (SELECT 1) T
;


-- 8)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+--+
-- | test.b  | test.c  | test.d  | test.a  | test.p  |
-- +---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | P1      |
-- | b       | c       | d       | a       | P1      |
-- | a       | b       | c       | d       | P2      |
-- | b       | c       | d       | a       | P2      |
-- +---------+---------+---------+---------+---------+--+


-- 9)
INSERT OVERWRITE TABLE test PARTITION(p="P1")
VALUES ("b", "c", "d", "a")
;
INSERT OVERWRITE TABLE test PARTITION(p="P2")
VALUES ("b", "c", "d", "a")
;


-- 10)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+--+
-- | test.b  | test.c  | test.d  | test.a  | test.p  |
-- +---------+---------+---------+---------+---------+--+
-- | b       | c       | d       | a       | P1      |
-- | b       | c       | d       | a       | P2      |
-- +---------+---------+---------+---------+---------+--+


-- 11)
describe test partition(p="P1") ;
-- 
+--------------------------+-----------------------+-----------------------+--+
-- |         col_name         |       data_type       |        comment
   |
-- 
+--------------------------+-----------------------+-----------------------+--+
-- | b                        | string                |
  |
-- | c                        | string                |
  |
-- | d                        | string                |
  |
-- | a                        | string                |
  |
-- | p                        | string                |
  |
-- |                          | NULL                  | NULL
   |
-- | # Partition Information  | NULL                  | NULL
   |
-- | # col_name               | data_type             | comment
  |
-- |                          | NULL                  | NULL
   |
-- | p                        | string                |
  |
-- 
+--------------------------+-----------------------+-----------------------+--+


-- 12)
describe test partition(p="P2") ;
-- 
+--------------------------+-----------------------+-----------------------+--+
-- |         col_name         |       data_type       |        comment
   |
-- 
+--------------------------+-----------------------+-----------------------+--+
-- | a                        | string                |
  |
-- | b                        | string                |
  |
-- | c                        | string                |
  |
-- | d                        | string                |
  |
-- | p                        | string                |
  |
-- |                          | NULL                  | NULL
   |
-- | # Partition Information  | NULL                  | NULL
   |
-- | # col_name               | data_type             | comment
  |
-- |                          | NULL                  | NULL
   |
-- | p                        | string                |
  |
-- 
+--------------------------+-----------------------+-----------------------+--+


-- 13)
ALTER TABLE test CHANGE COLUMN a a INT ;
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+--+
-- | test.b  | test.c  | test.d  | test.a  | test.p  |
-- +---------+---------+---------+---------+---------+--+
-- | b       | c       | d       | NULL    | P1      |
-- | b       | c       | d       | NULL    | P2      |
-- +---------+---------+---------+---------+---------+--+


-- 14)
ALTER TABLE test PARTITION(p="P1") CHANGE COLUMN a a INT ;
SELECT * FROM test ;
-- Error: java.io.IOException:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.hadoop.io.IntWritable (state=,code=0)




## TEST 2 : adding new column
###########################################################################


-- 1)
DROP TABLE IF EXISTS test ;

-- 2)
CREATE TABLE test (a STRING, b STRING, c STRING, d STRING) PARTITIONED BY
(p STRING)
STORED AS ORC
;


-- 3)
INSERT INTO TABLE test PARTITION(p="P1")
SELECT "a", "b", "c", "d" FROM (SELECT 1) T
;

INSERT INTO TABLE test PARTITION(p="P2")
SELECT "a", "b", "c", "d" FROM (SELECT 1) T
;


-- 4)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+--+
-- | test.a  | test.b  | test.c  | test.d  | test.p  |
-- +---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | P1      |
-- | a       | b       | c       | d       | P2      |
-- +---------+---------+---------+---------+---------+--+


-- 5)
ALTER TABLE test ADD COLUMNS (e STRING) ;
ALTER TABLE test PARTITION(p="P1") ADD COLUMNS (e STRING) ;


-- 6)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+---------+--+
-- | test.a  | test.b  | test.c  | test.d  | test.e  | test.p  |
-- +---------+---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | NULL    | P1      |
-- | a       | b       | c       | d       | NULL    | P2      |
-- +---------+---------+---------+---------+---------+---------+--+


-- 7)
INSERT INTO TABLE test PARTITION(p="P1")
SELECT "a", "b", "c", "d", "e" FROM (SELECT 1) T
;
INSERT INTO TABLE test PARTITION(p="P2")
SELECT "a", "b", "c", "d", "e" FROM (SELECT 1) T
;


-- 8)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+---------+--+
-- | test.a  | test.b  | test.c  | test.d  | test.e  | test.p  |
-- +---------+---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | e       | P1      |
-- | a       | b       | c       | d       | NULL    | P1      |
-- | a       | b       | c       | d       | NULL    | P2      |
-- | a       | b       | c       | d       | NULL    | P2      |
-- +---------+---------+---------+---------+---------+---------+--+


-- 9)
INSERT OVERWRITE TABLE test PARTITION(p="P1")
SELECT "a", "b", "c", "d", "e" FROM (SELECT 1) T
;
INSERT OVERWRITE TABLE test PARTITION(p="P2")
SELECT "a", "b", "c", "d", "e" FROM (SELECT 1) T
;

-- 10)
SELECT * FROM test ;
-- +---------+---------+---------+---------+---------+---------+--+
-- | test.a  | test.b  | test.c  | test.d  | test.e  | test.p  |
-- +---------+---------+---------+---------+---------+---------+--+
-- | a       | b       | c       | d       | e       | P1      |
-- | a       | b       | c       | d       | NULL    | P2      |
-- +---------+---------+---------+---------+---------+---------+--+









On Wed, Dec 7, 2016 at 11:32 AM, Rajat Khandelwal <
rajat.khandelwal@inmobi.com> wrote:

> So far, my understanding has been that in Hive tables, each partition has
> a schema and whenever you add a partition to a Hive table, the current
> table schema is copied into the partition schema. This should allow
> a seamless evolution of the schema. Recently I came across something that
> contradicts this. Hence, looking for some clarification.
>
> So we have a table, and we have fixed ORC format for it. The table has a
> schema say (a,b,c,d). We added one partition. The data is stored in the
> same order. When we query (a, b) from this partition, the data has the two
> columns in the correct order. Now we go ahead and change the schema of the
> *table* to (b,c,d,a). But the schema of the partition is still (a,b,c,d) as
> verified by doing describe extended on the partition. Now we issue the same
> query on the old partition projecting (a,b). Surprisingly, it projects (b,
> c). Is this the expected behaviour or am I missing something obvious?
>
> Coming back to the question of schema evolution, as business usecases
> grow, there is a need to add fields in the table. So am I restricted by
> hive to add my fields at the end only?
>
> Thanks
>
> --
> Rajat Khandelwal
> Software Engineer
>
>
>
>
> _____________________________________________________________
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.