You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Marcin Tustin <mt...@handybook.com> on 2015/12/18 15:14:47 UTC

Importing into a hive database with minimal unavailability or renaming a database

Hi All,

We import our production database into hive on a schedule using sqoop.
Unfortunately, sqoop won't update the table schema in hive when the table
schema has changed in the source database.

Accordingly, to get updates to the table schema we drop the hive table
first.

Unfortunately, this causes the data to be unavailable in hive for a certain
period of time.

Accordingly, I'd like to know how people on this list have tackled the
issue. Is there a way to get sqoop to update the table schema in hive, or
can we import into a staging hive database and rename it?

Thanks,
Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity


RE: Importing into a hive database with minimal unavailability or renaming a database

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Marcin,

 

 

If the DDL update to the main table involves a new column, then at the moment Hive does not support adding column. Yes the schema in metastore can change but the file system will not allow you to add values to the new column.

 

1.    Thus as discussed in “Adding a new column to a table and updating it” thread, The easiest option is to create a new table with the new column and do insert/select from the existing table with values set for the new column

2.    So the classic ETL issue. The way I approach would be to use sqoop to import the key column + changed columns to a staging table. Create a new table with the correct DDL in Hive. Insert into this new table using unchanged columns from the existing Hive table and changed columns from the staging table where the join will have source_table.KEY = staging_table.KEY. This would be faster than recreating and populating the original table. Once the job done move/drop the original table and rename the new table to the original table.

 

Remember this follows the CAP theorem. Data will be eventually consistent. However, Availability is more important than Consistency. I hope this is correct.

 

 

HTH,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Marcin Tustin [mailto:mtustin@handybook.com] 
Sent: 18 December 2015 14:15
To: user@hive.apache.org
Subject: Importing into a hive database with minimal unavailability or renaming a database

 

Hi All,

 

We import our production database into hive on a schedule using sqoop. Unfortunately, sqoop won't update the table schema in hive when the table schema has changed in the source database.

 

Accordingly, to get updates to the table schema we drop the hive table first.

 

Unfortunately, this causes the data to be unavailable in hive for a certain period of time.

 

Accordingly, I'd like to know how people on this list have tackled the issue. Is there a way to get sqoop to update the table schema in hive, or can we import into a staging hive database and rename it?

 

Thanks,

Marcin

 

Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>  led by Fidelity