You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by atreju <n....@gmail.com> on 2010/06/08 20:54:01 UTC

How to apply RDBMS table updates and deletes into Hadoop

To generate smart output from base data we need to copy some base tables
from relational database into Hadoop. Some of them are big. To dump the
entire table into Hadoop everyday is not an option since there are like 30+
tables and each would take several hours.

The methodology that we approached is to get the entire table dump first.
Then each day or every 4-6 hours get only insert/update/delete since the
last copy from RDBMS (based on a date field in the table). Using Hive do
outer join + union the new data with existing data and write into a new
file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
and write into a new file. The other applications like Pig or Hive will pick
the most recent file to use when selecting/loading data from those base
table data files.

This logic is working fine in lower environments for small size tables. With
production data, for about 30GB size table, the incremental re-generation of
the file in Hadoop is still taking several hours. I tried using zipped
version and it took even longer time. I am not convinced that this is the
best we can do to handle updates and deletes since we had to re-write 29GB
unchanged data of the 30GB file again into a new file. ...and this is not
the biggest table.

I am thinking that this should be problem for many companies. What are the
other approaches to apply updates and deletes on base tables to the
Hadoop data files?

We have 4 data nodes and using version 20.3.

Thanks!

Re: How to apply RDBMS table updates and deletes into Hadoop

Posted by Sonal Goyal <so...@gmail.com>.

Hi Atreju,

You have a very valid use case here. Data changes in your database,
and you want to pull in only the changes to Hadoop. Have you
considered query based data retrieval from the RDBMS to Hadoop?  As
you already have a date field in your tables which marks the changed
rows, you can query on that field and get only the changed records to
Hadoop.

I have been working on an open source framework for incremental
updates and fetching such records to Hadoop. You can check

http://code.google.com/p/hiho/
http://code.google.com/p/hiho/wiki/DatabaseImportFAQ

If you have any questions or need any changes, please send me an
offline mail.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Wed, Jun 9, 2010 at 4:39 AM, Yongqiang He <he...@gmail.com> wrote:
> Hi,
>
> I think hive’s join + transform could be helpful here.
>
> Thanks
> Yongqiang
> On 6/8/10 3:58 PM, "Aaron Kimball" <aa...@cloudera.com> wrote:
>
> I think that this might be the way to go. In general, folding updates and
> deletes into datasets is a difficult problem due to the append-only nature
> of datasets.
>
> Something that might help you here is to partition your tables in Hive based
> on some well-distributed key. Then if you have a relatively small number of
> partitions affected by an incremental import (perhaps more recently-imported
> records are more likely to be updated? in this case, partition the tables by
> the month/week you imported them?) you can only perform the fold-in of the
> new deltas on the affected partitions. This should be much faster than a
> full table scan.
>
> Have you seen the Sqoop tool? It handles imports and exports between HDFS
> (and Hive) and RDBMS systems --  but currently can only import new records
> (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is
> available at http://github.com/cloudera/sqoop -- it doesn't run on Apache
> 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop
> 0.21/trunk.
>
> This sort of capability is something I'm really interested in adding to
> Sqoop. If you've got a well-run process for doing this, I'd really
> appreciate your help adding this feature :) Send me an email off-list if
> you're interested. At the very least, I'd urge you to try out the tool.
>
> Cheers,
> - Aaron Kimball
>
> On Tue, Jun 8, 2010 at 8:54 PM, atreju <n....@gmail.com> wrote:
>
> To generate smart output from base data we need to copy some base tables
> from relational database into Hadoop. Some of them are big. To dump the
> entire table into Hadoop everyday is not an option since there are like 30+
> tables and each would take several hours.
>
> The methodology that we approached is to get the entire table dump first.
> Then each day or every 4-6 hours get only insert/update/delete since the
> last copy from RDBMS (based on a date field in the table). Using Hive do
> outer join + union the new data with existing data and write into a new
> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
> inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
> the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
> and write into a new file. The other applications like Pig or Hive will pick
> the most recent file to use when selecting/loading data from those base
> table data files.
>
> This logic is working fine in lower environments for small size tables. With
> production data, for about 30GB size table, the incremental re-generation of
> the file in Hadoop is still taking several hours. I tried using zipped
> version and it took even longer time. I am not convinced that this is the
> best we can do to handle updates and deletes since we had to re-write 29GB
> unchanged data of the 30GB file again into a new file. ...and this is not
> the biggest table.
>
> I am thinking that this should be problem for many companies. What are the
> other approaches to apply updates and deletes on base tables to the
> Hadoop data files?
>
> We have 4 data nodes and using version 20.3.
>
> Thanks!
>
>
>
>