You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2015/12/17 17:21:24 UTC

Synchronizing Hive metastores across clusters

Hello,

I'm thinking about the steps required to repeatedly push Hive datasets out
from a traditional Hadoop cluster into a parallel cloud based cluster. This
is not a one off, it needs to be a constantly running sync process. As new
tables and partitions are added in one cluster, they need to be synced to
the cloud cluster. Assuming for a moment that I have the HDFS data syncing
working, I'm wondering what steps I need to take to reliably ship the
HCatalog metadata across. I use HCatalog as the point of truth as to when
when data is available and where it is located and so I think that metadata
is a critical element to replicate in the cloud based cluster.

Does anyone have any recommendations on how to achieve this in practice?
One issue (of many I suspect) is that Hive appears to store table/partition
locations internally with absolute, fully qualified URLs, therefore unless
the target cloud cluster is similarly named and configured some path
transformation step will be needed as part of the synchronisation process.

I'd appreciate any suggestions, thoughts, or experiences related to this.

Cheers - Elliot.

Re: Synchronizing Hive metastores across clusters

Posted by Jörn Franke <jo...@gmail.com>.

Hive has the export/import commands, alternatively Falcon+oozie

> On 17 Dec 2015, at 17:21, Elliot West <te...@gmail.com> wrote:
> 
> Hello,
> 
> I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata across. I use HCatalog as the point of truth as to when when data is available and where it is located and so I think that metadata is a critical element to replicate in the cloud based cluster.
> 
> Does anyone have any recommendations on how to achieve this in practice? One issue (of many I suspect) is that Hive appears to store table/partition locations internally with absolute, fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured some path transformation step will be needed as part of the synchronisation process.
> 
> I'd appreciate any suggestions, thoughts, or experiences related to this.
> 
> Cheers - Elliot.
> 
>

Re: Synchronizing Hive metastores across clusters

Posted by Elliot West <te...@gmail.com>.

Hi Mich,

Thanks for your reply. The cloud cluster is to be used for read-only
analytics, so effectively one-way, stand-by. I'll take a look at your
suggested technologies as I'm not familiar with them.

Thanks - Elliot.

On 17 December 2015 at 16:57, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

> Sounds like one way replication of metastore. Depending on your metastore
> platform that could be achieved pretty easily.
>
>
>
> Mine is Oracle and I use Materialised View replication which is pretty
> good but no latest technology. Others would be GoldenGate or SAP
> replication server.
>
>
>
> HTH,
>
>
>
> Mich
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* 17 December 2015 16:47
> *To:* user@hive.apache.org
> *Subject:* RE: Synchronizing Hive metastores across clusters
>
>
>
> Are both clusters in active/active mode or the cloud based cluster is
> standby?
>
>
>
> *From:* Elliot West [mailto:teabot@gmail.com <te...@gmail.com>]
> *Sent:* 17 December 2015 16:21
> *To:* user@hive.apache.org
> *Subject:* Synchronizing Hive metastores across clusters
>
>
>
> Hello,
>
>
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
>
>
> Does anyone have any recommendations on how to achieve this in practice?
> One issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
>
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
>
>
> Cheers - Elliot.
>
>
>
>
>

RE: Synchronizing Hive metastores across clusters

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Hi Elliot.

 

Strictly speaking I believe your question is when the metastore in the replicate gets out of sync in replicate. So any query against cloud table will only show say partitions at time T0 as opposed to T1?

 

I don’t know what your metastore is on. With ours on Oracle this can happen when there is a network glitch hence the metadata tables can get out of sync. Each table has a Materialized view (MV) log that keeps the deltas for that table and pushes the deltas to the replicate table every say 30 seconds (configurable). So this is the scenario

 

1.    Network issue. Data cannot be delivered (deltas) and the replicate table is out of sync. The replicated table data is kept in the primary table MV log until the network is back and the next scheduled refresh delivers it. There could be a backlog

2.    The replicated table gets out of sync. In this case Oracle package DBMS_MVIEW.REFRESH is used to sync the replicate table. Again best done when there is no activity in the primary

 

 

We use Oracle for our metastore as the Bank has many instances of Oracle, Sybase, Microsoft SQL server and it is pretty easy for DBAs to look after a small Hive schema on an Oracle instance.

 

I gather if we build a model based on what classic databases do to keep reporting database tables in sync (which is in essence what we are talking about) then we should be OK.

 

That takes care of metadata but I noticed that you are also mentioning synching data on HDFS in the replicate as well. Sounds like many people go for DistCp <http://hadoop.apache.org/common/docs/current/distcp.html>  — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel. There seems to be a good article here <https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920>  on general replication for Facebook.

 

 

HTH,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Elliot West [mailto:teabot@gmail.com] 
Sent: 17 December 2015 17:17
To: user@hive.apache.org
Subject: Re: Synchronizing Hive metastores across clusters

 

Hi Mich,

 

In your scenario is there any coordination of data syncing on HDFS and metadata in HCatalog? I.e. could a situation occur where the replicated metastore shows a partition as 'present' yet the data that backs the partition in HDFS has not yet arrived at the replica filesystem? I Imagine one could avoid this by snapshotting the source metastore, then syncing HDFS, and then finally shipping the snapshot to the replica(?).

 

Thanks - Elliot.

 

On 17 December 2015 at 16:57, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

Sounds like one way replication of metastore. Depending on your metastore platform that could be achieved pretty easily. 

 

Mine is Oracle and I use Materialised View replication which is pretty good but no latest technology. Others would be GoldenGate or SAP replication server.

 

HTH,

 

Mich

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ] 
Sent: 17 December 2015 16:47
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Synchronizing Hive metastores across clusters

 

Are both clusters in active/active mode or the cloud based cluster is standby?

 

From: Elliot West [mailto:teabot@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Synchronizing Hive metastores across clusters

 

Hello,

 

I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata across. I use HCatalog as the point of truth as to when when data is available and where it is located and so I think that metadata is a critical element to replicate in the cloud based cluster.

 

Does anyone have any recommendations on how to achieve this in practice? One issue (of many I suspect) is that Hive appears to store table/partition locations internally with absolute, fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured some path transformation step will be needed as part of the synchronisation process.

 

I'd appreciate any suggestions, thoughts, or experiences related to this.

 

Cheers - Elliot.

Re: Synchronizing Hive metastores across clusters

Posted by Elliot West <te...@gmail.com>.

Hi Mich,

In your scenario is there any coordination of data syncing on HDFS and
metadata in HCatalog? I.e. could a situation occur where the replicated
metastore shows a partition as 'present' yet the data that backs the
partition in HDFS has not yet arrived at the replica filesystem? I Imagine
one could avoid this by snapshotting the source metastore, then syncing
HDFS, and then finally shipping the snapshot to the replica(?).

Thanks - Elliot.

On 17 December 2015 at 16:57, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

> Sounds like one way replication of metastore. Depending on your metastore
> platform that could be achieved pretty easily.
>
>
>
> Mine is Oracle and I use Materialised View replication which is pretty
> good but no latest technology. Others would be GoldenGate or SAP
> replication server.
>
>
>
> HTH,
>
>
>
> Mich
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* 17 December 2015 16:47
> *To:* user@hive.apache.org
> *Subject:* RE: Synchronizing Hive metastores across clusters
>
>
>
> Are both clusters in active/active mode or the cloud based cluster is
> standby?
>
>
>
> *From:* Elliot West [mailto:teabot@gmail.com <te...@gmail.com>]
> *Sent:* 17 December 2015 16:21
> *To:* user@hive.apache.org
> *Subject:* Synchronizing Hive metastores across clusters
>
>
>
> Hello,
>
>
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
>
>
> Does anyone have any recommendations on how to achieve this in practice?
> One issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
>
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
>
>
> Cheers - Elliot.
>
>
>
>
>

RE: Synchronizing Hive metastores across clusters

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Sounds like one way replication of metastore. Depending on your metastore platform that could be achieved pretty easily. 

 

Mine is Oracle and I use Materialised View replication which is pretty good but no latest technology. Others would be GoldenGate or SAP replication server.

 

HTH,

 

Mich

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: 17 December 2015 16:47
To: user@hive.apache.org
Subject: RE: Synchronizing Hive metastores across clusters

 

Are both clusters in active/active mode or the cloud based cluster is standby?

 

From: Elliot West [mailto:teabot@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Synchronizing Hive metastores across clusters

 

Hello,

 

I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata across. I use HCatalog as the point of truth as to when when data is available and where it is located and so I think that metadata is a critical element to replicate in the cloud based cluster.

 

Does anyone have any recommendations on how to achieve this in practice? One issue (of many I suspect) is that Hive appears to store table/partition locations internally with absolute, fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured some path transformation step will be needed as part of the synchronisation process.

 

I'd appreciate any suggestions, thoughts, or experiences related to this.

 

Cheers - Elliot.

RE: Synchronizing Hive metastores across clusters

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Are both clusters in active/active mode or the cloud based cluster is standby?

From: Elliot West [mailto:teabot@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org
Subject: Synchronizing Hive metastores across clusters

Hello,

I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata across. I use HCatalog as the point of truth as to when when data is available and where it is located and so I think that metadata is a critical element to replicate in the cloud based cluster.

Does anyone have any recommendations on how to achieve this in practice? One issue (of many I suspect) is that Hive appears to store table/partition locations internally with absolute, fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured some path transformation step will be needed as part of the synchronisation process.

I'd appreciate any suggestions, thoughts, or experiences related to this.

Cheers - Elliot.

Re: Synchronizing Hive metastores across clusters

Posted by Elliot West <te...@gmail.com>.

Following up on this: I've spent some time trying to evaluate the Hive
replication features but in truth it's more been an exercise in trying to
get them working! I thought I'd share my findings:

   - Conceptually this feature can sync (nearly) all Hive metadata and data
   changes between two clusters.
   - On the source cluster you require at least Hive 1.1.0
   (DbNotificationListener dependency).
   - On the destination cluster you require at least Hive 0.8.0 (IMPORT
   command dependency).
   - The environment in which you execute replication tasks requires at
   least Hive 1.2.0 (ReplicationTask dependency) although this is at the JAR
   level only (i.e. you do not need a 1.2.0 metastore running etc).
   - It is not an 'out of the box solution'; you must still write some kind
   of service that instantiates, schedules, and executes ReplicationTasks.
   This can be quite simple.
   - Exporting into S3 using Hive on EMR (AMI 4.2.0) is currently broken,
   but apparently work is underway to fix it.
   - Data inserted into Hive tables using HCatalog writers will not be
   automatically synced (HIVE-9577).
   - Mappings can be applied to destination database names, table names,
   and table and partition locations.
   - All tables at the destination are managed, even if they are external
   at the source.
   - The source and destination can be running different Hadoop
   distributions and use differing metastore database providers.
   - There is no real user level documentation.
   - It might be nice to add a Kafka based NotificationListener.

In summary it looks like quite a powerful and useful feature. However as
I'm currently running Hive 1.0.0 at my source I cannot use it in a
straightforward manner.

Thanks for your help.

Elliot.

On 18 December 2015 at 14:31, Elliot West <te...@gmail.com> wrote:

> Eugene/Susanth,
>
> Thank you for pointing me in the direction of these features. I'll
> investigate them further to see if I can put them to good use.
>
> Cheers - Elliot.
>
> On 17 December 2015 at 20:03, Sushanth Sowmyan <kh...@gmail.com> wrote:
>
>> Also, while I have not wiki-ized the documentation for the above, I
>> have uploaded slides from talks that I've given in hive user group
>> meetup on the subject, and also a doc that describes the replication
>> protocol followed for the EXIM replication that are attached over at
>> https://issues.apache.org/jira/browse/HIVE-10264
>>
>> On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <kh...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I think that the replication work added with
>> > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
>> > alley.
>> >
>> > Per Eugene's suggestion of MetaStoreEventListener, this replication
>> > system plugs into that and gets you a stream of notification events
>> > from HCatClient for the exact purpose you mention.
>> >
>> > There's some work still outstanding on this task, most notably
>> > documentation (sorry!) but please have a look at
>> > HCatClient.getReplicationTasks(...) and
>> > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
>> > your implementation of  ReplicationTask.Factory to inject your own
>> > logic for how to handle the replication according to your needs.
>> > (currently there exists an implementation that uses Hive EXPORT/IMPORT
>> > to perform replication - you can look at the code for this, and the
>> > tests for these classes to see how that is achieved. Falcon already
>> > uses this to perform cross-hive-warehouse replication)
>> >
>> >
>> > Thanks,
>> >
>> > -Sushanth
>> >
>> > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
>> > <ek...@hortonworks.com> wrote:
>> >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> >> which may be useful here
>> >>
>> >> Eugene
>> >>
>> >> From: Elliot West <te...@gmail.com>
>> >> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
>> >> Date: Thursday, December 17, 2015 at 8:21 AM
>> >> To: "user@hive.apache.org" <us...@hive.apache.org>
>> >> Subject: Synchronizing Hive metastores across clusters
>> >>
>> >> Hello,
>> >>
>> >> I'm thinking about the steps required to repeatedly push Hive datasets
>> out
>> >> from a traditional Hadoop cluster into a parallel cloud based cluster.
>> This
>> >> is not a one off, it needs to be a constantly running sync process. As
>> new
>> >> tables and partitions are added in one cluster, they need to be synced
>> to
>> >> the cloud cluster. Assuming for a moment that I have the HDFS data
>> syncing
>> >> working, I'm wondering what steps I need to take to reliably ship the
>> >> HCatalog metadata across. I use HCatalog as the point of truth as to
>> when
>> >> when data is available and where it is located and so I think that
>> metadata
>> >> is a critical element to replicate in the cloud based cluster.
>> >>
>> >> Does anyone have any recommendations on how to achieve this in
>> practice? One
>> >> issue (of many I suspect) is that Hive appears to store table/partition
>> >> locations internally with absolute, fully qualified URLs, therefore
>> unless
>> >> the target cloud cluster is similarly named and configured some path
>> >> transformation step will be needed as part of the synchronisation
>> process.
>> >>
>> >> I'd appreciate any suggestions, thoughts, or experiences related to
>> this.
>> >>
>> >> Cheers - Elliot.
>> >>
>> >>
>>
>
>

Re: Synchronizing Hive metastores across clusters

Posted by Elliot West <te...@gmail.com>.

Eugene/Susanth,

Thank you for pointing me in the direction of these features. I'll
investigate them further to see if I can put them to good use.

Cheers - Elliot.

On 17 December 2015 at 20:03, Sushanth Sowmyan <kh...@gmail.com> wrote:

> Also, while I have not wiki-ized the documentation for the above, I
> have uploaded slides from talks that I've given in hive user group
> meetup on the subject, and also a doc that describes the replication
> protocol followed for the EXIM replication that are attached over at
> https://issues.apache.org/jira/browse/HIVE-10264
>
> On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <kh...@gmail.com>
> wrote:
> > Hi,
> >
> > I think that the replication work added with
> > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
> > alley.
> >
> > Per Eugene's suggestion of MetaStoreEventListener, this replication
> > system plugs into that and gets you a stream of notification events
> > from HCatClient for the exact purpose you mention.
> >
> > There's some work still outstanding on this task, most notably
> > documentation (sorry!) but please have a look at
> > HCatClient.getReplicationTasks(...) and
> > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
> > your implementation of  ReplicationTask.Factory to inject your own
> > logic for how to handle the replication according to your needs.
> > (currently there exists an implementation that uses Hive EXPORT/IMPORT
> > to perform replication - you can look at the code for this, and the
> > tests for these classes to see how that is achieved. Falcon already
> > uses this to perform cross-hive-warehouse replication)
> >
> >
> > Thanks,
> >
> > -Sushanth
> >
> > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
> > <ek...@hortonworks.com> wrote:
> >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
> >> which may be useful here
> >>
> >> Eugene
> >>
> >> From: Elliot West <te...@gmail.com>
> >> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
> >> Date: Thursday, December 17, 2015 at 8:21 AM
> >> To: "user@hive.apache.org" <us...@hive.apache.org>
> >> Subject: Synchronizing Hive metastores across clusters
> >>
> >> Hello,
> >>
> >> I'm thinking about the steps required to repeatedly push Hive datasets
> out
> >> from a traditional Hadoop cluster into a parallel cloud based cluster.
> This
> >> is not a one off, it needs to be a constantly running sync process. As
> new
> >> tables and partitions are added in one cluster, they need to be synced
> to
> >> the cloud cluster. Assuming for a moment that I have the HDFS data
> syncing
> >> working, I'm wondering what steps I need to take to reliably ship the
> >> HCatalog metadata across. I use HCatalog as the point of truth as to
> when
> >> when data is available and where it is located and so I think that
> metadata
> >> is a critical element to replicate in the cloud based cluster.
> >>
> >> Does anyone have any recommendations on how to achieve this in
> practice? One
> >> issue (of many I suspect) is that Hive appears to store table/partition
> >> locations internally with absolute, fully qualified URLs, therefore
> unless
> >> the target cloud cluster is similarly named and configured some path
> >> transformation step will be needed as part of the synchronisation
> process.
> >>
> >> I'd appreciate any suggestions, thoughts, or experiences related to
> this.
> >>
> >> Cheers - Elliot.
> >>
> >>
>

Re: Synchronizing Hive metastores across clusters

Posted by Sushanth Sowmyan <kh...@gmail.com>.

Also, while I have not wiki-ized the documentation for the above, I
have uploaded slides from talks that I've given in hive user group
meetup on the subject, and also a doc that describes the replication
protocol followed for the EXIM replication that are attached over at
https://issues.apache.org/jira/browse/HIVE-10264

On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <kh...@gmail.com> wrote:
> Hi,
>
> I think that the replication work added with
> https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
> alley.
>
> Per Eugene's suggestion of MetaStoreEventListener, this replication
> system plugs into that and gets you a stream of notification events
> from HCatClient for the exact purpose you mention.
>
> There's some work still outstanding on this task, most notably
> documentation (sorry!) but please have a look at
> HCatClient.getReplicationTasks(...) and
> org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
> your implementation of  ReplicationTask.Factory to inject your own
> logic for how to handle the replication according to your needs.
> (currently there exists an implementation that uses Hive EXPORT/IMPORT
> to perform replication - you can look at the code for this, and the
> tests for these classes to see how that is achieved. Falcon already
> uses this to perform cross-hive-warehouse replication)
>
>
> Thanks,
>
> -Sushanth
>
> On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
> <ek...@hortonworks.com> wrote:
>> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> which may be useful here
>>
>> Eugene
>>
>> From: Elliot West <te...@gmail.com>
>> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
>> Date: Thursday, December 17, 2015 at 8:21 AM
>> To: "user@hive.apache.org" <us...@hive.apache.org>
>> Subject: Synchronizing Hive metastores across clusters
>>
>> Hello,
>>
>> I'm thinking about the steps required to repeatedly push Hive datasets out
>> from a traditional Hadoop cluster into a parallel cloud based cluster. This
>> is not a one off, it needs to be a constantly running sync process. As new
>> tables and partitions are added in one cluster, they need to be synced to
>> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
>> working, I'm wondering what steps I need to take to reliably ship the
>> HCatalog metadata across. I use HCatalog as the point of truth as to when
>> when data is available and where it is located and so I think that metadata
>> is a critical element to replicate in the cloud based cluster.
>>
>> Does anyone have any recommendations on how to achieve this in practice? One
>> issue (of many I suspect) is that Hive appears to store table/partition
>> locations internally with absolute, fully qualified URLs, therefore unless
>> the target cloud cluster is similarly named and configured some path
>> transformation step will be needed as part of the synchronisation process.
>>
>> I'd appreciate any suggestions, thoughts, or experiences related to this.
>>
>> Cheers - Elliot.
>>
>>

Re: Synchronizing Hive metastores across clusters

Posted by Sushanth Sowmyan <kh...@gmail.com>.

Hi,

I think that the replication work added with
https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
alley.

Per Eugene's suggestion of MetaStoreEventListener, this replication
system plugs into that and gets you a stream of notification events
from HCatClient for the exact purpose you mention.

There's some work still outstanding on this task, most notably
documentation (sorry!) but please have a look at
HCatClient.getReplicationTasks(...) and
org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
your implementation of  ReplicationTask.Factory to inject your own
logic for how to handle the replication according to your needs.
(currently there exists an implementation that uses Hive EXPORT/IMPORT
to perform replication - you can look at the code for this, and the
tests for these classes to see how that is achieved. Falcon already
uses this to perform cross-hive-warehouse replication)

Thanks,

-Sushanth

On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
<ek...@hortonworks.com> wrote:
> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
> which may be useful here
>
> Eugene
>
> From: Elliot West <te...@gmail.com>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
> Date: Thursday, December 17, 2015 at 8:21 AM
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Synchronizing Hive metastores across clusters
>
> Hello,
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
> Does anyone have any recommendations on how to achieve this in practice? One
> issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
> Cheers - Elliot.
>
>

Re: Synchronizing Hive metastores across clusters

Posted by Eugene Koifman <ek...@hortonworks.com>.

Metastore supports MetaStoreEventListener and MetaStorePreEventListener which may be useful here

Eugene

From: Elliot West <te...@gmail.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Thursday, December 17, 2015 at 8:21 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Synchronizing Hive metastores across clusters

Hello,

I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata across. I use HCatalog as the point of truth as to when when data is available and where it is located and so I think that metadata is a critical element to replicate in the cloud based cluster.

Does anyone have any recommendations on how to achieve this in practice? One issue (of many I suspect) is that Hive appears to store table/partition locations internally with absolute, fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured some path transformation step will be needed as part of the synchronisation process.

I'd appreciate any suggestions, thoughts, or experiences related to this.

Cheers - Elliot.