You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Mithun Radhakrishnan (JIRA)" <ji...@apache.org> on 2014/08/08 19:07:12 UTC
[jira] [Updated] (HIVE-7341) Support for Table replication across HCatalog instances

     [ https://issues.apache.org/jira/browse/HIVE-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mithun Radhakrishnan updated HIVE-7341:
---------------------------------------

    Status: Open  (was: Patch Available)

> Support for Table replication across HCatalog instances
> -------------------------------------------------------
>
>                 Key: HIVE-7341
>                 URL: https://issues.apache.org/jira/browse/HIVE-7341
>             Project: Hive
>          Issue Type: New Feature
>          Components: HCatalog
>    Affects Versions: 0.13.1
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7341.1.patch, HIVE-7341.2.patch, HIVE-7341.3.patch
>
>
> The HCatClient currently doesn't provide very much support for replicating HCatTable definitions between 2 HCatalog Server (i.e. Hive metastore) instances. 
> Systems similar to Apache Falcon might find the need to replicate partition data between 2 clusters, and keep the HCatalog metadata in sync between the two. This poses a couple of problems:
> # The definition of the source table might change (in column schema, I/O formats, record-formats, serde-parameters, etc.) The system will need a way to diff 2 tables and update the target-metastore with the changes. E.g. 
> {code}
> targetTable.resolve( sourceTable, targetTable.diff(sourceTable) );
> hcatClient.updateTableSchema(dbName, tableName, targetTable);
> {code}
> # The current {{HCatClient.addPartitions()}} API requires that the partition's schema be derived from the table's schema, thereby requiring that the table-schema be resolved *before* partitions with the new schema are added to the table. This is problematic, because it introduces race conditions when 2 partitions with differing column-schemas (e.g. right after a schema change) are copied in parallel. This can be avoided if each HCatAddPartitionDesc kept track of the partition's schema, in flight.
> # The source and target metastores might be running different/incompatible versions of Hive. 
> The impending patch attempts to address these concerns (with some caveats).
> # {{HCatTable}} now has 
> ## a {{diff()}} method, to compare against another HCatTable instance
> ## a {{resolve(diff)}} method to copy over specified table-attributes from another HCatTable
> ## a serialize/deserialize mechanism (via {{HCatClient.serializeTable()}} and {{HCatClient.deserializeTable()}}), so that HCatTable instances constructed in other class-loaders may be used for comparison
> # {{HCatPartition}} now provides finer-grained control over a Partition's column-schema, StorageDescriptor settings, etc. This allows partitions to be copied completely from source, with the ability to override specific properties if required (e.g. location).
> # {{HCatClient.updateTableSchema()}} can now update the entire table-definition, not just the column schema.
> # I've cleaned up and removed most of the redundancy between the HCatTable, HCatCreateTableDesc and HCatCreateTableDesc.Builder. The prior API failed to separate the table-attributes from the add-table-operation's attributes. By providing fluent-interfaces in HCatTable, and composing an HCatTable instance in HCatCreateTableDesc, the interfaces are cleaner(ish). The old setters are deprecated, in favour of those in HCatTable. Likewise, HCatPartition and HCatAddPartitionDesc.
> I'll post a patch for trunk shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)