You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2018/05/10 13:52:47 UTC

Re: Hive remote databases/tables proposal

The main problem I see with a SerDe based approach is that this abstraction
is not able to expose the needed set of metadata for the target table.
While the SerDe can return the schema (via getObjectInspector() I
presume), there is no provision for the delivery of available partitions,
or table and column statistics.

On a related note, I believe this might also preclude the SerDe from acting
as the main integration point for an iceberg integration (
https://issues.apache.org/jira/browse/HIVE-19457), as this too will need to
pass additional metadata that is stored outside of the metastore and does
not fall into the scope of the SerDe interface.

The org.apache.hadoop.hive.ql.metadata.MetastoreClientFactory integration
point looks promising for both of these cases, but I can only find an
operational implementation of this in EMR.

Cheers,

Elliot.

On 27 April 2018 at 17:32, Elliot West <te...@gmail.com> wrote:

> Hi Johannes,
>
> We did not. I presume that your suggestion is that my use case could be
> implemented as a storage handler, and not that we access remote Hive data
> via JDBC (and by implication, HS2)?
>
> I must confess that I hadn't considered this approach, likely because for
> some time I'd assumed that a storage handler could not also be the source
> of table metadata. However, lately I've been externalizing schemas with the
> AvroSerDe and so I now have practical experience that demonstrates that
> isn't the case.
>
> It's a very good idea and I'm keen to look into the practicalities.
>
> Thank you for your helpful reply.
>
> Elliot.
>
>
> On 26 April 2018 at 17:28, Johannes Alberti <jo...@altiscale.com>
> wrote:
>
>> Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler
>> and discussed the pros/cons/similarities of the qubole approach
>>
>> On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <te...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
>>> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for
>>> federating multiple Hive clusters and providing the illusion of a unified
>>> data catalog from disparate instances.
>>>
>>> We’ve been running Waggle Dance in production for well over a year and
>>> it has formed a critical part of our data platform architecture and
>>> infrastructure.We believe that this type of functionality will be of
>>> increasing importance as Hadoop and Hive workloads migrate to the cloud.
>>> While Waggle Dance is one solution, significant benefits could be realized
>>> if these kinds of abilities were an integral part of the Hive platform.
>>>
>>> If this sounds of interest, I've created a proposal on the Hive wiki.
>>> I've outlined why we think such a feature is needed in Hive, the benefits
>>> gained by offering it as a built-in feature, and representation of a
>>> possible implementation. Our proposed implementation draws inspiration from
>>> the remote table features present in some traditional RDBMSes, which may
>>> already be familiar to you.
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=80452092
>>>
>>> Feedback gratefully accepted,
>>>
>>> Elliot.
>>>
>>> Senior Engineer
>>> Big Data Platform Team
>>> Hotels.com
>>>
>>
>>
>