You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Elliot West <te...@gmail.com> on 2018/04/26 11:01:32 UTC

Hive remote databases/tables proposal

Hello,

At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for federating
multiple Hive clusters and providing the illusion of a unified data catalog
from disparate instances.

We’ve been running Waggle Dance in production for well over a year and it
has formed a critical part of our data platform architecture and
infrastructure.We believe that this type of functionality will be of
increasing importance as Hadoop and Hive workloads migrate to the cloud.
While Waggle Dance is one solution, significant benefits could be realized
if these kinds of abilities were an integral part of the Hive platform.

If this sounds of interest, I've created a proposal on the Hive wiki. I've
outlined why we think such a feature is needed in Hive, the benefits gained
by offering it as a built-in feature, and representation of a possible
implementation. Our proposed implementation draws inspiration from the
remote table features present in some traditional RDBMSes, which may
already be familiar to you.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80452092

Feedback gratefully accepted,

Elliot.

Senior Engineer
Big Data Platform Team
Hotels.com

Re: Hive remote databases/tables proposal

Posted by Elliot West <te...@gmail.com>.

The main problem I see with a SerDe based approach is that this abstraction
is not able to expose the needed set of metadata for the target table.
While the SerDe can return the schema (via getObjectInspector() I
presume), there is no provision for the delivery of available partitions,
or table and column statistics.

On a related note, I believe this might also preclude the SerDe from acting
as the main integration point for an iceberg integration (
https://issues.apache.org/jira/browse/HIVE-19457), as this too will need to
pass additional metadata that is stored outside of the metastore and does
not fall into the scope of the SerDe interface.

The org.apache.hadoop.hive.ql.metadata.MetastoreClientFactory integration
point looks promising for both of these cases, but I can only find an
operational implementation of this in EMR.

Cheers,

Elliot.

On 27 April 2018 at 17:32, Elliot West <te...@gmail.com> wrote:

> Hi Johannes,
>
> We did not. I presume that your suggestion is that my use case could be
> implemented as a storage handler, and not that we access remote Hive data
> via JDBC (and by implication, HS2)?
>
> I must confess that I hadn't considered this approach, likely because for
> some time I'd assumed that a storage handler could not also be the source
> of table metadata. However, lately I've been externalizing schemas with the
> AvroSerDe and so I now have practical experience that demonstrates that
> isn't the case.
>
> It's a very good idea and I'm keen to look into the practicalities.
>
> Thank you for your helpful reply.
>
> Elliot.
>
>
> On 26 April 2018 at 17:28, Johannes Alberti <jo...@altiscale.com>
> wrote:
>
>> Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler
>> and discussed the pros/cons/similarities of the qubole approach
>>
>> On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <te...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
>>> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for
>>> federating multiple Hive clusters and providing the illusion of a unified
>>> data catalog from disparate instances.
>>>
>>> We’ve been running Waggle Dance in production for well over a year and
>>> it has formed a critical part of our data platform architecture and
>>> infrastructure.We believe that this type of functionality will be of
>>> increasing importance as Hadoop and Hive workloads migrate to the cloud.
>>> While Waggle Dance is one solution, significant benefits could be realized
>>> if these kinds of abilities were an integral part of the Hive platform.
>>>
>>> If this sounds of interest, I've created a proposal on the Hive wiki.
>>> I've outlined why we think such a feature is needed in Hive, the benefits
>>> gained by offering it as a built-in feature, and representation of a
>>> possible implementation. Our proposed implementation draws inspiration from
>>> the remote table features present in some traditional RDBMSes, which may
>>> already be familiar to you.
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=80452092
>>>
>>> Feedback gratefully accepted,
>>>
>>> Elliot.
>>>
>>> Senior Engineer
>>> Big Data Platform Team
>>> Hotels.com
>>>
>>
>>
>

Re: Hive remote databases/tables proposal

Posted by Elliot West <te...@gmail.com>.

Hi Johannes,

We did not. I presume that your suggestion is that my use case could be
implemented as a storage handler, and not that we access remote Hive data
via JDBC (and by implication, HS2)?

I must confess that I hadn't considered this approach, likely because for
some time I'd assumed that a storage handler could not also be the source
of table metadata. However, lately I've been externalizing schemas with the
AvroSerDe and so I now have practical experience that demonstrates that
isn't the case.

It's a very good idea and I'm keen to look into the practicalities.

Thank you for your helpful reply.

Elliot.

On 26 April 2018 at 17:28, Johannes Alberti <jo...@altiscale.com> wrote:

> Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler
> and discussed the pros/cons/similarities of the qubole approach
>
> On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <te...@gmail.com> wrote:
>
>> Hello,
>>
>> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
>> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for
>> federating multiple Hive clusters and providing the illusion of a unified
>> data catalog from disparate instances.
>>
>> We’ve been running Waggle Dance in production for well over a year and it
>> has formed a critical part of our data platform architecture and
>> infrastructure.We believe that this type of functionality will be of
>> increasing importance as Hadoop and Hive workloads migrate to the cloud.
>> While Waggle Dance is one solution, significant benefits could be realized
>> if these kinds of abilities were an integral part of the Hive platform.
>>
>> If this sounds of interest, I've created a proposal on the Hive wiki.
>> I've outlined why we think such a feature is needed in Hive, the benefits
>> gained by offering it as a built-in feature, and representation of a
>> possible implementation. Our proposed implementation draws inspiration from
>> the remote table features present in some traditional RDBMSes, which may
>> already be familiar to you.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80452092
>>
>> Feedback gratefully accepted,
>>
>> Elliot.
>>
>> Senior Engineer
>> Big Data Platform Team
>> Hotels.com
>>
>
>

Re: Hive remote databases/tables proposal

Posted by Johannes Alberti <jo...@altiscale.com>.

Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler
and discussed the pros/cons/similarities of the qubole approach

On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <te...@gmail.com> wrote:

> Hello,
>
> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for
> federating multiple Hive clusters and providing the illusion of a unified
> data catalog from disparate instances.
>
> We’ve been running Waggle Dance in production for well over a year and it
> has formed a critical part of our data platform architecture and
> infrastructure.We believe that this type of functionality will be of
> increasing importance as Hadoop and Hive workloads migrate to the cloud.
> While Waggle Dance is one solution, significant benefits could be realized
> if these kinds of abilities were an integral part of the Hive platform.
>
> If this sounds of interest, I've created a proposal on the Hive wiki. I've
> outlined why we think such a feature is needed in Hive, the benefits gained
> by offering it as a built-in feature, and representation of a possible
> implementation. Our proposed implementation draws inspiration from the
> remote table features present in some traditional RDBMSes, which may
> already be familiar to you.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80452092
>
> Feedback gratefully accepted,
>
> Elliot.
>
> Senior Engineer
> Big Data Platform Team
> Hotels.com
>