You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Elliot West <te...@gmail.com> on 2020/01/08 14:27:43 UTC

Apache Hive integration

Hello,

We're considering working on an integration of Iceberg with Apache Hive,
initially so that the latest snapshot of Iceberg tables can be queried via
Hive, but later to allow the writing of data using the Iceberg table format.

I wanted to first check for the existence and status of any similar efforts
so that we do not find ourselves duplicating work unnecessarily. I've
checked both the Iceberg and Hive projects and can find no issues that
suggest that such an integration is underway or planned (only HIVE-19457
<https://issues.apache.org/jira/browse/HIVE-19457> which was raised by
myself and remains open).

If one or more efforts is underway we'd certainly be open to contributing.
If not, we'd be keen to capture any thoughts from the community on
preferred or recommended technical approaches.

I see that some work occurred on MR In/Out formats
<https://github.com/guilload/incubator-iceberg/pull/1> which might serve as
a foundation, so we'll certainly be investigating those further.

Thanks,

Elliot.

Re: Apache Hive integration

Posted by Elliot West <te...@gmail.com>.
Thanks for the update Adrien,

Does your PR reflect the current state of your implementation or did you
make additional progress?

Elliot.

On Thu, 9 Jan 2020 at 13:38, Adrien Guillo <ad...@airbnb.com> wrote:

> Hello,
>
> Last year, we started looking into integrating Iceberg with Hive and
> working on a proof-of-concept. Unfortunately, the project was paused a few
> months later but we're hoping to resume our work this year, hopefully in Q1.
>
> We'll keep you posted.
>
> Cheers,
> Adrien
>
> On Wed, Jan 8, 2020 at 10:43 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Thanks for the interest in Hive integration! I haven't heard about
>> progress here lately, so it's good that you bring it up. Hopefully the
>> other people that are interested can jump in with their current status.
>>
>> I think you're right that the MR input and output formats are a good
>> place to start, but if I remember correctly, Hive ignores the output
>> format's committer. That means we will need to plug in at the catalog level
>> at some point. Owen O'Malley has pointed us to the `RawStore` API that is
>> what backs metastore interaction for that.
>>
>> On Wed, Jan 8, 2020 at 6:28 AM Elliot West <te...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We're considering working on an integration of Iceberg with Apache Hive,
>>> initially so that the latest snapshot of Iceberg tables can be queried via
>>> Hive, but later to allow the writing of data using the Iceberg table format.
>>>
>>> I wanted to first check for the existence and status of any similar
>>> efforts so that we do not find ourselves duplicating work unnecessarily.
>>> I've checked both the Iceberg and Hive projects and can find no issues that
>>> suggest that such an integration is underway or planned (only HIVE-19457
>>> <https://issues.apache.org/jira/browse/HIVE-19457> which was raised by
>>> myself and remains open).
>>>
>>> If one or more efforts is underway we'd certainly be open to
>>> contributing. If not, we'd be keen to capture any thoughts from the
>>> community on preferred or recommended technical approaches.
>>>
>>> I see that some work occurred on MR In/Out formats
>>> <https://github.com/guilload/incubator-iceberg/pull/1> which might
>>> serve as a foundation, so we'll certainly be investigating those further.
>>>
>>> Thanks,
>>>
>>> Elliot.
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Adrien Guillo
> Data Infrastructure
> San Francisco
>

Re: Apache Hive integration

Posted by Adrien Guillo <ad...@airbnb.com.INVALID>.
Hello,

Last year, we started looking into integrating Iceberg with Hive and
working on a proof-of-concept. Unfortunately, the project was paused a few
months later but we're hoping to resume our work this year, hopefully in Q1.

We'll keep you posted.

Cheers,
Adrien

On Wed, Jan 8, 2020 at 10:43 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks for the interest in Hive integration! I haven't heard about
> progress here lately, so it's good that you bring it up. Hopefully the
> other people that are interested can jump in with their current status.
>
> I think you're right that the MR input and output formats are a good place
> to start, but if I remember correctly, Hive ignores the output
> format's committer. That means we will need to plug in at the catalog level
> at some point. Owen O'Malley has pointed us to the `RawStore` API that is
> what backs metastore interaction for that.
>
> On Wed, Jan 8, 2020 at 6:28 AM Elliot West <te...@gmail.com> wrote:
>
>> Hello,
>>
>> We're considering working on an integration of Iceberg with Apache Hive,
>> initially so that the latest snapshot of Iceberg tables can be queried via
>> Hive, but later to allow the writing of data using the Iceberg table format.
>>
>> I wanted to first check for the existence and status of any similar
>> efforts so that we do not find ourselves duplicating work unnecessarily.
>> I've checked both the Iceberg and Hive projects and can find no issues that
>> suggest that such an integration is underway or planned (only HIVE-19457
>> <https://issues.apache.org/jira/browse/HIVE-19457> which was raised by
>> myself and remains open).
>>
>> If one or more efforts is underway we'd certainly be open to
>> contributing. If not, we'd be keen to capture any thoughts from the
>> community on preferred or recommended technical approaches.
>>
>> I see that some work occurred on MR In/Out formats
>> <https://github.com/guilload/incubator-iceberg/pull/1> which might serve
>> as a foundation, so we'll certainly be investigating those further.
>>
>> Thanks,
>>
>> Elliot.
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Adrien Guillo
Data Infrastructure
San Francisco

Re: Apache Hive integration

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for the interest in Hive integration! I haven't heard about progress
here lately, so it's good that you bring it up. Hopefully the other people
that are interested can jump in with their current status.

I think you're right that the MR input and output formats are a good place
to start, but if I remember correctly, Hive ignores the output
format's committer. That means we will need to plug in at the catalog level
at some point. Owen O'Malley has pointed us to the `RawStore` API that is
what backs metastore interaction for that.

On Wed, Jan 8, 2020 at 6:28 AM Elliot West <te...@gmail.com> wrote:

> Hello,
>
> We're considering working on an integration of Iceberg with Apache Hive,
> initially so that the latest snapshot of Iceberg tables can be queried via
> Hive, but later to allow the writing of data using the Iceberg table format.
>
> I wanted to first check for the existence and status of any similar
> efforts so that we do not find ourselves duplicating work unnecessarily.
> I've checked both the Iceberg and Hive projects and can find no issues that
> suggest that such an integration is underway or planned (only HIVE-19457
> <https://issues.apache.org/jira/browse/HIVE-19457> which was raised by
> myself and remains open).
>
> If one or more efforts is underway we'd certainly be open to contributing.
> If not, we'd be keen to capture any thoughts from the community on
> preferred or recommended technical approaches.
>
> I see that some work occurred on MR In/Out formats
> <https://github.com/guilload/incubator-iceberg/pull/1> which might serve
> as a foundation, so we'll certainly be investigating those further.
>
> Thanks,
>
> Elliot.
>


-- 
Ryan Blue
Software Engineer
Netflix