You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Gustavo Torres Torres <gu...@airbnb.com.INVALID> on 2021/03/02 02:38:57 UTC

Airflow Integration

Hey folks,

Lately I've been thinking about integration between Airflow & Iceberg for a
smooth transition from Hive-based tables to Iceberg ones and would like to
hear about your experience. Specifically about Iceberg partition sensors in
Airflow.

From the way I see it, there are two ways to go about this (at least for
Hive-based catalogs):


   1. Modify our Hive Metastore API so that partitions-APIs are handled
   directly by the Iceberg API. This has the advantage of being mostly
   transparent to users but has the downside of being confusing since Iceberg
   creates tables with the Hive catalog as external non-partitioned tables.
   2. Create a separate sensor that makes it clear that we are sensing over
   an Iceberg table. This is probably the most straightforward approach, but
   if we do this we would probably need to do the same for any tool that used
   the metastore to get partition information.


Would love to hear what your experiences have been.
Thanks

Re: Airflow Integration

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I think it’s difficult for Iceberg to support the partition-related
commands because it ends up being a scan over metadata files rather than a
metastore operation. We have been trying to move away from our own
getPartitions API because it is expensive to satisfy those queries compared
to Hive because you have to find all matching files and then aggregate the
data.

I think a separate sensor for Airflow sounds like the right option because
that could use Iceberg metadata to find just the partitions that changed.
That’s a much more efficient operation because you’re basically consuming
snapshot metadata incrementally after some high watermark.

You may also want to think about different processing patterns in Airflow.
Part of what we want to do in Iceberg is to hide partitioning so that it is
configuration that can be changed. Using partitions for implicit deletes
(like INSERT OVERWRITE in Hive) or for units of work ties infrastructure to
partitioning and makes it so you can’t evolve it without breaking
something. I would like to avoid that. That’s why we built the new v2 write
API in Spark that overwrites using an explicit filter rather than
implicitly replacing partitions. Here, I would recommend an alternative
sensor that identifies when a portion of a table is complete and then
produces a filter to select that portion. That way you don’t depend
directly on partitioning.

On Tue, Mar 2, 2021 at 10:21 AM Gustavo Torres Torres
<gu...@airbnb.com.invalid> wrote:

> Thanks Peter!
>
> So in that case we do let users create iceberg tables with Hive DDL
> `CREATE EXTERNAL TABLE ice_table PARTITIONED BY ...` but my guess is that
> `SHOW PARTITIONS ice_table` would not work.
>
> Has there been any discussion about whether Iceberg tables in Hive should
> support these partition-related commands? I know Spark and Trino (ex.
> PrestoSQL) have metadata tables that would let you find out the partitions
> of your table.
>
> On Tue, Mar 2, 2021 at 3:35 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Gustavo,
>>
>> Not too familiar with the Airflow user base/use cases, but we had to
>> consider similar things when decided what to do with `CREATE EXTERNAL TABLE
>> ice_table PARTITIONED BY ...` Hive queries.
>> See: https://github.com/apache/iceberg/pull/1917
>>
>> The decision there was, that even thought the user issued a command to
>> create a partitioned Hive table, we created an unpartitioned Hive table,
>> where the backing Iceberg table was using identity partitions for the
>> originally requested columns.
>>
>> Hope this helps a bit.
>>
>> Thanks,
>> Peter
>>
>> On Mar 2, 2021, at 03:38, Gustavo Torres Torres <
>> gustavo.torres@airbnb.com.INVALID> wrote:
>>
>> Hey folks,
>>
>> Lately I've been thinking about integration between Airflow & Iceberg for
>> a smooth transition from Hive-based tables to Iceberg ones and would like
>> to hear about your experience. Specifically about Iceberg partition sensors
>> in Airflow.
>>
>> From the way I see it, there are two ways to go about this (at least for
>> Hive-based catalogs):
>>
>>
>>    1. Modify our Hive Metastore API so that partitions-APIs are handled
>>    directly by the Iceberg API. This has the advantage of being mostly
>>    transparent to users but has the downside of being confusing since Iceberg
>>    creates tables with the Hive catalog as external non-partitioned tables.
>>    2. Create a separate sensor that makes it clear that we are sensing
>>    over an Iceberg table. This is probably the most straightforward approach,
>>    but if we do this we would probably need to do the same for any tool that
>>    used the metastore to get partition information.
>>
>>
>> Would love to hear what your experiences have been.
>> Thanks
>>
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Airflow Integration

Posted by Gustavo Torres Torres <gu...@airbnb.com.INVALID>.

Thanks Peter!

So in that case we do let users create iceberg tables with Hive DDL `CREATE
EXTERNAL TABLE ice_table PARTITIONED BY ...` but my guess is that `SHOW
PARTITIONS ice_table` would not work.

Has there been any discussion about whether Iceberg tables in Hive should
support these partition-related commands? I know Spark and Trino (ex.
PrestoSQL) have metadata tables that would let you find out the partitions
of your table.

On Tue, Mar 2, 2021 at 3:35 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Gustavo,
>
> Not too familiar with the Airflow user base/use cases, but we had to
> consider similar things when decided what to do with `CREATE EXTERNAL TABLE
> ice_table PARTITIONED BY ...` Hive queries.
> See: https://github.com/apache/iceberg/pull/1917
>
> The decision there was, that even thought the user issued a command to
> create a partitioned Hive table, we created an unpartitioned Hive table,
> where the backing Iceberg table was using identity partitions for the
> originally requested columns.
>
> Hope this helps a bit.
>
> Thanks,
> Peter
>
> On Mar 2, 2021, at 03:38, Gustavo Torres Torres <
> gustavo.torres@airbnb.com.INVALID> wrote:
>
> Hey folks,
>
> Lately I've been thinking about integration between Airflow & Iceberg for
> a smooth transition from Hive-based tables to Iceberg ones and would like
> to hear about your experience. Specifically about Iceberg partition sensors
> in Airflow.
>
> From the way I see it, there are two ways to go about this (at least for
> Hive-based catalogs):
>
>
>    1. Modify our Hive Metastore API so that partitions-APIs are handled
>    directly by the Iceberg API. This has the advantage of being mostly
>    transparent to users but has the downside of being confusing since Iceberg
>    creates tables with the Hive catalog as external non-partitioned tables.
>    2. Create a separate sensor that makes it clear that we are sensing
>    over an Iceberg table. This is probably the most straightforward approach,
>    but if we do this we would probably need to do the same for any tool that
>    used the metastore to get partition information.
>
>
> Would love to hear what your experiences have been.
> Thanks
>
>
>

Re: Airflow Integration

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi Gustavo,

Not too familiar with the Airflow user base/use cases, but we had to consider similar things when decided what to do with `CREATE EXTERNAL TABLE ice_table PARTITIONED BY ...` Hive queries.
See: https://github.com/apache/iceberg/pull/1917 <https://github.com/apache/iceberg/pull/1917>

The decision there was, that even thought the user issued a command to create a partitioned Hive table, we created an unpartitioned Hive table, where the backing Iceberg table was using identity partitions for the originally requested columns.

Hope this helps a bit.

Thanks,
Peter

> On Mar 2, 2021, at 03:38, Gustavo Torres Torres <gu...@airbnb.com.INVALID> wrote:
> 
> Hey folks,
> 
> Lately I've been thinking about integration between Airflow & Iceberg for a smooth transition from Hive-based tables to Iceberg ones and would like to hear about your experience. Specifically about Iceberg partition sensors in Airflow.
> 
> From the way I see it, there are two ways to go about this (at least for Hive-based catalogs): 
> 
> Modify our Hive Metastore API so that partitions-APIs are handled directly by the Iceberg API. This has the advantage of being mostly transparent to users but has the downside of being confusing since Iceberg creates tables with the Hive catalog as external non-partitioned tables.
> Create a separate sensor that makes it clear that we are sensing over an Iceberg table. This is probably the most straightforward approach, but if we do this we would probably need to do the same for any tool that used the metastore to get partition information.
> 
> Would love to hear what your experiences have been.
> Thanks