You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Erik Wright <er...@shopify.com.INVALID> on 2018/11/27 19:54:03 UTC

Presto Partitioning question

>
> *Presto Integration*
>>
>> If I understand correctly, there is no current support for reading
>> Iceberg data directly from Presto. I imagine, however, that as long as your
>> partition specifications are restricted to the `identity` transform it
>> should be "easy" to consume an Iceberg table snapshot and register it as an
>> external hive table complete with partition metadata. Am I missing any
>> caveats there?
>>
>
> Presto support has been posted in a PR:
> https://github.com/prestodb/presto/pull/11767
>
> Presto performance should be on par with Parquet tables in Presto because
> we use the same vectorized read code. The implementation supports reading
> from partitioned tables with either hidden or identity partitions.
>

Can you clarify what is meant by "hidden" partitions? Will Presto
integration be able to convert filter predicates to partition predicates
for the partition transforms listed in the Iceberg table spec? For example,
will filtering on a timestamp column be converted into correct partition
predicates when data is partitioned by year/month?

Re: Presto Partitioning question

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Iceberg has 2 main types of partitions:

* identity, named for the identity transform, which are just like Hive
partitions where data from some column is used to partition without
modification (identity transform)
* hidden, which derive partition values from some column but don't expose
those values as a data column

An example of a "hidden" partition is a day transform, where the partition
value (day) is derived from a timestamp or date value. The data contains
only a timestamp column and partition values are automatically derived from
those timestamps to store the data. Then filters on the timestamp are
automatically converted to partition filters when planning a scan.

Hidden partitions have lots of benefits:

* Users aren't responsible for calculating partition values when writing,
which is error prone (i.e., date in what time zone?)
* Readers aren't responsible for filtering partition values in addition to
data values, which is a common cause of bad queries and full table scans
* Queries are written for data and can be run on any partition scheme, so
partitioning can be evolved over time as data volume changes

To answer your question, "will filtering on a timestamp column be converted
into correct partition predicates when data is partitioned by year/month?"

Yes, if the year or month partition is a hidden partition that is
maintained by Iceberg because Iceberg knows the relationship between a
timestamp/date and the partition. No if the year/month partition was
supplied by a user as a data column because Iceberg doesn't know anything
about the relationship between two columns.

rb

On Tue, Nov 27, 2018 at 11:54 AM Erik Wright <er...@shopify.com>
wrote:

> *Presto Integration*
>>>
>>> If I understand correctly, there is no current support for reading
>>> Iceberg data directly from Presto. I imagine, however, that as long as your
>>> partition specifications are restricted to the `identity` transform it
>>> should be "easy" to consume an Iceberg table snapshot and register it as an
>>> external hive table complete with partition metadata. Am I missing any
>>> caveats there?
>>>
>>
>> Presto support has been posted in a PR:
>> https://github.com/prestodb/presto/pull/11767
>>
>> Presto performance should be on par with Parquet tables in Presto because
>> we use the same vectorized read code. The implementation supports reading
>> from partitioned tables with either hidden or identity partitions.
>>
>
> Can you clarify what is meant by "hidden" partitions? Will Presto
> integration be able to convert filter predicates to partition predicates
> for the partition transforms listed in the Iceberg table spec? For example,
> will filtering on a timestamp column be converted into correct partition
> predicates when data is partitioned by year/month?
>


-- 
Ryan Blue
Software Engineer
Netflix