You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Piyush Vinay Hurpade <pi...@dremio.com> on 2021/01/08 07:15:14 UTC

About proposal for including Partition stats for Iceberg tables

Hi Team,
Need some help regarding our proposal for a partition-stats file within
each snapshot. With each snapshot we are proposing a partition-stats avro
file that contains information about all partitions in the table. So the
schema we decide to have is *(partition_spec_id(int),
partition(PartitionData), file_count(int), row_count(long)).* Problem is
with the 2nd *column(partition)*. When partition evolution happens, the
schema for PartitionData(PartitionSpec) will change. illustration :

{"partition_spec_id":0,"partition":"PartitionData{data=a}","file_count":2,"row_count":2}
{"partition_spec_id":0,"partition":"PartitionData{data=b}","file_count":1,"row_count":1}
{"partition_spec_id":1,"partition":"PartitionData{data=c,
id=1}","file_count":1,"row_count":1}

And this will be a problem for reader and writer. We decided to have *the
partition column as a "String type" and serialize PartitionData to string.*
Here we want to confirm that "*Can all data types supported in iceberg can
serialize to String"?* For example if a column in a table has binary type
and we have a partition on it. can it be serialize to string?
issue link : https://github.com/apache/iceberg/issues/1832
<https://github.com/apache/iceberg/issues/1832>

thanks and regards
-- 

Piyush Hurpade

Software Engineer

piyush.hurpade@dremio.com

Re: About proposal for including Partition stats for Iceberg tables

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Piyush,

You might want to consider having a separate partition stats file for each
partition spec. That way, each stats file contains just one partition
struct type and you can keep the struct unmodified. There is a way to
convert a partition struct to a string (PartitionSpec.partitionToPath) but
that is a one-way conversion and you shouldn't try to parse that string or
consider two partitions equal just because the string is equal.

Making the file specific to a partition spec fixes the problem and allows
you to find the data in each file using the same partition predicates that
we use to locate data files. That will make it easy to find the stats for
the partitions that you're looking for based on some data or partition
query filter.

rb

On Thu, Jan 7, 2021 at 11:15 PM Piyush Vinay Hurpade <
piyush.hurpade@dremio.com> wrote:

> Hi Team,
> Need some help regarding our proposal for a partition-stats file within
> each snapshot. With each snapshot we are proposing a partition-stats avro
> file that contains information about all partitions in the table. So the
> schema we decide to have is *(partition_spec_id(int),
> partition(PartitionData), file_count(int), row_count(long)).* Problem is
> with the 2nd *column(partition)*. When partition evolution happens, the
> schema for PartitionData(PartitionSpec) will change. illustration :
>
> {"partition_spec_id":0,"partition":"PartitionData{data=a}","file_count":2,"row_count":2}
> {"partition_spec_id":0,"partition":"PartitionData{data=b}","file_count":1,"row_count":1}
> {"partition_spec_id":1,"partition":"PartitionData{data=c, id=1}","file_count":1,"row_count":1}
>
> And this will be a problem for reader and writer. We decided to have *the
> partition column as a "String type" and serialize PartitionData to string.*
> Here we want to confirm that "*Can all data types supported in iceberg
> can serialize to String"?* For example if a column in a table has binary
> type and we have a partition on it. can it be serialize to string?
> issue link : https://github.com/apache/iceberg/issues/1832
> <https://github.com/apache/iceberg/issues/1832>
>
> thanks and regards
> --
>
> Piyush Hurpade
>
> Software Engineer
>
> piyush.hurpade@dremio.com
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix