You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Vivekanand Vellanki <vi...@dremio.com> on 2020/11/20 09:43:50 UTC

Proposal for additional fields in Iceberg manifest files

Hi,

I would like to propose additional fields in Iceberg manifest files
<https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#>
to support the following scenarios:

   - Partition index to include per-partition stats to help support planning
   - Data locality information to support split assignment in distributed
   query engines

Comments are welcome.

-- 
Thanks
Vivek

Re: Proposal for additional fields in Iceberg manifest files

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for creating the issues!

We don't usually assign issues to individuals. If you're working on it,
just comment on the issue. Assigning tends to discourage other people from
working on it, rather than contacting the person currently working to see
if they can help. And assignees don't always complete what they intend to,
so the assignments can get stale.

On Thu, Nov 26, 2020 at 12:48 AM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> Ryan,
>
> I created 2 tickets to work on the partition index. Can you please assign
> them to me (vvellanki) so that I can work on them?
>
> https://github.com/apache/iceberg/issues/1832
> https://github.com/apache/iceberg/issues/1833
>
> Thanks
> Vivek
>
> On Fri, Nov 20, 2020 at 11:18 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Thanks Vivekanand!
>>
>> I made some comments on the doc. Overall, I think a partition index is a
>> good idea. We've thought about adding sketches that contain skew estimates
>> for certain columns in a partition so that we can do better join
>> estimation. Getting a start on how we would store data like this is a good
>> step.
>>
>> I'm a bit more skeptical about locality information, since it would get
>> out of date and require rewriting old, large manifests.
>>
>> On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I would like to propose additional fields in Iceberg manifest files
>>> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#>
>>> to support the following scenarios:
>>>
>>>    - Partition index to include per-partition stats to help support
>>>    planning
>>>    - Data locality information to support split assignment in
>>>    distributed query engines
>>>
>>> Comments are welcome.
>>>
>>> --
>>> Thanks
>>> Vivek
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Proposal for additional fields in Iceberg manifest files

Posted by Vivekanand Vellanki <vi...@dremio.com>.
Ryan,

I created 2 tickets to work on the partition index. Can you please assign
them to me (vvellanki) so that I can work on them?

https://github.com/apache/iceberg/issues/1832
https://github.com/apache/iceberg/issues/1833

Thanks
Vivek

On Fri, Nov 20, 2020 at 11:18 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Thanks Vivekanand!
>
> I made some comments on the doc. Overall, I think a partition index is a
> good idea. We've thought about adding sketches that contain skew estimates
> for certain columns in a partition so that we can do better join
> estimation. Getting a start on how we would store data like this is a good
> step.
>
> I'm a bit more skeptical about locality information, since it would get
> out of date and require rewriting old, large manifests.
>
> On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Hi,
>>
>> I would like to propose additional fields in Iceberg manifest files
>> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#>
>> to support the following scenarios:
>>
>>    - Partition index to include per-partition stats to help support
>>    planning
>>    - Data locality information to support split assignment in
>>    distributed query engines
>>
>> Comments are welcome.
>>
>> --
>> Thanks
>> Vivek
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Proposal for additional fields in Iceberg manifest files

Posted by Vivekanand Vellanki <vi...@dremio.com>.
Thanks for the feedback. I responded to the comments in the doc.

Regarding locality information, I introduced a timestamp field to track the
time when the information was populated. Engines can use this timestamp to
decide the validity of this data locality information. Further, when
manifest files are restated as part of MergeAppend; or compaction; this
information would be updated.

On Fri, Nov 20, 2020 at 11:18 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Thanks Vivekanand!
>
> I made some comments on the doc. Overall, I think a partition index is a
> good idea. We've thought about adding sketches that contain skew estimates
> for certain columns in a partition so that we can do better join
> estimation. Getting a start on how we would store data like this is a good
> step.
>
> I'm a bit more skeptical about locality information, since it would get
> out of date and require rewriting old, large manifests.
>
> On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Hi,
>>
>> I would like to propose additional fields in Iceberg manifest files
>> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#>
>> to support the following scenarios:
>>
>>    - Partition index to include per-partition stats to help support
>>    planning
>>    - Data locality information to support split assignment in
>>    distributed query engines
>>
>> Comments are welcome.
>>
>> --
>> Thanks
>> Vivek
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Proposal for additional fields in Iceberg manifest files

Posted by Mass Dosage <ma...@gmail.com>.
+1 - I also like the idea of having more data profiling info for the
partition but worry about hostnames and IP addresses and maintaining those
as things change, especially if you have hundreds of hosts, I'd rather
leave that to the name node.

On Fri, 20 Nov 2020 at 17:48, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks Vivekanand!
>
> I made some comments on the doc. Overall, I think a partition index is a
> good idea. We've thought about adding sketches that contain skew estimates
> for certain columns in a partition so that we can do better join
> estimation. Getting a start on how we would store data like this is a good
> step.
>
> I'm a bit more skeptical about locality information, since it would get
> out of date and require rewriting old, large manifests.
>
> On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Hi,
>>
>> I would like to propose additional fields in Iceberg manifest files
>> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#>
>> to support the following scenarios:
>>
>>    - Partition index to include per-partition stats to help support
>>    planning
>>    - Data locality information to support split assignment in
>>    distributed query engines
>>
>> Comments are welcome.
>>
>> --
>> Thanks
>> Vivek
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Proposal for additional fields in Iceberg manifest files

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks Vivekanand!

I made some comments on the doc. Overall, I think a partition index is a
good idea. We've thought about adding sketches that contain skew estimates
for certain columns in a partition so that we can do better join
estimation. Getting a start on how we would store data like this is a good
step.

I'm a bit more skeptical about locality information, since it would get out
of date and require rewriting old, large manifests.

On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <vi...@dremio.com>
wrote:

> Hi,
>
> I would like to propose additional fields in Iceberg manifest files
> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#>
> to support the following scenarios:
>
>    - Partition index to include per-partition stats to help support
>    planning
>    - Data locality information to support split assignment in distributed
>    query engines
>
> Comments are welcome.
>
> --
> Thanks
> Vivek
>
>

-- 
Ryan Blue
Software Engineer
Netflix