You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Ryan Blue <bl...@apache.org> on 2021/07/12 00:51:45 UTC

[DISCUSS] Adopting the v2 spec changes

Hi everyone,

I’d like to start a discussion about adopting the current set of changes to
the spec as v2. Adopting the current set of changes means that we won’t
have the ability to include any more breaking changes in v2. Any new
breaking change would require v3. This would not stop or affect any of the
ongoing work to add secondary indexes or new metadata because those
additions are forward-compatible (old readers can ignore it).

The main feature that has been added to the spec is row-level deletes. This
requires a breaking change because all readers must be updated to apply the
deletes when reading. Existing readers don’t do that, so a new format
version is required to ensure correctness for tables that track delete
files. Older readers are not forward-compatible with v2 tables.

Row-level delete changes include:

   - Addition of equality and positional delete files
   - Addition of content column in manifests to track the type of file
   (data, position deletes, equality deletes)
   - Addition of sequence_number column in manifests to track when in time
   a data or delete file was added
   - New rules for inheriting sequence numbers through the metadata tree to
   keep metadata write amplification low

In addition, v2 tightens requirements for additions that were compatible
with v1. For example, in v1 we added schemas and current-schema-id to table
metadata to allow tracking multiple versions of the table schema and
tracking the schema that was current when a given snapshot was written. The
table metadata fields were optional in v1 but became mandatory in v2. Also,
the schema field is removed in v2 in favor of the new fields.

Features that are supported by v2 are:

   - Row-level deletes
   - Multiple schemas, snapshot schemas
   - Multiple partition specs, better partition evolution
   - Table sort orders
   - Tracking identifier fields
   - NaN value counts for double and float fields
   - Manifest file encryption metadata

The full list of v2 changes <https://iceberg.apache.org/spec/#version-2> is
documented at the end of the spec.

I should also mention that v2 is not the default for tables in the
reference implementation (Java). We will continue building out better
support for row-level deletes, but I think that we are confident that the
row-level deletes design works and isn’t going to need breaking changes.

I think the next steps are to discuss the current v2 spec changes and make
sure everyone is comfortable adopting them. If that goes well, we’ll have a
vote to adopt the changes. Thanks for discussing, everyone!

Ryan
-- 
Ryan Blue

Re: [DISCUSS] Adopting the v2 spec changes

Posted by Ryan Blue <bl...@tabular.io>.

The changes that were suggested were just merged:
* #2864 <https://github.com/apache/iceberg/pull/2864> made the contains_nan
field optional
* #2805 <https://github.com/apache/iceberg/pull/2805> added back
distinct_counts (not required but good to have done, too)

It looks like there's mostly consensus around the v2 spec, and it's great
that we've already seen validation in both the Flink and Spark integration.
I'll start a vote thread to adopt the spec.

Ryan

On Fri, Jul 23, 2021 at 4:37 PM Ryan Blue <bl...@tabular.io> wrote:

> I'm going to revert the change to NaN tracking that makes that field
> required. I think we can make other fields required in table metadata.json
> files and manifests, but that one in the manifest list isn't a good idea.
> I'll open a PR to update it this weekend and I'll update the distinct
> counts PR from the discussion there. After that, I think we'll be ready to
> go.
>
> On Fri, Jul 23, 2021 at 4:29 PM Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
>
>> For the last month, I’ve been actively working on using the v2 spec in
>> Spark. Specifically, my focus is to implement merge-on-read using the
>> proposed API in Spark [1]. That’s why I would support the idea of adopting
>> v2 as the current design is sufficient to implement considered use cases. I
>> expect to find bugs and hit performance issues but I think we will be able
>> to solve them without breaking the forward compatibility.
>>
>> I’ll need a bit more time to dig into the NaN counts issue that Dan
>> mentioned but I overall support the idea of adopting the v2 spec without
>> making it default (until we prove its correctness and performance).
>>
>> - Anton
>>
>> [1] - https://github.com/apache/spark/pull/33008
>>
>> On 19 Jul 2021, at 17:01, Ryan Blue <bl...@tabular.io> wrote:
>>
>> I'll reply inline:
>>
>> On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dw...@apache.org> wrote:
>>
>>> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are
>>> a few points that I'm not entirely clear on, which may warrant discussion:
>>>
>>> 1)  The discussion on resurfacing distinct_count may be something we
>>> want to include since there is some value.  I think it warrants keeping and
>>> dropping the deprecation (see distinct count discussion in dev list).  This
>>> actually isn't a change proposal to v2, but rather signifying that it will
>>> be carried forward.
>>>
>>
>> I don't think that this affects v2 because it is a non-breaking change to
>> forward-compatibility. I think of this as adding it back to v1 because we
>> can add optional fields without any problem. Older writers won't produce
>> it, but they never did before it was deprecated anyway. Older readers will
>> simply ignore it when reading files.
>>
>>
>>> 2)  NaN counts would now be required for field_summary and I'm not clear
>>> on what that implies for tables being promoted from v1 to v2 since missing
>>> this information would require evaluating all of the partition values and
>>> rewriting all of the manifest lists where this info is missing (it's still
>>> optional for data files).  I wasn't able to find the discussion on making
>>> this required, but I assume it is so that we can accurately apply filters
>>> at this level.  Is this case handled?
>>>
>>
>> You might be right about not making this required.
>>
>> The intent was to change just the writer requirements. While that isn't
>> strictly necessary for some fields, it helps ensure that writers follow the
>> format as it evolves. But the problem is that the caller may not have this
>> information (whether any partition tuple contained NaN for that field) when
>> writing a new manifest list file because the old (v1) manifest list file
>> didn't include it. That would be known if the manifest was rewritten, but
>> may not be if the manifest is carried forward into a new manifest list.
>>
>> Making this required would mean that we need to set it to something when
>> writing. Setting it to `true` would be fine because if the field is
>> missing, we assume there may be a NaN value in a partition tuple and scan
>> the manifest when looking for NaNs. But it does seem strange to require
>> that default so I'd support removing this.
>>
>> I should also mention that even though this is optional for data files,
>> we should always know when writing a manifest. That's because this is about
>> the partition tuple, which is always present when writing a manifest.
>>
>>
>>> 3)  If we're signaling the formal adoption of v2 by changing the default
>>> table created, do we want to do that in the next release (0.12 assuming the
>>> discussion/vote closes in time) or should we separate that with a second
>>> release (0.13 or 1.0) to separate all of the other changes going into the
>>> next release from the official format adoption?
>>>
>>
>> I don't think that we are signalling formal adoption by changing the
>> default. We are adopting by voting and will change the default when it
>> makes sense to do that based on real-world use. I think that adopting the
>> proposed v2 format means that we're confident that although there may be
>> implementation bugs, we are ready to support the new v2 format as we do v1.
>>
>> The Java release and v2 format adoption are somewhat orthogonal. We could
>> release 0.12.0 and when v2 is released later state that they are
>> compatible. I think the only question is whether we find anything that we
>> need to change in 0.12.0 in order to use v2 tables. That's why I think it
>> is good to adopt the spec without changing the default. People can opt into
>> using the new tables for the new use cases and we can get some real-world
>> use before changing the default. The main consequence of adoption is that
>> we commit to supporting tables written for the v2 spec (as opposed to
>> changing v2 in incompatible ways) and we start to guarantee
>> forward-compatibility (as opposed to making more optional -> required
>> changes).
>>
>> Ryan
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 
Ryan Blue
Tabular

Re: [DISCUSS] Adopting the v2 spec changes

Posted by Ryan Blue <bl...@tabular.io>.

I'm going to revert the change to NaN tracking that makes that field
required. I think we can make other fields required in table metadata.json
files and manifests, but that one in the manifest list isn't a good idea.
I'll open a PR to update it this weekend and I'll update the distinct
counts PR from the discussion there. After that, I think we'll be ready to
go.

On Fri, Jul 23, 2021 at 4:29 PM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> For the last month, I’ve been actively working on using the v2 spec in
> Spark. Specifically, my focus is to implement merge-on-read using the
> proposed API in Spark [1]. That’s why I would support the idea of adopting
> v2 as the current design is sufficient to implement considered use cases. I
> expect to find bugs and hit performance issues but I think we will be able
> to solve them without breaking the forward compatibility.
>
> I’ll need a bit more time to dig into the NaN counts issue that Dan
> mentioned but I overall support the idea of adopting the v2 spec without
> making it default (until we prove its correctness and performance).
>
> - Anton
>
> [1] - https://github.com/apache/spark/pull/33008
>
> On 19 Jul 2021, at 17:01, Ryan Blue <bl...@tabular.io> wrote:
>
> I'll reply inline:
>
> On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dw...@apache.org> wrote:
>
>> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are
>> a few points that I'm not entirely clear on, which may warrant discussion:
>>
>> 1)  The discussion on resurfacing distinct_count may be something we want
>> to include since there is some value.  I think it warrants keeping and
>> dropping the deprecation (see distinct count discussion in dev list).  This
>> actually isn't a change proposal to v2, but rather signifying that it will
>> be carried forward.
>>
>
> I don't think that this affects v2 because it is a non-breaking change to
> forward-compatibility. I think of this as adding it back to v1 because we
> can add optional fields without any problem. Older writers won't produce
> it, but they never did before it was deprecated anyway. Older readers will
> simply ignore it when reading files.
>
>
>> 2)  NaN counts would now be required for field_summary and I'm not clear
>> on what that implies for tables being promoted from v1 to v2 since missing
>> this information would require evaluating all of the partition values and
>> rewriting all of the manifest lists where this info is missing (it's still
>> optional for data files).  I wasn't able to find the discussion on making
>> this required, but I assume it is so that we can accurately apply filters
>> at this level.  Is this case handled?
>>
>
> You might be right about not making this required.
>
> The intent was to change just the writer requirements. While that isn't
> strictly necessary for some fields, it helps ensure that writers follow the
> format as it evolves. But the problem is that the caller may not have this
> information (whether any partition tuple contained NaN for that field) when
> writing a new manifest list file because the old (v1) manifest list file
> didn't include it. That would be known if the manifest was rewritten, but
> may not be if the manifest is carried forward into a new manifest list.
>
> Making this required would mean that we need to set it to something when
> writing. Setting it to `true` would be fine because if the field is
> missing, we assume there may be a NaN value in a partition tuple and scan
> the manifest when looking for NaNs. But it does seem strange to require
> that default so I'd support removing this.
>
> I should also mention that even though this is optional for data files, we
> should always know when writing a manifest. That's because this is about
> the partition tuple, which is always present when writing a manifest.
>
>
>> 3)  If we're signaling the formal adoption of v2 by changing the default
>> table created, do we want to do that in the next release (0.12 assuming the
>> discussion/vote closes in time) or should we separate that with a second
>> release (0.13 or 1.0) to separate all of the other changes going into the
>> next release from the official format adoption?
>>
>
> I don't think that we are signalling formal adoption by changing the
> default. We are adopting by voting and will change the default when it
> makes sense to do that based on real-world use. I think that adopting the
> proposed v2 format means that we're confident that although there may be
> implementation bugs, we are ready to support the new v2 format as we do v1.
>
> The Java release and v2 format adoption are somewhat orthogonal. We could
> release 0.12.0 and when v2 is released later state that they are
> compatible. I think the only question is whether we find anything that we
> need to change in 0.12.0 in order to use v2 tables. That's why I think it
> is good to adopt the spec without changing the default. People can opt into
> using the new tables for the new use cases and we can get some real-world
> use before changing the default. The main consequence of adoption is that
> we commit to supporting tables written for the v2 spec (as opposed to
> changing v2 in incompatible ways) and we start to guarantee
> forward-compatibility (as opposed to making more optional -> required
> changes).
>
> Ryan
>
> --
> Ryan Blue
> Tabular
>
>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Adopting the v2 spec changes

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

For the last month, I’ve been actively working on using the v2 spec in Spark. Specifically, my focus is to implement merge-on-read using the proposed API in Spark [1]. That’s why I would support the idea of adopting v2 as the current design is sufficient to implement considered use cases. I expect to find bugs and hit performance issues but I think we will be able to solve them without breaking the forward compatibility.

I’ll need a bit more time to dig into the NaN counts issue that Dan mentioned but I overall support the idea of adopting the v2 spec without making it default (until we prove its correctness and performance).

- Anton

[1] - https://github.com/apache/spark/pull/33008 <https://github.com/apache/spark/pull/33008>

> On 19 Jul 2021, at 17:01, Ryan Blue <bl...@tabular.io> wrote:
> 
> I'll reply inline:
> 
> On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dweeks@apache.org <ma...@apache.org>> wrote:
> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are a few points that I'm not entirely clear on, which may warrant discussion:
> 
> 1)  The discussion on resurfacing distinct_count may be something we want to include since there is some value.  I think it warrants keeping and dropping the deprecation (see distinct count discussion in dev list).  This actually isn't a change proposal to v2, but rather signifying that it will be carried forward.
> 
> I don't think that this affects v2 because it is a non-breaking change to forward-compatibility. I think of this as adding it back to v1 because we can add optional fields without any problem. Older writers won't produce it, but they never did before it was deprecated anyway. Older readers will simply ignore it when reading files.
>  
> 2)  NaN counts would now be required for field_summary and I'm not clear on what that implies for tables being promoted from v1 to v2 since missing this information would require evaluating all of the partition values and rewriting all of the manifest lists where this info is missing (it's still optional for data files).  I wasn't able to find the discussion on making this required, but I assume it is so that we can accurately apply filters at this level.  Is this case handled?
> 
> You might be right about not making this required.
> 
> The intent was to change just the writer requirements. While that isn't strictly necessary for some fields, it helps ensure that writers follow the format as it evolves. But the problem is that the caller may not have this information (whether any partition tuple contained NaN for that field) when writing a new manifest list file because the old (v1) manifest list file didn't include it. That would be known if the manifest was rewritten, but may not be if the manifest is carried forward into a new manifest list.
> 
> Making this required would mean that we need to set it to something when writing. Setting it to `true` would be fine because if the field is missing, we assume there may be a NaN value in a partition tuple and scan the manifest when looking for NaNs. But it does seem strange to require that default so I'd support removing this.
> 
> I should also mention that even though this is optional for data files, we should always know when writing a manifest. That's because this is about the partition tuple, which is always present when writing a manifest.
>  
> 3)  If we're signaling the formal adoption of v2 by changing the default table created, do we want to do that in the next release (0.12 assuming the discussion/vote closes in time) or should we separate that with a second release (0.13 or 1.0) to separate all of the other changes going into the next release from the official format adoption?
> 
> I don't think that we are signalling formal adoption by changing the default. We are adopting by voting and will change the default when it makes sense to do that based on real-world use. I think that adopting the proposed v2 format means that we're confident that although there may be implementation bugs, we are ready to support the new v2 format as we do v1.
> 
> The Java release and v2 format adoption are somewhat orthogonal. We could release 0.12.0 and when v2 is released later state that they are compatible. I think the only question is whether we find anything that we need to change in 0.12.0 in order to use v2 tables. That's why I think it is good to adopt the spec without changing the default. People can opt into using the new tables for the new use cases and we can get some real-world use before changing the default. The main consequence of adoption is that we commit to supporting tables written for the v2 spec (as opposed to changing v2 in incompatible ways) and we start to guarantee forward-compatibility (as opposed to making more optional -> required changes).
> 
> Ryan
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] Adopting the v2 spec changes

Posted by Ryan Blue <bl...@tabular.io>.

I'll reply inline:

On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dw...@apache.org> wrote:

> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are a
> few points that I'm not entirely clear on, which may warrant discussion:
>
> 1)  The discussion on resurfacing distinct_count may be something we want
> to include since there is some value.  I think it warrants keeping and
> dropping the deprecation (see distinct count discussion in dev list).  This
> actually isn't a change proposal to v2, but rather signifying that it will
> be carried forward.
>

I don't think that this affects v2 because it is a non-breaking change to
forward-compatibility. I think of this as adding it back to v1 because we
can add optional fields without any problem. Older writers won't produce
it, but they never did before it was deprecated anyway. Older readers will
simply ignore it when reading files.

> 2)  NaN counts would now be required for field_summary and I'm not clear
> on what that implies for tables being promoted from v1 to v2 since missing
> this information would require evaluating all of the partition values and
> rewriting all of the manifest lists where this info is missing (it's still
> optional for data files).  I wasn't able to find the discussion on making
> this required, but I assume it is so that we can accurately apply filters
> at this level.  Is this case handled?
>

You might be right about not making this required.

The intent was to change just the writer requirements. While that isn't
strictly necessary for some fields, it helps ensure that writers follow the
format as it evolves. But the problem is that the caller may not have this
information (whether any partition tuple contained NaN for that field) when
writing a new manifest list file because the old (v1) manifest list file
didn't include it. That would be known if the manifest was rewritten, but
may not be if the manifest is carried forward into a new manifest list.

Making this required would mean that we need to set it to something when
writing. Setting it to `true` would be fine because if the field is
missing, we assume there may be a NaN value in a partition tuple and scan
the manifest when looking for NaNs. But it does seem strange to require
that default so I'd support removing this.

I should also mention that even though this is optional for data files, we
should always know when writing a manifest. That's because this is about
the partition tuple, which is always present when writing a manifest.

> 3)  If we're signaling the formal adoption of v2 by changing the default
> table created, do we want to do that in the next release (0.12 assuming the
> discussion/vote closes in time) or should we separate that with a second
> release (0.13 or 1.0) to separate all of the other changes going into the
> next release from the official format adoption?
>

I don't think that we are signalling formal adoption by changing the
default. We are adopting by voting and will change the default when it
makes sense to do that based on real-world use. I think that adopting the
proposed v2 format means that we're confident that although there may be
implementation bugs, we are ready to support the new v2 format as we do v1.

The Java release and v2 format adoption are somewhat orthogonal. We could
release 0.12.0 and when v2 is released later state that they are
compatible. I think the only question is whether we find anything that we
need to change in 0.12.0 in order to use v2 tables. That's why I think it
is good to adopt the spec without changing the default. People can opt into
using the new tables for the new use cases and we can get some real-world
use before changing the default. The main consequence of adoption is that
we commit to supporting tables written for the v2 spec (as opposed to
changing v2 in incompatible ways) and we start to guarantee
forward-compatibility (as opposed to making more optional -> required
changes).

Ryan

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Adopting the v2 spec changes

Posted by Daniel Weeks <dw...@apache.org>.

Overall, I'm in favor of what Ryan has proposed for v2 above.  There are a
few points that I'm not entirely clear on, which may warrant discussion:

1)  The discussion on resurfacing distinct_count may be something we want
to include since there is some value.  I think it warrants keeping and
dropping the deprecation (see distinct count discussion in dev list).  This
actually isn't a change proposal to v2, but rather signifying that it will
be carried forward.

2)  NaN counts would now be required for field_summary and I'm not clear on
what that implies for tables being promoted from v1 to v2 since missing
this information would require evaluating all of the partition values and
rewriting all of the manifest lists where this info is missing (it's still
optional for data files).  I wasn't able to find the discussion on making
this required, but I assume it is so that we can accurately apply filters
at this level.  Is this case handled?

3)  If we're signaling the formal adoption of v2 by changing the default
table created, do we want to do that in the next release (0.12 assuming the
discussion/vote closes in time) or should we separate that with a second
release (0.13 or 1.0) to separate all of the other changes going into the
next release from the official format adoption?

-Dan

On Sun, Jul 11, 2021 at 5:51 PM Ryan Blue <bl...@apache.org> wrote:

> Hi everyone,
>
> I’d like to start a discussion about adopting the current set of changes
> to the spec as v2. Adopting the current set of changes means that we won’t
> have the ability to include any more breaking changes in v2. Any new
> breaking change would require v3. This would not stop or affect any of the
> ongoing work to add secondary indexes or new metadata because those
> additions are forward-compatible (old readers can ignore it).
>
> The main feature that has been added to the spec is row-level deletes.
> This requires a breaking change because all readers must be updated to
> apply the deletes when reading. Existing readers don’t do that, so a new
> format version is required to ensure correctness for tables that track
> delete files. Older readers are not forward-compatible with v2 tables.
>
> Row-level delete changes include:
>
>    - Addition of equality and positional delete files
>    - Addition of content column in manifests to track the type of file
>    (data, position deletes, equality deletes)
>    - Addition of sequence_number column in manifests to track when in
>    time a data or delete file was added
>    - New rules for inheriting sequence numbers through the metadata tree
>    to keep metadata write amplification low
>
> In addition, v2 tightens requirements for additions that were compatible
> with v1. For example, in v1 we added schemas and current-schema-id to
> table metadata to allow tracking multiple versions of the table schema and
> tracking the schema that was current when a given snapshot was written. The
> table metadata fields were optional in v1 but became mandatory in v2. Also,
> the schema field is removed in v2 in favor of the new fields.
>
> Features that are supported by v2 are:
>
>    - Row-level deletes
>    - Multiple schemas, snapshot schemas
>    - Multiple partition specs, better partition evolution
>    - Table sort orders
>    - Tracking identifier fields
>    - NaN value counts for double and float fields
>    - Manifest file encryption metadata
>
> The full list of v2 changes <https://iceberg.apache.org/spec/#version-2>
> is documented at the end of the spec.
>
> I should also mention that v2 is not the default for tables in the
> reference implementation (Java). We will continue building out better
> support for row-level deletes, but I think that we are confident that the
> row-level deletes design works and isn’t going to need breaking changes.
>
> I think the next steps are to discuss the current v2 spec changes and make
> sure everyone is comfortable adopting them. If that goes well, we’ll have a
> vote to adopt the changes. Thanks for discussing, everyone!
>
> Ryan
> --
> Ryan Blue
>