You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Huadong Liu <hu...@gmail.com> on 2021/05/15 00:00:46 UTC

Stableness of V2 Spec/API

Hi iceberg-dev,

I tried v2 row-level deletion by committing equality delete files after
*upgradeToFormatVersion(2)*. It worked well. I know that Spark actions to
compact delete files and data files
<https://github.com/apache/iceberg/milestone/4> etc. are in progress. I
currently use the JAVA API to update, query and do maintenance ops. I am
not using Flink at the moment and I will definitely pick up Spark actions
when they are completed. Deletions can be scheduled in batches (e.g.
weekly) to control the volume of delete files. I want to get a sense of the
risk level of losing data at some point because of v2 Spec/API changes if I
start to use v2 format now. It is not an easy question. Any input is
appreciated.

--
Huadong

Re: Stableness of V2 Spec/API

Posted by Ryan Blue <bl...@apache.org>.

I just commented on #2303. I think we should get that fixed fairly soon --
at least an interim fix to ensure that compaction correctly catches the
problem and fails. The plan for the long-term fix looks good to me as well.

On Mon, May 17, 2021 at 7:17 PM OpenInx <op...@gmail.com> wrote:

> The PR-2303 defines how the batch job does the compaction work,  the
> PR-2308  decides what's the behavior that compaction txn and  row-delta txn
> commit at the same time.   They should n't block each other,  but we will
> need to resolve both of them.
>
> On Tue, May 18, 2021 at 9:36 AM Huadong Liu <hu...@gmail.com> wrote:
>
>> Thanks. Compaction is https://github.com/apache/iceberg/pull/2303 and it
>> is currently blocked by https://github.com/apache/iceberg/issues/2308?
>>
>> On Mon, May 17, 2021 at 6:17 PM OpenInx <op...@gmail.com> wrote:
>>
>>> Hi Huadong
>>>
>>> From the perspective of iceberg developers, we don't expose the format
>>> v2 to end users because we think there is still other work that needs to be
>>> done. As you can see there are still some unfinished issues from your link.
>>> As for whether v2 will cause data loss, from my perspective as a
>>> designer, semantics and correctness should be handled very rigorously if we
>>> don't do any compaction.  Once we introduce the compaction action,  we will
>>> encounter this issue: https://github.com/apache/iceberg/issues/2308,
>>> we've proposed a solution but still not reached an agreement in the
>>> community.  I will suggest using v2 in production after we resolve this
>>> issue at least.
>>>
>>> On Sat, May 15, 2021 at 8:01 AM Huadong Liu <hu...@gmail.com>
>>> wrote:
>>>
>>>> Hi iceberg-dev,
>>>>
>>>> I tried v2 row-level deletion by committing equality delete files after
>>>> *upgradeToFormatVersion(2)*. It worked well. I know that Spark actions
>>>> to compact delete files and data files
>>>> <https://github.com/apache/iceberg/milestone/4> etc. are in progress.
>>>> I currently use the JAVA API to update, query and do maintenance ops. I am
>>>> not using Flink at the moment and I will definitely pick up Spark actions
>>>> when they are completed. Deletions can be scheduled in batches (e.g.
>>>> weekly) to control the volume of delete files. I want to get a sense of the
>>>> risk level of losing data at some point because of v2 Spec/API changes if I
>>>> start to use v2 format now. It is not an easy question. Any input is
>>>> appreciated.
>>>>
>>>> --
>>>> Huadong
>>>>
>>>

-- 
Ryan Blue

Re: Stableness of V2 Spec/API

Posted by OpenInx <op...@gmail.com>.

The PR-2303 defines how the batch job does the compaction work,  the
PR-2308  decides what's the behavior that compaction txn and  row-delta txn
commit at the same time.   They should n't block each other,  but we will
need to resolve both of them.

On Tue, May 18, 2021 at 9:36 AM Huadong Liu <hu...@gmail.com> wrote:

> Thanks. Compaction is https://github.com/apache/iceberg/pull/2303 and it
> is currently blocked by https://github.com/apache/iceberg/issues/2308?
>
> On Mon, May 17, 2021 at 6:17 PM OpenInx <op...@gmail.com> wrote:
>
>> Hi Huadong
>>
>> From the perspective of iceberg developers, we don't expose the format v2
>> to end users because we think there is still other work that needs to be
>> done. As you can see there are still some unfinished issues from your link.
>> As for whether v2 will cause data loss, from my perspective as a
>> designer, semantics and correctness should be handled very rigorously if we
>> don't do any compaction.  Once we introduce the compaction action,  we will
>> encounter this issue: https://github.com/apache/iceberg/issues/2308,
>> we've proposed a solution but still not reached an agreement in the
>> community.  I will suggest using v2 in production after we resolve this
>> issue at least.
>>
>> On Sat, May 15, 2021 at 8:01 AM Huadong Liu <hu...@gmail.com> wrote:
>>
>>> Hi iceberg-dev,
>>>
>>> I tried v2 row-level deletion by committing equality delete files after
>>> *upgradeToFormatVersion(2)*. It worked well. I know that Spark actions
>>> to compact delete files and data files
>>> <https://github.com/apache/iceberg/milestone/4> etc. are in progress. I
>>> currently use the JAVA API to update, query and do maintenance ops. I am
>>> not using Flink at the moment and I will definitely pick up Spark actions
>>> when they are completed. Deletions can be scheduled in batches (e.g.
>>> weekly) to control the volume of delete files. I want to get a sense of the
>>> risk level of losing data at some point because of v2 Spec/API changes if I
>>> start to use v2 format now. It is not an easy question. Any input is
>>> appreciated.
>>>
>>> --
>>> Huadong
>>>
>>

Re: Stableness of V2 Spec/API

Posted by Huadong Liu <hu...@gmail.com>.

Thanks. Compaction is https://github.com/apache/iceberg/pull/2303 and it is
currently blocked by https://github.com/apache/iceberg/issues/2308?

On Mon, May 17, 2021 at 6:17 PM OpenInx <op...@gmail.com> wrote:

> Hi Huadong
>
> From the perspective of iceberg developers, we don't expose the format v2
> to end users because we think there is still other work that needs to be
> done. As you can see there are still some unfinished issues from your link.
> As for whether v2 will cause data loss, from my perspective as a designer,
> semantics and correctness should be handled very rigorously if we don't do
> any compaction.  Once we introduce the compaction action,  we will
> encounter this issue: https://github.com/apache/iceberg/issues/2308,
> we've proposed a solution but still not reached an agreement in the
> community.  I will suggest using v2 in production after we resolve this
> issue at least.
>
> On Sat, May 15, 2021 at 8:01 AM Huadong Liu <hu...@gmail.com> wrote:
>
>> Hi iceberg-dev,
>>
>> I tried v2 row-level deletion by committing equality delete files after
>> *upgradeToFormatVersion(2)*. It worked well. I know that Spark actions
>> to compact delete files and data files
>> <https://github.com/apache/iceberg/milestone/4> etc. are in progress. I
>> currently use the JAVA API to update, query and do maintenance ops. I am
>> not using Flink at the moment and I will definitely pick up Spark actions
>> when they are completed. Deletions can be scheduled in batches (e.g.
>> weekly) to control the volume of delete files. I want to get a sense of the
>> risk level of losing data at some point because of v2 Spec/API changes if I
>> start to use v2 format now. It is not an easy question. Any input is
>> appreciated.
>>
>> --
>> Huadong
>>
>

Re: Stableness of V2 Spec/API

Posted by OpenInx <op...@gmail.com>.

Hi Huadong

From the perspective of iceberg developers, we don't expose the format v2
to end users because we think there is still other work that needs to be
done. As you can see there are still some unfinished issues from your link.
As for whether v2 will cause data loss, from my perspective as a designer,
semantics and correctness should be handled very rigorously if we don't do
any compaction.  Once we introduce the compaction action,  we will
encounter this issue: https://github.com/apache/iceberg/issues/2308,  we've
proposed a solution but still not reached an agreement in the community.  I
will suggest using v2 in production after we resolve this issue at least.

On Sat, May 15, 2021 at 8:01 AM Huadong Liu <hu...@gmail.com> wrote:

> Hi iceberg-dev,
>
> I tried v2 row-level deletion by committing equality delete files after
> *upgradeToFormatVersion(2)*. It worked well. I know that Spark actions to
> compact delete files and data files
> <https://github.com/apache/iceberg/milestone/4> etc. are in progress. I
> currently use the JAVA API to update, query and do maintenance ops. I am
> not using Flink at the moment and I will definitely pick up Spark actions
> when they are completed. Deletions can be scheduled in batches (e.g.
> weekly) to control the volume of delete files. I want to get a sense of the
> risk level of losing data at some point because of v2 Spec/API changes if I
> start to use v2 format now. It is not an easy question. Any input is
> appreciated.
>
> --
> Huadong
>