You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by OpenInx <op...@gmail.com> on 2020/04/01 02:50:51 UTC

Re: Open a new branch for row-delete feature ?

Hi Ryan & Miao

I'm fine if you think it's not the appropriate time to open a new feature
branch for row-delete feature. The point for us is to accomplish the
row-delete feature as soon as possible in apache iceberg, so that our
community users could benefit from it and adopt the iceberg to more
scenarios.
As we know the table specification changes is the foundation, so what do
you think about the newly introduced table specification in PR ?  If there
are some concerns in your side, please let us know, so that we could
iterate the patches and push the work forward.
1.   sequence_number.  https://github.com/apache/incubator-iceberg/pull/588.
the delete differential file is also considered as a data file, and the
2.   file_type. https://github.com/apache/incubator-iceberg/pull/885.   the
delete differential file is also considered as a data file, the field is
used for distinguishing wether is a normal data file,  a file/pos delete
file or equality-delete file.

Thanks.


On Wed, Apr 1, 2020 at 2:44 AM Miao Wang <mi...@adobe.com> wrote:

> +1 on not creating a branch now. Rebasing and maintenance are too
> expensive, comparing of fast development. Some additional thoughts below.
>
>
>
> Row-delete feature should be behind a feature flag, which implies that it
> should have minimum impact on Master branch if it is turned off. Working on
> Master avoids the pain of breaking Master at branch merge, which actually
> works at a fail-fast and fail-early mode.
>
>
>
> Working on Master Branch will not prevent splitting the feature into small
> items. Instead, it will encourage more people to work it and help the
> community stay focus on Master roadmap.
>
>
>
> Finally, if we think about rebasing, it either ends too expensive to
> rebase or easy to rebase. If it is the former case, we should not create a
> branch because it is hard to keep sync with Master. If it is the latter
> case, it has little impact on Master and there is no need to have a branch.
>
>
>
> Thanks!
>
>
>
> Miao
>
>
>
> *From: *Ryan Blue <rb...@netflix.com.INVALID>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>, "
> rblue@netflix.com" <rb...@netflix.com>
> *Date: *Tuesday, March 31, 2020 at 10:08 AM
> *To: *OpenInx <op...@gmail.com>
> *Cc: *Iceberg Dev List <de...@iceberg.apache.org>
> *Subject: *Re: Open a new branch for row-delete feature ?
>
>
>
> I'm fine starting a branch later if we do run into those issues, but I
> don't think it is a good idea to do it now in anticipation. All of the work
> that we can do on master we should try to do on master. We can start a
> branch when we need one.
>
>
>
> On Mon, Mar 30, 2020 at 7:44 PM OpenInx <op...@gmail.com> wrote:
>
> Hi Ryan
>
>
>
> The reason I suggest to open a new dev branch for row-delete development
> is:  we will split the whole feature into
>
> many small issues and each issue will have a pull request with appropriate
> length of code so the contributors/reviewers
>
> can discuss one point each time and make this feature a faster iteration.
> In the process of implementation, we will ensure
>
> that the v1 works for every separate PR but it may not ready for cutting
> release, for example, when release the 0.8.0 I'm
>
> sure we won't like the release version contains part of the v2 spec(such
> as provide the sequence_number, but no file_type).
>
> The spark reader/writer and data/delete manifest may also need some code
> refactor, it's possible to put them into several PR.
>
> Splitting into multiple Pull Requests may block the release of the new
> version for a certain period of time, that's not we want
>
> to see.
>
>
> About the new branch maintenance, in my experience we could rebase the new
> branch with master periodly(such as rebase
>
> for every three days), so that the new pull request for row-delete will be
> designed based on the newest changes. It should work
>
> for the master which would not have too many new change. This is in line
> with our current situation.
>
>
>
> In this case, I weighed the maintenance costs of the new branch against
> the delay of the row-delete. I think we should let the
>
> row-delete go a little faster (almost all community users are looking
> forward to this feature), and I think the current maintenance
>
> cost is acceptable.
>
>
>
> Thanks
>
>
>
> On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> Sorry, I didn't address the suggestion to add a Flink branch as well. The
> work needed for the Flink sink is to remove parts that are specific to
> Netflix, so I'm not sure what the rationale for a branch would be. Is there
> a reason why this can't be done in master, but requires a shared branch? If
> multiple people want to contribute, why not contribute to the same PR?
>
>
>
> A shared PR branch makes the most sense to me for this because it is
> regularly tested against master.
>
>
>
> On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue <rb...@netflix.com> wrote:
>
> I think we will eventually may want a branch, but I think it is too early
> to create one now.
>
>
>
> Branches are expensive. They require maintenance to stay in sync with
> master, usually copying changes from master into the branch with updates.
> Updating the changes to master for the branch is more difficult because it
> is usually not the original contributor or reviewer porting them. And it is
> better to catch problems between changes in master and the branch early.
>
>
>
> I'm not against branches, but I don't want to create them unless they are
> valuable. In this case, I don't see the value. We plan to add v2 in
> parallel so you can still write v1 tables for compatibility, and most of
> the work that needs to be done -- like creating readers and writers for
> diff formats -- can be done in master.
>
>
>
> rb
>
>
>
> On Mon, Mar 30, 2020 at 9:00 AM Gautam <ga...@gmail.com> wrote:
>
> Thanks for bringing this up OpenInx.  That's a great idea: to open a
> separate branch for row-level deletes.
>
>
>
> I would like to help support/contribute/review this as well. If there are
> sub-tasks you guys have identified that can be added to
> https://github.com/apache/incubator-iceberg/milestone/4
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=d%2BWSh2N5PBTVFP%2B9AZEwbg8ElpjoCpxbYLn8z2BSwSI%3D&reserved=0> we
> can start taking those up too.
>
>
>
> thanks for the good work,
>
> - Gautam.
>
>
>
>
>
>
>
> On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen <ch...@gmail.com>
> wrote:
>
> +1 to create the branch. Some row-level delete subtasks must be based on
> the sequence number as well as end to end tests.
>
>
>
> On Fri, Mar 27, 2020 at 4:42 PM OpenInx <op...@gmail.com> wrote:
>
> Dear Dev:
>
>
>
>      Tuesday, we had a sync meeting. and discussed about the things:
>
>          1.  cut the 0.8.0 release;
>
>          2.  flink connector ;
>
>          3.  iceberg row-level delete;
>
>          4. Map-Reduce Formats and Hive support.
>
>
>
>       We'll release version 0.8.0 around April 15, the following 0.9.0
> will be
>
>      released in the next few month. On the other hand, Ryan, Junjie Chen
>
>      and I have done three PoC versions for the row-level deletes. We had
>
>      a full discussion[4] and started to do the relevant code design.
> we're sure that
>
>      the feature will introduce some incompatible specification,  such as
> the
>
>      sequence_number spec[1], file_type spec[2], the sortedOrder feature
> seems
>
>      also to be a breaking change [3].
>
>
>
>      To avoid affecting the release of version 0.8.0 and push the
> row-delete feature
>
>      early. I suggest to open a new branch for the row-delete feature,
> name it branch-1.
>
>      Once the row-delete feature is stable, we could release the 1.0.0. Or
> we can just
>
>      open a row-delete feature branch and once the work is done we will
> merge
>
>      the row-delete feature branch back to master branch, and continue to
> release the 0.9.0
>
>      version.
>
>
>
>      I guess the flink connector dev are facing the same problem ?
>
>
>
>      What do you think about this ?
>
>
>
>      Thank you.
>
>
>
>
>
>   [1]. https://github.com/apache/incubator-iceberg/pull/588
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F588&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=P7N%2Bdwc3qejjdC%2F%2F5qU0eFn12ejmo0xG0kOfHBlRJPs%3D&reserved=0>
>
>   [2]. https://github.com/apache/incubator-iceberg/issues/824
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F824&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=xotQnZYPCRUiw1cM83obHXhmp%2FePuwxH%2BDRW8fldJPA%3D&reserved=0>
>
>   [3]. https://github.com/apache/incubator-iceberg/issues/317
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F317&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=5MqRFWfWJyM7XOyPBrnCYKDjDUWXw5nFTQV%2BN3znwMc%3D&reserved=0>
>
>   [4].
> https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M%2Fedit%3Fusp%3Dsharing&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887737807&sdata=qrFOHp2Ri3Q3mrHaOImcnDQWyopWDQEnkNtoWyxe3ME%3D&reserved=0>
>
>
>
>
>
>
> --
>
> Best Regards
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>

Re: Open a new branch for row-delete feature ?

Posted by OpenInx <op...@gmail.com>.
OK.  we're going to push this in two ways. the first thing is providing the
PR which don't depend on the table specification and the next thing is
providing a proposal to address the table specification things.

Thanks for the feedback.

On Thu, Apr 2, 2020 at 3:22 AM Ryan Blue <rb...@netflix.com> wrote:

> I agree with Miao’s points. I think what I missed in an email above was
> the assertion that we can move faster in a branch than merging PRs into
> master. We should be breaking the work down into small reviewable pieces,
> as you suggested, but merge them to master like Miao pointed out. That has
> the benefits of more iterative development and doesn’t have the costs of a
> branch.
>
> Iterative development is what we’ve always tried to encourage for this
> feature, which is why we broke the design down into next steps in the
> milestone. It’s good to see progress on those, like Junjie’s PR to add
> file/position delete files to the spec. There is still quite a bit to do
> that is unblocked:
>
>    - Write up a spec for equality delete files
>    - Implement a reader and writer for file/position deletes
>    - Implement a reader and writer for equality deletes
>    - Implement a row filter that applies a file/position delete file
>    (using a merge strategy)
>    - Implement a row filter that applies equality deletes using a hash set
>    - Implement a row filter that applies equality deletes using a merge
>    strategy
>    - Make progress on adding sort orders to the spec (know when to apply
>    the equality deletes with a merge)
>
> For the two PRs you referenced:
>
>    1. Adding sequence numbers is difficult because it requires an option
>    to move to the v2 format. This is going to take some careful planning that
>    I haven’t seen a proposal for yet, and haven’t had a chance to do myself
>    yet. I suggest focusing on other areas that are not blocked at the moment.
>    2. As I mentioned in the sync, how we will track delete files in
>    metadata is still outstanding. Maybe we will choose to add a type column
>    like this, but I’d like to have a design in mind before we merge these PRs.
>    Thinking through this and coming up with a proposal here is the next
>    priority for this work, because it will unlock more tasks we can do in
>    parallel.
>
>
> On Tue, Mar 31, 2020 at 7:51 PM OpenInx <op...@gmail.com> wrote:
>
>> Hi Ryan & Miao
>>
>> I'm fine if you think it's not the appropriate time to open a new feature
>> branch for row-delete feature. The point for us is to accomplish the
>> row-delete feature as soon as possible in apache iceberg, so that our
>> community users could benefit from it and adopt the iceberg to more
>> scenarios.
>> As we know the table specification changes is the foundation, so what do
>> you think about the newly introduced table specification in PR ?  If there
>> are some concerns in your side, please let us know, so that we could
>> iterate the patches and push the work forward.
>> 1.   sequence_number.
>> https://github.com/apache/incubator-iceberg/pull/588. the delete
>> differential file is also considered as a data file, and the
>> 2.   file_type. https://github.com/apache/incubator-iceberg/pull/885.
>>  the delete differential file is also considered as a data file, the field
>> is used for distinguishing wether is a normal data file,  a file/pos delete
>> file or equality-delete file.
>>
>> Thanks.
>>
>>
>> On Wed, Apr 1, 2020 at 2:44 AM Miao Wang <mi...@adobe.com> wrote:
>>
>>> +1 on not creating a branch now. Rebasing and maintenance are too
>>> expensive, comparing of fast development. Some additional thoughts below.
>>>
>>>
>>>
>>> Row-delete feature should be behind a feature flag, which implies that
>>> it should have minimum impact on Master branch if it is turned off. Working
>>> on Master avoids the pain of breaking Master at branch merge, which
>>> actually works at a fail-fast and fail-early mode.
>>>
>>>
>>>
>>> Working on Master Branch will not prevent splitting the feature into
>>> small items. Instead, it will encourage more people to work it and help the
>>> community stay focus on Master roadmap.
>>>
>>>
>>>
>>> Finally, if we think about rebasing, it either ends too expensive to
>>> rebase or easy to rebase. If it is the former case, we should not create a
>>> branch because it is hard to keep sync with Master. If it is the latter
>>> case, it has little impact on Master and there is no need to have a branch.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Miao
>>>
>>>
>>>
>>> *From: *Ryan Blue <rb...@netflix.com.INVALID>
>>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>, "
>>> rblue@netflix.com" <rb...@netflix.com>
>>> *Date: *Tuesday, March 31, 2020 at 10:08 AM
>>> *To: *OpenInx <op...@gmail.com>
>>> *Cc: *Iceberg Dev List <de...@iceberg.apache.org>
>>> *Subject: *Re: Open a new branch for row-delete feature ?
>>>
>>>
>>>
>>> I'm fine starting a branch later if we do run into those issues, but I
>>> don't think it is a good idea to do it now in anticipation. All of the work
>>> that we can do on master we should try to do on master. We can start a
>>> branch when we need one.
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 7:44 PM OpenInx <op...@gmail.com> wrote:
>>>
>>> Hi Ryan
>>>
>>>
>>>
>>> The reason I suggest to open a new dev branch for row-delete development
>>> is:  we will split the whole feature into
>>>
>>> many small issues and each issue will have a pull request with
>>> appropriate length of code so the contributors/reviewers
>>>
>>> can discuss one point each time and make this feature a faster
>>> iteration. In the process of implementation, we will ensure
>>>
>>> that the v1 works for every separate PR but it may not ready for cutting
>>> release, for example, when release the 0.8.0 I'm
>>>
>>> sure we won't like the release version contains part of the v2 spec(such
>>> as provide the sequence_number, but no file_type).
>>>
>>> The spark reader/writer and data/delete manifest may also need some code
>>> refactor, it's possible to put them into several PR.
>>>
>>> Splitting into multiple Pull Requests may block the release of the new
>>> version for a certain period of time, that's not we want
>>>
>>> to see.
>>>
>>>
>>> About the new branch maintenance, in my experience we could rebase the
>>> new branch with master periodly(such as rebase
>>>
>>> for every three days), so that the new pull request for row-delete will
>>> be designed based on the newest changes. It should work
>>>
>>> for the master which would not have too many new change. This is in line
>>> with our current situation.
>>>
>>>
>>>
>>> In this case, I weighed the maintenance costs of the new branch against
>>> the delay of the row-delete. I think we should let the
>>>
>>> row-delete go a little faster (almost all community users are looking
>>> forward to this feature), and I think the current maintenance
>>>
>>> cost is acceptable.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>> Sorry, I didn't address the suggestion to add a Flink branch as well.
>>> The work needed for the Flink sink is to remove parts that are specific to
>>> Netflix, so I'm not sure what the rationale for a branch would be. Is there
>>> a reason why this can't be done in master, but requires a shared branch? If
>>> multiple people want to contribute, why not contribute to the same PR?
>>>
>>>
>>>
>>> A shared PR branch makes the most sense to me for this because it is
>>> regularly tested against master.
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> I think we will eventually may want a branch, but I think it is too
>>> early to create one now.
>>>
>>>
>>>
>>> Branches are expensive. They require maintenance to stay in sync with
>>> master, usually copying changes from master into the branch with updates.
>>> Updating the changes to master for the branch is more difficult because it
>>> is usually not the original contributor or reviewer porting them. And it is
>>> better to catch problems between changes in master and the branch early.
>>>
>>>
>>>
>>> I'm not against branches, but I don't want to create them unless they
>>> are valuable. In this case, I don't see the value. We plan to add v2 in
>>> parallel so you can still write v1 tables for compatibility, and most of
>>> the work that needs to be done -- like creating readers and writers for
>>> diff formats -- can be done in master.
>>>
>>>
>>>
>>> rb
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 9:00 AM Gautam <ga...@gmail.com> wrote:
>>>
>>> Thanks for bringing this up OpenInx.  That's a great idea: to open a
>>> separate branch for row-level deletes.
>>>
>>>
>>>
>>> I would like to help support/contribute/review this as well. If there
>>> are sub-tasks you guys have identified that can be added to
>>> https://github.com/apache/incubator-iceberg/milestone/4
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=d%2BWSh2N5PBTVFP%2B9AZEwbg8ElpjoCpxbYLn8z2BSwSI%3D&reserved=0> we
>>> can start taking those up too.
>>>
>>>
>>>
>>> thanks for the good work,
>>>
>>> - Gautam.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen <ch...@gmail.com>
>>> wrote:
>>>
>>> +1 to create the branch. Some row-level delete subtasks must be based on
>>> the sequence number as well as end to end tests.
>>>
>>>
>>>
>>> On Fri, Mar 27, 2020 at 4:42 PM OpenInx <op...@gmail.com> wrote:
>>>
>>> Dear Dev:
>>>
>>>
>>>
>>>      Tuesday, we had a sync meeting. and discussed about the things:
>>>
>>>          1.  cut the 0.8.0 release;
>>>
>>>          2.  flink connector ;
>>>
>>>          3.  iceberg row-level delete;
>>>
>>>          4. Map-Reduce Formats and Hive support.
>>>
>>>
>>>
>>>       We'll release version 0.8.0 around April 15, the following 0.9.0
>>> will be
>>>
>>>      released in the next few month. On the other hand, Ryan, Junjie
>>> Chen
>>>
>>>      and I have done three PoC versions for the row-level deletes. We
>>> had
>>>
>>>      a full discussion[4] and started to do the relevant code design.
>>> we're sure that
>>>
>>>      the feature will introduce some incompatible specification,  such
>>> as the
>>>
>>>      sequence_number spec[1], file_type spec[2], the sortedOrder feature
>>> seems
>>>
>>>      also to be a breaking change [3].
>>>
>>>
>>>
>>>      To avoid affecting the release of version 0.8.0 and push the
>>> row-delete feature
>>>
>>>      early. I suggest to open a new branch for the row-delete feature,
>>> name it branch-1.
>>>
>>>      Once the row-delete feature is stable, we could release the 1.0.0.
>>> Or we can just
>>>
>>>      open a row-delete feature branch and once the work is done we will
>>> merge
>>>
>>>      the row-delete feature branch back to master branch, and continue
>>> to release the 0.9.0
>>>
>>>      version.
>>>
>>>
>>>
>>>      I guess the flink connector dev are facing the same problem ?
>>>
>>>
>>>
>>>      What do you think about this ?
>>>
>>>
>>>
>>>      Thank you.
>>>
>>>
>>>
>>>
>>>
>>>   [1]. https://github.com/apache/incubator-iceberg/pull/588
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F588&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=P7N%2Bdwc3qejjdC%2F%2F5qU0eFn12ejmo0xG0kOfHBlRJPs%3D&reserved=0>
>>>
>>>   [2]. https://github.com/apache/incubator-iceberg/issues/824
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F824&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=xotQnZYPCRUiw1cM83obHXhmp%2FePuwxH%2BDRW8fldJPA%3D&reserved=0>
>>>
>>>   [3]. https://github.com/apache/incubator-iceberg/issues/317
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F317&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=5MqRFWfWJyM7XOyPBrnCYKDjDUWXw5nFTQV%2BN3znwMc%3D&reserved=0>
>>>
>>>   [4].
>>> https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M%2Fedit%3Fusp%3Dsharing&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887737807&sdata=qrFOHp2Ri3Q3mrHaOImcnDQWyopWDQEnkNtoWyxe3ME%3D&reserved=0>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Open a new branch for row-delete feature ?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I agree with Miao’s points. I think what I missed in an email above was the
assertion that we can move faster in a branch than merging PRs into master.
We should be breaking the work down into small reviewable pieces, as you
suggested, but merge them to master like Miao pointed out. That has the
benefits of more iterative development and doesn’t have the costs of a
branch.

Iterative development is what we’ve always tried to encourage for this
feature, which is why we broke the design down into next steps in the
milestone. It’s good to see progress on those, like Junjie’s PR to add
file/position delete files to the spec. There is still quite a bit to do
that is unblocked:

   - Write up a spec for equality delete files
   - Implement a reader and writer for file/position deletes
   - Implement a reader and writer for equality deletes
   - Implement a row filter that applies a file/position delete file (using
   a merge strategy)
   - Implement a row filter that applies equality deletes using a hash set
   - Implement a row filter that applies equality deletes using a merge
   strategy
   - Make progress on adding sort orders to the spec (know when to apply
   the equality deletes with a merge)

For the two PRs you referenced:

   1. Adding sequence numbers is difficult because it requires an option to
   move to the v2 format. This is going to take some careful planning that I
   haven’t seen a proposal for yet, and haven’t had a chance to do myself yet.
   I suggest focusing on other areas that are not blocked at the moment.
   2. As I mentioned in the sync, how we will track delete files in
   metadata is still outstanding. Maybe we will choose to add a type column
   like this, but I’d like to have a design in mind before we merge these PRs.
   Thinking through this and coming up with a proposal here is the next
   priority for this work, because it will unlock more tasks we can do in
   parallel.


On Tue, Mar 31, 2020 at 7:51 PM OpenInx <op...@gmail.com> wrote:

> Hi Ryan & Miao
>
> I'm fine if you think it's not the appropriate time to open a new feature
> branch for row-delete feature. The point for us is to accomplish the
> row-delete feature as soon as possible in apache iceberg, so that our
> community users could benefit from it and adopt the iceberg to more
> scenarios.
> As we know the table specification changes is the foundation, so what do
> you think about the newly introduced table specification in PR ?  If there
> are some concerns in your side, please let us know, so that we could
> iterate the patches and push the work forward.
> 1.   sequence_number.
> https://github.com/apache/incubator-iceberg/pull/588. the delete
> differential file is also considered as a data file, and the
> 2.   file_type. https://github.com/apache/incubator-iceberg/pull/885.
>  the delete differential file is also considered as a data file, the field
> is used for distinguishing wether is a normal data file,  a file/pos delete
> file or equality-delete file.
>
> Thanks.
>
>
> On Wed, Apr 1, 2020 at 2:44 AM Miao Wang <mi...@adobe.com> wrote:
>
>> +1 on not creating a branch now. Rebasing and maintenance are too
>> expensive, comparing of fast development. Some additional thoughts below.
>>
>>
>>
>> Row-delete feature should be behind a feature flag, which implies that it
>> should have minimum impact on Master branch if it is turned off. Working on
>> Master avoids the pain of breaking Master at branch merge, which actually
>> works at a fail-fast and fail-early mode.
>>
>>
>>
>> Working on Master Branch will not prevent splitting the feature into
>> small items. Instead, it will encourage more people to work it and help the
>> community stay focus on Master roadmap.
>>
>>
>>
>> Finally, if we think about rebasing, it either ends too expensive to
>> rebase or easy to rebase. If it is the former case, we should not create a
>> branch because it is hard to keep sync with Master. If it is the latter
>> case, it has little impact on Master and there is no need to have a branch.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Miao
>>
>>
>>
>> *From: *Ryan Blue <rb...@netflix.com.INVALID>
>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>, "
>> rblue@netflix.com" <rb...@netflix.com>
>> *Date: *Tuesday, March 31, 2020 at 10:08 AM
>> *To: *OpenInx <op...@gmail.com>
>> *Cc: *Iceberg Dev List <de...@iceberg.apache.org>
>> *Subject: *Re: Open a new branch for row-delete feature ?
>>
>>
>>
>> I'm fine starting a branch later if we do run into those issues, but I
>> don't think it is a good idea to do it now in anticipation. All of the work
>> that we can do on master we should try to do on master. We can start a
>> branch when we need one.
>>
>>
>>
>> On Mon, Mar 30, 2020 at 7:44 PM OpenInx <op...@gmail.com> wrote:
>>
>> Hi Ryan
>>
>>
>>
>> The reason I suggest to open a new dev branch for row-delete development
>> is:  we will split the whole feature into
>>
>> many small issues and each issue will have a pull request with
>> appropriate length of code so the contributors/reviewers
>>
>> can discuss one point each time and make this feature a faster iteration.
>> In the process of implementation, we will ensure
>>
>> that the v1 works for every separate PR but it may not ready for cutting
>> release, for example, when release the 0.8.0 I'm
>>
>> sure we won't like the release version contains part of the v2 spec(such
>> as provide the sequence_number, but no file_type).
>>
>> The spark reader/writer and data/delete manifest may also need some code
>> refactor, it's possible to put them into several PR.
>>
>> Splitting into multiple Pull Requests may block the release of the new
>> version for a certain period of time, that's not we want
>>
>> to see.
>>
>>
>> About the new branch maintenance, in my experience we could rebase the
>> new branch with master periodly(such as rebase
>>
>> for every three days), so that the new pull request for row-delete will
>> be designed based on the newest changes. It should work
>>
>> for the master which would not have too many new change. This is in line
>> with our current situation.
>>
>>
>>
>> In this case, I weighed the maintenance costs of the new branch against
>> the delay of the row-delete. I think we should let the
>>
>> row-delete go a little faster (almost all community users are looking
>> forward to this feature), and I think the current maintenance
>>
>> cost is acceptable.
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> Sorry, I didn't address the suggestion to add a Flink branch as well. The
>> work needed for the Flink sink is to remove parts that are specific to
>> Netflix, so I'm not sure what the rationale for a branch would be. Is there
>> a reason why this can't be done in master, but requires a shared branch? If
>> multiple people want to contribute, why not contribute to the same PR?
>>
>>
>>
>> A shared PR branch makes the most sense to me for this because it is
>> regularly tested against master.
>>
>>
>>
>> On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>> I think we will eventually may want a branch, but I think it is too early
>> to create one now.
>>
>>
>>
>> Branches are expensive. They require maintenance to stay in sync with
>> master, usually copying changes from master into the branch with updates.
>> Updating the changes to master for the branch is more difficult because it
>> is usually not the original contributor or reviewer porting them. And it is
>> better to catch problems between changes in master and the branch early.
>>
>>
>>
>> I'm not against branches, but I don't want to create them unless they are
>> valuable. In this case, I don't see the value. We plan to add v2 in
>> parallel so you can still write v1 tables for compatibility, and most of
>> the work that needs to be done -- like creating readers and writers for
>> diff formats -- can be done in master.
>>
>>
>>
>> rb
>>
>>
>>
>> On Mon, Mar 30, 2020 at 9:00 AM Gautam <ga...@gmail.com> wrote:
>>
>> Thanks for bringing this up OpenInx.  That's a great idea: to open a
>> separate branch for row-level deletes.
>>
>>
>>
>> I would like to help support/contribute/review this as well. If there are
>> sub-tasks you guys have identified that can be added to
>> https://github.com/apache/incubator-iceberg/milestone/4
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=d%2BWSh2N5PBTVFP%2B9AZEwbg8ElpjoCpxbYLn8z2BSwSI%3D&reserved=0> we
>> can start taking those up too.
>>
>>
>>
>> thanks for the good work,
>>
>> - Gautam.
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen <ch...@gmail.com>
>> wrote:
>>
>> +1 to create the branch. Some row-level delete subtasks must be based on
>> the sequence number as well as end to end tests.
>>
>>
>>
>> On Fri, Mar 27, 2020 at 4:42 PM OpenInx <op...@gmail.com> wrote:
>>
>> Dear Dev:
>>
>>
>>
>>      Tuesday, we had a sync meeting. and discussed about the things:
>>
>>          1.  cut the 0.8.0 release;
>>
>>          2.  flink connector ;
>>
>>          3.  iceberg row-level delete;
>>
>>          4. Map-Reduce Formats and Hive support.
>>
>>
>>
>>       We'll release version 0.8.0 around April 15, the following 0.9.0
>> will be
>>
>>      released in the next few month. On the other hand, Ryan, Junjie Chen
>>
>>      and I have done three PoC versions for the row-level deletes. We had
>>
>>      a full discussion[4] and started to do the relevant code design.
>> we're sure that
>>
>>      the feature will introduce some incompatible specification,  such as
>> the
>>
>>      sequence_number spec[1], file_type spec[2], the sortedOrder feature
>> seems
>>
>>      also to be a breaking change [3].
>>
>>
>>
>>      To avoid affecting the release of version 0.8.0 and push the
>> row-delete feature
>>
>>      early. I suggest to open a new branch for the row-delete feature,
>> name it branch-1.
>>
>>      Once the row-delete feature is stable, we could release the 1.0.0.
>> Or we can just
>>
>>      open a row-delete feature branch and once the work is done we will
>> merge
>>
>>      the row-delete feature branch back to master branch, and continue to
>> release the 0.9.0
>>
>>      version.
>>
>>
>>
>>      I guess the flink connector dev are facing the same problem ?
>>
>>
>>
>>      What do you think about this ?
>>
>>
>>
>>      Thank you.
>>
>>
>>
>>
>>
>>   [1]. https://github.com/apache/incubator-iceberg/pull/588
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F588&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=P7N%2Bdwc3qejjdC%2F%2F5qU0eFn12ejmo0xG0kOfHBlRJPs%3D&reserved=0>
>>
>>   [2]. https://github.com/apache/incubator-iceberg/issues/824
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F824&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=xotQnZYPCRUiw1cM83obHXhmp%2FePuwxH%2BDRW8fldJPA%3D&reserved=0>
>>
>>   [3]. https://github.com/apache/incubator-iceberg/issues/317
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F317&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=5MqRFWfWJyM7XOyPBrnCYKDjDUWXw5nFTQV%2BN3znwMc%3D&reserved=0>
>>
>>   [4].
>> https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M%2Fedit%3Fusp%3Dsharing&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887737807&sdata=qrFOHp2Ri3Q3mrHaOImcnDQWyopWDQEnkNtoWyxe3ME%3D&reserved=0>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix