You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by OpenInx <op...@gmail.com> on 2022/02/18 06:23:58 UTC

Re: [DISCUSS] Iceberg roadmap

Update:

As the Dell EMC EcsFileIO has been merged into apache iceberg
official repo, so I think it's okay to get this project from roadmap closed
now: https://github.com/apache/iceberg/projects/22

Thanks.

On Wed, Nov 10, 2021 at 10:22 AM Zhao Chun <zh...@apache.org> wrote:

> Thanks Ryan.
> We will keep a close eye on what is happening in the iceberg community and
> seek help when necessary.
>
> Thanks,
> Zhao Chun
>
>
> Ryan Blue <bl...@tabular.io> 于2021年11月10日周三 上午8:54写道:
>
>> Thanks, Zhao. I think those are great ways to work together. Let us know
>> how we can help you make StarRocks successful with Iceberg as its data
>> format. We're always happy to help people understand how Iceberg works and
>> improve our docs on how to use it.
>>
>> Ryan
>>
>> On Mon, Nov 8, 2021 at 8:17 PM Zhao Chun <zh...@apache.org> wrote:
>>
>>> I feel that Ryan's response exemplifies the generosity of an Apache
>>> project creator,
>>> a quality that has touched and benefited us. We look forward to
>>> contributing
>>> further to the Apache project in the future.
>>> As for the need for an issue to track progress,I don't think so for now.
>>> At the moment the main development work is done in the StarRocks
>>> repository.
>>> As for further cooperation in the future, I think there are several
>>> aspects.
>>> 1. StarRocks will be trying to support Iceberg.
>>> I think this will help StarRocks to re-examine how it integrates with
>>> the lakehouse system
>>> and we will be happy to feed back to the Apache Iceberg community the
>>> issues and benefits
>>> we encounter during the integration process.
>>> This will also validate the versatility of the iceberg project to
>>> support more query engines.
>>> I think this project will benefit both projects.
>>> 2. In the future, we will share some of our best practices for iceberg
>>> and StarRocks integration in a blog or talk.
>>> If the Apache Iceberg project feels that these blogs or talks would be
>>> beneficial to the Apache iceberg community,
>>> please consider linking our subsequent blogs or talks to the apache
>>> iceberg website blog.
>>> The Iceberg community can, of course, not link if they feel it is
>>> inappropriate.
>>> 3. we expect to contribute to the Apache Iceberg community under the
>>> Apache License V2.
>>>
>>> Thanks,
>>> Zhao Chun
>>>
>>>
>>> Ryan Blue <bl...@tabular.io> 于2021年11月9日周二 上午3:05写道:
>>>
>>>> I think it is great to see another processing engine adding support for
>>>> Apache Iceberg, and I do look forward to collaborating with the StarRocks
>>>> community in the future.
>>>>
>>>> I'm not entirely sure what that collaboration would look like just yet
>>>> though. For most processing engines, it is people joining the Apache
>>>> Iceberg community. No matter what the license of the downstream project, we
>>>> always welcome more people contributing here!
>>>>
>>>> As for opening a project in our tracker, I'm not sure it makes sense to
>>>> do that just yet. As far as I know there aren't any issues to track there.
>>>> And would the StarRocks community find it helpful?
>>>>
>>>> On Mon, Nov 8, 2021 at 12:14 AM Zhao Chun <bu...@gmail.com> wrote:
>>>>
>>>>> Thanks to @OpenInx for mentioning StarRocks in the iceberg community.
>>>>>
>>>>> I'm from the StarRocks community.
>>>>>
>>>>> StarRocks is based on the Apache Doris project.
>>>>> It has been in development internally for almost two years and is
>>>>> currently used by hundreds of companies.
>>>>> It was just opened 2 months ago.
>>>>>
>>>>> Iceberg is a great project that makes huge datasets analysis more
>>>>> convenient.
>>>>> The StarRocks community is planning to support the iceberg engine.
>>>>> This will provide StarRocks users with the ability to analyze data in
>>>>> iceberg.
>>>>>
>>>>> Regarding the license, StarRocks' ELv2 will not affect our
>>>>> contribution to the iceberg community under the Apache License V2.
>>>>>
>>>>> We are also looking forward to receiving help from the iceberg
>>>>> community and will be contributing back to the iceberg community.
>>>>>
>>>>> Thanks,
>>>>> Zhao Chun
>>>>>
>>>>>
>>>>> Kyle Bendickson <ky...@tabular.io> 于2021年11月8日周一 下午2:53写道:
>>>>>
>>>>>> +1 around concerns with the Elastic license.
>>>>>>
>>>>>> Also, more importantly, how important is integration with either of
>>>>>> these tools to the Iceberg community and contributors?
>>>>>>
>>>>>> The Elastic license makes a bit more sense for elasticsearch, as it
>>>>>> was an existing project for quite some time. I won’t reiterate the details
>>>>>> of that situation, but it’s odd to see a fork of a new, active project
>>>>>> using the Elastic license in my opinion.
>>>>>>
>>>>>> StarRocks admits that they’re at least 40% of code from the Apache
>>>>>> Doris project.
>>>>>>
>>>>>> That said, StarRocks claims to not require other dependencies. It
>>>>>> seems StarRocks supports query federation with a few tools so as not to
>>>>>> have to import the data and query those systems directly. So I’m not sure
>>>>>> what Iceberg support would look like beyond additional query federation.
>>>>>> What benefit does this provide?
>>>>>>
>>>>>> If we determined that integration with one of these tools was
>>>>>> something the community valued, could a connector be built to target the
>>>>>> Apache Doris project and then StarRocks could fork that code if they liked?
>>>>>>
>>>>>> - Kyle Bendickson
>>>>>> GitHub @kbendick
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Nov 7, 2021 at 9:24 PM Reo Lei <le...@gmail.com> wrote:
>>>>>>
>>>>>>> +1, I have the same concern for the incompatible license.
>>>>>>>
>>>>>>> Jacques Nadeau <ja...@gmail.com> 于2021年11月8日周一 上午11:48写道:
>>>>>>>
>>>>>>>> A few additional observations about StarRocks...
>>>>>>>>
>>>>>>>> - As far as I can tell, StarRocks has an ASF incompatible license
>>>>>>>> (Elastic License 2.0).
>>>>>>>> - It appears to be a hard fork of Apache Doris, a project still in
>>>>>>>> the incubator (and looks like it probably is destructive to the Doris
>>>>>>>> project)
>>>>>>>> - The project has only existed for ~2 months.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Nov 7, 2021 at 7:34 PM OpenInx <op...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Any thoughts for adding StarRocks integration to the roadmap ?
>>>>>>>>>
>>>>>>>>> I think the guys from StarRocks community can provide more
>>>>>>>>> background and inputs.
>>>>>>>>>
>>>>>>>>> On Thu, Nov 4, 2021 at 5:59 PM OpenInx <op...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Update:
>>>>>>>>>>
>>>>>>>>>> StarRocks[1] is a next-gen sub-second MPP database for full
>>>>>>>>>> analysis scenarios, including multi-dimensional analytics, real-time
>>>>>>>>>> analytics and ad-hoc query.  Their team is planning to integrate iceberg
>>>>>>>>>> tables as StarRocks external tables in the next month [2], so that people
>>>>>>>>>> could connect the data lake and StarRocks warehouse in the same engine.
>>>>>>>>>> The excellent performance of StarRocks will also help accelerate
>>>>>>>>>> the analysis and access of the iceberg table, I think this is a great thing
>>>>>>>>>> for both the iceberg community and the StarRocks community.   I think we
>>>>>>>>>> can add an extra project about StarRocks integration work in the apache
>>>>>>>>>> iceberg roadmap [3] ?
>>>>>>>>>>
>>>>>>>>>> [1].  https://github.com/StarRocks/starrocks
>>>>>>>>>> [2].  https://github.com/StarRocks/starrocks/issues/1030
>>>>>>>>>> [3].  https://github.com/apache/iceberg/projects
>>>>>>>>>>
>>>>>>>>>> On Mon, Nov 1, 2021 at 11:52 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I closed the upgrade project and marked the FLIP-27 project
>>>>>>>>>>> priority 1. Thanks for all the work to get this done!
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 31, 2021 at 8:10 PM OpenInx <op...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Update:
>>>>>>>>>>>>
>>>>>>>>>>>> I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap
>>>>>>>>>>>> can be closed now, because all of the issues have been addressed.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]. https://github.com/apache/iceberg/projects/12
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner <
>>>>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I created a Roadmap section in
>>>>>>>>>>>>>  https://github.com/apache/iceberg/pull/3163
>>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/3163> that links to
>>>>>>>>>>>>> the planning boards that Jack created. I figured it makes sense if we link
>>>>>>>>>>>>> available Design Docs directly on those Boards (as was already done),
>>>>>>>>>>>>> because then the Design docs are closer to the set of related issues.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Jack!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Eduard, I think that's a good idea. We should have a roadmap
>>>>>>>>>>>>>> page as well that links to the projects that Jack just created.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 20, 2021 at 12:57 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It seems like we have reached some consensus around the
>>>>>>>>>>>>>>> projects listed here. I have created corresponding Github projects for
>>>>>>>>>>>>>>> each: https://github.com/apache/iceberg/projects
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Related design docs are also linked there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Sep 19, 2021 at 11:18 PM Eduard Tudenhoefner <
>>>>>>>>>>>>>>> eduard@dremio.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would it make sense to have a section on the website where
>>>>>>>>>>>>>>>> we collect all the links to the design docs/specs as that would be easier
>>>>>>>>>>>>>>>> to find than searching for things on the ML?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was thinking about something like for each component:
>>>>>>>>>>>>>>>> * link to the ML discussion
>>>>>>>>>>>>>>>> * link to the actual Spec/Design Doc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 10, 2021 at 11:38 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> At the last sync meeting, we brought up publishing a
>>>>>>>>>>>>>>>>> community roadmap and brainstormed the many features and initiatives that
>>>>>>>>>>>>>>>>> the community is working on. In this thread, I want to make sure that we
>>>>>>>>>>>>>>>>> have a good list of what people are thinking about and I think we should
>>>>>>>>>>>>>>>>> try to categorize the projects by size and general priority. When we reach
>>>>>>>>>>>>>>>>> a rough agreement, I’ll write this up and post it on the ASF site along
>>>>>>>>>>>>>>>>> with links to some projects in Github.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My rationale for attempting to prioritize projects is that
>>>>>>>>>>>>>>>>> if we try to do too many things, it will be slower progress across
>>>>>>>>>>>>>>>>> everything rather than getting a few important items done. I know that
>>>>>>>>>>>>>>>>> priorities don’t align very cleanly in practice, but it is hopefully worth
>>>>>>>>>>>>>>>>> trying. To come up with a priority, I’m trying to keep top priority items
>>>>>>>>>>>>>>>>> to a minimum by including only one from each group (Spark, Flink, Python,
>>>>>>>>>>>>>>>>> etc.). The remaining items are split between priority 2 and 3. Priority 3
>>>>>>>>>>>>>>>>> is not urgent, including things that can be plugged in (like other IO
>>>>>>>>>>>>>>>>> libraries), docs, etc. Everything else is priority 2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That something isn’t priority 1 doesn’t mean it isn’t
>>>>>>>>>>>>>>>>> important or progressing, just that it isn’t the current focus. I think of
>>>>>>>>>>>>>>>>> it this way: if someone has extra time to review something, what should be
>>>>>>>>>>>>>>>>> next? That’s top priority.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here’s my rough categorization. If you disagree, please
>>>>>>>>>>>>>>>>> speak up:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - If you think that something should be top priority,
>>>>>>>>>>>>>>>>>    what gets moved to priority 2?
>>>>>>>>>>>>>>>>>    - Should the priority for a project in 2 or 3 change?
>>>>>>>>>>>>>>>>>    - Is the S/M/L size of a project wrong?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Top priority, 1:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - API: Iceberg 1.0 [medium]
>>>>>>>>>>>>>>>>>    - Spark: Merge-on-read plans [large]
>>>>>>>>>>>>>>>>>    - Maintenance: Delete file compaction [medium]
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Flink: Upgrade to 1.13.2 (document compatibility)
>>>>>>>>>>>>>>>>>    [medium]
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Python: Pythonic refactor [medium]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Priority 2:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - ORC: Support delete files stored as ORC [small]
>>>>>>>>>>>>>>>>>    - Spark: DSv2 streaming improvements [small]
>>>>>>>>>>>>>>>>>    - Flink: Inline file compaction [small]
>>>>>>>>>>>>>>>>>    - Flink: Support UPSERT [small]
>>>>>>>>>>>>>>>>>    - Views: Spec [medium]
>>>>>>>>>>>>>>>>>    - Spec: Z-ordering / Space-filling curves [medium]
>>>>>>>>>>>>>>>>>    - Spec: Snapshot tagging and branching [small]
>>>>>>>>>>>>>>>>>    - Spec: Secondary indexes [large]
>>>>>>>>>>>>>>>>>    - Spec v3: Encryption [large]
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Spec v3: Relative paths [large]
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Spec v3: Default field values [medium]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Priority 3:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Docs: versioned docs [medium]
>>>>>>>>>>>>>>>>>    - IO: Support Aliyun OSS/DLF [medium]
>>>>>>>>>>>>>>>>>    - IO: Support Dell ECS [medium]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> External:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Trino: Bucketed joins [small]
>>>>>>>>>>>>>>>>>    - Trino: Row-level delete support [medium]
>>>>>>>>>>>>>>>>>    - Trino: Merge-on-read plans [medium]
>>>>>>>>>>>>>>>>>    - Trino: Multi-catalog support [small]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>