You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Anton Okolnychyi <ao...@apple.com.INVALID> on 2021/09/14 02:39:46 UTC

[DISCUSS] Spark version support strategy

Hey folks,

I want to discuss our Spark version support strategy.

So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.

Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.

I see two options to move forward:

Option 1

Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.

Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.

Option 2

Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.

Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.

Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.

I’d love to hear what everybody thinks about this matter.

Thanks,
Anton

Re: [DISCUSS] Spark version support strategy

Posted by OpenInx <op...@gmail.com>.

> We should probably add a section to our Flink docs that explains and
links to Flink’s support policy and has a table of Iceberg versions that
work with Flink versions. (We should probably have the same table for
Spark, too!)

Thanks Ryan for the suggestion, I created a separate issue to address this
thing before: https://github.com/apache/iceberg/issues/3115 .  I will make
this forward.

On Thu, Oct 7, 2021 at 1:55 PM Jack Ye <ye...@gmail.com> wrote:

> Hi everyone,
>
> I tried to prototype option 3, here is the PR:
> https://github.com/apache/iceberg/pull/3237
>
> Sorry I did not see that Anton is planning to do it, but anyway it's just
> a draft, so feel free to just use it as reference.
>
> Best,
> Jack Ye
>
> On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Thanks for the context on the Flink side! I think it sounds reasonable to
>> keep up to date with the latest supported Flink version. If we want, we
>> could later go with something similar to what we do for Spark but we’ll see
>> how it goes and what the Flink community needs. We should probably add a
>> section to our Flink docs that explains and links to Flink’s support policy
>> and has a table of Iceberg versions that work with Flink versions. (We
>> should probably have the same table for Spark, too!)
>>
>> For Spark, I’m also leaning toward the modified option 3 where we keep
>> all of the code in the main repository but only build with one module at a
>> time by default. It makes sense to switch based on modules — rather than
>> selecting src paths within a module — so that it is easy to run a build
>> with all modules if you choose to — for example, when building release
>> binaries.
>>
>> The reason I think we should go with option 3 is for testing. If we have
>> a single repo with api, core, etc. and spark then changes to the common
>> modules can be tested by CI actions. Updates to individual Spark modules
>> would be completely independent. There is a slight inconvenience that when
>> an API used by Spark changes, the author would still need to fix multiple
>> Spark versions. But the trade-off is that with a separate repository like
>> option 2, changes that break Spark versions are not caught and then the
>> Spark repository’s CI ends up failing on completely unrelated changes. That
>> would be a major pain, felt by everyone contributing to the Spark
>> integration, so I think option 3 is the best path forward.
>>
>> It sounds like we probably have some agreement now, but please speak up
>> if you think another option would be better.
>>
>> The next step is to prototype the build changes to test out option 3. Or
>> if you prefer option 2, then prototype those changes as well. I think that
>> Anton is planning to do this, but if you have time and the desire to do it
>> please reach out and coordinate with us!
>>
>> Ryan
>>
>> On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <st...@gmail.com> wrote:
>>
>>> Wing, sorry, my earlier message probably misled you. I was speaking my
>>> personal opinion on Flink version support.
>>>
>>> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon
>>> <wy...@cloudera.com.invalid> wrote:
>>>
>>>> Hi OpenInx,
>>>> I'm sorry I misunderstood the thinking of the Flink community. Thanks
>>>> for the clarification.
>>>> - Wing Yew
>>>>
>>>>
>>>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <op...@gmail.com> wrote:
>>>>
>>>>> Hi Wing
>>>>>
>>>>> As we discussed above, we community prefer to choose option.2 or
>>>>> option.3.  So in fact, when we planned to upgrade the flink version from
>>>>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>>>>> could work fine for both flink1.12 & flink1.13. More context please see
>>>>> [1], [2], [3]
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/3116
>>>>> [2] https://github.com/apache/iceberg/issues/3183
>>>>> [3]
>>>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>>>>
>>>>>
>>>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon
>>>>> <wy...@cloudera.com.invalid> wrote:
>>>>>
>>>>>> In the last community sync, we spent a little time on this topic. For
>>>>>> Spark support, there are currently two options under consideration:
>>>>>>
>>>>>> Option 2: Separate repo for the Spark support. Use branches for
>>>>>> supporting different Spark versions. Main branch for the latest Spark
>>>>>> version (3.2 to begin with).
>>>>>> Tooling needs to be built for producing regular snapshots of core
>>>>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>>>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>>>>> that they will not be, and the Spark support build can be broken by changes
>>>>>> to core.
>>>>>>
>>>>>> A variant of option 3 (which we will simply call Option 3 going
>>>>>> forward): Single repo, separate module (subdirectory) for each Spark
>>>>>> version to be supported. Code duplication in each Spark module (no attempt
>>>>>> to refactor out common code). Each module built against the specific
>>>>>> version of Spark to be supported, producing a runtime jar built against
>>>>>> that version. CI will test all modules. Support can be provided for only
>>>>>> building the modules a developer cares about.
>>>>>>
>>>>>> More input was sought and people are encouraged to voice their
>>>>>> preference.
>>>>>> I lean towards Option 3.
>>>>>>
>>>>>> - Wing Yew
>>>>>>
>>>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the
>>>>>> same multi-version support strategy can be adopted across engines. Based on
>>>>>> what Steven wrote, currently the Flink developer community's bandwidth
>>>>>> makes supporting only a single Flink version (and focusing resources on
>>>>>> developing new features on that version) the preferred choice. If so, then
>>>>>> no multi-version support strategy for Flink is needed at this time.
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> During the sync meeting, people talked about if and how we can have
>>>>>>> the same version support model across engines like Flink and Spark. I can
>>>>>>> provide some input from the Flink side.
>>>>>>>
>>>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13
>>>>>>> is the latest released version. That means only Flink 1.12 and 1.13 are
>>>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>>>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>>>>> personally I like option 1 (with one actively supported Flink version in
>>>>>>> master branch) for the iceberg-flink module.
>>>>>>>
>>>>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>>>>> layer and multiple modules. While it may be a little better to support
>>>>>>> multiple Flink versions, I don't know if there is enough support and
>>>>>>> resources from the community to pull it off. Also the ongoing maintenance
>>>>>>> burden for each minor version release from Flink, which happens roughly
>>>>>>> every 4 months.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary
>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Since you mentioned Hive, I chime in with what we do there. You
>>>>>>>> might find it useful:
>>>>>>>> - metastore module - only small differences - DynConstructor solves
>>>>>>>> for us
>>>>>>>> - mr module - some bigger differences, but still manageable for
>>>>>>>> Hive 2-3. Need some new classes, but most of the code is reused - extra
>>>>>>>> module for Hive 3. For Hive 4 we use a different repo as we moved to the
>>>>>>>> Hive codebase.
>>>>>>>>
>>>>>>>> My thoughts based on the above experience:
>>>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>>>>> have problems with backporting changes between repos and we are slacking
>>>>>>>> behind which hurts both projects
>>>>>>>> - Hive 2-3 model is working better by forcing us to keep the things
>>>>>>>> in sync, but with serious differences in the Hive project it still doesn't
>>>>>>>> seem like a viable option.
>>>>>>>>
>>>>>>>> So I think the question is: How stable is the Spark code we are
>>>>>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>>>>>> repo multiple modules" approach and we should consider the multirepo only
>>>>>>>> if the differences become prohibitive.
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>>>>> <ao...@apple.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Okay, looks like there is consensus around supporting multiple
>>>>>>>>> Spark versions at the same time. There are folks who mentioned this on this
>>>>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>>>>
>>>>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>>>>
>>>>>>>>> Option 2
>>>>>>>>>
>>>>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>>>>>> The main development will happen there and the artifact version will be
>>>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>>>>> branches where we will cherry-pick applicable changes. Once we are ready to
>>>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1
>>>>>>>>> branches for cherry-picks.
>>>>>>>>>
>>>>>>>>> I guess we will continue to shade everything in the new repo and
>>>>>>>>> will have to release every time the core is released. We will do a
>>>>>>>>> maintenance release for each supported Spark version whenever we cut a new
>>>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark
>>>>>>>>> integration.
>>>>>>>>> Under this model, we will probably need nightly snapshots (or on
>>>>>>>>> each commit) for the core format and the Spark integration will depend on
>>>>>>>>> snapshots until we are ready to release.
>>>>>>>>>
>>>>>>>>> Overall, I think this option gives us very simple builds and
>>>>>>>>> provides best separation. It will keep the main repo clean. The main
>>>>>>>>> downside is that we will have to split a Spark feature into two PRs: one
>>>>>>>>> against the core and one against the Spark integration. Certain changes in
>>>>>>>>> core can also break the Spark integration too and will require adaptations.
>>>>>>>>>
>>>>>>>>> Ryan, I am not sure I fully understood the testing part. How will
>>>>>>>>> we be able to test the Spark integration in the main repo if certain
>>>>>>>>> changes in core may break the Spark integration and require changes there?
>>>>>>>>> Will we try to prohibit such changes?
>>>>>>>>>
>>>>>>>>> Option 3 (modified)
>>>>>>>>>
>>>>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>>>>> the initially suggested approach by Imran but with code duplication instead
>>>>>>>>> of extra refactoring and introducing new common modules.
>>>>>>>>>
>>>>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>>>>>> compared to just having a module per version? I did not fully
>>>>>>>>> understand.
>>>>>>>>>
>>>>>>>>> My worry with this approach is that our build will be very
>>>>>>>>> complicated and we will still have a lot of Spark-related modules in the
>>>>>>>>> main repo. Once people start using Flink and Hive more, will we have to do
>>>>>>>>> the same?
>>>>>>>>>
>>>>>>>>> - Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>>>>> expectations for keeping it clean.
>>>>>>>>>
>>>>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>>>>> versions -- that introduces risk because we're relying on compiling against
>>>>>>>>> one version and running in another and both Spark and Scala change rapidly.
>>>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>>>>>>> version. I think we should duplicate code rather than spend time
>>>>>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>>>>>> Spark version by copying the last one and updating it. And we should build
>>>>>>>>> just the latest supported version by default.
>>>>>>>>>
>>>>>>>>> The drawback to having everything in a single repo is that we
>>>>>>>>> wouldn't be able to cherry-pick changes across Spark versions/branches, but
>>>>>>>>> I think Jack is right that having a single build is better.
>>>>>>>>>
>>>>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>>>>>>> that selects the Spark version that you want to build against.
>>>>>>>>>
>>>>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think in Ryan's proposal we will create a ton of modules
>>>>>>>>>> anyway, as Wing listed we are just using git branch as an additional
>>>>>>>>>> dimension, but my understanding is that you will still have 1 core, 1
>>>>>>>>>> extension, 1 runtime artifact published for each Spark version in either
>>>>>>>>>> approach.
>>>>>>>>>>
>>>>>>>>>> In that case, this is just brainstorming, I wonder if we can
>>>>>>>>>> explore a modified option 3 that flattens all the versions in each Spark
>>>>>>>>>> branch in option 2 into master. The repository structure would look
>>>>>>>>>> something like:
>>>>>>>>>>
>>>>>>>>>> iceberg/api/...
>>>>>>>>>>             /bundled-guava/...
>>>>>>>>>>             /core/...
>>>>>>>>>>             ...
>>>>>>>>>>             /spark/2.4/core/...
>>>>>>>>>>                             /extension/...
>>>>>>>>>>                             /runtime/...
>>>>>>>>>>                       /3.1/core/...
>>>>>>>>>>                             /extension/...
>>>>>>>>>>                             /runtime/...
>>>>>>>>>>
>>>>>>>>>> The gradle build script in the root is configured to build
>>>>>>>>>> against the latest version of Spark by default, unless otherwise specified
>>>>>>>>>> by the user.
>>>>>>>>>>
>>>>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>>>>> versions based on the same config used in build.
>>>>>>>>>>
>>>>>>>>>> In this way, I imagine the CI setup to be much easier to do
>>>>>>>>>> things like testing version compatibility for a feature or running only a
>>>>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>>>>> directories touched.
>>>>>>>>>>
>>>>>>>>>> And the biggest benefit is that we don't have the same difficulty
>>>>>>>>>> as option 2 of developing a feature when it's both in core and Spark.
>>>>>>>>>>
>>>>>>>>>> We can then develop a mechanism to vote to stop support of
>>>>>>>>>> certain versions, and archive the corresponding directory to avoid
>>>>>>>>>> accumulating too many versions in the long term.
>>>>>>>>>>
>>>>>>>>>> -Jack Ye
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java
>>>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a big
>>>>>>>>>>> thing to leave out!
>>>>>>>>>>>
>>>>>>>>>>> I would definitely want to test the projects together. One thing
>>>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also
>>>>>>>>>>> wondering if we could have some tighter integration where the Iceberg Spark
>>>>>>>>>>> build can be included in the Iceberg Java build using properties. Maybe the
>>>>>>>>>>> github action could checkout Iceberg, then checkout the Spark
>>>>>>>>>>> integration's latest branch, and then run the gradle build with a property
>>>>>>>>>>> that makes Spark a subproject in the build. That way we can continue to
>>>>>>>>>>> have Spark CI run regularly.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>>>>>> prohibitive to actually implementing new features that appear in the API, I
>>>>>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>>>>>> artifacts published nightly?
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the
>>>>>>>>>>>>> set of potential solutions well defined.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looks like the next step is to decide whether we want to
>>>>>>>>>>>>> require people to update Spark versions to pick up newer versions of
>>>>>>>>>>>>> Iceberg. If we choose to make people upgrade, then option 1 is clearly the
>>>>>>>>>>>>> best choice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t think that we should make updating Spark a
>>>>>>>>>>>>> requirement. Many of the things that we’re working on are orthogonal to
>>>>>>>>>>>>> Spark versions, like table maintenance actions, secondary indexes, the 1.0
>>>>>>>>>>>>> API, views, ORC delete files, new storage implementations, etc. Upgrading
>>>>>>>>>>>>> Spark is time consuming and untrusted in my experience, so I think we would
>>>>>>>>>>>>> be setting up an unnecessary trade-off between spending lots of time to
>>>>>>>>>>>>> upgrade Spark and picking up new Iceberg features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another way of thinking about this is that if we went with
>>>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are many
>>>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO implementation
>>>>>>>>>>>>> for ADLS. So some people in the community would have to maintain branches
>>>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of the main
>>>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things with
>>>>>>>>>>>>> option 1 because we would then have more people maintaining the same 0.13.x
>>>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, where we
>>>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the community
>>>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, Tencent,
>>>>>>>>>>>>> Apple, etc.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>>>>> some of us would — we should make it possible to share that work. That’s
>>>>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we don’t go with option 1, then the choice is how to
>>>>>>>>>>>>> maintain multiple Spark versions. I think that the way we’re doing it right
>>>>>>>>>>>>> now is not something we want to continue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because
>>>>>>>>>>>>> of the changes in Spark. We currently structure the library to share as
>>>>>>>>>>>>> much code as possible. But that means compiling against different Spark
>>>>>>>>>>>>> versions and relying on binary compatibility and reflection in some cases.
>>>>>>>>>>>>> To me, this seems unmaintainable in the long run because it requires
>>>>>>>>>>>>> refactoring common classes and spending a lot of time deduplicating code.
>>>>>>>>>>>>> It also creates a ton of modules, at least one common module, then a module
>>>>>>>>>>>>> per version, then an extensions module per version, and finally a runtime
>>>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new common
>>>>>>>>>>>>> modules. And each module needs to be tested, which is making our CI take a
>>>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, which is
>>>>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>>>>> Before giving my preference, let me raise one question:    what's the top
>>>>>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  This
>>>>>>>>>>>>>> question will help us to answer the following question: Should we support
>>>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>>>>>> concentrate on getting the new features that users need most in order to
>>>>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If people watch the apache iceberg project and check the
>>>>>>>>>>>>>> issues & PR frequently,  I guess more than 90% people will answer the
>>>>>>>>>>>>>> priority question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to ensure the highest priority at this point in
>>>>>>>>>>>>>> time, I will prefer option-1 to reduce the cost of engine maintenance, so
>>>>>>>>>>>>>> as to free up resources to make v2 production-ready.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>>>>> sai.sai.shao@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to upgrade
>>>>>>>>>>>>>>> the Spark version for the first time (since we have many customizations
>>>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>>>>>> community ditches the support of old version of Spark3, users have to
>>>>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So I'm inclined to make this support in community, not by
>>>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. And to
>>>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark (for example
>>>>>>>>>>>>>>> 2 versions).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to
>>>>>>>>>>>>>>>> support Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages to add and
>>>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of that package
>>>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still publish a few
>>>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can control the
>>>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power of
>>>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink and Hive. With
>>>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for engine versions
>>>>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new
>>>>>>>>>>>>>>>>> DSv2 features in Spark 3.2. I agree that Option 1 is the easiest for
>>>>>>>>>>>>>>>>> developers, but I don't think it considers the interests of users. I do not
>>>>>>>>>>>>>>>>> think that most users will upgrade to Spark 3.2 as soon as it is released.
>>>>>>>>>>>>>>>>> It is a "minor version" upgrade in name from 3.1 (or from 3.0), but I think
>>>>>>>>>>>>>>>>> we all know that it is not a minor upgrade. There are a lot of changes from
>>>>>>>>>>>>>>>>> 3.0 to 3.1 and from 3.1 to 3.2. I think there are even a lot of users
>>>>>>>>>>>>>>>>> running Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop
>>>>>>>>>>>>>>>>> supporting Spark 2.4?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same organization, don't
>>>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, all internal,
>>>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already running an
>>>>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I work for an organization with customers running
>>>>>>>>>>>>>>>>> different versions of Spark. It is true that we can backport new features
>>>>>>>>>>>>>>>>> to older versions if we wanted to. I suppose the people contributing to
>>>>>>>>>>>>>>>>> Iceberg work for some organization or other that either use Iceberg
>>>>>>>>>>>>>>>>> in-house, or provide software (possibly in the form of a service) to
>>>>>>>>>>>>>>>>> customers, and either way, the organizations have the ability to backport
>>>>>>>>>>>>>>>>> features and fixes to internal versions. Are there any users out there who
>>>>>>>>>>>>>>>>> simply use Apache Iceberg and depend on the community version?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even
>>>>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1,
>>>>>>>>>>>>>>>>> but I would consider Option 3 too. Anton, you said 5 modules are required;
>>>>>>>>>>>>>>>>> what are the modules you're thinking of?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>>>>> flyrain000@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development.
>>>>>>>>>>>>>>>>>> Considering the limited resources in the open source community, the upsides
>>>>>>>>>>>>>>>>>> of option 2 and 3 are probably not worthy.
>>>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are legit, users can
>>>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older version in case of
>>>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to
>>>>>>>>>>>>>>>>>>> upgrade to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core,
>>>>>>>>>>>>>>>>>>> may slow down the development.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a
>>>>>>>>>>>>>>>>>>> big fan of having runtime errors for unsupported things based on versions
>>>>>>>>>>>>>>>>>>> and I don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it means we have
>>>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the
>>>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to all other
>>>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that would slow down
>>>>>>>>>>>>>>>>>>> its development speed.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It
>>>>>>>>>>>>>>>>>>>> is great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot
>>>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a
>>>>>>>>>>>>>>>>>>>> while by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is actively
>>>>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to
>>>>>>>>>>>>>>>>>>>> master could also work with older Spark versions but all 0.12 releases will
>>>>>>>>>>>>>>>>>>>> only contain bug fixes. Therefore, users will be forced to migrate to Spark
>>>>>>>>>>>>>>>>>>>> 3.2 to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more
>>>>>>>>>>>>>>>>>>>> work to release, will need a new release of the core format to consume any
>>>>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but
>>>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this
>>>>>>>>>>>>>>>>>>>> matter.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] Spark version support strategy

Posted by Jack Ye <ye...@gmail.com>.

Hi everyone,

I tried to prototype option 3, here is the PR:
https://github.com/apache/iceberg/pull/3237

Sorry I did not see that Anton is planning to do it, but anyway it's just a
draft, so feel free to just use it as reference.

Best,
Jack Ye

On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue <bl...@tabular.io> wrote:

> Thanks for the context on the Flink side! I think it sounds reasonable to
> keep up to date with the latest supported Flink version. If we want, we
> could later go with something similar to what we do for Spark but we’ll see
> how it goes and what the Flink community needs. We should probably add a
> section to our Flink docs that explains and links to Flink’s support policy
> and has a table of Iceberg versions that work with Flink versions. (We
> should probably have the same table for Spark, too!)
>
> For Spark, I’m also leaning toward the modified option 3 where we keep all
> of the code in the main repository but only build with one module at a time
> by default. It makes sense to switch based on modules — rather than
> selecting src paths within a module — so that it is easy to run a build
> with all modules if you choose to — for example, when building release
> binaries.
>
> The reason I think we should go with option 3 is for testing. If we have a
> single repo with api, core, etc. and spark then changes to the common
> modules can be tested by CI actions. Updates to individual Spark modules
> would be completely independent. There is a slight inconvenience that when
> an API used by Spark changes, the author would still need to fix multiple
> Spark versions. But the trade-off is that with a separate repository like
> option 2, changes that break Spark versions are not caught and then the
> Spark repository’s CI ends up failing on completely unrelated changes. That
> would be a major pain, felt by everyone contributing to the Spark
> integration, so I think option 3 is the best path forward.
>
> It sounds like we probably have some agreement now, but please speak up if
> you think another option would be better.
>
> The next step is to prototype the build changes to test out option 3. Or
> if you prefer option 2, then prototype those changes as well. I think that
> Anton is planning to do this, but if you have time and the desire to do it
> please reach out and coordinate with us!
>
> Ryan
>
> On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <st...@gmail.com> wrote:
>
>> Wing, sorry, my earlier message probably misled you. I was speaking my
>> personal opinion on Flink version support.
>>
>> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <wy...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi OpenInx,
>>> I'm sorry I misunderstood the thinking of the Flink community. Thanks
>>> for the clarification.
>>> - Wing Yew
>>>
>>>
>>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <op...@gmail.com> wrote:
>>>
>>>> Hi Wing
>>>>
>>>> As we discussed above, we community prefer to choose option.2 or
>>>> option.3.  So in fact, when we planned to upgrade the flink version from
>>>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>>>> could work fine for both flink1.12 & flink1.13. More context please see
>>>> [1], [2], [3]
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/3116
>>>> [2] https://github.com/apache/iceberg/issues/3183
>>>> [3]
>>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>>>
>>>>
>>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon
>>>> <wy...@cloudera.com.invalid> wrote:
>>>>
>>>>> In the last community sync, we spent a little time on this topic. For
>>>>> Spark support, there are currently two options under consideration:
>>>>>
>>>>> Option 2: Separate repo for the Spark support. Use branches for
>>>>> supporting different Spark versions. Main branch for the latest Spark
>>>>> version (3.2 to begin with).
>>>>> Tooling needs to be built for producing regular snapshots of core
>>>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>>>> that they will not be, and the Spark support build can be broken by changes
>>>>> to core.
>>>>>
>>>>> A variant of option 3 (which we will simply call Option 3 going
>>>>> forward): Single repo, separate module (subdirectory) for each Spark
>>>>> version to be supported. Code duplication in each Spark module (no attempt
>>>>> to refactor out common code). Each module built against the specific
>>>>> version of Spark to be supported, producing a runtime jar built against
>>>>> that version. CI will test all modules. Support can be provided for only
>>>>> building the modules a developer cares about.
>>>>>
>>>>> More input was sought and people are encouraged to voice their
>>>>> preference.
>>>>> I lean towards Option 3.
>>>>>
>>>>> - Wing Yew
>>>>>
>>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the
>>>>> same multi-version support strategy can be adopted across engines. Based on
>>>>> what Steven wrote, currently the Flink developer community's bandwidth
>>>>> makes supporting only a single Flink version (and focusing resources on
>>>>> developing new features on that version) the preferred choice. If so, then
>>>>> no multi-version support strategy for Flink is needed at this time.
>>>>>
>>>>>
>>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> During the sync meeting, people talked about if and how we can have
>>>>>> the same version support model across engines like Flink and Spark. I can
>>>>>> provide some input from the Flink side.
>>>>>>
>>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>>>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>>>> personally I like option 1 (with one actively supported Flink version in
>>>>>> master branch) for the iceberg-flink module.
>>>>>>
>>>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>>>> layer and multiple modules. While it may be a little better to support
>>>>>> multiple Flink versions, I don't know if there is enough support and
>>>>>> resources from the community to pull it off. Also the ongoing maintenance
>>>>>> burden for each minor version release from Flink, which happens roughly
>>>>>> every 4 months.
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary
>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Since you mentioned Hive, I chime in with what we do there. You
>>>>>>> might find it useful:
>>>>>>> - metastore module - only small differences - DynConstructor solves
>>>>>>> for us
>>>>>>> - mr module - some bigger differences, but still manageable for Hive
>>>>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>>>>> codebase.
>>>>>>>
>>>>>>> My thoughts based on the above experience:
>>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>>>> have problems with backporting changes between repos and we are slacking
>>>>>>> behind which hurts both projects
>>>>>>> - Hive 2-3 model is working better by forcing us to keep the things
>>>>>>> in sync, but with serious differences in the Hive project it still doesn't
>>>>>>> seem like a viable option.
>>>>>>>
>>>>>>> So I think the question is: How stable is the Spark code we are
>>>>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>>>>> repo multiple modules" approach and we should consider the multirepo only
>>>>>>> if the differences become prohibitive.
>>>>>>>
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>>>> <ao...@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Okay, looks like there is consensus around supporting multiple
>>>>>>>> Spark versions at the same time. There are folks who mentioned this on this
>>>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>>>
>>>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>>>
>>>>>>>> Option 2
>>>>>>>>
>>>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>>>>> The main development will happen there and the artifact version will be
>>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>>>> branches where we will cherry-pick applicable changes. Once we are ready to
>>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1
>>>>>>>> branches for cherry-picks.
>>>>>>>>
>>>>>>>> I guess we will continue to shade everything in the new repo and
>>>>>>>> will have to release every time the core is released. We will do a
>>>>>>>> maintenance release for each supported Spark version whenever we cut a new
>>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark
>>>>>>>> integration.
>>>>>>>> Under this model, we will probably need nightly snapshots (or on
>>>>>>>> each commit) for the core format and the Spark integration will depend on
>>>>>>>> snapshots until we are ready to release.
>>>>>>>>
>>>>>>>> Overall, I think this option gives us very simple builds and
>>>>>>>> provides best separation. It will keep the main repo clean. The main
>>>>>>>> downside is that we will have to split a Spark feature into two PRs: one
>>>>>>>> against the core and one against the Spark integration. Certain changes in
>>>>>>>> core can also break the Spark integration too and will require adaptations.
>>>>>>>>
>>>>>>>> Ryan, I am not sure I fully understood the testing part. How will
>>>>>>>> we be able to test the Spark integration in the main repo if certain
>>>>>>>> changes in core may break the Spark integration and require changes there?
>>>>>>>> Will we try to prohibit such changes?
>>>>>>>>
>>>>>>>> Option 3 (modified)
>>>>>>>>
>>>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>>>> the initially suggested approach by Imran but with code duplication instead
>>>>>>>> of extra refactoring and introducing new common modules.
>>>>>>>>
>>>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>>>>> compared to just having a module per version? I did not fully
>>>>>>>> understand.
>>>>>>>>
>>>>>>>> My worry with this approach is that our build will be very
>>>>>>>> complicated and we will still have a lot of Spark-related modules in the
>>>>>>>> main repo. Once people start using Flink and Hive more, will we have to do
>>>>>>>> the same?
>>>>>>>>
>>>>>>>> - Anton
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>>>> expectations for keeping it clean.
>>>>>>>>
>>>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>>>> versions -- that introduces risk because we're relying on compiling against
>>>>>>>> one version and running in another and both Spark and Scala change rapidly.
>>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>>>>>> version. I think we should duplicate code rather than spend time
>>>>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>>>>> Spark version by copying the last one and updating it. And we should build
>>>>>>>> just the latest supported version by default.
>>>>>>>>
>>>>>>>> The drawback to having everything in a single repo is that we
>>>>>>>> wouldn't be able to cherry-pick changes across Spark versions/branches, but
>>>>>>>> I think Jack is right that having a single build is better.
>>>>>>>>
>>>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>>>>>> that selects the Spark version that you want to build against.
>>>>>>>>
>>>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think in Ryan's proposal we will create a ton of modules anyway,
>>>>>>>>> as Wing listed we are just using git branch as an additional dimension, but
>>>>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>>>>>>> artifact published for each Spark version in either approach.
>>>>>>>>>
>>>>>>>>> In that case, this is just brainstorming, I wonder if we can
>>>>>>>>> explore a modified option 3 that flattens all the versions in each Spark
>>>>>>>>> branch in option 2 into master. The repository structure would look
>>>>>>>>> something like:
>>>>>>>>>
>>>>>>>>> iceberg/api/...
>>>>>>>>>             /bundled-guava/...
>>>>>>>>>             /core/...
>>>>>>>>>             ...
>>>>>>>>>             /spark/2.4/core/...
>>>>>>>>>                             /extension/...
>>>>>>>>>                             /runtime/...
>>>>>>>>>                       /3.1/core/...
>>>>>>>>>                             /extension/...
>>>>>>>>>                             /runtime/...
>>>>>>>>>
>>>>>>>>> The gradle build script in the root is configured to build against
>>>>>>>>> the latest version of Spark by default, unless otherwise specified by the
>>>>>>>>> user.
>>>>>>>>>
>>>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>>>> versions based on the same config used in build.
>>>>>>>>>
>>>>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>>>>> like testing version compatibility for a feature or running only a
>>>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>>>> directories touched.
>>>>>>>>>
>>>>>>>>> And the biggest benefit is that we don't have the same difficulty
>>>>>>>>> as option 2 of developing a feature when it's both in core and Spark.
>>>>>>>>>
>>>>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>>>>>>> many versions in the long term.
>>>>>>>>>
>>>>>>>>> -Jack Ye
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java
>>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a big
>>>>>>>>>> thing to leave out!
>>>>>>>>>>
>>>>>>>>>> I would definitely want to test the projects together. One thing
>>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also
>>>>>>>>>> wondering if we could have some tighter integration where the Iceberg Spark
>>>>>>>>>> build can be included in the Iceberg Java build using properties. Maybe the
>>>>>>>>>> github action could checkout Iceberg, then checkout the Spark
>>>>>>>>>> integration's latest branch, and then run the gradle build with a property
>>>>>>>>>> that makes Spark a subproject in the build. That way we can continue to
>>>>>>>>>> have Spark CI run regularly.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>>>>> prohibitive to actually implementing new features that appear in the API, I
>>>>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>>>>> artifacts published nightly?
>>>>>>>>>>>
>>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the
>>>>>>>>>>>> set of potential solutions well defined.
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like the next step is to decide whether we want to
>>>>>>>>>>>> require people to update Spark versions to pick up newer versions of
>>>>>>>>>>>> Iceberg. If we choose to make people upgrade, then option 1 is clearly the
>>>>>>>>>>>> best choice.
>>>>>>>>>>>>
>>>>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark versions,
>>>>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>>>>
>>>>>>>>>>>> Another way of thinking about this is that if we went with
>>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are many
>>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO implementation
>>>>>>>>>>>> for ADLS. So some people in the community would have to maintain branches
>>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of the main
>>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things with
>>>>>>>>>>>> option 1 because we would then have more people maintaining the same 0.13.x
>>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, where we
>>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the community
>>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, Tencent,
>>>>>>>>>>>> Apple, etc.)
>>>>>>>>>>>>
>>>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>>>> some of us would — we should make it possible to share that work. That’s
>>>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>>>
>>>>>>>>>>>> If we don’t go with option 1, then the choice is how to
>>>>>>>>>>>> maintain multiple Spark versions. I think that the way we’re doing it right
>>>>>>>>>>>> now is not something we want to continue.
>>>>>>>>>>>>
>>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because
>>>>>>>>>>>> of the changes in Spark. We currently structure the library to share as
>>>>>>>>>>>> much code as possible. But that means compiling against different Spark
>>>>>>>>>>>> versions and relying on binary compatibility and reflection in some cases.
>>>>>>>>>>>> To me, this seems unmaintainable in the long run because it requires
>>>>>>>>>>>> refactoring common classes and spending a lot of time deduplicating code.
>>>>>>>>>>>> It also creates a ton of modules, at least one common module, then a module
>>>>>>>>>>>> per version, then an extensions module per version, and finally a runtime
>>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new common
>>>>>>>>>>>> modules. And each module needs to be tested, which is making our CI take a
>>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, which is
>>>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>>>
>>>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>>>
>>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>>>> Before giving my preference, let me raise one question:    what's the top
>>>>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  This
>>>>>>>>>>>>> question will help us to answer the following question: Should we support
>>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>>>>> concentrate on getting the new features that users need most in order to
>>>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If people watch the apache iceberg project and check the
>>>>>>>>>>>>> issues & PR frequently,  I guess more than 90% people will answer the
>>>>>>>>>>>>> priority question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> In order to ensure the highest priority at this point in time,
>>>>>>>>>>>>> I will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>>>> sai.sai.shao@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to upgrade
>>>>>>>>>>>>>> the Spark version for the first time (since we have many customizations
>>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>>>>> community ditches the support of old version of Spark3, users have to
>>>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So I'm inclined to make this support in community, not by
>>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. And to
>>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark (for example
>>>>>>>>>>>>>> 2 versions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to
>>>>>>>>>>>>>>> support Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages to add and
>>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of that package
>>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still publish a few
>>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can control the
>>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power of
>>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink and Hive. With
>>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for engine versions
>>>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same organization, don't
>>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, all internal,
>>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already running an
>>>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even
>>>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1,
>>>>>>>>>>>>>>>> but I would consider Option 3 too. Anton, you said 5 modules are required;
>>>>>>>>>>>>>>>> what are the modules you're thinking of?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>>>> flyrain000@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development.
>>>>>>>>>>>>>>>>> Considering the limited resources in the open source community, the upsides
>>>>>>>>>>>>>>>>> of option 2 and 3 are probably not worthy.
>>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are legit, users can
>>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older version in case of
>>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to
>>>>>>>>>>>>>>>>>> upgrade to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it means we have
>>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the
>>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to all other
>>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that would slow down
>>>>>>>>>>>>>>>>>> its development speed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>>>>> great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot
>>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a
>>>>>>>>>>>>>>>>>>> while by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is actively
>>>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 releases will only
>>>>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2
>>>>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more
>>>>>>>>>>>>>>>>>>> work to release, will need a new release of the core format to consume any
>>>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but
>>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Tabular
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>>
>>>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Spark version support strategy

Posted by Ryan Blue <bl...@tabular.io>.

Thanks for the context on the Flink side! I think it sounds reasonable to
keep up to date with the latest supported Flink version. If we want, we
could later go with something similar to what we do for Spark but we’ll see
how it goes and what the Flink community needs. We should probably add a
section to our Flink docs that explains and links to Flink’s support policy
and has a table of Iceberg versions that work with Flink versions. (We
should probably have the same table for Spark, too!)

For Spark, I’m also leaning toward the modified option 3 where we keep all
of the code in the main repository but only build with one module at a time
by default. It makes sense to switch based on modules — rather than
selecting src paths within a module — so that it is easy to run a build
with all modules if you choose to — for example, when building release
binaries.

The reason I think we should go with option 3 is for testing. If we have a
single repo with api, core, etc. and spark then changes to the common
modules can be tested by CI actions. Updates to individual Spark modules
would be completely independent. There is a slight inconvenience that when
an API used by Spark changes, the author would still need to fix multiple
Spark versions. But the trade-off is that with a separate repository like
option 2, changes that break Spark versions are not caught and then the
Spark repository’s CI ends up failing on completely unrelated changes. That
would be a major pain, felt by everyone contributing to the Spark
integration, so I think option 3 is the best path forward.

It sounds like we probably have some agreement now, but please speak up if
you think another option would be better.

The next step is to prototype the build changes to test out option 3. Or if
you prefer option 2, then prototype those changes as well. I think that
Anton is planning to do this, but if you have time and the desire to do it
please reach out and coordinate with us!

Ryan

On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <st...@gmail.com> wrote:

> Wing, sorry, my earlier message probably misled you. I was speaking my
> personal opinion on Flink version support.
>
> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <wy...@cloudera.com.invalid>
> wrote:
>
>> Hi OpenInx,
>> I'm sorry I misunderstood the thinking of the Flink community. Thanks for
>> the clarification.
>> - Wing Yew
>>
>>
>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <op...@gmail.com> wrote:
>>
>>> Hi Wing
>>>
>>> As we discussed above, we community prefer to choose option.2 or
>>> option.3.  So in fact, when we planned to upgrade the flink version from
>>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>>> could work fine for both flink1.12 & flink1.13. More context please see
>>> [1], [2], [3]
>>>
>>> [1] https://github.com/apache/iceberg/pull/3116
>>> [2] https://github.com/apache/iceberg/issues/3183
>>> [3]
>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>>
>>>
>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon
>>> <wy...@cloudera.com.invalid> wrote:
>>>
>>>> In the last community sync, we spent a little time on this topic. For
>>>> Spark support, there are currently two options under consideration:
>>>>
>>>> Option 2: Separate repo for the Spark support. Use branches for
>>>> supporting different Spark versions. Main branch for the latest Spark
>>>> version (3.2 to begin with).
>>>> Tooling needs to be built for producing regular snapshots of core
>>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>>> that they will not be, and the Spark support build can be broken by changes
>>>> to core.
>>>>
>>>> A variant of option 3 (which we will simply call Option 3 going
>>>> forward): Single repo, separate module (subdirectory) for each Spark
>>>> version to be supported. Code duplication in each Spark module (no attempt
>>>> to refactor out common code). Each module built against the specific
>>>> version of Spark to be supported, producing a runtime jar built against
>>>> that version. CI will test all modules. Support can be provided for only
>>>> building the modules a developer cares about.
>>>>
>>>> More input was sought and people are encouraged to voice their
>>>> preference.
>>>> I lean towards Option 3.
>>>>
>>>> - Wing Yew
>>>>
>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the
>>>> same multi-version support strategy can be adopted across engines. Based on
>>>> what Steven wrote, currently the Flink developer community's bandwidth
>>>> makes supporting only a single Flink version (and focusing resources on
>>>> developing new features on that version) the preferred choice. If so, then
>>>> no multi-version support strategy for Flink is needed at this time.
>>>>
>>>>
>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com> wrote:
>>>>
>>>>> During the sync meeting, people talked about if and how we can have
>>>>> the same version support model across engines like Flink and Spark. I can
>>>>> provide some input from the Flink side.
>>>>>
>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>>> personally I like option 1 (with one actively supported Flink version in
>>>>> master branch) for the iceberg-flink module.
>>>>>
>>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>>> layer and multiple modules. While it may be a little better to support
>>>>> multiple Flink versions, I don't know if there is enough support and
>>>>> resources from the community to pull it off. Also the ongoing maintenance
>>>>> burden for each minor version release from Flink, which happens roughly
>>>>> every 4 months.
>>>>>
>>>>>
>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Since you mentioned Hive, I chime in with what we do there. You might
>>>>>> find it useful:
>>>>>> - metastore module - only small differences - DynConstructor solves
>>>>>> for us
>>>>>> - mr module - some bigger differences, but still manageable for Hive
>>>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>>>> codebase.
>>>>>>
>>>>>> My thoughts based on the above experience:
>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>>> have problems with backporting changes between repos and we are slacking
>>>>>> behind which hurts both projects
>>>>>> - Hive 2-3 model is working better by forcing us to keep the things
>>>>>> in sync, but with serious differences in the Hive project it still doesn't
>>>>>> seem like a viable option.
>>>>>>
>>>>>> So I think the question is: How stable is the Spark code we are
>>>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>>>> repo multiple modules" approach and we should consider the multirepo only
>>>>>> if the differences become prohibitive.
>>>>>>
>>>>>> Thanks, Peter
>>>>>>
>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>>> <ao...@apple.com.invalid> wrote:
>>>>>>
>>>>>>> Okay, looks like there is consensus around supporting multiple Spark
>>>>>>> versions at the same time. There are folks who mentioned this on this
>>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>>
>>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>>
>>>>>>> Option 2
>>>>>>>
>>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>>>> The main development will happen there and the artifact version will be
>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>>> branches where we will cherry-pick applicable changes. Once we are ready to
>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1
>>>>>>> branches for cherry-picks.
>>>>>>>
>>>>>>> I guess we will continue to shade everything in the new repo and
>>>>>>> will have to release every time the core is released. We will do a
>>>>>>> maintenance release for each supported Spark version whenever we cut a new
>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark
>>>>>>> integration.
>>>>>>> Under this model, we will probably need nightly snapshots (or on
>>>>>>> each commit) for the core format and the Spark integration will depend on
>>>>>>> snapshots until we are ready to release.
>>>>>>>
>>>>>>> Overall, I think this option gives us very simple builds and
>>>>>>> provides best separation. It will keep the main repo clean. The main
>>>>>>> downside is that we will have to split a Spark feature into two PRs: one
>>>>>>> against the core and one against the Spark integration. Certain changes in
>>>>>>> core can also break the Spark integration too and will require adaptations.
>>>>>>>
>>>>>>> Ryan, I am not sure I fully understood the testing part. How will we
>>>>>>> be able to test the Spark integration in the main repo if certain changes
>>>>>>> in core may break the Spark integration and require changes there? Will we
>>>>>>> try to prohibit such changes?
>>>>>>>
>>>>>>> Option 3 (modified)
>>>>>>>
>>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>>> the initially suggested approach by Imran but with code duplication instead
>>>>>>> of extra refactoring and introducing new common modules.
>>>>>>>
>>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>>>> compared to just having a module per version? I did not fully
>>>>>>> understand.
>>>>>>>
>>>>>>> My worry with this approach is that our build will be very
>>>>>>> complicated and we will still have a lot of Spark-related modules in the
>>>>>>> main repo. Once people start using Flink and Hive more, will we have to do
>>>>>>> the same?
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>>> expectations for keeping it clean.
>>>>>>>
>>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>>> versions -- that introduces risk because we're relying on compiling against
>>>>>>> one version and running in another and both Spark and Scala change rapidly.
>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>>>>> version. I think we should duplicate code rather than spend time
>>>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>>>> Spark version by copying the last one and updating it. And we should build
>>>>>>> just the latest supported version by default.
>>>>>>>
>>>>>>> The drawback to having everything in a single repo is that we
>>>>>>> wouldn't be able to cherry-pick changes across Spark versions/branches, but
>>>>>>> I think Jack is right that having a single build is better.
>>>>>>>
>>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>>>>> that selects the Spark version that you want to build against.
>>>>>>>
>>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think in Ryan's proposal we will create a ton of modules anyway,
>>>>>>>> as Wing listed we are just using git branch as an additional dimension, but
>>>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>>>>>> artifact published for each Spark version in either approach.
>>>>>>>>
>>>>>>>> In that case, this is just brainstorming, I wonder if we can
>>>>>>>> explore a modified option 3 that flattens all the versions in each Spark
>>>>>>>> branch in option 2 into master. The repository structure would look
>>>>>>>> something like:
>>>>>>>>
>>>>>>>> iceberg/api/...
>>>>>>>>             /bundled-guava/...
>>>>>>>>             /core/...
>>>>>>>>             ...
>>>>>>>>             /spark/2.4/core/...
>>>>>>>>                             /extension/...
>>>>>>>>                             /runtime/...
>>>>>>>>                       /3.1/core/...
>>>>>>>>                             /extension/...
>>>>>>>>                             /runtime/...
>>>>>>>>
>>>>>>>> The gradle build script in the root is configured to build against
>>>>>>>> the latest version of Spark by default, unless otherwise specified by the
>>>>>>>> user.
>>>>>>>>
>>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>>> versions based on the same config used in build.
>>>>>>>>
>>>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>>>> like testing version compatibility for a feature or running only a
>>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>>> directories touched.
>>>>>>>>
>>>>>>>> And the biggest benefit is that we don't have the same difficulty
>>>>>>>> as option 2 of developing a feature when it's both in core and Spark.
>>>>>>>>
>>>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>>>>>> many versions in the long term.
>>>>>>>>
>>>>>>>> -Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java
>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a big
>>>>>>>>> thing to leave out!
>>>>>>>>>
>>>>>>>>> I would definitely want to test the projects together. One thing
>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also
>>>>>>>>> wondering if we could have some tighter integration where the Iceberg Spark
>>>>>>>>> build can be included in the Iceberg Java build using properties. Maybe the
>>>>>>>>> github action could checkout Iceberg, then checkout the Spark
>>>>>>>>> integration's latest branch, and then run the gradle build with a property
>>>>>>>>> that makes Spark a subproject in the build. That way we can continue to
>>>>>>>>> have Spark CI run regularly.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>>>> prohibitive to actually implementing new features that appear in the API, I
>>>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>>>> artifacts published nightly?
>>>>>>>>>>
>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>>>>>
>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the
>>>>>>>>>>> set of potential solutions well defined.
>>>>>>>>>>>
>>>>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>>>>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>>>>>>>
>>>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark versions,
>>>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>>>
>>>>>>>>>>> Another way of thinking about this is that if we went with
>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are many
>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO implementation
>>>>>>>>>>> for ADLS. So some people in the community would have to maintain branches
>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of the main
>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things with
>>>>>>>>>>> option 1 because we would then have more people maintaining the same 0.13.x
>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, where we
>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the community
>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, Tencent,
>>>>>>>>>>> Apple, etc.)
>>>>>>>>>>>
>>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>>> some of us would — we should make it possible to share that work. That’s
>>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>>
>>>>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>>>>>>>> not something we want to continue.
>>>>>>>>>>>
>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because of
>>>>>>>>>>> the changes in Spark. We currently structure the library to share as much
>>>>>>>>>>> code as possible. But that means compiling against different Spark versions
>>>>>>>>>>> and relying on binary compatibility and reflection in some cases. To me,
>>>>>>>>>>> this seems unmaintainable in the long run because it requires refactoring
>>>>>>>>>>> common classes and spending a lot of time deduplicating code. It also
>>>>>>>>>>> creates a ton of modules, at least one common module, then a module per
>>>>>>>>>>> version, then an extensions module per version, and finally a runtime
>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new common
>>>>>>>>>>> modules. And each module needs to be tested, which is making our CI take a
>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, which is
>>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>>
>>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>>
>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>>
>>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>>> Before giving my preference, let me raise one question:    what's the top
>>>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  This
>>>>>>>>>>>> question will help us to answer the following question: Should we support
>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>>>> concentrate on getting the new features that users need most in order to
>>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>>
>>>>>>>>>>>> If people watch the apache iceberg project and check the issues
>>>>>>>>>>>> & PR frequently,  I guess more than 90% people will answer the priority
>>>>>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>> In order to ensure the highest priority at this point in time,
>>>>>>>>>>>> I will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>>> sai.sai.shao@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to upgrade
>>>>>>>>>>>>> the Spark version for the first time (since we have many customizations
>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>>>> community ditches the support of old version of Spark3, users have to
>>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I'm inclined to make this support in community, not by
>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. And to
>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark (for example
>>>>>>>>>>>>> 2 versions).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages to add and
>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of that package
>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still publish a few
>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can control the
>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power of
>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink and Hive. With
>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for engine versions
>>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same organization, don't
>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, all internal,
>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already running an
>>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even
>>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but
>>>>>>>>>>>>>>> I would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>>> flyrain000@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering
>>>>>>>>>>>>>>>> the limited resources in the open source community, the upsides of option 2
>>>>>>>>>>>>>>>> and 3 are probably not worthy.
>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are legit, users can
>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older version in case of
>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade
>>>>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it means we have
>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all
>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the
>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to all other
>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that would slow down
>>>>>>>>>>>>>>>>> its development speed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>>>> great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot
>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while
>>>>>>>>>>>>>>>>>> by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is actively
>>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 releases will only
>>>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2
>>>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work
>>>>>>>>>>>>>>>>>> to release, will need a new release of the core format to consume any
>>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but
>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>>
>>>>>>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Spark version support strategy

Posted by Steven Wu <st...@gmail.com>.

Wing, sorry, my earlier message probably misled you. I was speaking my
personal opinion on Flink version support.

On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <wy...@cloudera.com.invalid>
wrote:

> Hi OpenInx,
> I'm sorry I misunderstood the thinking of the Flink community. Thanks for
> the clarification.
> - Wing Yew
>
>
> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <op...@gmail.com> wrote:
>
>> Hi Wing
>>
>> As we discussed above, we community prefer to choose option.2 or
>> option.3.  So in fact, when we planned to upgrade the flink version from
>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>> could work fine for both flink1.12 & flink1.13. More context please see
>> [1], [2], [3]
>>
>> [1] https://github.com/apache/iceberg/pull/3116
>> [2] https://github.com/apache/iceberg/issues/3183
>> [3]
>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>
>>
>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <wy...@cloudera.com.invalid>
>> wrote:
>>
>>> In the last community sync, we spent a little time on this topic. For
>>> Spark support, there are currently two options under consideration:
>>>
>>> Option 2: Separate repo for the Spark support. Use branches for
>>> supporting different Spark versions. Main branch for the latest Spark
>>> version (3.2 to begin with).
>>> Tooling needs to be built for producing regular snapshots of core
>>> Iceberg in a consumable way for this repo. Unclear if commits to core
>>> Iceberg will be tested pre-commit against Spark support; my impression is
>>> that they will not be, and the Spark support build can be broken by changes
>>> to core.
>>>
>>> A variant of option 3 (which we will simply call Option 3 going
>>> forward): Single repo, separate module (subdirectory) for each Spark
>>> version to be supported. Code duplication in each Spark module (no attempt
>>> to refactor out common code). Each module built against the specific
>>> version of Spark to be supported, producing a runtime jar built against
>>> that version. CI will test all modules. Support can be provided for only
>>> building the modules a developer cares about.
>>>
>>> More input was sought and people are encouraged to voice their
>>> preference.
>>> I lean towards Option 3.
>>>
>>> - Wing Yew
>>>
>>> ps. In the sync, as Steven Wu wrote, the question was raised if the same
>>> multi-version support strategy can be adopted across engines. Based on what
>>> Steven wrote, currently the Flink developer community's bandwidth makes
>>> supporting only a single Flink version (and focusing resources on
>>> developing new features on that version) the preferred choice. If so, then
>>> no multi-version support strategy for Flink is needed at this time.
>>>
>>>
>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com> wrote:
>>>
>>>> During the sync meeting, people talked about if and how we can have the
>>>> same version support model across engines like Flink and Spark. I can
>>>> provide some input from the Flink side.
>>>>
>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>>> 1.13, unless it is a serious bug (like security). With that context,
>>>> personally I like option 1 (with one actively supported Flink version in
>>>> master branch) for the iceberg-flink module.
>>>>
>>>> We discussed the idea of supporting multiple Flink versions via shm
>>>> layer and multiple modules. While it may be a little better to support
>>>> multiple Flink versions, I don't know if there is enough support and
>>>> resources from the community to pull it off. Also the ongoing maintenance
>>>> burden for each minor version release from Flink, which happens roughly
>>>> every 4 months.
>>>>
>>>>
>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Since you mentioned Hive, I chime in with what we do there. You might
>>>>> find it useful:
>>>>> - metastore module - only small differences - DynConstructor solves
>>>>> for us
>>>>> - mr module - some bigger differences, but still manageable for Hive
>>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>>> codebase.
>>>>>
>>>>> My thoughts based on the above experience:
>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>>> have problems with backporting changes between repos and we are slacking
>>>>> behind which hurts both projects
>>>>> - Hive 2-3 model is working better by forcing us to keep the things in
>>>>> sync, but with serious differences in the Hive project it still doesn't
>>>>> seem like a viable option.
>>>>>
>>>>> So I think the question is: How stable is the Spark code we are
>>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>>> repo multiple modules" approach and we should consider the multirepo only
>>>>> if the differences become prohibitive.
>>>>>
>>>>> Thanks, Peter
>>>>>
>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>>> <ao...@apple.com.invalid> wrote:
>>>>>
>>>>>> Okay, looks like there is consensus around supporting multiple Spark
>>>>>> versions at the same time. There are folks who mentioned this on this
>>>>>> thread and there were folks who brought this up during the sync.
>>>>>>
>>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>>
>>>>>> Option 2
>>>>>>
>>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>>> The main development will happen there and the artifact version will be
>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>>> branches where we will cherry-pick applicable changes. Once we are ready to
>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1
>>>>>> branches for cherry-picks.
>>>>>>
>>>>>> I guess we will continue to shade everything in the new repo and will
>>>>>> have to release every time the core is released. We will do a maintenance
>>>>>> release for each supported Spark version whenever we cut a new maintenance Iceberg
>>>>>> release or need to fix any bugs in the Spark integration.
>>>>>> Under this model, we will probably need nightly snapshots (or on each
>>>>>> commit) for the core format and the Spark integration will depend on
>>>>>> snapshots until we are ready to release.
>>>>>>
>>>>>> Overall, I think this option gives us very simple builds and provides
>>>>>> best separation. It will keep the main repo clean. The main downside is
>>>>>> that we will have to split a Spark feature into two PRs: one against the
>>>>>> core and one against the Spark integration. Certain changes in core can
>>>>>> also break the Spark integration too and will require adaptations.
>>>>>>
>>>>>> Ryan, I am not sure I fully understood the testing part. How will we
>>>>>> be able to test the Spark integration in the main repo if certain changes
>>>>>> in core may break the Spark integration and require changes there? Will we
>>>>>> try to prohibit such changes?
>>>>>>
>>>>>> Option 3 (modified)
>>>>>>
>>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>>> the initially suggested approach by Imran but with code duplication instead
>>>>>> of extra refactoring and introducing new common modules.
>>>>>>
>>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>>> compared to just having a module per version? I did not fully
>>>>>> understand.
>>>>>>
>>>>>> My worry with this approach is that our build will be very
>>>>>> complicated and we will still have a lot of Spark-related modules in the
>>>>>> main repo. Once people start using Flink and Hive more, will we have to do
>>>>>> the same?
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>> I'd support the option that Jack suggests if we can set a few
>>>>>> expectations for keeping it clean.
>>>>>>
>>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>>> versions -- that introduces risk because we're relying on compiling against
>>>>>> one version and running in another and both Spark and Scala change rapidly.
>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>>>> version. I think we should duplicate code rather than spend time
>>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>>> Spark version by copying the last one and updating it. And we should build
>>>>>> just the latest supported version by default.
>>>>>>
>>>>>> The drawback to having everything in a single repo is that we
>>>>>> wouldn't be able to cherry-pick changes across Spark versions/branches, but
>>>>>> I think Jack is right that having a single build is better.
>>>>>>
>>>>>> Second, we should make CI faster by running the Spark builds in
>>>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>>>> that selects the Spark version that you want to build against.
>>>>>>
>>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>>> I think in Ryan's proposal we will create a ton of modules anyway,
>>>>>>> as Wing listed we are just using git branch as an additional dimension, but
>>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>>>>> artifact published for each Spark version in either approach.
>>>>>>>
>>>>>>> In that case, this is just brainstorming, I wonder if we can explore
>>>>>>> a modified option 3 that flattens all the versions in each Spark branch in
>>>>>>> option 2 into master. The repository structure would look something like:
>>>>>>>
>>>>>>> iceberg/api/...
>>>>>>>             /bundled-guava/...
>>>>>>>             /core/...
>>>>>>>             ...
>>>>>>>             /spark/2.4/core/...
>>>>>>>                             /extension/...
>>>>>>>                             /runtime/...
>>>>>>>                       /3.1/core/...
>>>>>>>                             /extension/...
>>>>>>>                             /runtime/...
>>>>>>>
>>>>>>> The gradle build script in the root is configured to build against
>>>>>>> the latest version of Spark by default, unless otherwise specified by the
>>>>>>> user.
>>>>>>>
>>>>>>> Intellij can also be configured to only index files of specific
>>>>>>> versions based on the same config used in build.
>>>>>>>
>>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>>> like testing version compatibility for a feature or running only a
>>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>>> directories touched.
>>>>>>>
>>>>>>> And the biggest benefit is that we don't have the same difficulty as
>>>>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>>>>
>>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>>>>> many versions in the long term.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>>>>>>> leave out!
>>>>>>>>
>>>>>>>> I would definitely want to test the projects together. One thing we
>>>>>>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>>>>>>> if we could have some tighter integration where the Iceberg Spark build can
>>>>>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>>>>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>>>>>>> branch, and then run the gradle build with a property that makes Spark a
>>>>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>>>>> regularly.
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>>> prohibitive to actually implementing new features that appear in the API, I
>>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>>> artifacts published nightly?
>>>>>>>>>
>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>>>>
>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>>>>> consume changes fairly quickly?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set
>>>>>>>>>> of potential solutions well defined.
>>>>>>>>>>
>>>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>>>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>>>>>>
>>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark versions,
>>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>>
>>>>>>>>>> Another way of thinking about this is that if we went with option
>>>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many things that
>>>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>>>>>>
>>>>>>>>>> If the community is going to do the work anyway — and I think
>>>>>>>>>> some of us would — we should make it possible to share that work. That’s
>>>>>>>>>> why I don’t think that we should go with option 1.
>>>>>>>>>>
>>>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>>>>>>> not something we want to continue.
>>>>>>>>>>
>>>>>>>>>> Using multiple modules (option 3) is concerning to me because of
>>>>>>>>>> the changes in Spark. We currently structure the library to share as much
>>>>>>>>>> code as possible. But that means compiling against different Spark versions
>>>>>>>>>> and relying on binary compatibility and reflection in some cases. To me,
>>>>>>>>>> this seems unmaintainable in the long run because it requires refactoring
>>>>>>>>>> common classes and spending a lot of time deduplicating code. It also
>>>>>>>>>> creates a ton of modules, at least one common module, then a module per
>>>>>>>>>> version, then an extensions module per version, and finally a runtime
>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new common
>>>>>>>>>> modules. And each module needs to be tested, which is making our CI take a
>>>>>>>>>> really long time. We also don’t support multiple Scala versions, which is
>>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>>
>>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>>
>>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>>
>>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>>> Before giving my preference, let me raise one question:    what's the top
>>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  This
>>>>>>>>>>> question will help us to answer the following question: Should we support
>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>>> concentrate on getting the new features that users need most in order to
>>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>>
>>>>>>>>>>> If people watch the apache iceberg project and check the issues
>>>>>>>>>>> & PR frequently,  I guess more than 90% people will answer the priority
>>>>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>>> sai.sai.shao@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to upgrade
>>>>>>>>>>>> the Spark version for the first time (since we have many customizations
>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>>> community ditches the support of old version of Spark3, users have to
>>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>>
>>>>>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>>>>>>>
>>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>>
>>>>>>>>>>>> -Saisai
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the biggest issue of option 3 is about its
>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages to add and
>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of that package
>>>>>>>>>>>>> once created. If we go with option 1, I think we can still publish a few
>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can control the
>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power of
>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink and Hive. With
>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for engine versions
>>>>>>>>>>>>> against Iceberg versions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same organization, don't
>>>>>>>>>>>>>> they? And they don't have a problem with making their users, all internal,
>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already running an
>>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even
>>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but
>>>>>>>>>>>>>> I would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <
>>>>>>>>>>>>>> flyrain000@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering
>>>>>>>>>>>>>>> the limited resources in the open source community, the upsides of option 2
>>>>>>>>>>>>>>> and 3 are probably not worthy.
>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's
>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are legit, users can
>>>>>>>>>>>>>>> still get the new feature by backporting it to an older version in case of
>>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade
>>>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at
>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would
>>>>>>>>>>>>>>>> like to hear what other people think/need.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it means we have
>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>>>>>>> development speed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>>> great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while
>>>>>>>>>>>>>>>>> by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is actively
>>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 releases will only
>>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2
>>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work
>>>>>>>>>>>>>>>>> to release, will need a new release of the core format to consume any
>>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>>>>>> main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>>>

Re: [DISCUSS] Spark version support strategy

Posted by Wing Yew Poon <wy...@cloudera.com.INVALID>.

Hi OpenInx,
I'm sorry I misunderstood the thinking of the Flink community. Thanks for
the clarification.
- Wing Yew


On Tue, Sep 28, 2021 at 7:15 PM OpenInx <op...@gmail.com> wrote:

> Hi Wing
>
> As we discussed above, we community prefer to choose option.2 or
> option.3.  So in fact, when we planned to upgrade the flink version from
> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
> could work fine for both flink1.12 & flink1.13. More context please see
> [1], [2], [3]
>
> [1] https://github.com/apache/iceberg/pull/3116
> [2] https://github.com/apache/iceberg/issues/3183
> [3]
> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>
>
> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <wy...@cloudera.com.invalid>
> wrote:
>
>> In the last community sync, we spent a little time on this topic. For
>> Spark support, there are currently two options under consideration:
>>
>> Option 2: Separate repo for the Spark support. Use branches for
>> supporting different Spark versions. Main branch for the latest Spark
>> version (3.2 to begin with).
>> Tooling needs to be built for producing regular snapshots of core Iceberg
>> in a consumable way for this repo. Unclear if commits to core Iceberg will
>> be tested pre-commit against Spark support; my impression is that they will
>> not be, and the Spark support build can be broken by changes to core.
>>
>> A variant of option 3 (which we will simply call Option 3 going forward):
>> Single repo, separate module (subdirectory) for each Spark version to be
>> supported. Code duplication in each Spark module (no attempt to refactor
>> out common code). Each module built against the specific version of Spark
>> to be supported, producing a runtime jar built against that version. CI
>> will test all modules. Support can be provided for only building the
>> modules a developer cares about.
>>
>> More input was sought and people are encouraged to voice their preference.
>> I lean towards Option 3.
>>
>> - Wing Yew
>>
>> ps. In the sync, as Steven Wu wrote, the question was raised if the same
>> multi-version support strategy can be adopted across engines. Based on what
>> Steven wrote, currently the Flink developer community's bandwidth makes
>> supporting only a single Flink version (and focusing resources on
>> developing new features on that version) the preferred choice. If so, then
>> no multi-version support strategy for Flink is needed at this time.
>>
>>
>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com> wrote:
>>
>>> During the sync meeting, people talked about if and how we can have the
>>> same version support model across engines like Flink and Spark. I can
>>> provide some input from the Flink side.
>>>
>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>> 1.13, unless it is a serious bug (like security). With that context,
>>> personally I like option 1 (with one actively supported Flink version in
>>> master branch) for the iceberg-flink module.
>>>
>>> We discussed the idea of supporting multiple Flink versions via shm
>>> layer and multiple modules. While it may be a little better to support
>>> multiple Flink versions, I don't know if there is enough support and
>>> resources from the community to pull it off. Also the ongoing maintenance
>>> burden for each minor version release from Flink, which happens roughly
>>> every 4 months.
>>>
>>>
>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Since you mentioned Hive, I chime in with what we do there. You might
>>>> find it useful:
>>>> - metastore module - only small differences - DynConstructor solves for
>>>> us
>>>> - mr module - some bigger differences, but still manageable for Hive
>>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>>> codebase.
>>>>
>>>> My thoughts based on the above experience:
>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
>>>> have problems with backporting changes between repos and we are slacking
>>>> behind which hurts both projects
>>>> - Hive 2-3 model is working better by forcing us to keep the things in
>>>> sync, but with serious differences in the Hive project it still doesn't
>>>> seem like a viable option.
>>>>
>>>> So I think the question is: How stable is the Spark code we are
>>>> integrating to. If I is fairly stable then we are better off with a "one
>>>> repo multiple modules" approach and we should consider the multirepo only
>>>> if the differences become prohibitive.
>>>>
>>>> Thanks, Peter
>>>>
>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>> <ao...@apple.com.invalid> wrote:
>>>>
>>>>> Okay, looks like there is consensus around supporting multiple Spark
>>>>> versions at the same time. There are folks who mentioned this on this
>>>>> thread and there were folks who brought this up during the sync.
>>>>>
>>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>>
>>>>> Option 2
>>>>>
>>>>> In Option 2, there will be a separate repo. I believe the master
>>>>> branch will soon point to Spark 3.2 (the most recent supported version).
>>>>> The main development will happen there and the artifact version will be
>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1
>>>>> branches where we will cherry-pick applicable changes. Once we are ready to
>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3
>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the
>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1
>>>>> branches for cherry-picks.
>>>>>
>>>>> I guess we will continue to shade everything in the new repo and will
>>>>> have to release every time the core is released. We will do a maintenance
>>>>> release for each supported Spark version whenever we cut a new maintenance Iceberg
>>>>> release or need to fix any bugs in the Spark integration.
>>>>> Under this model, we will probably need nightly snapshots (or on each
>>>>> commit) for the core format and the Spark integration will depend on
>>>>> snapshots until we are ready to release.
>>>>>
>>>>> Overall, I think this option gives us very simple builds and provides
>>>>> best separation. It will keep the main repo clean. The main downside is
>>>>> that we will have to split a Spark feature into two PRs: one against the
>>>>> core and one against the Spark integration. Certain changes in core can
>>>>> also break the Spark integration too and will require adaptations.
>>>>>
>>>>> Ryan, I am not sure I fully understood the testing part. How will we
>>>>> be able to test the Spark integration in the main repo if certain changes
>>>>> in core may break the Spark integration and require changes there? Will we
>>>>> try to prohibit such changes?
>>>>>
>>>>> Option 3 (modified)
>>>>>
>>>>> If I get correctly, the modified Option 3 sounds very close to
>>>>> the initially suggested approach by Imran but with code duplication instead
>>>>> of extra refactoring and introducing new common modules.
>>>>>
>>>>> Jack, are you suggesting we test only a single Spark version at a
>>>>> time? Or do we expect to test all versions? Will there be any difference
>>>>> compared to just having a module per version? I did not fully
>>>>> understand.
>>>>>
>>>>> My worry with this approach is that our build will be very complicated
>>>>> and we will still have a lot of Spark-related modules in the main repo.
>>>>> Once people start using Flink and Hive more, will we have to do the same?
>>>>>
>>>>> - Anton
>>>>>
>>>>>
>>>>>
>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>> I'd support the option that Jack suggests if we can set a few
>>>>> expectations for keeping it clean.
>>>>>
>>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>>> versions -- that introduces risk because we're relying on compiling against
>>>>> one version and running in another and both Spark and Scala change rapidly.
>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>>> version. I think we should duplicate code rather than spend time
>>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>>> Spark version by copying the last one and updating it. And we should build
>>>>> just the latest supported version by default.
>>>>>
>>>>> The drawback to having everything in a single repo is that we wouldn't
>>>>> be able to cherry-pick changes across Spark versions/branches, but I think
>>>>> Jack is right that having a single build is better.
>>>>>
>>>>> Second, we should make CI faster by running the Spark builds in
>>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>>> that selects the Spark version that you want to build against.
>>>>>
>>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> I think in Ryan's proposal we will create a ton of modules anyway, as
>>>>>> Wing listed we are just using git branch as an additional dimension, but my
>>>>>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>>>> artifact published for each Spark version in either approach.
>>>>>>
>>>>>> In that case, this is just brainstorming, I wonder if we can explore
>>>>>> a modified option 3 that flattens all the versions in each Spark branch in
>>>>>> option 2 into master. The repository structure would look something like:
>>>>>>
>>>>>> iceberg/api/...
>>>>>>             /bundled-guava/...
>>>>>>             /core/...
>>>>>>             ...
>>>>>>             /spark/2.4/core/...
>>>>>>                             /extension/...
>>>>>>                             /runtime/...
>>>>>>                       /3.1/core/...
>>>>>>                             /extension/...
>>>>>>                             /runtime/...
>>>>>>
>>>>>> The gradle build script in the root is configured to build against
>>>>>> the latest version of Spark by default, unless otherwise specified by the
>>>>>> user.
>>>>>>
>>>>>> Intellij can also be configured to only index files of specific
>>>>>> versions based on the same config used in build.
>>>>>>
>>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>>> like testing version compatibility for a feature or running only a
>>>>>> specific subset of Spark version builds based on the Spark version
>>>>>> directories touched.
>>>>>>
>>>>>> And the biggest benefit is that we don't have the same difficulty as
>>>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>>>
>>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>>>> many versions in the long term.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>>>>>> leave out!
>>>>>>>
>>>>>>> I would definitely want to test the projects together. One thing we
>>>>>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>>>>>> if we could have some tighter integration where the Iceberg Spark build can
>>>>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>>>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>>>>>> branch, and then run the gradle build with a property that makes Spark a
>>>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>>>> regularly.
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>
>>>>>>>> I agree that Option 2 is considerably more difficult for
>>>>>>>> development when core API changes need to be picked up by the external
>>>>>>>> Spark module. I also think a monthly release would probably still be
>>>>>>>> prohibitive to actually implementing new features that appear in the API, I
>>>>>>>> would hope we have a much faster process or maybe just have snapshot
>>>>>>>> artifacts published nightly?
>>>>>>>>
>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>>>
>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>>>> consume changes fairly quickly?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set
>>>>>>>>> of potential solutions well defined.
>>>>>>>>>
>>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>>>>>
>>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark versions,
>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>>>>> picking up new Iceberg features.
>>>>>>>>>
>>>>>>>>> Another way of thinking about this is that if we went with option
>>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many things that
>>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>>>>>
>>>>>>>>> If the community is going to do the work anyway — and I think some
>>>>>>>>> of us would — we should make it possible to share that work. That’s why I
>>>>>>>>> don’t think that we should go with option 1.
>>>>>>>>>
>>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>>>>>> not something we want to continue.
>>>>>>>>>
>>>>>>>>> Using multiple modules (option 3) is concerning to me because of
>>>>>>>>> the changes in Spark. We currently structure the library to share as much
>>>>>>>>> code as possible. But that means compiling against different Spark versions
>>>>>>>>> and relying on binary compatibility and reflection in some cases. To me,
>>>>>>>>> this seems unmaintainable in the long run because it requires refactoring
>>>>>>>>> common classes and spending a lot of time deduplicating code. It also
>>>>>>>>> creates a ton of modules, at least one common module, then a module per
>>>>>>>>> version, then an extensions module per version, and finally a runtime
>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new common
>>>>>>>>> modules. And each module needs to be tested, which is making our CI take a
>>>>>>>>> really long time. We also don’t support multiple Scala versions, which is
>>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>>
>>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>>
>>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>>>> We’re just trying to build a library.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>>
>>>>>>>>>> Everyone has great pros/cons to support their preferences.
>>>>>>>>>> Before giving my preference, let me raise one question:    what's the top
>>>>>>>>>> priority thing for apache iceberg project at this point in time ?  This
>>>>>>>>>> question will help us to answer the following question: Should we support
>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and
>>>>>>>>>> concentrate on getting the new features that users need most in order to
>>>>>>>>>> keep the project more competitive ?
>>>>>>>>>>
>>>>>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>>> sai.sai.shao@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> From Dev's point, it has less burden to always support the
>>>>>>>>>>> latest version of Spark (for example). But from user's point,
>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy to upgrade
>>>>>>>>>>> the Spark version for the first time (since we have many customizations
>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If the
>>>>>>>>>>> community ditches the support of old version of Spark3, users have to
>>>>>>>>>>> maintain it themselves unavoidably.
>>>>>>>>>>>
>>>>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>>>>>>
>>>>>>>>>>> Just my two cents.
>>>>>>>>>>>
>>>>>>>>>>> -Saisai
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>>
>>>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>>
>>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>>
>>>>>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>>>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>>>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>>>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>>>>>>> Iceberg versions.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same organization, don't
>>>>>>>>>>>>> they? And they don't have a problem with making their users, all internal,
>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already running an
>>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>>
>>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even
>>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering
>>>>>>>>>>>>>> the limited resources in the open source community, the upsides of option 2
>>>>>>>>>>>>>> and 3 are probably not worthy.
>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard
>>>>>>>>>>>>>> to predict anything, but even if these use cases are legit, users can still
>>>>>>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade
>>>>>>>>>>>>>>> to the most recent minor Spark version to consume any new
>>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the
>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches.
>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at least
>>>>>>>>>>>>>>> 5 modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would like
>>>>>>>>>>>>>>> to hear what other people think/need.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big
>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to
>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, it means we have
>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> f there are some features requiring a newer version, it
>>>>>>>>>>>>>>> makes sense to move that newer version in master.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>>>>>> development speed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>>> great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while
>>>>>>>>>>>>>>>> by releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no
>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is actively
>>>>>>>>>>>>>>>> maintained.
>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 releases will only
>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2
>>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work
>>>>>>>>>>>>>>>> to release, will need a new release of the core format to consume any
>>>>>>>>>>>>>>>> changes in the Spark integration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>>>>> main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Spark version support strategy

Posted by OpenInx <op...@gmail.com>.

Hi Wing

As we discussed above, we community prefer to choose option.2 or option.3.
So in fact, when we planned to upgrade the flink version from 1.12 to
1.13,  we are doing our best to guarantee the master iceberg repo could
work fine for both flink1.12 & flink1.13. More context please see [1], [2],
[3]

[1] https://github.com/apache/iceberg/pull/3116
[2] https://github.com/apache/iceberg/issues/3183
[3]
https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E


On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon <wy...@cloudera.com.invalid>
wrote:

> In the last community sync, we spent a little time on this topic. For
> Spark support, there are currently two options under consideration:
>
> Option 2: Separate repo for the Spark support. Use branches for supporting
> different Spark versions. Main branch for the latest Spark version (3.2 to
> begin with).
> Tooling needs to be built for producing regular snapshots of core Iceberg
> in a consumable way for this repo. Unclear if commits to core Iceberg will
> be tested pre-commit against Spark support; my impression is that they will
> not be, and the Spark support build can be broken by changes to core.
>
> A variant of option 3 (which we will simply call Option 3 going forward):
> Single repo, separate module (subdirectory) for each Spark version to be
> supported. Code duplication in each Spark module (no attempt to refactor
> out common code). Each module built against the specific version of Spark
> to be supported, producing a runtime jar built against that version. CI
> will test all modules. Support can be provided for only building the
> modules a developer cares about.
>
> More input was sought and people are encouraged to voice their preference.
> I lean towards Option 3.
>
> - Wing Yew
>
> ps. In the sync, as Steven Wu wrote, the question was raised if the same
> multi-version support strategy can be adopted across engines. Based on what
> Steven wrote, currently the Flink developer community's bandwidth makes
> supporting only a single Flink version (and focusing resources on
> developing new features on that version) the preferred choice. If so, then
> no multi-version support strategy for Flink is needed at this time.
>
>
> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com> wrote:
>
>> During the sync meeting, people talked about if and how we can have the
>> same version support model across engines like Flink and Spark. I can
>> provide some input from the Flink side.
>>
>> Flink only supports two minor versions. E.g., right now Flink 1.13 is the
>> latest released version. That means only Flink 1.12 and 1.13 are supported.
>> Feature changes or bug fixes will only be backported to 1.12 and 1.13,
>> unless it is a serious bug (like security). With that context, personally I
>> like option 1 (with one actively supported Flink version in master branch)
>> for the iceberg-flink module.
>>
>> We discussed the idea of supporting multiple Flink versions via shm layer
>> and multiple modules. While it may be a little better to support multiple
>> Flink versions, I don't know if there is enough support and resources from
>> the community to pull it off. Also the ongoing maintenance burden for each
>> minor version release from Flink, which happens roughly every 4 months.
>>
>>
>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Since you mentioned Hive, I chime in with what we do there. You might
>>> find it useful:
>>> - metastore module - only small differences - DynConstructor solves for
>>> us
>>> - mr module - some bigger differences, but still manageable for Hive
>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>> codebase.
>>>
>>> My thoughts based on the above experience:
>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
>>> problems with backporting changes between repos and we are slacking behind
>>> which hurts both projects
>>> - Hive 2-3 model is working better by forcing us to keep the things in
>>> sync, but with serious differences in the Hive project it still doesn't
>>> seem like a viable option.
>>>
>>> So I think the question is: How stable is the Spark code we are
>>> integrating to. If I is fairly stable then we are better off with a "one
>>> repo multiple modules" approach and we should consider the multirepo only
>>> if the differences become prohibitive.
>>>
>>> Thanks, Peter
>>>
>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>> <ao...@apple.com.invalid> wrote:
>>>
>>>> Okay, looks like there is consensus around supporting multiple Spark
>>>> versions at the same time. There are folks who mentioned this on this
>>>> thread and there were folks who brought this up during the sync.
>>>>
>>>> Let’s think through Option 2 and 3 in more detail then.
>>>>
>>>> Option 2
>>>>
>>>> In Option 2, there will be a separate repo. I believe the master branch
>>>> will soon point to Spark 3.2 (the most recent supported version). The main
>>>> development will happen there and the artifact version will be 0.1.0. I
>>>> also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where
>>>> we will cherry-pick applicable changes. Once we are ready to release 0.1.0
>>>> Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark
>>>> 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master
>>>> to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for
>>>> cherry-picks.
>>>>
>>>> I guess we will continue to shade everything in the new repo and will
>>>> have to release every time the core is released. We will do a maintenance
>>>> release for each supported Spark version whenever we cut a new maintenance Iceberg
>>>> release or need to fix any bugs in the Spark integration.
>>>> Under this model, we will probably need nightly snapshots (or on each
>>>> commit) for the core format and the Spark integration will depend on
>>>> snapshots until we are ready to release.
>>>>
>>>> Overall, I think this option gives us very simple builds and provides
>>>> best separation. It will keep the main repo clean. The main downside is
>>>> that we will have to split a Spark feature into two PRs: one against the
>>>> core and one against the Spark integration. Certain changes in core can
>>>> also break the Spark integration too and will require adaptations.
>>>>
>>>> Ryan, I am not sure I fully understood the testing part. How will we be
>>>> able to test the Spark integration in the main repo if certain changes in
>>>> core may break the Spark integration and require changes there? Will we try
>>>> to prohibit such changes?
>>>>
>>>> Option 3 (modified)
>>>>
>>>> If I get correctly, the modified Option 3 sounds very close to
>>>> the initially suggested approach by Imran but with code duplication instead
>>>> of extra refactoring and introducing new common modules.
>>>>
>>>> Jack, are you suggesting we test only a single Spark version at a time?
>>>> Or do we expect to test all versions? Will there be any difference compared
>>>> to just having a module per version? I did not fully understand.
>>>>
>>>> My worry with this approach is that our build will be very complicated
>>>> and we will still have a lot of Spark-related modules in the main repo.
>>>> Once people start using Flink and Hive more, will we have to do the same?
>>>>
>>>> - Anton
>>>>
>>>>
>>>>
>>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>> I'd support the option that Jack suggests if we can set a few
>>>> expectations for keeping it clean.
>>>>
>>>> First, I'd like to avoid refactoring code to share it across Spark
>>>> versions -- that introduces risk because we're relying on compiling against
>>>> one version and running in another and both Spark and Scala change rapidly.
>>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>>> version. I think we should duplicate code rather than spend time
>>>> refactoring to rely on binary compatibility. I propose we start each new
>>>> Spark version by copying the last one and updating it. And we should build
>>>> just the latest supported version by default.
>>>>
>>>> The drawback to having everything in a single repo is that we wouldn't
>>>> be able to cherry-pick changes across Spark versions/branches, but I think
>>>> Jack is right that having a single build is better.
>>>>
>>>> Second, we should make CI faster by running the Spark builds in
>>>> parallel. It sounds like this is what would happen anyway, with a property
>>>> that selects the Spark version that you want to build against.
>>>>
>>>> Overall, this new suggestion sounds like a promising way forward.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> I think in Ryan's proposal we will create a ton of modules anyway, as
>>>>> Wing listed we are just using git branch as an additional dimension, but my
>>>>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>>> artifact published for each Spark version in either approach.
>>>>>
>>>>> In that case, this is just brainstorming, I wonder if we can explore a
>>>>> modified option 3 that flattens all the versions in each Spark branch in
>>>>> option 2 into master. The repository structure would look something like:
>>>>>
>>>>> iceberg/api/...
>>>>>             /bundled-guava/...
>>>>>             /core/...
>>>>>             ...
>>>>>             /spark/2.4/core/...
>>>>>                             /extension/...
>>>>>                             /runtime/...
>>>>>                       /3.1/core/...
>>>>>                             /extension/...
>>>>>                             /runtime/...
>>>>>
>>>>> The gradle build script in the root is configured to build against the
>>>>> latest version of Spark by default, unless otherwise specified by the user.
>>>>>
>>>>> Intellij can also be configured to only index files of specific
>>>>> versions based on the same config used in build.
>>>>>
>>>>> In this way, I imagine the CI setup to be much easier to do things
>>>>> like testing version compatibility for a feature or running only a
>>>>> specific subset of Spark version builds based on the Spark version
>>>>> directories touched.
>>>>>
>>>>> And the biggest benefit is that we don't have the same difficulty as
>>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>>
>>>>> We can then develop a mechanism to vote to stop support of certain
>>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>>> many versions in the long term.
>>>>>
>>>>> -Jack Ye
>>>>>
>>>>>
>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>>>>> leave out!
>>>>>>
>>>>>> I would definitely want to test the projects together. One thing we
>>>>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>>>>> if we could have some tighter integration where the Iceberg Spark build can
>>>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>>>>> branch, and then run the gradle build with a property that makes Spark a
>>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>>> regularly.
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>
>>>>>>> I agree that Option 2 is considerably more difficult for development
>>>>>>> when core API changes need to be picked up by the external Spark module. I
>>>>>>> also think a monthly release would probably still be prohibitive to
>>>>>>> actually implementing new features that appear in the API, I would hope we
>>>>>>> have a much faster process or maybe just have snapshot artifacts published
>>>>>>> nightly?
>>>>>>>
>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>>
>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>>> consume changes fairly quickly?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set
>>>>>>>> of potential solutions well defined.
>>>>>>>>
>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>>>>
>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>> Many of the things that we’re working on are orthogonal to Spark versions,
>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>>>> picking up new Iceberg features.
>>>>>>>>
>>>>>>>> Another way of thinking about this is that if we went with option
>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many things that
>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>>>>
>>>>>>>> If the community is going to do the work anyway — and I think some
>>>>>>>> of us would — we should make it possible to share that work. That’s why I
>>>>>>>> don’t think that we should go with option 1.
>>>>>>>>
>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>>>>> not something we want to continue.
>>>>>>>>
>>>>>>>> Using multiple modules (option 3) is concerning to me because of
>>>>>>>> the changes in Spark. We currently structure the library to share as much
>>>>>>>> code as possible. But that means compiling against different Spark versions
>>>>>>>> and relying on binary compatibility and reflection in some cases. To me,
>>>>>>>> this seems unmaintainable in the long run because it requires refactoring
>>>>>>>> common classes and spending a lot of time deduplicating code. It also
>>>>>>>> creates a ton of modules, at least one common module, then a module per
>>>>>>>> version, then an extensions module per version, and finally a runtime
>>>>>>>> module per version. That’s 3 modules per Spark version, plus any new common
>>>>>>>> modules. And each module needs to be tested, which is making our CI take a
>>>>>>>> really long time. We also don’t support multiple Scala versions, which is
>>>>>>>> another gap that will require even more modules and tests.
>>>>>>>>
>>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>>
>>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>>> We’re just trying to build a library.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>>
>>>>>>>>> Everyone has great pros/cons to support their preferences.  Before
>>>>>>>>> giving my preference, let me raise one question:    what's the top priority
>>>>>>>>> thing for apache iceberg project at this point in time ?  This question
>>>>>>>>> will help us to answer the following question: Should we support more
>>>>>>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>>>>>>> on getting the new features that users need most in order to keep the
>>>>>>>>> project more competitive ?
>>>>>>>>>
>>>>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <
>>>>>>>>> sai.sai.shao@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> From Dev's point, it has less burden to always support the latest
>>>>>>>>>> version of Spark (for example). But from user's point, especially for us
>>>>>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>>>>>>>> for the first time (since we have many customizations internally), and
>>>>>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>>>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>>>>>>> unavoidably.
>>>>>>>>>>
>>>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>>>>>
>>>>>>>>>> Just my two cents.
>>>>>>>>>>
>>>>>>>>>> -Saisai
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>>
>>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>>
>>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>>
>>>>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>>>>>> Iceberg versions.
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>>
>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have
>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same organization, don't
>>>>>>>>>>>> they? And they don't have a problem with making their users, all internal,
>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already running an
>>>>>>>>>>>> internal fork that is close to 3.2.)
>>>>>>>>>>>>
>>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>>
>>>>>>>>>>>> There may be features that are broadly useful that do not
>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even
>>>>>>>>>>>> 2.4)?
>>>>>>>>>>>>
>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>>
>>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering
>>>>>>>>>>>>> the limited resources in the open source community, the upsides of option 2
>>>>>>>>>>>>> and 3 are probably not worthy.
>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard
>>>>>>>>>>>>> to predict anything, but even if these use cases are legit, users can still
>>>>>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yufei
>>>>>>>>>>>>>
>>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3
>>>>>>>>>>>>>> version)*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to
>>>>>>>>>>>>>> the most recent minor Spark version to consume any new
>>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can support as many Spark versions as needed and the codebase
>>>>>>>>>>>>>> is still separate as we can use separate branches.
>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may
>>>>>>>>>>>>>> slow down the development.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>>> Can consume unreleased changes but it will required at least
>>>>>>>>>>>>>> 5 modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark
>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would like
>>>>>>>>>>>>>> to hear what other people think/need.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big fan
>>>>>>>>>>>>>> of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to support
>>>>>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have to
>>>>>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>>>>> development speed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>>> great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra
>>>>>>>>>>>>>>> work on our side as just a single Spark version is actively maintained.
>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 releases will only
>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2
>>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can
>>>>>>>>>>>>>>> support as many Spark versions as needed.
>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>>>>>>>> in the Spark integration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>>>> main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>>>

Re: [DISCUSS] Spark version support strategy

Posted by Wing Yew Poon <wy...@cloudera.com.INVALID>.

In the last community sync, we spent a little time on this topic. For Spark
support, there are currently two options under consideration:

Option 2: Separate repo for the Spark support. Use branches for supporting
different Spark versions. Main branch for the latest Spark version (3.2 to
begin with).
Tooling needs to be built for producing regular snapshots of core Iceberg
in a consumable way for this repo. Unclear if commits to core Iceberg will
be tested pre-commit against Spark support; my impression is that they will
not be, and the Spark support build can be broken by changes to core.

A variant of option 3 (which we will simply call Option 3 going forward):
Single repo, separate module (subdirectory) for each Spark version to be
supported. Code duplication in each Spark module (no attempt to refactor
out common code). Each module built against the specific version of Spark
to be supported, producing a runtime jar built against that version. CI
will test all modules. Support can be provided for only building the
modules a developer cares about.

More input was sought and people are encouraged to voice their preference.
I lean towards Option 3.

- Wing Yew

ps. In the sync, as Steven Wu wrote, the question was raised if the same
multi-version support strategy can be adopted across engines. Based on what
Steven wrote, currently the Flink developer community's bandwidth makes
supporting only a single Flink version (and focusing resources on
developing new features on that version) the preferred choice. If so, then
no multi-version support strategy for Flink is needed at this time.


On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <st...@gmail.com> wrote:

> During the sync meeting, people talked about if and how we can have the
> same version support model across engines like Flink and Spark. I can
> provide some input from the Flink side.
>
> Flink only supports two minor versions. E.g., right now Flink 1.13 is the
> latest released version. That means only Flink 1.12 and 1.13 are supported.
> Feature changes or bug fixes will only be backported to 1.12 and 1.13,
> unless it is a serious bug (like security). With that context, personally I
> like option 1 (with one actively supported Flink version in master branch)
> for the iceberg-flink module.
>
> We discussed the idea of supporting multiple Flink versions via shm layer
> and multiple modules. While it may be a little better to support multiple
> Flink versions, I don't know if there is enough support and resources from
> the community to pull it off. Also the ongoing maintenance burden for each
> minor version release from Flink, which happens roughly every 4 months.
>
>
> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Since you mentioned Hive, I chime in with what we do there. You might
>> find it useful:
>> - metastore module - only small differences - DynConstructor solves for us
>> - mr module - some bigger differences, but still manageable for Hive 2-3.
>> Need some new classes, but most of the code is reused - extra module for
>> Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>> codebase.
>>
>> My thoughts based on the above experience:
>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
>> problems with backporting changes between repos and we are slacking behind
>> which hurts both projects
>> - Hive 2-3 model is working better by forcing us to keep the things in
>> sync, but with serious differences in the Hive project it still doesn't
>> seem like a viable option.
>>
>> So I think the question is: How stable is the Spark code we are
>> integrating to. If I is fairly stable then we are better off with a "one
>> repo multiple modules" approach and we should consider the multirepo only
>> if the differences become prohibitive.
>>
>> Thanks, Peter
>>
>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>> <ao...@apple.com.invalid> wrote:
>>
>>> Okay, looks like there is consensus around supporting multiple Spark
>>> versions at the same time. There are folks who mentioned this on this
>>> thread and there were folks who brought this up during the sync.
>>>
>>> Let’s think through Option 2 and 3 in more detail then.
>>>
>>> Option 2
>>>
>>> In Option 2, there will be a separate repo. I believe the master branch
>>> will soon point to Spark 3.2 (the most recent supported version). The main
>>> development will happen there and the artifact version will be 0.1.0. I
>>> also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where
>>> we will cherry-pick applicable changes. Once we are ready to release 0.1.0
>>> Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark
>>> 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master
>>> to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for
>>> cherry-picks.
>>>
>>> I guess we will continue to shade everything in the new repo and will
>>> have to release every time the core is released. We will do a maintenance
>>> release for each supported Spark version whenever we cut a new maintenance Iceberg
>>> release or need to fix any bugs in the Spark integration.
>>> Under this model, we will probably need nightly snapshots (or on each
>>> commit) for the core format and the Spark integration will depend on
>>> snapshots until we are ready to release.
>>>
>>> Overall, I think this option gives us very simple builds and provides
>>> best separation. It will keep the main repo clean. The main downside is
>>> that we will have to split a Spark feature into two PRs: one against the
>>> core and one against the Spark integration. Certain changes in core can
>>> also break the Spark integration too and will require adaptations.
>>>
>>> Ryan, I am not sure I fully understood the testing part. How will we be
>>> able to test the Spark integration in the main repo if certain changes in
>>> core may break the Spark integration and require changes there? Will we try
>>> to prohibit such changes?
>>>
>>> Option 3 (modified)
>>>
>>> If I get correctly, the modified Option 3 sounds very close to
>>> the initially suggested approach by Imran but with code duplication instead
>>> of extra refactoring and introducing new common modules.
>>>
>>> Jack, are you suggesting we test only a single Spark version at a time?
>>> Or do we expect to test all versions? Will there be any difference compared
>>> to just having a module per version? I did not fully understand.
>>>
>>> My worry with this approach is that our build will be very complicated
>>> and we will still have a lot of Spark-related modules in the main repo.
>>> Once people start using Flink and Hive more, will we have to do the same?
>>>
>>> - Anton
>>>
>>>
>>>
>>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>>
>>> I'd support the option that Jack suggests if we can set a few
>>> expectations for keeping it clean.
>>>
>>> First, I'd like to avoid refactoring code to share it across Spark
>>> versions -- that introduces risk because we're relying on compiling against
>>> one version and running in another and both Spark and Scala change rapidly.
>>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>>> version. I think we should duplicate code rather than spend time
>>> refactoring to rely on binary compatibility. I propose we start each new
>>> Spark version by copying the last one and updating it. And we should build
>>> just the latest supported version by default.
>>>
>>> The drawback to having everything in a single repo is that we wouldn't
>>> be able to cherry-pick changes across Spark versions/branches, but I think
>>> Jack is right that having a single build is better.
>>>
>>> Second, we should make CI faster by running the Spark builds in
>>> parallel. It sounds like this is what would happen anyway, with a property
>>> that selects the Spark version that you want to build against.
>>>
>>> Overall, this new suggestion sounds like a promising way forward.
>>>
>>> Ryan
>>>
>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> I think in Ryan's proposal we will create a ton of modules anyway, as
>>>> Wing listed we are just using git branch as an additional dimension, but my
>>>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>>>> artifact published for each Spark version in either approach.
>>>>
>>>> In that case, this is just brainstorming, I wonder if we can explore a
>>>> modified option 3 that flattens all the versions in each Spark branch in
>>>> option 2 into master. The repository structure would look something like:
>>>>
>>>> iceberg/api/...
>>>>             /bundled-guava/...
>>>>             /core/...
>>>>             ...
>>>>             /spark/2.4/core/...
>>>>                             /extension/...
>>>>                             /runtime/...
>>>>                       /3.1/core/...
>>>>                             /extension/...
>>>>                             /runtime/...
>>>>
>>>> The gradle build script in the root is configured to build against the
>>>> latest version of Spark by default, unless otherwise specified by the user.
>>>>
>>>> Intellij can also be configured to only index files of specific
>>>> versions based on the same config used in build.
>>>>
>>>> In this way, I imagine the CI setup to be much easier to do things like
>>>> testing version compatibility for a feature or running only a
>>>> specific subset of Spark version builds based on the Spark version
>>>> directories touched.
>>>>
>>>> And the biggest benefit is that we don't have the same difficulty as
>>>> option 2 of developing a feature when it's both in core and Spark.
>>>>
>>>> We can then develop a mechanism to vote to stop support of certain
>>>> versions, and archive the corresponding directory to avoid accumulating too
>>>> many versions in the long term.
>>>>
>>>> -Jack Ye
>>>>
>>>>
>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>>>> leave out!
>>>>>
>>>>> I would definitely want to test the projects together. One thing we
>>>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>>>> if we could have some tighter integration where the Iceberg Spark build can
>>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>>>> branch, and then run the gradle build with a property that makes Spark a
>>>>> subproject in the build. That way we can continue to have Spark CI run
>>>>> regularly.
>>>>>
>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>>> russell.spitzer@gmail.com> wrote:
>>>>>
>>>>>> I agree that Option 2 is considerably more difficult for development
>>>>>> when core API changes need to be picked up by the external Spark module. I
>>>>>> also think a monthly release would probably still be prohibitive to
>>>>>> actually implementing new features that appear in the API, I would hope we
>>>>>> have a much faster process or maybe just have snapshot artifacts published
>>>>>> nightly?
>>>>>>
>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>>
>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>>> consume changes fairly quickly?
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>>>>>> potential solutions well defined.
>>>>>>>
>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>>>
>>>>>>> I don’t think that we should make updating Spark a requirement. Many
>>>>>>> of the things that we’re working on are orthogonal to Spark versions, like
>>>>>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>>> picking up new Iceberg features.
>>>>>>>
>>>>>>> Another way of thinking about this is that if we went with option 1,
>>>>>>> then we could port bug fixes into 0.12.x. But there are many things that
>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>>>
>>>>>>> If the community is going to do the work anyway — and I think some
>>>>>>> of us would — we should make it possible to share that work. That’s why I
>>>>>>> don’t think that we should go with option 1.
>>>>>>>
>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>>>> not something we want to continue.
>>>>>>>
>>>>>>> Using multiple modules (option 3) is concerning to me because of the
>>>>>>> changes in Spark. We currently structure the library to share as much code
>>>>>>> as possible. But that means compiling against different Spark versions and
>>>>>>> relying on binary compatibility and reflection in some cases. To me, this
>>>>>>> seems unmaintainable in the long run because it requires refactoring common
>>>>>>> classes and spending a lot of time deduplicating code. It also creates a
>>>>>>> ton of modules, at least one common module, then a module per version, then
>>>>>>> an extensions module per version, and finally a runtime module per version.
>>>>>>> That’s 3 modules per Spark version, plus any new common modules. And each
>>>>>>> module needs to be tested, which is making our CI take a really long time.
>>>>>>> We also don’t support multiple Scala versions, which is another gap that
>>>>>>> will require even more modules and tests.
>>>>>>>
>>>>>>> I like option 2 because it would allow us to compile against a
>>>>>>> single version of Spark (which will be much more reliable). It would give
>>>>>>> us an opportunity to support different Scala versions. It avoids the need
>>>>>>> to refactor to share code and allows people to focus on a single version of
>>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>>
>>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>>> repository, but I think that the need to manage more version combinations
>>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>>> We’re just trying to build a library.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>>
>>>>>>>> Everyone has great pros/cons to support their preferences.  Before
>>>>>>>> giving my preference, let me raise one question:    what's the top priority
>>>>>>>> thing for apache iceberg project at this point in time ?  This question
>>>>>>>> will help us to answer the following question: Should we support more
>>>>>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>>>>>> on getting the new features that users need most in order to keep the
>>>>>>>> project more competitive ?
>>>>>>>>
>>>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>>> .
>>>>>>>>
>>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>>> free up resources to make v2 production-ready.
>>>>>>>>
>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> From Dev's point, it has less burden to always support the latest
>>>>>>>>> version of Spark (for example). But from user's point, especially for us
>>>>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>>>>>>> for the first time (since we have many customizations internally), and
>>>>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>>>>>> unavoidably.
>>>>>>>>>
>>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>>>>
>>>>>>>>> Just my two cents.
>>>>>>>>>
>>>>>>>>> -Saisai
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>>
>>>>>>>>>> Hi Wing Yew,
>>>>>>>>>>
>>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>>
>>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>>
>>>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>>>>> Iceberg versions.
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>>> Spark 2.4?
>>>>>>>>>>>
>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken
>>>>>>>>>>> out in favor of Option 1 all work for the same organization, don't they?
>>>>>>>>>>> And they don't have a problem with making their users, all internal, simply
>>>>>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>>>>>>>> fork that is close to 3.2.)
>>>>>>>>>>>
>>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>>
>>>>>>>>>>> There may be features that are broadly useful that do not depend
>>>>>>>>>>> on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>>>>>>>
>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>>
>>>>>>>>>>> - Wing Yew
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>>>>>>>> 3 are probably not worthy.
>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard
>>>>>>>>>>>> to predict anything, but even if these use cases are legit, users can still
>>>>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Yufei
>>>>>>>>>>>>
>>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>>>>>>
>>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to
>>>>>>>>>>>>> the most recent minor Spark version to consume any new
>>>>>>>>>>>>> Iceberg features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can support as many Spark versions as needed and the codebase
>>>>>>>>>>>>> is still separate as we can use separate branches.
>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may slow
>>>>>>>>>>>>> down the development.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>>
>>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>>> complicated.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would like
>>>>>>>>>>>>> to hear what other people think/need.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we should go for option 1. I already am not a big fan
>>>>>>>>>>>>> of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this
>>>>>>>>>>>>> moment.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>> major version.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to support
>>>>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have to
>>>>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>>>> development speed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is
>>>>>>>>>>>>>> great to support older versions but because we compile against 3.0, we
>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer versions.
>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra
>>>>>>>>>>>>>> work on our side as just a single Spark version is actively maintained.
>>>>>>>>>>>>>> Cons: some new features that we will be adding to master
>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 releases will only
>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2
>>>>>>>>>>>>>> to consume any new Spark or format features.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can support
>>>>>>>>>>>>>> as many Spark versions as needed.
>>>>>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>>>>>>> in the Spark integration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>>> main worry is that we will have to release the format more frequently
>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Anton
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>>
>>>

Re: [DISCUSS] Spark version support strategy

Posted by Steven Wu <st...@gmail.com>.

During the sync meeting, people talked about if and how we can have the
same version support model across engines like Flink and Spark. I can
provide some input from the Flink side.

Flink only supports two minor versions. E.g., right now Flink 1.13 is the
latest released version. That means only Flink 1.12 and 1.13 are supported.
Feature changes or bug fixes will only be backported to 1.12 and 1.13,
unless it is a serious bug (like security). With that context, personally I
like option 1 (with one actively supported Flink version in master branch)
for the iceberg-flink module.

We discussed the idea of supporting multiple Flink versions via shm layer
and multiple modules. While it may be a little better to support multiple
Flink versions, I don't know if there is enough support and resources from
the community to pull it off. Also the ongoing maintenance burden for each
minor version release from Flink, which happens roughly every 4 months.


On Thu, Sep 16, 2021 at 10:25 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Since you mentioned Hive, I chime in with what we do there. You might find
> it useful:
> - metastore module - only small differences - DynConstructor solves for us
> - mr module - some bigger differences, but still manageable for Hive 2-3.
> Need some new classes, but most of the code is reused - extra module for
> Hive 3. For Hive 4 we use a different repo as we moved to the Hive
> codebase.
>
> My thoughts based on the above experience:
> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
> problems with backporting changes between repos and we are slacking behind
> which hurts both projects
> - Hive 2-3 model is working better by forcing us to keep the things in
> sync, but with serious differences in the Hive project it still doesn't
> seem like a viable option.
>
> So I think the question is: How stable is the Spark code we are
> integrating to. If I is fairly stable then we are better off with a "one
> repo multiple modules" approach and we should consider the multirepo only
> if the differences become prohibitive.
>
> Thanks, Peter
>
> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, <ao...@apple.com.invalid>
> wrote:
>
>> Okay, looks like there is consensus around supporting multiple Spark
>> versions at the same time. There are folks who mentioned this on this
>> thread and there were folks who brought this up during the sync.
>>
>> Let’s think through Option 2 and 3 in more detail then.
>>
>> Option 2
>>
>> In Option 2, there will be a separate repo. I believe the master branch
>> will soon point to Spark 3.2 (the most recent supported version). The main
>> development will happen there and the artifact version will be 0.1.0. I
>> also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where
>> we will cherry-pick applicable changes. Once we are ready to release 0.1.0
>> Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark
>> 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master
>> to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for
>> cherry-picks.
>>
>> I guess we will continue to shade everything in the new repo and will
>> have to release every time the core is released. We will do a maintenance
>> release for each supported Spark version whenever we cut a new maintenance Iceberg
>> release or need to fix any bugs in the Spark integration.
>> Under this model, we will probably need nightly snapshots (or on each
>> commit) for the core format and the Spark integration will depend on
>> snapshots until we are ready to release.
>>
>> Overall, I think this option gives us very simple builds and provides
>> best separation. It will keep the main repo clean. The main downside is
>> that we will have to split a Spark feature into two PRs: one against the
>> core and one against the Spark integration. Certain changes in core can
>> also break the Spark integration too and will require adaptations.
>>
>> Ryan, I am not sure I fully understood the testing part. How will we be
>> able to test the Spark integration in the main repo if certain changes in
>> core may break the Spark integration and require changes there? Will we try
>> to prohibit such changes?
>>
>> Option 3 (modified)
>>
>> If I get correctly, the modified Option 3 sounds very close to
>> the initially suggested approach by Imran but with code duplication instead
>> of extra refactoring and introducing new common modules.
>>
>> Jack, are you suggesting we test only a single Spark version at a time?
>> Or do we expect to test all versions? Will there be any difference compared
>> to just having a module per version? I did not fully understand.
>>
>> My worry with this approach is that our build will be very complicated
>> and we will still have a lot of Spark-related modules in the main repo.
>> Once people start using Flink and Hive more, will we have to do the same?
>>
>> - Anton
>>
>>
>>
>> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>>
>> I'd support the option that Jack suggests if we can set a few
>> expectations for keeping it clean.
>>
>> First, I'd like to avoid refactoring code to share it across Spark
>> versions -- that introduces risk because we're relying on compiling against
>> one version and running in another and both Spark and Scala change rapidly.
>> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
>> version. I think we should duplicate code rather than spend time
>> refactoring to rely on binary compatibility. I propose we start each new
>> Spark version by copying the last one and updating it. And we should build
>> just the latest supported version by default.
>>
>> The drawback to having everything in a single repo is that we wouldn't be
>> able to cherry-pick changes across Spark versions/branches, but I think
>> Jack is right that having a single build is better.
>>
>> Second, we should make CI faster by running the Spark builds in parallel.
>> It sounds like this is what would happen anyway, with a property that
>> selects the Spark version that you want to build against.
>>
>> Overall, this new suggestion sounds like a promising way forward.
>>
>> Ryan
>>
>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> I think in Ryan's proposal we will create a ton of modules anyway, as
>>> Wing listed we are just using git branch as an additional dimension, but my
>>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>>> artifact published for each Spark version in either approach.
>>>
>>> In that case, this is just brainstorming, I wonder if we can explore a
>>> modified option 3 that flattens all the versions in each Spark branch in
>>> option 2 into master. The repository structure would look something like:
>>>
>>> iceberg/api/...
>>>             /bundled-guava/...
>>>             /core/...
>>>             ...
>>>             /spark/2.4/core/...
>>>                             /extension/...
>>>                             /runtime/...
>>>                       /3.1/core/...
>>>                             /extension/...
>>>                             /runtime/...
>>>
>>> The gradle build script in the root is configured to build against the
>>> latest version of Spark by default, unless otherwise specified by the user.
>>>
>>> Intellij can also be configured to only index files of specific versions
>>> based on the same config used in build.
>>>
>>> In this way, I imagine the CI setup to be much easier to do things like
>>> testing version compatibility for a feature or running only a
>>> specific subset of Spark version builds based on the Spark version
>>> directories touched.
>>>
>>> And the biggest benefit is that we don't have the same difficulty as
>>> option 2 of developing a feature when it's both in core and Spark.
>>>
>>> We can then develop a mechanism to vote to stop support of certain
>>> versions, and archive the corresponding directory to avoid accumulating too
>>> many versions in the long term.
>>>
>>> -Jack Ye
>>>
>>>
>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>>> leave out!
>>>>
>>>> I would definitely want to test the projects together. One thing we
>>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>>> if we could have some tighter integration where the Iceberg Spark build can
>>>> be included in the Iceberg Java build using properties. Maybe the github
>>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>>> branch, and then run the gradle build with a property that makes Spark a
>>>> subproject in the build. That way we can continue to have Spark CI run
>>>> regularly.
>>>>
>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>>> russell.spitzer@gmail.com> wrote:
>>>>
>>>>> I agree that Option 2 is considerably more difficult for development
>>>>> when core API changes need to be picked up by the external Spark module. I
>>>>> also think a monthly release would probably still be prohibitive to
>>>>> actually implementing new features that appear in the API, I would hope we
>>>>> have a much faster process or maybe just have snapshot artifacts published
>>>>> nightly?
>>>>>
>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <
>>>>> wypoon@cloudera.com.INVALID> wrote:
>>>>>
>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>> more work to commit to multiple branches for simplified build and CI
>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>> well as in the engine (in this case Spark) support, and we need to wait for
>>>>> a release of core Iceberg to consume the changes in the subproject. In this
>>>>> case, maybe we should have a monthly release of core Iceberg (no matter how
>>>>> many changes go in, as long as it is non-zero) so that the subproject can
>>>>> consume changes fairly quickly?
>>>>>
>>>>>
>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>>>>> potential solutions well defined.
>>>>>>
>>>>>> Looks like the next step is to decide whether we want to require
>>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>>
>>>>>> I don’t think that we should make updating Spark a requirement. Many
>>>>>> of the things that we’re working on are orthogonal to Spark versions, like
>>>>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>>> picking up new Iceberg features.
>>>>>>
>>>>>> Another way of thinking about this is that if we went with option 1,
>>>>>> then we could port bug fixes into 0.12.x. But there are many things that
>>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>>> some people in the community would have to maintain branches of newer
>>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>>
>>>>>> If the community is going to do the work anyway — and I think some of
>>>>>> us would — we should make it possible to share that work. That’s why I
>>>>>> don’t think that we should go with option 1.
>>>>>>
>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>>> not something we want to continue.
>>>>>>
>>>>>> Using multiple modules (option 3) is concerning to me because of the
>>>>>> changes in Spark. We currently structure the library to share as much code
>>>>>> as possible. But that means compiling against different Spark versions and
>>>>>> relying on binary compatibility and reflection in some cases. To me, this
>>>>>> seems unmaintainable in the long run because it requires refactoring common
>>>>>> classes and spending a lot of time deduplicating code. It also creates a
>>>>>> ton of modules, at least one common module, then a module per version, then
>>>>>> an extensions module per version, and finally a runtime module per version.
>>>>>> That’s 3 modules per Spark version, plus any new common modules. And each
>>>>>> module needs to be tested, which is making our CI take a really long time.
>>>>>> We also don’t support multiple Scala versions, which is another gap that
>>>>>> will require even more modules and tests.
>>>>>>
>>>>>> I like option 2 because it would allow us to compile against a single
>>>>>> version of Spark (which will be much more reliable). It would give us an
>>>>>> opportunity to support different Scala versions. It avoids the need to
>>>>>> refactor to share code and allows people to focus on a single version of
>>>>>> Spark, while also creating a way for people to maintain and update the
>>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>>> slow down development. I think it would actually speed it up because we’d
>>>>>> be spending less time trying to make multiple versions work in the same
>>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>>> don’t have to care about branches for older Spark versions.
>>>>>>
>>>>>> Jack makes a good point about wanting to keep code in a single
>>>>>> repository, but I think that the need to manage more version combinations
>>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>>> We’re just trying to build a library.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for bringing this up,  Anton.
>>>>>>>
>>>>>>> Everyone has great pros/cons to support their preferences.  Before
>>>>>>> giving my preference, let me raise one question:    what's the top priority
>>>>>>> thing for apache iceberg project at this point in time ?  This question
>>>>>>> will help us to answer the following question: Should we support more
>>>>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>>>>> on getting the new features that users need most in order to keep the
>>>>>>> project more competitive ?
>>>>>>>
>>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>>> .
>>>>>>>
>>>>>>> In order to ensure the highest priority at this point in time, I
>>>>>>> will prefer option-1 to reduce the cost of engine maintenance, so as to
>>>>>>> free up resources to make v2 production-ready.
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> From Dev's point, it has less burden to always support the latest
>>>>>>>> version of Spark (for example). But from user's point, especially for us
>>>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>>>>>> for the first time (since we have many customizations internally), and
>>>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>>>>> unavoidably.
>>>>>>>>
>>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>>>
>>>>>>>> Just my two cents.
>>>>>>>>
>>>>>>>> -Saisai
>>>>>>>>
>>>>>>>>
>>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>>
>>>>>>>>> Hi Wing Yew,
>>>>>>>>>
>>>>>>>>> I think 2.4 is a different story, we will continue to support
>>>>>>>>> Spark 2.4, but as you can see it will continue to have very limited
>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>>
>>>>>>>>> I can totally understand your point of view Wing, in fact,
>>>>>>>>> speaking from the perspective of AWS EMR, we have to support over 40
>>>>>>>>> versions of the software because there are people who are still using Spark
>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes will become a
>>>>>>>>> liability not only on the user side, but also on the service provider side,
>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, as it will
>>>>>>>>> make the life of both parties easier in the end. New feature is definitely
>>>>>>>>> one of the best incentives to promote an upgrade on user side.
>>>>>>>>>
>>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>>>> Iceberg versions.
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>>> Spark 2.4?
>>>>>>>>>>
>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken
>>>>>>>>>> out in favor of Option 1 all work for the same organization, don't they?
>>>>>>>>>> And they don't have a problem with making their users, all internal, simply
>>>>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>>>>>>> fork that is close to 3.2.)
>>>>>>>>>>
>>>>>>>>>> I work for an organization with customers running different
>>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>>
>>>>>>>>>> There may be features that are broadly useful that do not depend
>>>>>>>>>> on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>>>>>>
>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>>
>>>>>>>>>> - Wing Yew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>>
>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>>>>>>> 3 are probably not worthy.
>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Yufei
>>>>>>>>>>>
>>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>>>>>
>>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to
>>>>>>>>>>>> the most recent minor Spark version to consume any new Iceberg
>>>>>>>>>>>> features.
>>>>>>>>>>>>
>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>>
>>>>>>>>>>>> Can support as many Spark versions as needed and the codebase
>>>>>>>>>>>> is still separate as we can use separate branches.
>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may slow
>>>>>>>>>>>> down the development.
>>>>>>>>>>>>
>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>>
>>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>>> complicated.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>>> We follow Option 1 internally at the moment but I would like to
>>>>>>>>>>>> hear what other people think/need.
>>>>>>>>>>>>
>>>>>>>>>>>> - Anton
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I think we should go for option 1. I already am not a big fan
>>>>>>>>>>>> of having runtime errors for unsupported things based on versions and I
>>>>>>>>>>>> don't think minor version upgrades are a large issue for users.  I'm
>>>>>>>>>>>> especially not looking forward to supporting interfaces that only exist in
>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>>>>>>
>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>> major version.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is when it gets a bit complicated. If we want to support
>>>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have to
>>>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>>
>>>>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>>
>>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>>
>>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>>> development speed.
>>>>>>>>>>>>
>>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great
>>>>>>>>>>>>> to support older versions but because we compile against 3.0, we cannot use
>>>>>>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Option 1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra
>>>>>>>>>>>>> work on our side as just a single Spark version is actively maintained.
>>>>>>>>>>>>> Cons: some new features that we will be adding to master could
>>>>>>>>>>>>> also work with older Spark versions but all 0.12 releases will only contain
>>>>>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to
>>>>>>>>>>>>> consume any new Spark or format features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Option 2
>>>>>>>>>>>>>
>>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can support
>>>>>>>>>>>>> as many Spark versions as needed.
>>>>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>>>>>> in the Spark integration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my
>>>>>>>>>>>>> main worry is that we will have to release the format more frequently
>>>>>>>>>>>>> (which is a good thing but requires more work and time) and the overall
>>>>>>>>>>>>> Spark development may be slower.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Anton
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>
>>

Re: [DISCUSS] Spark version support strategy

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Since you mentioned Hive, I chime in with what we do there. You might find
it useful:
- metastore module - only small differences - DynConstructor solves for us
- mr module - some bigger differences, but still manageable for Hive 2-3.
Need some new classes, but most of the code is reused - extra module for
Hive 3. For Hive 4 we use a different repo as we moved to the Hive
codebase.

My thoughts based on the above experience:
- Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
problems with backporting changes between repos and we are slacking behind
which hurts both projects
- Hive 2-3 model is working better by forcing us to keep the things in
sync, but with serious differences in the Hive project it still doesn't
seem like a viable option.

So I think the question is: How stable is the Spark code we are integrating
to. If I is fairly stable then we are better off with a "one repo multiple
modules" approach and we should consider the multirepo only if the
differences become prohibitive.

Thanks, Peter

On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, <ao...@apple.com.invalid>
wrote:

> Okay, looks like there is consensus around supporting multiple Spark
> versions at the same time. There are folks who mentioned this on this
> thread and there were folks who brought this up during the sync.
>
> Let’s think through Option 2 and 3 in more detail then.
>
> Option 2
>
> In Option 2, there will be a separate repo. I believe the master branch
> will soon point to Spark 3.2 (the most recent supported version). The main
> development will happen there and the artifact version will be 0.1.0. I
> also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where
> we will cherry-pick applicable changes. Once we are ready to release 0.1.0
> Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark
> 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master
> to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for
> cherry-picks.
>
> I guess we will continue to shade everything in the new repo and will have
> to release every time the core is released. We will do a maintenance
> release for each supported Spark version whenever we cut a new maintenance Iceberg
> release or need to fix any bugs in the Spark integration.
> Under this model, we will probably need nightly snapshots (or on each
> commit) for the core format and the Spark integration will depend on
> snapshots until we are ready to release.
>
> Overall, I think this option gives us very simple builds and provides best
> separation. It will keep the main repo clean. The main downside is that we
> will have to split a Spark feature into two PRs: one against the core and
> one against the Spark integration. Certain changes in core can also break
> the Spark integration too and will require adaptations.
>
> Ryan, I am not sure I fully understood the testing part. How will we be
> able to test the Spark integration in the main repo if certain changes in
> core may break the Spark integration and require changes there? Will we try
> to prohibit such changes?
>
> Option 3 (modified)
>
> If I get correctly, the modified Option 3 sounds very close to
> the initially suggested approach by Imran but with code duplication instead
> of extra refactoring and introducing new common modules.
>
> Jack, are you suggesting we test only a single Spark version at a time? Or
> do we expect to test all versions? Will there be any difference compared to
> just having a module per version? I did not fully understand.
>
> My worry with this approach is that our build will be very complicated and
> we will still have a lot of Spark-related modules in the main repo. Once
> people start using Flink and Hive more, will we have to do the same?
>
> - Anton
>
>
>
> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
>
> I'd support the option that Jack suggests if we can set a few expectations
> for keeping it clean.
>
> First, I'd like to avoid refactoring code to share it across Spark
> versions -- that introduces risk because we're relying on compiling against
> one version and running in another and both Spark and Scala change rapidly.
> A big benefit of options 1 and 2 is that we mostly focus on only one Spark
> version. I think we should duplicate code rather than spend time
> refactoring to rely on binary compatibility. I propose we start each new
> Spark version by copying the last one and updating it. And we should build
> just the latest supported version by default.
>
> The drawback to having everything in a single repo is that we wouldn't be
> able to cherry-pick changes across Spark versions/branches, but I think
> Jack is right that having a single build is better.
>
> Second, we should make CI faster by running the Spark builds in parallel.
> It sounds like this is what would happen anyway, with a property that
> selects the Spark version that you want to build against.
>
> Overall, this new suggestion sounds like a promising way forward.
>
> Ryan
>
> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:
>
>> I think in Ryan's proposal we will create a ton of modules anyway, as
>> Wing listed we are just using git branch as an additional dimension, but my
>> understanding is that you will still have 1 core, 1 extension, 1 runtime
>> artifact published for each Spark version in either approach.
>>
>> In that case, this is just brainstorming, I wonder if we can explore a
>> modified option 3 that flattens all the versions in each Spark branch in
>> option 2 into master. The repository structure would look something like:
>>
>> iceberg/api/...
>>             /bundled-guava/...
>>             /core/...
>>             ...
>>             /spark/2.4/core/...
>>                             /extension/...
>>                             /runtime/...
>>                       /3.1/core/...
>>                             /extension/...
>>                             /runtime/...
>>
>> The gradle build script in the root is configured to build against the
>> latest version of Spark by default, unless otherwise specified by the user.
>>
>> Intellij can also be configured to only index files of specific versions
>> based on the same config used in build.
>>
>> In this way, I imagine the CI setup to be much easier to do things like
>> testing version compatibility for a feature or running only a
>> specific subset of Spark version builds based on the Spark version
>> directories touched.
>>
>> And the biggest benefit is that we don't have the same difficulty as
>> option 2 of developing a feature when it's both in core and Spark.
>>
>> We can then develop a mechanism to vote to stop support of certain
>> versions, and archive the corresponding directory to avoid accumulating too
>> many versions in the long term.
>>
>> -Jack Ye
>>
>>
>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Sorry, I was thinking about CI integration between Iceberg Java and
>>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>>> leave out!
>>>
>>> I would definitely want to test the projects together. One thing we
>>> could do is have a nightly build like Russell suggests. I'm also wondering
>>> if we could have some tighter integration where the Iceberg Spark build can
>>> be included in the Iceberg Java build using properties. Maybe the github
>>> action could checkout Iceberg, then checkout the Spark integration's latest
>>> branch, and then run the gradle build with a property that makes Spark a
>>> subproject in the build. That way we can continue to have Spark CI run
>>> regularly.
>>>
>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>>> russell.spitzer@gmail.com> wrote:
>>>
>>>> I agree that Option 2 is considerably more difficult for development
>>>> when core API changes need to be picked up by the external Spark module. I
>>>> also think a monthly release would probably still be prohibitive to
>>>> actually implementing new features that appear in the API, I would hope we
>>>> have a much faster process or maybe just have snapshot artifacts published
>>>> nightly?
>>>>
>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wy...@cloudera.com.INVALID>
>>>> wrote:
>>>>
>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a separate
>>>> repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
>>>> 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
>>>> versions or all Spark 3 versions, then we would need to commit the changes
>>>> to all applicable branches. Basically we are trading more work to commit to
>>>> multiple branches for simplified build and CI time per branch, which might
>>>> be an acceptable trade-off. However, the biggest downside is that changes
>>>> may need to be made in core Iceberg as well as in the engine (in this case
>>>> Spark) support, and we need to wait for a release of core Iceberg to
>>>> consume the changes in the subproject. In this case, maybe we should have a
>>>> monthly release of core Iceberg (no matter how many changes go in, as long
>>>> as it is non-zero) so that the subproject can consume changes fairly
>>>> quickly?
>>>>
>>>>
>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>>>> potential solutions well defined.
>>>>>
>>>>> Looks like the next step is to decide whether we want to require
>>>>> people to update Spark versions to pick up newer versions of Iceberg. If we
>>>>> choose to make people upgrade, then option 1 is clearly the best choice.
>>>>>
>>>>> I don’t think that we should make updating Spark a requirement. Many
>>>>> of the things that we’re working on are orthogonal to Spark versions, like
>>>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>>> picking up new Iceberg features.
>>>>>
>>>>> Another way of thinking about this is that if we went with option 1,
>>>>> then we could port bug fixes into 0.12.x. But there are many things that
>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>>> some people in the community would have to maintain branches of newer
>>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>>
>>>>> If the community is going to do the work anyway — and I think some of
>>>>> us would — we should make it possible to share that work. That’s why I
>>>>> don’t think that we should go with option 1.
>>>>>
>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>>> not something we want to continue.
>>>>>
>>>>> Using multiple modules (option 3) is concerning to me because of the
>>>>> changes in Spark. We currently structure the library to share as much code
>>>>> as possible. But that means compiling against different Spark versions and
>>>>> relying on binary compatibility and reflection in some cases. To me, this
>>>>> seems unmaintainable in the long run because it requires refactoring common
>>>>> classes and spending a lot of time deduplicating code. It also creates a
>>>>> ton of modules, at least one common module, then a module per version, then
>>>>> an extensions module per version, and finally a runtime module per version.
>>>>> That’s 3 modules per Spark version, plus any new common modules. And each
>>>>> module needs to be tested, which is making our CI take a really long time.
>>>>> We also don’t support multiple Scala versions, which is another gap that
>>>>> will require even more modules and tests.
>>>>>
>>>>> I like option 2 because it would allow us to compile against a single
>>>>> version of Spark (which will be much more reliable). It would give us an
>>>>> opportunity to support different Scala versions. It avoids the need to
>>>>> refactor to share code and allows people to focus on a single version of
>>>>> Spark, while also creating a way for people to maintain and update the
>>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>>> slow down development. I think it would actually speed it up because we’d
>>>>> be spending less time trying to make multiple versions work in the same
>>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>>> don’t have to care about branches for older Spark versions.
>>>>>
>>>>> Jack makes a good point about wanting to keep code in a single
>>>>> repository, but I think that the need to manage more version combinations
>>>>> overrides this concern. It’s easier to make this decision in python because
>>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>>> We’re just trying to build a library.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>>>
>>>>>> Thanks for bringing this up,  Anton.
>>>>>>
>>>>>> Everyone has great pros/cons to support their preferences.  Before
>>>>>> giving my preference, let me raise one question:    what's the top priority
>>>>>> thing for apache iceberg project at this point in time ?  This question
>>>>>> will help us to answer the following question: Should we support more
>>>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>>>> on getting the new features that users need most in order to keep the
>>>>>> project more competitive ?
>>>>>>
>>>>>> If people watch the apache iceberg project and check the issues &
>>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>>> .
>>>>>>
>>>>>> In order to ensure the highest priority at this point in time, I will
>>>>>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>>>>>> resources to make v2 production-ready.
>>>>>>
>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> From Dev's point, it has less burden to always support the latest
>>>>>>> version of Spark (for example). But from user's point, especially for us
>>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>>>>> for the first time (since we have many customizations internally), and
>>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>>>> unavoidably.
>>>>>>>
>>>>>>> So I'm inclined to make this support in community, not by users
>>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>>
>>>>>>> Just my two cents.
>>>>>>>
>>>>>>> -Saisai
>>>>>>>
>>>>>>>
>>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>>
>>>>>>>> Hi Wing Yew,
>>>>>>>>
>>>>>>>> I think 2.4 is a different story, we will continue to support Spark
>>>>>>>> 2.4, but as you can see it will continue to have very limited
>>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>>
>>>>>>>> I can totally understand your point of view Wing, in fact, speaking
>>>>>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>>>>>> software because there are people who are still using Spark 1.4, believe it
>>>>>>>> or not. After all, keep backporting changes will become a liability not
>>>>>>>> only on the user side, but also on the service provider side, so I believe
>>>>>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>>>>>> of both parties easier in the end. New feature is definitely one of the
>>>>>>>> best incentives to promote an upgrade on user side.
>>>>>>>>
>>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>>> Iceberg versions.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>>> Spark 2.4?
>>>>>>>>>
>>>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken
>>>>>>>>> out in favor of Option 1 all work for the same organization, don't they?
>>>>>>>>> And they don't have a problem with making their users, all internal, simply
>>>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>>>>>> fork that is close to 3.2.)
>>>>>>>>>
>>>>>>>>> I work for an organization with customers running different
>>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>>
>>>>>>>>> There may be features that are broadly useful that do not depend
>>>>>>>>> on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>>>>>
>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>>> are the modules you're thinking of?
>>>>>>>>>
>>>>>>>>> - Wing Yew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>>
>>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>>>>>> 3 are probably not worthy.
>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Yufei
>>>>>>>>>>
>>>>>>>>>> `This is not a contribution`
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>>>>
>>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to
>>>>>>>>>>> the most recent minor Spark version to consume any new Iceberg
>>>>>>>>>>> features.
>>>>>>>>>>>
>>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>>
>>>>>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>>>>>> still separate as we can use separate branches.
>>>>>>>>>>> Impossible to consume any unreleased changes in core, may slow
>>>>>>>>>>> down the development.
>>>>>>>>>>>
>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>>
>>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>>> complicated.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>>> We follow Option 1 internally at the moment but I would like to
>>>>>>>>>>> hear what other people think/need.
>>>>>>>>>>>
>>>>>>>>>>> - Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>>>>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>>>>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>>>>>>>> in a multiple Spark version support future.
>>>>>>>>>>>
>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>>
>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>>>>>
>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>> major version.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is when it gets a bit complicated. If we want to support
>>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have to
>>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>>
>>>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Internally, we don’t deliver new features to older Spark
>>>>>>>>>>> versions as it requires a lot of effort to port things. Personally, I don’t
>>>>>>>>>>> think it is too bad to require users to upgrade if they want new features.
>>>>>>>>>>> At the same time, there are valid concerns with this approach too that we
>>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>>
>>>>>>>>>>> Overall I would personally prefer us to not support all the
>>>>>>>>>>> minor versions, but instead support maybe just 2-3 latest versions in a
>>>>>>>>>>> major version. This avoids the problem that some users are unwilling to
>>>>>>>>>>> move to a newer version and keep patching old Spark version branches. If
>>>>>>>>>>> there are some features requiring a newer version, it makes sense to move
>>>>>>>>>>> that newer version in master.
>>>>>>>>>>>
>>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>>> development speed.
>>>>>>>>>>>
>>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jack Ye
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey folks,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>>
>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great
>>>>>>>>>>>> to support older versions but because we compile against 3.0, we cannot use
>>>>>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>>> to ignore them.
>>>>>>>>>>>>
>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>>
>>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>>
>>>>>>>>>>>> Option 1
>>>>>>>>>>>>
>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>>>
>>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra
>>>>>>>>>>>> work on our side as just a single Spark version is actively maintained.
>>>>>>>>>>>> Cons: some new features that we will be adding to master could
>>>>>>>>>>>> also work with older Spark versions but all 0.12 releases will only contain
>>>>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to
>>>>>>>>>>>> consume any new Spark or format features.
>>>>>>>>>>>>
>>>>>>>>>>>> Option 2
>>>>>>>>>>>>
>>>>>>>>>>>> Move our Spark integration into a separate project and
>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>>
>>>>>>>>>>>> Pros: decouples the format version from Spark, we can support
>>>>>>>>>>>> as many Spark versions as needed.
>>>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>>>>> in the Spark integration.
>>>>>>>>>>>>
>>>>>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>>>>>> worry is that we will have to release the format more frequently (which is
>>>>>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>>>>>> development may be slower.
>>>>>>>>>>>>
>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Anton
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>
>

Re: [DISCUSS] Spark version support strategy

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

Okay, looks like there is consensus around supporting multiple Spark versions at the same time. There are folks who mentioned this on this thread and there were folks who brought this up during the sync.

Let’s think through Option 2 and 3 in more detail then.

Option 2

In Option 2, there will be a separate repo. I believe the master branch will soon point to Spark 3.2 (the most recent supported version). The main development will happen there and the artifact version will be 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where we will cherry-pick applicable changes. Once we are ready to release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for cherry-picks.

I guess we will continue to shade everything in the new repo and will have to release every time the core is released. We will do a maintenance release for each supported Spark version whenever we cut a new maintenance Iceberg release or need to fix any bugs in the Spark integration.
Under this model, we will probably need nightly snapshots (or on each commit) for the core format and the Spark integration will depend on snapshots until we are ready to release.

Overall, I think this option gives us very simple builds and provides best separation. It will keep the main repo clean. The main downside is that we will have to split a Spark feature into two PRs: one against the core and one against the Spark integration. Certain changes in core can also break the Spark integration too and will require adaptations.

Ryan, I am not sure I fully understood the testing part. How will we be able to test the Spark integration in the main repo if certain changes in core may break the Spark integration and require changes there? Will we try to prohibit such changes?

Option 3 (modified)

If I get correctly, the modified Option 3 sounds very close to the initially suggested approach by Imran but with code duplication instead of extra refactoring and introducing new common modules.

Jack, are you suggesting we test only a single Spark version at a time? Or do we expect to test all versions? Will there be any difference compared to just having a module per version? I did not fully understand.

My worry with this approach is that our build will be very complicated and we will still have a lot of Spark-related modules in the main repo. Once people start using Flink and Hive more, will we have to do the same?

- Anton



> On 16 Sep 2021, at 08:11, Ryan Blue <bl...@tabular.io> wrote:
> 
> I'd support the option that Jack suggests if we can set a few expectations for keeping it clean.
> 
> First, I'd like to avoid refactoring code to share it across Spark versions -- that introduces risk because we're relying on compiling against one version and running in another and both Spark and Scala change rapidly. A big benefit of options 1 and 2 is that we mostly focus on only one Spark version. I think we should duplicate code rather than spend time refactoring to rely on binary compatibility. I propose we start each new Spark version by copying the last one and updating it. And we should build just the latest supported version by default.
> 
> The drawback to having everything in a single repo is that we wouldn't be able to cherry-pick changes across Spark versions/branches, but I think Jack is right that having a single build is better.
> 
> Second, we should make CI faster by running the Spark builds in parallel. It sounds like this is what would happen anyway, with a property that selects the Spark version that you want to build against.
> 
> Overall, this new suggestion sounds like a promising way forward.
> 
> Ryan
> 
> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
> I think in Ryan's proposal we will create a ton of modules anyway, as Wing listed we are just using git branch as an additional dimension, but my understanding is that you will still have 1 core, 1 extension, 1 runtime artifact published for each Spark version in either approach.
> 
> In that case, this is just brainstorming, I wonder if we can explore a modified option 3 that flattens all the versions in each Spark branch in option 2 into master. The repository structure would look something like:
> 
> iceberg/api/...
>             /bundled-guava/...
>             /core/...
>             ...
>             /spark/2.4/core/...
>                             /extension/...
>                             /runtime/...
>                       /3.1/core/...
>                             /extension/...
>                             /runtime/...
> 
> The gradle build script in the root is configured to build against the latest version of Spark by default, unless otherwise specified by the user. 
> 
> Intellij can also be configured to only index files of specific versions based on the same config used in build.
> 
> In this way, I imagine the CI setup to be much easier to do things like testing version compatibility for a feature or running only a specific subset of Spark version builds based on the Spark version directories touched. 
> 
> And the biggest benefit is that we don't have the same difficulty as option 2 of developing a feature when it's both in core and Spark.
> 
> We can then develop a mechanism to vote to stop support of certain versions, and archive the corresponding directory to avoid accumulating too many versions in the long term.
> 
> -Jack Ye
> 
> 
> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
> Sorry, I was thinking about CI integration between Iceberg Java and Iceberg Spark, I just didn't mention it and I see how that's a big thing to leave out!
> 
> I would definitely want to test the projects together. One thing we could do is have a nightly build like Russell suggests. I'm also wondering if we could have some tighter integration where the Iceberg Spark build can be included in the Iceberg Java build using properties. Maybe the github action could checkout Iceberg, then checkout the Spark integration's latest branch, and then run the gradle build with a property that makes Spark a subproject in the build. That way we can continue to have Spark CI run regularly.
> 
> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
> I agree that Option 2 is considerably more difficult for development when core API changes need to be picked up by the external Spark module. I also think a monthly release would probably still be prohibitive to actually implementing new features that appear in the API, I would hope we have a much faster process or maybe just have snapshot artifacts published nightly?
> 
>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wypoon@cloudera.com.INVALID <ma...@cloudera.com.INVALID>> wrote:
>> 
>> IIUC, Option 2 is to move the Spark support for Iceberg into a separate repo (subproject of Iceberg). Would we have branches such as 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all versions or all Spark 3 versions, then we would need to commit the changes to all applicable branches. Basically we are trading more work to commit to multiple branches for simplified build and CI time per branch, which might be an acceptable trade-off. However, the biggest downside is that changes may need to be made in core Iceberg as well as in the engine (in this case Spark) support, and we need to wait for a release of core Iceberg to consume the changes in the subproject. In this case, maybe we should have a monthly release of core Iceberg (no matter how many changes go in, as long as it is non-zero) so that the subproject can consume changes fairly quickly?
>> 
>> 
>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
>> Thanks for bringing this up, Anton. I’m glad that we have the set of potential solutions well defined.
>> 
>> Looks like the next step is to decide whether we want to require people to update Spark versions to pick up newer versions of Iceberg. If we choose to make people upgrade, then option 1 is clearly the best choice.
>> 
>> I don’t think that we should make updating Spark a requirement. Many of the things that we’re working on are orthogonal to Spark versions, like table maintenance actions, secondary indexes, the 1.0 API, views, ORC delete files, new storage implementations, etc. Upgrading Spark is time consuming and untrusted in my experience, so I think we would be setting up an unnecessary trade-off between spending lots of time to upgrade Spark and picking up new Iceberg features.
>> 
>> Another way of thinking about this is that if we went with option 1, then we could port bug fixes into 0.12.x. But there are many things that wouldn’t fit this model, like adding a FileIO implementation for ADLS. So some people in the community would have to maintain branches of newer Iceberg versions with older versions of Spark outside of the main Iceberg project — that defeats the purpose of simplifying things with option 1 because we would then have more people maintaining the same 0.13.x with Spark 3.1 branch. (This reminds me of the Spark community, where we wanted to release a 2.5 line with DSv2 backported, but the community decided not to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>> 
>> If the community is going to do the work anyway — and I think some of us would — we should make it possible to share that work. That’s why I don’t think that we should go with option 1.
>> 
>> If we don’t go with option 1, then the choice is how to maintain multiple Spark versions. I think that the way we’re doing it right now is not something we want to continue.
>> 
>> Using multiple modules (option 3) is concerning to me because of the changes in Spark. We currently structure the library to share as much code as possible. But that means compiling against different Spark versions and relying on binary compatibility and reflection in some cases. To me, this seems unmaintainable in the long run because it requires refactoring common classes and spending a lot of time deduplicating code. It also creates a ton of modules, at least one common module, then a module per version, then an extensions module per version, and finally a runtime module per version. That’s 3 modules per Spark version, plus any new common modules. And each module needs to be tested, which is making our CI take a really long time. We also don’t support multiple Scala versions, which is another gap that will require even more modules and tests.
>> 
>> I like option 2 because it would allow us to compile against a single version of Spark (which will be much more reliable). It would give us an opportunity to support different Scala versions. It avoids the need to refactor to share code and allows people to focus on a single version of Spark, while also creating a way for people to maintain and update the older versions with newer Iceberg releases. I don’t think that this would slow down development. I think it would actually speed it up because we’d be spending less time trying to make multiple versions work in the same build. And anyone in favor of option 1 would basically get option 1: you don’t have to care about branches for older Spark versions.
>> 
>> Jack makes a good point about wanting to keep code in a single repository, but I think that the need to manage more version combinations overrides this concern. It’s easier to make this decision in python because we’re not trying to depend on two projects that change relatively quickly. We’re just trying to build a library.
>> 
>> Ryan
>> 
>> 
>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <openinx@gmail.com <ma...@gmail.com>> wrote:
>> Thanks for bringing this up,  Anton. 
>> 
>> Everyone has great pros/cons to support their preferences.  Before giving my preference, let me raise one question:    what's the top priority thing for apache iceberg project at this point in time ?  This question will help us to answer the following question: Should we support more engine versions more robustly or be a bit more aggressive and concentrate on getting the new features that users need most in order to keep the project more competitive ? 
>> 
>> If people watch the apache iceberg project and check the issues & PR frequently,  I guess more than 90% people will answer the priority question:   There is no doubt for making the whole v2 story to be production-ready.   The current roadmap discussion also proofs the thing : https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E <https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E> .   
>> 
>> In order to ensure the highest priority at this point in time, I will prefer option-1 to reduce the cost of engine maintenance, so as to free up resources to make v2 production-ready. 
>> 
>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sai.sai.shao@gmail.com <ma...@gmail.com>> wrote:
>> From Dev's point, it has less burden to always support the latest version of Spark (for example). But from user's point, especially for us who maintain Spark internally, it is not easy to upgrade the Spark version for the first time (since we have many customizations internally), and we're still promoting to upgrade to 3.1.2. If the community ditches the support of old version of Spark3, users have to maintain it themselves unavoidably. 
>> 
>> So I'm inclined to make this support in community, not by users themselves, as for Option 2 or 3, I'm fine with either. And to relieve the burden, we could support limited versions of Spark (for example 2 versions).
>> 
>> Just my two cents.
>> 
>> -Saisai
>> 
>> 
>> Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> 于2021年9月15日周三 下午1:35写道：
>> Hi Wing Yew,
>> 
>> I think 2.4 is a different story, we will continue to support Spark 2.4, but as you can see it will continue to have very limited functionalities comparing to Spark 3. I believe we discussed about option 3 when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy around this, let's take this chance to make a good community guideline for all future engine versions, especially for Spark, Flink and Hive that are in the same repository.
>> 
>> I can totally understand your point of view Wing, in fact, speaking from the perspective of AWS EMR, we have to support over 40 versions of the software because there are people who are still using Spark 1.4, believe it or not. After all, keep backporting changes will become a liability not only on the user side, but also on the service provider side, so I believe it's not a bad practice to push for user upgrade, as it will make the life of both parties easier in the end. New feature is definitely one of the best incentives to promote an upgrade on user side.
>> 
>> I think the biggest issue of option 3 is about its scalability, because we will have an unbounded list of packages to add and compile in the future, and we probably cannot drop support of that package once created. If we go with option 1, I think we can still publish a few patch versions for old Iceberg releases, and committers can control the amount of patch versions to guard people from abusing the power of patching. I see this as a consistent strategy also for Flink and Hive. With this strategy, we can truly have a compatibility matrix for engine versions against Iceberg versions.
>> 
>> -Jack
>> 
>> 
>> 
>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <wypoon@cloudera.com.invalid <ma...@cloudera.com.invalid>> wrote:
>> I understand and sympathize with the desire to use new DSv2 features in Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't think it considers the interests of users. I do not think that most users will upgrade to Spark 3.2 as soon as it is released. It is a "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all know that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1 and from 3.1 to 3.2. I think there are even a lot of users running Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark 2.4?
>> 
>> Please correct me if I'm mistaken, but the folks who have spoken out in favor of Option 1 all work for the same organization, don't they? And they don't have a problem with making their users, all internal, simply upgrade to Spark 3.2, do they? (Or they are already running an internal fork that is close to 3.2.)
>> 
>> I work for an organization with customers running different versions of Spark. It is true that we can backport new features to older versions if we wanted to. I suppose the people contributing to Iceberg work for some organization or other that either use Iceberg in-house, or provide software (possibly in the form of a service) to customers, and either way, the organizations have the ability to backport features and fixes to internal versions. Are there any users out there who simply use Apache Iceberg and depend on the community version?
>> 
>> There may be features that are broadly useful that do not depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>> 
>> I am not in favor of Option 2. I do not oppose Option 1, but I would consider Option 3 too. Anton, you said 5 modules are required; what are the modules you're thinking of?
>> 
>> - Wing Yew
>> 
>> 
>> 
>> 
>> 
>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain000@gmail.com <ma...@gmail.com>> wrote:
>> Option 1 sounds good to me. Here are my reasons:
>> 
>> 1. Both 2 and 3 will slow down the development. Considering the limited resources in the open source community, the upsides of option 2 and 3 are probably not worthy.
>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict anything, but even if these use cases are legit, users can still get the new feature by backporting it to an older version in case of upgrading to a newer version isn't an option.
>> 
>> Best,
>> 
>> Yufei
>> 
>> `This is not a contribution`
>> 
>> 
>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <aokolnychyi@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>> To sum up what we have so far:
>> 
>> 
>> Option 1 (support just the most recent minor Spark 3 version)
>> 
>> The easiest option for us devs, forces the user to upgrade to the most recent minor Spark version to consume any new Iceberg features.
>> 
>> Option 2 (a separate project under Iceberg)
>> 
>> Can support as many Spark versions as needed and the codebase is still separate as we can use separate branches.
>> Impossible to consume any unreleased changes in core, may slow down the development.
>> 
>> Option 3 (separate modules for Spark 3.1/3.2)
>> 
>> Introduce more modules in the same project.
>> Can consume unreleased changes but it will required at least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing complicated.
>> 
>> 
>> Are there any users for whom upgrading the minor Spark version (e3.1 to 3.2) to consume new features is a blocker?
>> We follow Option 1 internally at the moment but I would like to hear what other people think/need.
>> 
>> - Anton
>> 
>> 
>>> On 14 Sep 2021, at 09:44, Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I think we should go for option 1. I already am not a big fan of having runtime errors for unsupported things based on versions and I don't think minor version upgrades are a large issue for users.  I'm especially not looking forward to supporting interfaces that only exist in Spark 3.2 in a multiple Spark version support future.
>>> 
>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <aokolnychyi@apple.com.INVALID <ma...@apple.com.INVALID>> wrote:
>>>> 
>>>>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>>>> 
>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>> 
>>>>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. 
>>>> 
>>>> This is when it gets a bit complicated. If we want to support both Spark 3.1 and Spark 3.2 with a single module, it means we have to compile against 3.1. The problem is that we rely on DSv2 that is being actively developed. 3.2 and 3.1 have substantial differences. On top of that, we have our extensions that are extremely low-level and may break not only between minor versions but also between patch releases.
>>>> 
>>>>> f there are some features requiring a newer version, it makes sense to move that newer version in master.
>>>> 
>>>> Internally, we don’t deliver new features to older Spark versions as it requires a lot of effort to port things. Personally, I don’t think it is too bad to require users to upgrade if they want new features. At the same time, there are valid concerns with this approach too that we mentioned during the sync. For example, certain new features would also work fine with older Spark versions. I generally agree with that and that not supporting recent versions is not ideal. However, I want to find a balance between the complexity on our side and ease of use for the users. Ideally, supporting a few recent versions would be sufficient but our Spark integration is too low-level to do that with a single module.
>>>>  
>>>> 
>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>>>>> 
>>>>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. This avoids the problem that some users are unwilling to move to a newer version and keep patching old Spark version branches. If there are some features requiring a newer version, it makes sense to move that newer version in master.
>>>>> 
>>>>> In addition, because currently Spark is considered the most feature-complete reference implementation compared to all other engines, I think we should not add artificial barriers that would slow down its development speed.
>>>>> 
>>>>> So my thinking is closer to option 1.
>>>>> 
>>>>> Best,
>>>>> Jack Ye
>>>>> 
>>>>> 
>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <aokolnychyi@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>>>>> Hey folks,
>>>>> 
>>>>> I want to discuss our Spark version support strategy.
>>>>> 
>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
>>>>> Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.
>>>>> 
>>>>> Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.
>>>>> 
>>>>> I see two options to move forward:
>>>>> 
>>>>> Option 1
>>>>> 
>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.
>>>>> 
>>>>> Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
>>>>> Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.
>>>>> 
>>>>> Option 2
>>>>> 
>>>>> Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.
>>>>> 
>>>>> Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
>>>>> Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.
>>>>> 
>>>>> Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.
>>>>> 
>>>>> I’d love to hear what everybody thinks about this matter.
>>>>> 
>>>>> Thanks,
>>>>> Anton
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Tabular
> 
> 
> 
> -- 
> Ryan Blue
> Tabular
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] Spark version support strategy

Posted by Ryan Blue <bl...@tabular.io>.

I'd support the option that Jack suggests if we can set a few expectations
for keeping it clean.

First, I'd like to avoid refactoring code to share it across Spark versions
-- that introduces risk because we're relying on compiling against one
version and running in another and both Spark and Scala change rapidly. A
big benefit of options 1 and 2 is that we mostly focus on only one Spark
version. I think we should duplicate code rather than spend time
refactoring to rely on binary compatibility. I propose we start each new
Spark version by copying the last one and updating it. And we should build
just the latest supported version by default.

The drawback to having everything in a single repo is that we wouldn't be
able to cherry-pick changes across Spark versions/branches, but I think
Jack is right that having a single build is better.

Second, we should make CI faster by running the Spark builds in parallel.
It sounds like this is what would happen anyway, with a property that
selects the Spark version that you want to build against.

Overall, this new suggestion sounds like a promising way forward.

Ryan

On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <ye...@gmail.com> wrote:

> I think in Ryan's proposal we will create a ton of modules anyway, as Wing
> listed we are just using git branch as an additional dimension, but my
> understanding is that you will still have 1 core, 1 extension, 1 runtime
> artifact published for each Spark version in either approach.
>
> In that case, this is just brainstorming, I wonder if we can explore a
> modified option 3 that flattens all the versions in each Spark branch in
> option 2 into master. The repository structure would look something like:
>
> iceberg/api/...
>             /bundled-guava/...
>             /core/...
>             ...
>             /spark/2.4/core/...
>                             /extension/...
>                             /runtime/...
>                       /3.1/core/...
>                             /extension/...
>                             /runtime/...
>
> The gradle build script in the root is configured to build against the
> latest version of Spark by default, unless otherwise specified by the user.
>
> Intellij can also be configured to only index files of specific versions
> based on the same config used in build.
>
> In this way, I imagine the CI setup to be much easier to do things like
> testing version compatibility for a feature or running only a
> specific subset of Spark version builds based on the Spark version
> directories touched.
>
> And the biggest benefit is that we don't have the same difficulty as
> option 2 of developing a feature when it's both in core and Spark.
>
> We can then develop a mechanism to vote to stop support of certain
> versions, and archive the corresponding directory to avoid accumulating too
> many versions in the long term.
>
> -Jack Ye
>
>
> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Sorry, I was thinking about CI integration between Iceberg Java and
>> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
>> leave out!
>>
>> I would definitely want to test the projects together. One thing we could
>> do is have a nightly build like Russell suggests. I'm also wondering if we
>> could have some tighter integration where the Iceberg Spark build can be
>> included in the Iceberg Java build using properties. Maybe the github
>> action could checkout Iceberg, then checkout the Spark integration's latest
>> branch, and then run the gradle build with a property that makes Spark a
>> subproject in the build. That way we can continue to have Spark CI run
>> regularly.
>>
>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <
>> russell.spitzer@gmail.com> wrote:
>>
>>> I agree that Option 2 is considerably more difficult for development
>>> when core API changes need to be picked up by the external Spark module. I
>>> also think a monthly release would probably still be prohibitive to
>>> actually implementing new features that appear in the API, I would hope we
>>> have a much faster process or maybe just have snapshot artifacts published
>>> nightly?
>>>
>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wy...@cloudera.com.INVALID>
>>> wrote:
>>>
>>> IIUC, Option 2 is to move the Spark support for Iceberg into a separate
>>> repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
>>> 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
>>> versions or all Spark 3 versions, then we would need to commit the changes
>>> to all applicable branches. Basically we are trading more work to commit to
>>> multiple branches for simplified build and CI time per branch, which might
>>> be an acceptable trade-off. However, the biggest downside is that changes
>>> may need to be made in core Iceberg as well as in the engine (in this case
>>> Spark) support, and we need to wait for a release of core Iceberg to
>>> consume the changes in the subproject. In this case, maybe we should have a
>>> monthly release of core Iceberg (no matter how many changes go in, as long
>>> as it is non-zero) so that the subproject can consume changes fairly
>>> quickly?
>>>
>>>
>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>>> potential solutions well defined.
>>>>
>>>> Looks like the next step is to decide whether we want to require people
>>>> to update Spark versions to pick up newer versions of Iceberg. If we choose
>>>> to make people upgrade, then option 1 is clearly the best choice.
>>>>
>>>> I don’t think that we should make updating Spark a requirement. Many of
>>>> the things that we’re working on are orthogonal to Spark versions, like
>>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>> consuming and untrusted in my experience, so I think we would be setting up
>>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>>> picking up new Iceberg features.
>>>>
>>>> Another way of thinking about this is that if we went with option 1,
>>>> then we could port bug fixes into 0.12.x. But there are many things that
>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>>> some people in the community would have to maintain branches of newer
>>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>>> project — that defeats the purpose of simplifying things with option 1
>>>> because we would then have more people maintaining the same 0.13.x with
>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>>
>>>> If the community is going to do the work anyway — and I think some of
>>>> us would — we should make it possible to share that work. That’s why I
>>>> don’t think that we should go with option 1.
>>>>
>>>> If we don’t go with option 1, then the choice is how to maintain
>>>> multiple Spark versions. I think that the way we’re doing it right now is
>>>> not something we want to continue.
>>>>
>>>> Using multiple modules (option 3) is concerning to me because of the
>>>> changes in Spark. We currently structure the library to share as much code
>>>> as possible. But that means compiling against different Spark versions and
>>>> relying on binary compatibility and reflection in some cases. To me, this
>>>> seems unmaintainable in the long run because it requires refactoring common
>>>> classes and spending a lot of time deduplicating code. It also creates a
>>>> ton of modules, at least one common module, then a module per version, then
>>>> an extensions module per version, and finally a runtime module per version.
>>>> That’s 3 modules per Spark version, plus any new common modules. And each
>>>> module needs to be tested, which is making our CI take a really long time.
>>>> We also don’t support multiple Scala versions, which is another gap that
>>>> will require even more modules and tests.
>>>>
>>>> I like option 2 because it would allow us to compile against a single
>>>> version of Spark (which will be much more reliable). It would give us an
>>>> opportunity to support different Scala versions. It avoids the need to
>>>> refactor to share code and allows people to focus on a single version of
>>>> Spark, while also creating a way for people to maintain and update the
>>>> older versions with newer Iceberg releases. I don’t think that this would
>>>> slow down development. I think it would actually speed it up because we’d
>>>> be spending less time trying to make multiple versions work in the same
>>>> build. And anyone in favor of option 1 would basically get option 1: you
>>>> don’t have to care about branches for older Spark versions.
>>>>
>>>> Jack makes a good point about wanting to keep code in a single
>>>> repository, but I think that the need to manage more version combinations
>>>> overrides this concern. It’s easier to make this decision in python because
>>>> we’re not trying to depend on two projects that change relatively quickly.
>>>> We’re just trying to build a library.
>>>>
>>>> Ryan
>>>>
>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>>
>>>>> Thanks for bringing this up,  Anton.
>>>>>
>>>>> Everyone has great pros/cons to support their preferences.  Before
>>>>> giving my preference, let me raise one question:    what's the top priority
>>>>> thing for apache iceberg project at this point in time ?  This question
>>>>> will help us to answer the following question: Should we support more
>>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>>> on getting the new features that users need most in order to keep the
>>>>> project more competitive ?
>>>>>
>>>>> If people watch the apache iceberg project and check the issues &
>>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>>> question:   There is no doubt for making the whole v2 story to be
>>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>>> .
>>>>>
>>>>> In order to ensure the highest priority at this point in time, I will
>>>>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>>>>> resources to make v2 production-ready.
>>>>>
>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> From Dev's point, it has less burden to always support the latest
>>>>>> version of Spark (for example). But from user's point, especially for us
>>>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>>>> for the first time (since we have many customizations internally), and
>>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>>> unavoidably.
>>>>>>
>>>>>> So I'm inclined to make this support in community, not by users
>>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>>
>>>>>> Just my two cents.
>>>>>>
>>>>>> -Saisai
>>>>>>
>>>>>>
>>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>>
>>>>>>> Hi Wing Yew,
>>>>>>>
>>>>>>> I think 2.4 is a different story, we will continue to support Spark
>>>>>>> 2.4, but as you can see it will continue to have very limited
>>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>>> Flink and Hive that are in the same repository.
>>>>>>>
>>>>>>> I can totally understand your point of view Wing, in fact, speaking
>>>>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>>>>> software because there are people who are still using Spark 1.4, believe it
>>>>>>> or not. After all, keep backporting changes will become a liability not
>>>>>>> only on the user side, but also on the service provider side, so I believe
>>>>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>>>>> of both parties easier in the end. New feature is definitely one of the
>>>>>>> best incentives to promote an upgrade on user side.
>>>>>>>
>>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>>> the future, and we probably cannot drop support of that package once
>>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>>> Iceberg versions.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> I understand and sympathize with the desire to use new DSv2
>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest for developers,
>>>>>>>> but I don't think it considers the interests of users. I do not think that
>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is released. It is a
>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all
>>>>>>>> know that it is not a minor upgrade. There are a lot of changes from 3.0 to
>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users running
>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting
>>>>>>>> Spark 2.4?
>>>>>>>>
>>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken
>>>>>>>> out in favor of Option 1 all work for the same organization, don't they?
>>>>>>>> And they don't have a problem with making their users, all internal, simply
>>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>>>>> fork that is close to 3.2.)
>>>>>>>>
>>>>>>>> I work for an organization with customers running different
>>>>>>>> versions of Spark. It is true that we can backport new features to older
>>>>>>>> versions if we wanted to. I suppose the people contributing to Iceberg work
>>>>>>>> for some organization or other that either use Iceberg in-house, or provide
>>>>>>>> software (possibly in the form of a service) to customers, and either way,
>>>>>>>> the organizations have the ability to backport features and fixes to
>>>>>>>> internal versions. Are there any users out there who simply use Apache
>>>>>>>> Iceberg and depend on the community version?
>>>>>>>>
>>>>>>>> There may be features that are broadly useful that do not depend on
>>>>>>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>>>>
>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I
>>>>>>>> would consider Option 3 too. Anton, you said 5 modules are required; what
>>>>>>>> are the modules you're thinking of?
>>>>>>>>
>>>>>>>> - Wing Yew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>>
>>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>>>>> 3 are probably not worthy.
>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Yufei
>>>>>>>>>
>>>>>>>>> `This is not a contribution`
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> To sum up what we have so far:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>>>
>>>>>>>>>> The easiest option for us devs, forces the user to upgrade to the
>>>>>>>>>> most recent minor Spark version to consume any new Iceberg
>>>>>>>>>> features.
>>>>>>>>>>
>>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>>
>>>>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>>>>> still separate as we can use separate branches.
>>>>>>>>>> Impossible to consume any unreleased changes in core, may slow
>>>>>>>>>> down the development.
>>>>>>>>>>
>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>>
>>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>>> complicated.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>>> We follow Option 1 internally at the moment but I would like to
>>>>>>>>>> hear what other people think/need.
>>>>>>>>>>
>>>>>>>>>> - Anton
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>>>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>>>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>>>>>>> in a multiple Spark version support future.
>>>>>>>>>>
>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>>
>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>>>>
>>>>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>>>>> version.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is when it gets a bit complicated. If we want to support
>>>>>>>>>> both Spark 3.1 and Spark 3.2 with a single module, it means we have to
>>>>>>>>>> compile against 3.1. The problem is that we rely on DSv2 that is being
>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial differences. On top of
>>>>>>>>>> that, we have our extensions that are extremely low-level and may break not
>>>>>>>>>> only between minor versions but also between patch releases.
>>>>>>>>>>
>>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Internally, we don’t deliver new features to older Spark versions
>>>>>>>>>> as it requires a lot of effort to port things. Personally, I don’t think it
>>>>>>>>>> is too bad to require users to upgrade if they want new features. At the
>>>>>>>>>> same time, there are valid concerns with this approach too that we
>>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> First of all, is option 2 a viable option? We discussed
>>>>>>>>>> separating the python module outside of the project a few weeks ago, and
>>>>>>>>>> decided to not do that because it's beneficial for code cross reference and
>>>>>>>>>> more intuitive for new developers to see everything in the same repository.
>>>>>>>>>> I would expect the same argument to also hold here.
>>>>>>>>>>
>>>>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>>>>> version. This avoids the problem that some users are unwilling to move to a
>>>>>>>>>> newer version and keep patching old Spark version branches. If there are
>>>>>>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>>>>>>> version in master.
>>>>>>>>>>
>>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>>> development speed.
>>>>>>>>>>
>>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jack Ye
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey folks,
>>>>>>>>>>>
>>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>>
>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great
>>>>>>>>>>> to support older versions but because we compile against 3.0, we cannot use
>>>>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>>> to ignore them.
>>>>>>>>>>>
>>>>>>>>>>> Apart from that, I have an end-to-end prototype for
>>>>>>>>>>> merge-on-read with Spark that actually leverages some of the 3.2 features.
>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us internally and would
>>>>>>>>>>> love to share that with the rest of the community.
>>>>>>>>>>>
>>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>>
>>>>>>>>>>> Option 1
>>>>>>>>>>>
>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>>
>>>>>>>>>>> Pros: almost no changes to the build configuration, no extra
>>>>>>>>>>> work on our side as just a single Spark version is actively maintained.
>>>>>>>>>>> Cons: some new features that we will be adding to master could
>>>>>>>>>>> also work with older Spark versions but all 0.12 releases will only contain
>>>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to
>>>>>>>>>>> consume any new Spark or format features.
>>>>>>>>>>>
>>>>>>>>>>> Option 2
>>>>>>>>>>>
>>>>>>>>>>> Move our Spark integration into a separate project and introduce
>>>>>>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>>
>>>>>>>>>>> Pros: decouples the format version from Spark, we can support as
>>>>>>>>>>> many Spark versions as needed.
>>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>>>> in the Spark integration.
>>>>>>>>>>>
>>>>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>>>>> worry is that we will have to release the format more frequently (which is
>>>>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>>>>> development may be slower.
>>>>>>>>>>>
>>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anton
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Spark version support strategy

Posted by Jack Ye <ye...@gmail.com>.

I think in Ryan's proposal we will create a ton of modules anyway, as Wing
listed we are just using git branch as an additional dimension, but my
understanding is that you will still have 1 core, 1 extension, 1 runtime
artifact published for each Spark version in either approach.

In that case, this is just brainstorming, I wonder if we can explore a
modified option 3 that flattens all the versions in each Spark branch in
option 2 into master. The repository structure would look something like:

iceberg/api/...
            /bundled-guava/...
            /core/...
            ...
            /spark/2.4/core/...
                            /extension/...
                            /runtime/...
                      /3.1/core/...
                            /extension/...
                            /runtime/...

The gradle build script in the root is configured to build against the
latest version of Spark by default, unless otherwise specified by the user.

Intellij can also be configured to only index files of specific versions
based on the same config used in build.

In this way, I imagine the CI setup to be much easier to do things like
testing version compatibility for a feature or running only a
specific subset of Spark version builds based on the Spark version
directories touched.

And the biggest benefit is that we don't have the same difficulty as option
2 of developing a feature when it's both in core and Spark.

We can then develop a mechanism to vote to stop support of certain
versions, and archive the corresponding directory to avoid accumulating too
many versions in the long term.

-Jack Ye


On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <bl...@tabular.io> wrote:

> Sorry, I was thinking about CI integration between Iceberg Java and
> Iceberg Spark, I just didn't mention it and I see how that's a big thing to
> leave out!
>
> I would definitely want to test the projects together. One thing we could
> do is have a nightly build like Russell suggests. I'm also wondering if we
> could have some tighter integration where the Iceberg Spark build can be
> included in the Iceberg Java build using properties. Maybe the github
> action could checkout Iceberg, then checkout the Spark integration's latest
> branch, and then run the gradle build with a property that makes Spark a
> subproject in the build. That way we can continue to have Spark CI run
> regularly.
>
> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <ru...@gmail.com>
> wrote:
>
>> I agree that Option 2 is considerably more difficult for development when
>> core API changes need to be picked up by the external Spark module. I also
>> think a monthly release would probably still be prohibitive to actually
>> implementing new features that appear in the API, I would hope we have a
>> much faster process or maybe just have snapshot artifacts published nightly?
>>
>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wy...@cloudera.com.INVALID>
>> wrote:
>>
>> IIUC, Option 2 is to move the Spark support for Iceberg into a separate
>> repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
>> 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
>> versions or all Spark 3 versions, then we would need to commit the changes
>> to all applicable branches. Basically we are trading more work to commit to
>> multiple branches for simplified build and CI time per branch, which might
>> be an acceptable trade-off. However, the biggest downside is that changes
>> may need to be made in core Iceberg as well as in the engine (in this case
>> Spark) support, and we need to wait for a release of core Iceberg to
>> consume the changes in the subproject. In this case, maybe we should have a
>> monthly release of core Iceberg (no matter how many changes go in, as long
>> as it is non-zero) so that the subproject can consume changes fairly
>> quickly?
>>
>>
>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>>> potential solutions well defined.
>>>
>>> Looks like the next step is to decide whether we want to require people
>>> to update Spark versions to pick up newer versions of Iceberg. If we choose
>>> to make people upgrade, then option 1 is clearly the best choice.
>>>
>>> I don’t think that we should make updating Spark a requirement. Many of
>>> the things that we’re working on are orthogonal to Spark versions, like
>>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>> consuming and untrusted in my experience, so I think we would be setting up
>>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>>> picking up new Iceberg features.
>>>
>>> Another way of thinking about this is that if we went with option 1,
>>> then we could port bug fixes into 0.12.x. But there are many things that
>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>>> some people in the community would have to maintain branches of newer
>>> Iceberg versions with older versions of Spark outside of the main Iceberg
>>> project — that defeats the purpose of simplifying things with option 1
>>> because we would then have more people maintaining the same 0.13.x with
>>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>>> to release a 2.5 line with DSv2 backported, but the community decided not
>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>>
>>> If the community is going to do the work anyway — and I think some of us
>>> would — we should make it possible to share that work. That’s why I don’t
>>> think that we should go with option 1.
>>>
>>> If we don’t go with option 1, then the choice is how to maintain
>>> multiple Spark versions. I think that the way we’re doing it right now is
>>> not something we want to continue.
>>>
>>> Using multiple modules (option 3) is concerning to me because of the
>>> changes in Spark. We currently structure the library to share as much code
>>> as possible. But that means compiling against different Spark versions and
>>> relying on binary compatibility and reflection in some cases. To me, this
>>> seems unmaintainable in the long run because it requires refactoring common
>>> classes and spending a lot of time deduplicating code. It also creates a
>>> ton of modules, at least one common module, then a module per version, then
>>> an extensions module per version, and finally a runtime module per version.
>>> That’s 3 modules per Spark version, plus any new common modules. And each
>>> module needs to be tested, which is making our CI take a really long time.
>>> We also don’t support multiple Scala versions, which is another gap that
>>> will require even more modules and tests.
>>>
>>> I like option 2 because it would allow us to compile against a single
>>> version of Spark (which will be much more reliable). It would give us an
>>> opportunity to support different Scala versions. It avoids the need to
>>> refactor to share code and allows people to focus on a single version of
>>> Spark, while also creating a way for people to maintain and update the
>>> older versions with newer Iceberg releases. I don’t think that this would
>>> slow down development. I think it would actually speed it up because we’d
>>> be spending less time trying to make multiple versions work in the same
>>> build. And anyone in favor of option 1 would basically get option 1: you
>>> don’t have to care about branches for older Spark versions.
>>>
>>> Jack makes a good point about wanting to keep code in a single
>>> repository, but I think that the need to manage more version combinations
>>> overrides this concern. It’s easier to make this decision in python because
>>> we’re not trying to depend on two projects that change relatively quickly.
>>> We’re just trying to build a library.
>>>
>>> Ryan
>>>
>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>>
>>>> Thanks for bringing this up,  Anton.
>>>>
>>>> Everyone has great pros/cons to support their preferences.  Before
>>>> giving my preference, let me raise one question:    what's the top priority
>>>> thing for apache iceberg project at this point in time ?  This question
>>>> will help us to answer the following question: Should we support more
>>>> engine versions more robustly or be a bit more aggressive and concentrate
>>>> on getting the new features that users need most in order to keep the
>>>> project more competitive ?
>>>>
>>>> If people watch the apache iceberg project and check the issues &
>>>> PR frequently,  I guess more than 90% people will answer the priority
>>>> question:   There is no doubt for making the whole v2 story to be
>>>> production-ready.   The current roadmap discussion also proofs the thing :
>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>>> .
>>>>
>>>> In order to ensure the highest priority at this point in time, I will
>>>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>>>> resources to make v2 production-ready.
>>>>
>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> From Dev's point, it has less burden to always support the latest
>>>>> version of Spark (for example). But from user's point, especially for us
>>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>>> for the first time (since we have many customizations internally), and
>>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>>> support of old version of Spark3, users have to maintain it themselves
>>>>> unavoidably.
>>>>>
>>>>> So I'm inclined to make this support in community, not by users
>>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>>
>>>>> Just my two cents.
>>>>>
>>>>> -Saisai
>>>>>
>>>>>
>>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>>
>>>>>> Hi Wing Yew,
>>>>>>
>>>>>> I think 2.4 is a different story, we will continue to support Spark
>>>>>> 2.4, but as you can see it will continue to have very limited
>>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>>> consistent strategy around this, let's take this chance to make a good
>>>>>> community guideline for all future engine versions, especially for Spark,
>>>>>> Flink and Hive that are in the same repository.
>>>>>>
>>>>>> I can totally understand your point of view Wing, in fact, speaking
>>>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>>>> software because there are people who are still using Spark 1.4, believe it
>>>>>> or not. After all, keep backporting changes will become a liability not
>>>>>> only on the user side, but also on the service provider side, so I believe
>>>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>>>> of both parties easier in the end. New feature is definitely one of the
>>>>>> best incentives to promote an upgrade on user side.
>>>>>>
>>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>>> because we will have an unbounded list of packages to add and compile in
>>>>>> the future, and we probably cannot drop support of that package once
>>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>>> we can truly have a compatibility matrix for engine versions against
>>>>>> Iceberg versions.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> I understand and sympathize with the desire to use new DSv2 features
>>>>>>> in Spark 3.2. I agree that Option 1 is the easiest for developers, but I
>>>>>>> don't think it considers the interests of users. I do not think that most
>>>>>>> users will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>>>>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>>>>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>>>>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>>>>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>>>>>> 2.4?
>>>>>>>
>>>>>>> Please correct me if I'm mistaken, but the folks who have spoken out
>>>>>>> in favor of Option 1 all work for the same organization, don't they? And
>>>>>>> they don't have a problem with making their users, all internal, simply
>>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>>>> fork that is close to 3.2.)
>>>>>>>
>>>>>>> I work for an organization with customers running different versions
>>>>>>> of Spark. It is true that we can backport new features to older versions if
>>>>>>> we wanted to. I suppose the people contributing to Iceberg work for some
>>>>>>> organization or other that either use Iceberg in-house, or provide software
>>>>>>> (possibly in the form of a service) to customers, and either way, the
>>>>>>> organizations have the ability to backport features and fixes to internal
>>>>>>> versions. Are there any users out there who simply use Apache Iceberg and
>>>>>>> depend on the community version?
>>>>>>>
>>>>>>> There may be features that are broadly useful that do not depend on
>>>>>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>>>
>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>>>>>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>>>>>>> modules you're thinking of?
>>>>>>>
>>>>>>> - Wing Yew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>>
>>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>>>> 3 are probably not worthy.
>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Yufei
>>>>>>>>
>>>>>>>> `This is not a contribution`
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> To sum up what we have so far:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>>
>>>>>>>>> The easiest option for us devs, forces the user to upgrade to the
>>>>>>>>> most recent minor Spark version to consume any new Iceberg
>>>>>>>>> features.
>>>>>>>>>
>>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>>
>>>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>>>> still separate as we can use separate branches.
>>>>>>>>> Impossible to consume any unreleased changes in core, may slow
>>>>>>>>> down the development.
>>>>>>>>>
>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>>
>>>>>>>>> Introduce more modules in the same project.
>>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>>> complicated.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>>> We follow Option 1 internally at the moment but I would like to
>>>>>>>>> hear what other people think/need.
>>>>>>>>>
>>>>>>>>> - Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>>>>>> in a multiple Spark version support future.
>>>>>>>>>
>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>>
>>>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>>>> would expect the same argument to also hold here.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>>>
>>>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>>>> version.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is when it gets a bit complicated. If we want to support both
>>>>>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>>>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>>>>>> have our extensions that are extremely low-level and may break not only
>>>>>>>>> between minor versions but also between patch releases.
>>>>>>>>>
>>>>>>>>> f there are some features requiring a newer version, it makes
>>>>>>>>> sense to move that newer version in master.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Internally, we don’t deliver new features to older Spark versions
>>>>>>>>> as it requires a lot of effort to port things. Personally, I don’t think it
>>>>>>>>> is too bad to require users to upgrade if they want new features. At the
>>>>>>>>> same time, there are valid concerns with this approach too that we
>>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>>>> would expect the same argument to also hold here.
>>>>>>>>>
>>>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>>>> version. This avoids the problem that some users are unwilling to move to a
>>>>>>>>> newer version and keep patching old Spark version branches. If there are
>>>>>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>>>>>> version in master.
>>>>>>>>>
>>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>>> development speed.
>>>>>>>>>
>>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jack Ye
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hey folks,
>>>>>>>>>>
>>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>>
>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>>> to ignore them.
>>>>>>>>>>
>>>>>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>>>>>>> share that with the rest of the community.
>>>>>>>>>>
>>>>>>>>>> I see two options to move forward:
>>>>>>>>>>
>>>>>>>>>> Option 1
>>>>>>>>>>
>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>>
>>>>>>>>>> Pros: almost no changes to the build configuration, no extra work
>>>>>>>>>> on our side as just a single Spark version is actively maintained.
>>>>>>>>>> Cons: some new features that we will be adding to master could
>>>>>>>>>> also work with older Spark versions but all 0.12 releases will only contain
>>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to
>>>>>>>>>> consume any new Spark or format features.
>>>>>>>>>>
>>>>>>>>>> Option 2
>>>>>>>>>>
>>>>>>>>>> Move our Spark integration into a separate project and introduce
>>>>>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>>>>>
>>>>>>>>>> Pros: decouples the format version from Spark, we can support as
>>>>>>>>>> many Spark versions as needed.
>>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>>> in the Spark integration.
>>>>>>>>>>
>>>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>>>> worry is that we will have to release the format more frequently (which is
>>>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>>>> development may be slower.
>>>>>>>>>>
>>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Anton
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Spark version support strategy

Posted by Ryan Blue <bl...@tabular.io>.

Sorry, I was thinking about CI integration between Iceberg Java and Iceberg
Spark, I just didn't mention it and I see how that's a big thing to leave
out!

I would definitely want to test the projects together. One thing we could
do is have a nightly build like Russell suggests. I'm also wondering if we
could have some tighter integration where the Iceberg Spark build can be
included in the Iceberg Java build using properties. Maybe the github
action could checkout Iceberg, then checkout the Spark integration's latest
branch, and then run the gradle build with a property that makes Spark a
subproject in the build. That way we can continue to have Spark CI run
regularly.

On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer <ru...@gmail.com>
wrote:

> I agree that Option 2 is considerably more difficult for development when
> core API changes need to be picked up by the external Spark module. I also
> think a monthly release would probably still be prohibitive to actually
> implementing new features that appear in the API, I would hope we have a
> much faster process or maybe just have snapshot artifacts published nightly?
>
> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wy...@cloudera.com.INVALID>
> wrote:
>
> IIUC, Option 2 is to move the Spark support for Iceberg into a separate
> repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
> 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
> versions or all Spark 3 versions, then we would need to commit the changes
> to all applicable branches. Basically we are trading more work to commit to
> multiple branches for simplified build and CI time per branch, which might
> be an acceptable trade-off. However, the biggest downside is that changes
> may need to be made in core Iceberg as well as in the engine (in this case
> Spark) support, and we need to wait for a release of core Iceberg to
> consume the changes in the subproject. In this case, maybe we should have a
> monthly release of core Iceberg (no matter how many changes go in, as long
> as it is non-zero) so that the subproject can consume changes fairly
> quickly?
>
>
> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> Thanks for bringing this up, Anton. I’m glad that we have the set of
>> potential solutions well defined.
>>
>> Looks like the next step is to decide whether we want to require people
>> to update Spark versions to pick up newer versions of Iceberg. If we choose
>> to make people upgrade, then option 1 is clearly the best choice.
>>
>> I don’t think that we should make updating Spark a requirement. Many of
>> the things that we’re working on are orthogonal to Spark versions, like
>> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
>> delete files, new storage implementations, etc. Upgrading Spark is time
>> consuming and untrusted in my experience, so I think we would be setting up
>> an unnecessary trade-off between spending lots of time to upgrade Spark and
>> picking up new Iceberg features.
>>
>> Another way of thinking about this is that if we went with option 1, then
>> we could port bug fixes into 0.12.x. But there are many things that
>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
>> some people in the community would have to maintain branches of newer
>> Iceberg versions with older versions of Spark outside of the main Iceberg
>> project — that defeats the purpose of simplifying things with option 1
>> because we would then have more people maintaining the same 0.13.x with
>> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
>> to release a 2.5 line with DSv2 backported, but the community decided not
>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>>
>> If the community is going to do the work anyway — and I think some of us
>> would — we should make it possible to share that work. That’s why I don’t
>> think that we should go with option 1.
>>
>> If we don’t go with option 1, then the choice is how to maintain multiple
>> Spark versions. I think that the way we’re doing it right now is not
>> something we want to continue.
>>
>> Using multiple modules (option 3) is concerning to me because of the
>> changes in Spark. We currently structure the library to share as much code
>> as possible. But that means compiling against different Spark versions and
>> relying on binary compatibility and reflection in some cases. To me, this
>> seems unmaintainable in the long run because it requires refactoring common
>> classes and spending a lot of time deduplicating code. It also creates a
>> ton of modules, at least one common module, then a module per version, then
>> an extensions module per version, and finally a runtime module per version.
>> That’s 3 modules per Spark version, plus any new common modules. And each
>> module needs to be tested, which is making our CI take a really long time.
>> We also don’t support multiple Scala versions, which is another gap that
>> will require even more modules and tests.
>>
>> I like option 2 because it would allow us to compile against a single
>> version of Spark (which will be much more reliable). It would give us an
>> opportunity to support different Scala versions. It avoids the need to
>> refactor to share code and allows people to focus on a single version of
>> Spark, while also creating a way for people to maintain and update the
>> older versions with newer Iceberg releases. I don’t think that this would
>> slow down development. I think it would actually speed it up because we’d
>> be spending less time trying to make multiple versions work in the same
>> build. And anyone in favor of option 1 would basically get option 1: you
>> don’t have to care about branches for older Spark versions.
>>
>> Jack makes a good point about wanting to keep code in a single
>> repository, but I think that the need to manage more version combinations
>> overrides this concern. It’s easier to make this decision in python because
>> we’re not trying to depend on two projects that change relatively quickly.
>> We’re just trying to build a library.
>>
>> Ryan
>>
>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>>
>>> Thanks for bringing this up,  Anton.
>>>
>>> Everyone has great pros/cons to support their preferences.  Before
>>> giving my preference, let me raise one question:    what's the top priority
>>> thing for apache iceberg project at this point in time ?  This question
>>> will help us to answer the following question: Should we support more
>>> engine versions more robustly or be a bit more aggressive and concentrate
>>> on getting the new features that users need most in order to keep the
>>> project more competitive ?
>>>
>>> If people watch the apache iceberg project and check the issues &
>>> PR frequently,  I guess more than 90% people will answer the priority
>>> question:   There is no doubt for making the whole v2 story to be
>>> production-ready.   The current roadmap discussion also proofs the thing :
>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>>> .
>>>
>>> In order to ensure the highest priority at this point in time, I will
>>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>>> resources to make v2 production-ready.
>>>
>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>>> wrote:
>>>
>>>> From Dev's point, it has less burden to always support the latest
>>>> version of Spark (for example). But from user's point, especially for us
>>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>>> for the first time (since we have many customizations internally), and
>>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>>> support of old version of Spark3, users have to maintain it themselves
>>>> unavoidably.
>>>>
>>>> So I'm inclined to make this support in community, not by users
>>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>>
>>>> Just my two cents.
>>>>
>>>> -Saisai
>>>>
>>>>
>>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>>
>>>>> Hi Wing Yew,
>>>>>
>>>>> I think 2.4 is a different story, we will continue to support Spark
>>>>> 2.4, but as you can see it will continue to have very limited
>>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>>> consistent strategy around this, let's take this chance to make a good
>>>>> community guideline for all future engine versions, especially for Spark,
>>>>> Flink and Hive that are in the same repository.
>>>>>
>>>>> I can totally understand your point of view Wing, in fact, speaking
>>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>>> software because there are people who are still using Spark 1.4, believe it
>>>>> or not. After all, keep backporting changes will become a liability not
>>>>> only on the user side, but also on the service provider side, so I believe
>>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>>> of both parties easier in the end. New feature is definitely one of the
>>>>> best incentives to promote an upgrade on user side.
>>>>>
>>>>> I think the biggest issue of option 3 is about its scalability,
>>>>> because we will have an unbounded list of packages to add and compile in
>>>>> the future, and we probably cannot drop support of that package once
>>>>> created. If we go with option 1, I think we can still publish a few patch
>>>>> versions for old Iceberg releases, and committers can control the amount of
>>>>> patch versions to guard people from abusing the power of patching. I see
>>>>> this as a consistent strategy also for Flink and Hive. With this strategy,
>>>>> we can truly have a compatibility matrix for engine versions against
>>>>> Iceberg versions.
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <
>>>>> wypoon@cloudera.com.invalid> wrote:
>>>>>
>>>>>> I understand and sympathize with the desire to use new DSv2 features
>>>>>> in Spark 3.2. I agree that Option 1 is the easiest for developers, but I
>>>>>> don't think it considers the interests of users. I do not think that most
>>>>>> users will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>>>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>>>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>>>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>>>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>>>>> 2.4?
>>>>>>
>>>>>> Please correct me if I'm mistaken, but the folks who have spoken out
>>>>>> in favor of Option 1 all work for the same organization, don't they? And
>>>>>> they don't have a problem with making their users, all internal, simply
>>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>>> fork that is close to 3.2.)
>>>>>>
>>>>>> I work for an organization with customers running different versions
>>>>>> of Spark. It is true that we can backport new features to older versions if
>>>>>> we wanted to. I suppose the people contributing to Iceberg work for some
>>>>>> organization or other that either use Iceberg in-house, or provide software
>>>>>> (possibly in the form of a service) to customers, and either way, the
>>>>>> organizations have the ability to backport features and fixes to internal
>>>>>> versions. Are there any users out there who simply use Apache Iceberg and
>>>>>> depend on the community version?
>>>>>>
>>>>>> There may be features that are broadly useful that do not depend on
>>>>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>>
>>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>>>>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>>>>>> modules you're thinking of?
>>>>>>
>>>>>> - Wing Yew
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>>
>>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>>> 3 are probably not worthy.
>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>>> get the new feature by backporting it to an older version in case of
>>>>>>> upgrading to a newer version isn't an option.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>>> `This is not a contribution`
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <
>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> To sum up what we have so far:
>>>>>>>>
>>>>>>>>
>>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>>
>>>>>>>> The easiest option for us devs, forces the user to upgrade to the
>>>>>>>> most recent minor Spark version to consume any new Iceberg features
>>>>>>>> .
>>>>>>>>
>>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>>
>>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>>> still separate as we can use separate branches.
>>>>>>>> Impossible to consume any unreleased changes in core, may slow down
>>>>>>>> the development.
>>>>>>>>
>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>>
>>>>>>>> Introduce more modules in the same project.
>>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>>> complicated.
>>>>>>>>
>>>>>>>>
>>>>>>>> Are there any users for whom upgrading the minor Spark version
>>>>>>>> (e3.1 to 3.2) to consume new features is a blocker?
>>>>>>>> We follow Option 1 internally at the moment but I would like to
>>>>>>>> hear what other people think/need.
>>>>>>>>
>>>>>>>> - Anton
>>>>>>>>
>>>>>>>>
>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <
>>>>>>>> russell.spitzer@gmail.com> wrote:
>>>>>>>>
>>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>>>>> in a multiple Spark version support future.
>>>>>>>>
>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>>
>>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>>> would expect the same argument to also hold here.
>>>>>>>>
>>>>>>>>
>>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>>
>>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>>> version.
>>>>>>>>
>>>>>>>>
>>>>>>>> This is when it gets a bit complicated. If we want to support both
>>>>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>>>>> have our extensions that are extremely low-level and may break not only
>>>>>>>> between minor versions but also between patch releases.
>>>>>>>>
>>>>>>>> f there are some features requiring a newer version, it makes sense
>>>>>>>> to move that newer version in master.
>>>>>>>>
>>>>>>>>
>>>>>>>> Internally, we don’t deliver new features to older Spark versions
>>>>>>>> as it requires a lot of effort to port things. Personally, I don’t think it
>>>>>>>> is too bad to require users to upgrade if they want new features. At the
>>>>>>>> same time, there are valid concerns with this approach too that we
>>>>>>>> mentioned during the sync. For example, certain new features would also
>>>>>>>> work fine with older Spark versions. I generally agree with that and that
>>>>>>>> not supporting recent versions is not ideal. However, I want to find a
>>>>>>>> balance between the complexity on our side and ease of use for the users.
>>>>>>>> Ideally, supporting a few recent versions would be sufficient but our Spark
>>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>>> would expect the same argument to also hold here.
>>>>>>>>
>>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>>> version. This avoids the problem that some users are unwilling to move to a
>>>>>>>> newer version and keep patching old Spark version branches. If there are
>>>>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>>>>> version in master.
>>>>>>>>
>>>>>>>> In addition, because currently Spark is considered the most
>>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>>> development speed.
>>>>>>>>
>>>>>>>> So my thinking is closer to option 1.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hey folks,
>>>>>>>>>
>>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>>
>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>>> to ignore them.
>>>>>>>>>
>>>>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>>>>>> share that with the rest of the community.
>>>>>>>>>
>>>>>>>>> I see two options to move forward:
>>>>>>>>>
>>>>>>>>> Option 1
>>>>>>>>>
>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>>
>>>>>>>>> Pros: almost no changes to the build configuration, no extra work
>>>>>>>>> on our side as just a single Spark version is actively maintained.
>>>>>>>>> Cons: some new features that we will be adding to master could
>>>>>>>>> also work with older Spark versions but all 0.12 releases will only contain
>>>>>>>>> bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to
>>>>>>>>> consume any new Spark or format features.
>>>>>>>>>
>>>>>>>>> Option 2
>>>>>>>>>
>>>>>>>>> Move our Spark integration into a separate project and introduce
>>>>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>>>>
>>>>>>>>> Pros: decouples the format version from Spark, we can support as
>>>>>>>>> many Spark versions as needed.
>>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>>> in the Spark integration.
>>>>>>>>>
>>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>>> worry is that we will have to release the format more frequently (which is
>>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>>> development may be slower.
>>>>>>>>>
>>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anton
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Spark version support strategy

Posted by Russell Spitzer <ru...@gmail.com>.

I agree that Option 2 is considerably more difficult for development when core API changes need to be picked up by the external Spark module. I also think a monthly release would probably still be prohibitive to actually implementing new features that appear in the API, I would hope we have a much faster process or maybe just have snapshot artifacts published nightly?

> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon <wy...@cloudera.com.INVALID> wrote:
> 
> IIUC, Option 2 is to move the Spark support for Iceberg into a separate repo (subproject of Iceberg). Would we have branches such as 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all versions or all Spark 3 versions, then we would need to commit the changes to all applicable branches. Basically we are trading more work to commit to multiple branches for simplified build and CI time per branch, which might be an acceptable trade-off. However, the biggest downside is that changes may need to be made in core Iceberg as well as in the engine (in this case Spark) support, and we need to wait for a release of core Iceberg to consume the changes in the subproject. In this case, maybe we should have a monthly release of core Iceberg (no matter how many changes go in, as long as it is non-zero) so that the subproject can consume changes fairly quickly?
> 
> 
> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <blue@tabular.io <ma...@tabular.io>> wrote:
> Thanks for bringing this up, Anton. I’m glad that we have the set of potential solutions well defined.
> 
> Looks like the next step is to decide whether we want to require people to update Spark versions to pick up newer versions of Iceberg. If we choose to make people upgrade, then option 1 is clearly the best choice.
> 
> I don’t think that we should make updating Spark a requirement. Many of the things that we’re working on are orthogonal to Spark versions, like table maintenance actions, secondary indexes, the 1.0 API, views, ORC delete files, new storage implementations, etc. Upgrading Spark is time consuming and untrusted in my experience, so I think we would be setting up an unnecessary trade-off between spending lots of time to upgrade Spark and picking up new Iceberg features.
> 
> Another way of thinking about this is that if we went with option 1, then we could port bug fixes into 0.12.x. But there are many things that wouldn’t fit this model, like adding a FileIO implementation for ADLS. So some people in the community would have to maintain branches of newer Iceberg versions with older versions of Spark outside of the main Iceberg project — that defeats the purpose of simplifying things with option 1 because we would then have more people maintaining the same 0.13.x with Spark 3.1 branch. (This reminds me of the Spark community, where we wanted to release a 2.5 line with DSv2 backported, but the community decided not to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
> 
> If the community is going to do the work anyway — and I think some of us would — we should make it possible to share that work. That’s why I don’t think that we should go with option 1.
> 
> If we don’t go with option 1, then the choice is how to maintain multiple Spark versions. I think that the way we’re doing it right now is not something we want to continue.
> 
> Using multiple modules (option 3) is concerning to me because of the changes in Spark. We currently structure the library to share as much code as possible. But that means compiling against different Spark versions and relying on binary compatibility and reflection in some cases. To me, this seems unmaintainable in the long run because it requires refactoring common classes and spending a lot of time deduplicating code. It also creates a ton of modules, at least one common module, then a module per version, then an extensions module per version, and finally a runtime module per version. That’s 3 modules per Spark version, plus any new common modules. And each module needs to be tested, which is making our CI take a really long time. We also don’t support multiple Scala versions, which is another gap that will require even more modules and tests.
> 
> I like option 2 because it would allow us to compile against a single version of Spark (which will be much more reliable). It would give us an opportunity to support different Scala versions. It avoids the need to refactor to share code and allows people to focus on a single version of Spark, while also creating a way for people to maintain and update the older versions with newer Iceberg releases. I don’t think that this would slow down development. I think it would actually speed it up because we’d be spending less time trying to make multiple versions work in the same build. And anyone in favor of option 1 would basically get option 1: you don’t have to care about branches for older Spark versions.
> 
> Jack makes a good point about wanting to keep code in a single repository, but I think that the need to manage more version combinations overrides this concern. It’s easier to make this decision in python because we’re not trying to depend on two projects that change relatively quickly. We’re just trying to build a library.
> 
> Ryan
> 
> 
> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <openinx@gmail.com <ma...@gmail.com>> wrote:
> Thanks for bringing this up,  Anton. 
> 
> Everyone has great pros/cons to support their preferences.  Before giving my preference, let me raise one question:    what's the top priority thing for apache iceberg project at this point in time ?  This question will help us to answer the following question: Should we support more engine versions more robustly or be a bit more aggressive and concentrate on getting the new features that users need most in order to keep the project more competitive ? 
> 
> If people watch the apache iceberg project and check the issues & PR frequently,  I guess more than 90% people will answer the priority question:   There is no doubt for making the whole v2 story to be production-ready.   The current roadmap discussion also proofs the thing : https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E <https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E> .   
> 
> In order to ensure the highest priority at this point in time, I will prefer option-1 to reduce the cost of engine maintenance, so as to free up resources to make v2 production-ready. 
> 
> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sai.sai.shao@gmail.com <ma...@gmail.com>> wrote:
> From Dev's point, it has less burden to always support the latest version of Spark (for example). But from user's point, especially for us who maintain Spark internally, it is not easy to upgrade the Spark version for the first time (since we have many customizations internally), and we're still promoting to upgrade to 3.1.2. If the community ditches the support of old version of Spark3, users have to maintain it themselves unavoidably. 
> 
> So I'm inclined to make this support in community, not by users themselves, as for Option 2 or 3, I'm fine with either. And to relieve the burden, we could support limited versions of Spark (for example 2 versions).
> 
> Just my two cents.
> 
> -Saisai
> 
> 
> Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> 于2021年9月15日周三 下午1:35写道：
> Hi Wing Yew,
> 
> I think 2.4 is a different story, we will continue to support Spark 2.4, but as you can see it will continue to have very limited functionalities comparing to Spark 3. I believe we discussed about option 3 when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy around this, let's take this chance to make a good community guideline for all future engine versions, especially for Spark, Flink and Hive that are in the same repository.
> 
> I can totally understand your point of view Wing, in fact, speaking from the perspective of AWS EMR, we have to support over 40 versions of the software because there are people who are still using Spark 1.4, believe it or not. After all, keep backporting changes will become a liability not only on the user side, but also on the service provider side, so I believe it's not a bad practice to push for user upgrade, as it will make the life of both parties easier in the end. New feature is definitely one of the best incentives to promote an upgrade on user side.
> 
> I think the biggest issue of option 3 is about its scalability, because we will have an unbounded list of packages to add and compile in the future, and we probably cannot drop support of that package once created. If we go with option 1, I think we can still publish a few patch versions for old Iceberg releases, and committers can control the amount of patch versions to guard people from abusing the power of patching. I see this as a consistent strategy also for Flink and Hive. With this strategy, we can truly have a compatibility matrix for engine versions against Iceberg versions.
> 
> -Jack
> 
> 
> 
> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <wy...@cloudera.com.invalid> wrote:
> I understand and sympathize with the desire to use new DSv2 features in Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't think it considers the interests of users. I do not think that most users will upgrade to Spark 3.2 as soon as it is released. It is a "minor version" upgrade in name from 3.1 (or from 3.0), but I think we all know that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1 and from 3.1 to 3.2. I think there are even a lot of users running Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark 2.4?
> 
> Please correct me if I'm mistaken, but the folks who have spoken out in favor of Option 1 all work for the same organization, don't they? And they don't have a problem with making their users, all internal, simply upgrade to Spark 3.2, do they? (Or they are already running an internal fork that is close to 3.2.)
> 
> I work for an organization with customers running different versions of Spark. It is true that we can backport new features to older versions if we wanted to. I suppose the people contributing to Iceberg work for some organization or other that either use Iceberg in-house, or provide software (possibly in the form of a service) to customers, and either way, the organizations have the ability to backport features and fixes to internal versions. Are there any users out there who simply use Apache Iceberg and depend on the community version?
> 
> There may be features that are broadly useful that do not depend on Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
> 
> I am not in favor of Option 2. I do not oppose Option 1, but I would consider Option 3 too. Anton, you said 5 modules are required; what are the modules you're thinking of?
> 
> - Wing Yew
> 
> 
> 
> 
> 
> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain000@gmail.com <ma...@gmail.com>> wrote:
> Option 1 sounds good to me. Here are my reasons:
> 
> 1. Both 2 and 3 will slow down the development. Considering the limited resources in the open source community, the upsides of option 2 and 3 are probably not worthy.
> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict anything, but even if these use cases are legit, users can still get the new feature by backporting it to an older version in case of upgrading to a newer version isn't an option.
> 
> Best,
> 
> Yufei
> 
> `This is not a contribution`
> 
> 
> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi <ao...@apple.com.invalid> wrote:
> To sum up what we have so far:
> 
> 
> Option 1 (support just the most recent minor Spark 3 version)
> 
> The easiest option for us devs, forces the user to upgrade to the most recent minor Spark version to consume any new Iceberg features.
> 
> Option 2 (a separate project under Iceberg)
> 
> Can support as many Spark versions as needed and the codebase is still separate as we can use separate branches.
> Impossible to consume any unreleased changes in core, may slow down the development.
> 
> Option 3 (separate modules for Spark 3.1/3.2)
> 
> Introduce more modules in the same project.
> Can consume unreleased changes but it will required at least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing complicated.
> 
> 
> Are there any users for whom upgrading the minor Spark version (e3.1 to 3.2) to consume new features is a blocker?
> We follow Option 1 internally at the moment but I would like to hear what other people think/need.
> 
> - Anton
> 
> 
>> On 14 Sep 2021, at 09:44, Russell Spitzer <russell.spitzer@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I think we should go for option 1. I already am not a big fan of having runtime errors for unsupported things based on versions and I don't think minor version upgrades are a large issue for users.  I'm especially not looking forward to supporting interfaces that only exist in Spark 3.2 in a multiple Spark version support future.
>> 
>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <aokolnychyi@apple.com.INVALID <ma...@apple.com.INVALID>> wrote:
>>> 
>>>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>>> 
>>> That’s exactly the concern I have about Option 2 at this moment.
>>> 
>>>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. 
>>> 
>>> This is when it gets a bit complicated. If we want to support both Spark 3.1 and Spark 3.2 with a single module, it means we have to compile against 3.1. The problem is that we rely on DSv2 that is being actively developed. 3.2 and 3.1 have substantial differences. On top of that, we have our extensions that are extremely low-level and may break not only between minor versions but also between patch releases.
>>> 
>>>> f there are some features requiring a newer version, it makes sense to move that newer version in master.
>>> 
>>> Internally, we don’t deliver new features to older Spark versions as it requires a lot of effort to port things. Personally, I don’t think it is too bad to require users to upgrade if they want new features. At the same time, there are valid concerns with this approach too that we mentioned during the sync. For example, certain new features would also work fine with older Spark versions. I generally agree with that and that not supporting recent versions is not ideal. However, I want to find a balance between the complexity on our side and ease of use for the users. Ideally, supporting a few recent versions would be sufficient but our Spark integration is too low-level to do that with a single module.
>>>  
>>> 
>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>>>> 
>>>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. This avoids the problem that some users are unwilling to move to a newer version and keep patching old Spark version branches. If there are some features requiring a newer version, it makes sense to move that newer version in master.
>>>> 
>>>> In addition, because currently Spark is considered the most feature-complete reference implementation compared to all other engines, I think we should not add artificial barriers that would slow down its development speed.
>>>> 
>>>> So my thinking is closer to option 1.
>>>> 
>>>> Best,
>>>> Jack Ye
>>>> 
>>>> 
>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <aokolnychyi@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>>>> Hey folks,
>>>> 
>>>> I want to discuss our Spark version support strategy.
>>>> 
>>>> So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
>>>> Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.
>>>> 
>>>> Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.
>>>> 
>>>> I see two options to move forward:
>>>> 
>>>> Option 1
>>>> 
>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.
>>>> 
>>>> Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
>>>> Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.
>>>> 
>>>> Option 2
>>>> 
>>>> Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.
>>>> 
>>>> Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
>>>> Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.
>>>> 
>>>> Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.
>>>> 
>>>> I’d love to hear what everybody thinks about this matter.
>>>> 
>>>> Thanks,
>>>> Anton
>>> 
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] Spark version support strategy

Posted by Wing Yew Poon <wy...@cloudera.com.INVALID>.

IIUC, Option 2 is to move the Spark support for Iceberg into a separate
repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
versions or all Spark 3 versions, then we would need to commit the changes
to all applicable branches. Basically we are trading more work to commit to
multiple branches for simplified build and CI time per branch, which might
be an acceptable trade-off. However, the biggest downside is that changes
may need to be made in core Iceberg as well as in the engine (in this case
Spark) support, and we need to wait for a release of core Iceberg to
consume the changes in the subproject. In this case, maybe we should have a
monthly release of core Iceberg (no matter how many changes go in, as long
as it is non-zero) so that the subproject can consume changes fairly
quickly?


On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <bl...@tabular.io> wrote:

> Thanks for bringing this up, Anton. I’m glad that we have the set of
> potential solutions well defined.
>
> Looks like the next step is to decide whether we want to require people to
> update Spark versions to pick up newer versions of Iceberg. If we choose to
> make people upgrade, then option 1 is clearly the best choice.
>
> I don’t think that we should make updating Spark a requirement. Many of
> the things that we’re working on are orthogonal to Spark versions, like
> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
> delete files, new storage implementations, etc. Upgrading Spark is time
> consuming and untrusted in my experience, so I think we would be setting up
> an unnecessary trade-off between spending lots of time to upgrade Spark and
> picking up new Iceberg features.
>
> Another way of thinking about this is that if we went with option 1, then
> we could port bug fixes into 0.12.x. But there are many things that
> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
> some people in the community would have to maintain branches of newer
> Iceberg versions with older versions of Spark outside of the main Iceberg
> project — that defeats the purpose of simplifying things with option 1
> because we would then have more people maintaining the same 0.13.x with
> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
> to release a 2.5 line with DSv2 backported, but the community decided not
> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>
> If the community is going to do the work anyway — and I think some of us
> would — we should make it possible to share that work. That’s why I don’t
> think that we should go with option 1.
>
> If we don’t go with option 1, then the choice is how to maintain multiple
> Spark versions. I think that the way we’re doing it right now is not
> something we want to continue.
>
> Using multiple modules (option 3) is concerning to me because of the
> changes in Spark. We currently structure the library to share as much code
> as possible. But that means compiling against different Spark versions and
> relying on binary compatibility and reflection in some cases. To me, this
> seems unmaintainable in the long run because it requires refactoring common
> classes and spending a lot of time deduplicating code. It also creates a
> ton of modules, at least one common module, then a module per version, then
> an extensions module per version, and finally a runtime module per version.
> That’s 3 modules per Spark version, plus any new common modules. And each
> module needs to be tested, which is making our CI take a really long time.
> We also don’t support multiple Scala versions, which is another gap that
> will require even more modules and tests.
>
> I like option 2 because it would allow us to compile against a single
> version of Spark (which will be much more reliable). It would give us an
> opportunity to support different Scala versions. It avoids the need to
> refactor to share code and allows people to focus on a single version of
> Spark, while also creating a way for people to maintain and update the
> older versions with newer Iceberg releases. I don’t think that this would
> slow down development. I think it would actually speed it up because we’d
> be spending less time trying to make multiple versions work in the same
> build. And anyone in favor of option 1 would basically get option 1: you
> don’t have to care about branches for older Spark versions.
>
> Jack makes a good point about wanting to keep code in a single repository,
> but I think that the need to manage more version combinations overrides
> this concern. It’s easier to make this decision in python because we’re not
> trying to depend on two projects that change relatively quickly. We’re just
> trying to build a library.
>
> Ryan
>
> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:
>
>> Thanks for bringing this up,  Anton.
>>
>> Everyone has great pros/cons to support their preferences.  Before giving
>> my preference, let me raise one question:    what's the top priority thing
>> for apache iceberg project at this point in time ?  This question will help
>> us to answer the following question: Should we support more engine versions
>> more robustly or be a bit more aggressive and concentrate on getting the
>> new features that users need most in order to keep the project more
>> competitive ?
>>
>> If people watch the apache iceberg project and check the issues &
>> PR frequently,  I guess more than 90% people will answer the priority
>> question:   There is no doubt for making the whole v2 story to be
>> production-ready.   The current roadmap discussion also proofs the thing :
>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>> .
>>
>> In order to ensure the highest priority at this point in time, I will
>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>> resources to make v2 production-ready.
>>
>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
>> wrote:
>>
>>> From Dev's point, it has less burden to always support the latest
>>> version of Spark (for example). But from user's point, especially for us
>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>> for the first time (since we have many customizations internally), and
>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>> support of old version of Spark3, users have to maintain it themselves
>>> unavoidably.
>>>
>>> So I'm inclined to make this support in community, not by users
>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>
>>> Just my two cents.
>>>
>>> -Saisai
>>>
>>>
>>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>
>>>> Hi Wing Yew,
>>>>
>>>> I think 2.4 is a different story, we will continue to support Spark
>>>> 2.4, but as you can see it will continue to have very limited
>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>> consistent strategy around this, let's take this chance to make a good
>>>> community guideline for all future engine versions, especially for Spark,
>>>> Flink and Hive that are in the same repository.
>>>>
>>>> I can totally understand your point of view Wing, in fact, speaking
>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>> software because there are people who are still using Spark 1.4, believe it
>>>> or not. After all, keep backporting changes will become a liability not
>>>> only on the user side, but also on the service provider side, so I believe
>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>> of both parties easier in the end. New feature is definitely one of the
>>>> best incentives to promote an upgrade on user side.
>>>>
>>>> I think the biggest issue of option 3 is about its scalability, because
>>>> we will have an unbounded list of packages to add and compile in the
>>>> future, and we probably cannot drop support of that package once created.
>>>> If we go with option 1, I think we can still publish a few patch versions
>>>> for old Iceberg releases, and committers can control the amount of patch
>>>> versions to guard people from abusing the power of patching. I see this as
>>>> a consistent strategy also for Flink and Hive. With this strategy, we can
>>>> truly have a compatibility matrix for engine versions against Iceberg
>>>> versions.
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>>>> <wy...@cloudera.com.invalid> wrote:
>>>>
>>>>> I understand and sympathize with the desire to use new DSv2 features
>>>>> in Spark 3.2. I agree that Option 1 is the easiest for developers, but I
>>>>> don't think it considers the interests of users. I do not think that most
>>>>> users will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>>>> 2.4?
>>>>>
>>>>> Please correct me if I'm mistaken, but the folks who have spoken out
>>>>> in favor of Option 1 all work for the same organization, don't they? And
>>>>> they don't have a problem with making their users, all internal, simply
>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>> fork that is close to 3.2.)
>>>>>
>>>>> I work for an organization with customers running different versions
>>>>> of Spark. It is true that we can backport new features to older versions if
>>>>> we wanted to. I suppose the people contributing to Iceberg work for some
>>>>> organization or other that either use Iceberg in-house, or provide software
>>>>> (possibly in the form of a service) to customers, and either way, the
>>>>> organizations have the ability to backport features and fixes to internal
>>>>> versions. Are there any users out there who simply use Apache Iceberg and
>>>>> depend on the community version?
>>>>>
>>>>> There may be features that are broadly useful that do not depend on
>>>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>
>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>>>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>>>>> modules you're thinking of?
>>>>>
>>>>> - Wing Yew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com> wrote:
>>>>>
>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>
>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>>> 3 are probably not worthy.
>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>> get the new feature by backporting it to an older version in case of
>>>>>> upgrading to a newer version isn't an option.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Yufei
>>>>>>
>>>>>> `This is not a contribution`
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>>>>>> <ao...@apple.com.invalid> wrote:
>>>>>>
>>>>>>> To sum up what we have so far:
>>>>>>>
>>>>>>>
>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>
>>>>>>> The easiest option for us devs, forces the user to upgrade to the
>>>>>>> most recent minor Spark version to consume any new Iceberg features.
>>>>>>>
>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>
>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>> still separate as we can use separate branches.
>>>>>>> Impossible to consume any unreleased changes in core, may slow down
>>>>>>> the development.
>>>>>>>
>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>
>>>>>>> Introduce more modules in the same project.
>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>> complicated.
>>>>>>>
>>>>>>>
>>>>>>> Are there any users for whom upgrading the minor Spark version (e3.1
>>>>>>> to 3.2) to consume new features is a blocker?
>>>>>>> We follow Option 1 internally at the moment but I would like to hear
>>>>>>> what other people think/need.
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>>
>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>>>> in a multiple Spark version support future.
>>>>>>>
>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>>
>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>> would expect the same argument to also hold here.
>>>>>>>
>>>>>>>
>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>
>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>> version.
>>>>>>>
>>>>>>>
>>>>>>> This is when it gets a bit complicated. If we want to support both
>>>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>>>> have our extensions that are extremely low-level and may break not only
>>>>>>> between minor versions but also between patch releases.
>>>>>>>
>>>>>>> f there are some features requiring a newer version, it makes sense
>>>>>>> to move that newer version in master.
>>>>>>>
>>>>>>>
>>>>>>> Internally, we don’t deliver new features to older Spark versions as
>>>>>>> it requires a lot of effort to port things. Personally, I don’t think it is
>>>>>>> too bad to require users to upgrade if they want new features. At the same
>>>>>>> time, there are valid concerns with this approach too that we mentioned
>>>>>>> during the sync. For example, certain new features would also work fine
>>>>>>> with older Spark versions. I generally agree with that and that not
>>>>>>> supporting recent versions is not ideal. However, I want to find a balance
>>>>>>> between the complexity on our side and ease of use for the users. Ideally,
>>>>>>> supporting a few recent versions would be sufficient but our Spark
>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>
>>>>>>>
>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>>
>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>> would expect the same argument to also hold here.
>>>>>>>
>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>> version. This avoids the problem that some users are unwilling to move to a
>>>>>>> newer version and keep patching old Spark version branches. If there are
>>>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>>>> version in master.
>>>>>>>
>>>>>>> In addition, because currently Spark is considered the most
>>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>> development speed.
>>>>>>>
>>>>>>> So my thinking is closer to option 1.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>
>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>> distribution and ordering for writes, etc. These features are too important
>>>>>>>> to ignore them.
>>>>>>>>
>>>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>>>>> share that with the rest of the community.
>>>>>>>>
>>>>>>>> I see two options to move forward:
>>>>>>>>
>>>>>>>> Option 1
>>>>>>>>
>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>
>>>>>>>> Pros: almost no changes to the build configuration, no extra work
>>>>>>>> on our side as just a single Spark version is actively maintained.
>>>>>>>> Cons: some new features that we will be adding to master could also
>>>>>>>> work with older Spark versions but all 0.12 releases will only contain bug
>>>>>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>>>>>>> any new Spark or format features.
>>>>>>>>
>>>>>>>> Option 2
>>>>>>>>
>>>>>>>> Move our Spark integration into a separate project and introduce
>>>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>>>
>>>>>>>> Pros: decouples the format version from Spark, we can support as
>>>>>>>> many Spark versions as needed.
>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>>> in the Spark integration.
>>>>>>>>
>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>> worry is that we will have to release the format more frequently (which is
>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>> development may be slower.
>>>>>>>>
>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anton
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Spark version support strategy

Posted by Ryan Blue <bl...@tabular.io>.

Thanks for bringing this up, Anton. I’m glad that we have the set of
potential solutions well defined.

Looks like the next step is to decide whether we want to require people to
update Spark versions to pick up newer versions of Iceberg. If we choose to
make people upgrade, then option 1 is clearly the best choice.

I don’t think that we should make updating Spark a requirement. Many of the
things that we’re working on are orthogonal to Spark versions, like table
maintenance actions, secondary indexes, the 1.0 API, views, ORC delete
files, new storage implementations, etc. Upgrading Spark is time consuming
and untrusted in my experience, so I think we would be setting up an
unnecessary trade-off between spending lots of time to upgrade Spark and
picking up new Iceberg features.

Another way of thinking about this is that if we went with option 1, then
we could port bug fixes into 0.12.x. But there are many things that
wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
some people in the community would have to maintain branches of newer
Iceberg versions with older versions of Spark outside of the main Iceberg
project — that defeats the purpose of simplifying things with option 1
because we would then have more people maintaining the same 0.13.x with
Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
to release a 2.5 line with DSv2 backported, but the community decided not
to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)

If the community is going to do the work anyway — and I think some of us
would — we should make it possible to share that work. That’s why I don’t
think that we should go with option 1.

If we don’t go with option 1, then the choice is how to maintain multiple
Spark versions. I think that the way we’re doing it right now is not
something we want to continue.

Using multiple modules (option 3) is concerning to me because of the
changes in Spark. We currently structure the library to share as much code
as possible. But that means compiling against different Spark versions and
relying on binary compatibility and reflection in some cases. To me, this
seems unmaintainable in the long run because it requires refactoring common
classes and spending a lot of time deduplicating code. It also creates a
ton of modules, at least one common module, then a module per version, then
an extensions module per version, and finally a runtime module per version.
That’s 3 modules per Spark version, plus any new common modules. And each
module needs to be tested, which is making our CI take a really long time.
We also don’t support multiple Scala versions, which is another gap that
will require even more modules and tests.

I like option 2 because it would allow us to compile against a single
version of Spark (which will be much more reliable). It would give us an
opportunity to support different Scala versions. It avoids the need to
refactor to share code and allows people to focus on a single version of
Spark, while also creating a way for people to maintain and update the
older versions with newer Iceberg releases. I don’t think that this would
slow down development. I think it would actually speed it up because we’d
be spending less time trying to make multiple versions work in the same
build. And anyone in favor of option 1 would basically get option 1: you
don’t have to care about branches for older Spark versions.

Jack makes a good point about wanting to keep code in a single repository,
but I think that the need to manage more version combinations overrides
this concern. It’s easier to make this decision in python because we’re not
trying to depend on two projects that change relatively quickly. We’re just
trying to build a library.

Ryan

On Wed, Sep 15, 2021 at 2:58 AM OpenInx <op...@gmail.com> wrote:

> Thanks for bringing this up,  Anton.
>
> Everyone has great pros/cons to support their preferences.  Before giving
> my preference, let me raise one question:    what's the top priority thing
> for apache iceberg project at this point in time ?  This question will help
> us to answer the following question: Should we support more engine versions
> more robustly or be a bit more aggressive and concentrate on getting the
> new features that users need most in order to keep the project more
> competitive ?
>
> If people watch the apache iceberg project and check the issues &
> PR frequently,  I guess more than 90% people will answer the priority
> question:   There is no doubt for making the whole v2 story to be
> production-ready.   The current roadmap discussion also proofs the thing :
> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
> .
>
> In order to ensure the highest priority at this point in time, I will
> prefer option-1 to reduce the cost of engine maintenance, so as to free up
> resources to make v2 production-ready.
>
> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com>
> wrote:
>
>> From Dev's point, it has less burden to always support the latest version
>> of Spark (for example). But from user's point, especially for us who
>> maintain Spark internally, it is not easy to upgrade the Spark version for
>> the first time (since we have many customizations internally), and we're
>> still promoting to upgrade to 3.1.2. If the community ditches the support
>> of old version of Spark3, users have to maintain it themselves unavoidably.
>>
>> So I'm inclined to make this support in community, not by users
>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>> burden, we could support limited versions of Spark (for example 2 versions).
>>
>> Just my two cents.
>>
>> -Saisai
>>
>>
>> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>
>>> Hi Wing Yew,
>>>
>>> I think 2.4 is a different story, we will continue to support Spark 2.4,
>>> but as you can see it will continue to have very limited functionalities
>>> comparing to Spark 3. I believe we discussed about option 3 when we were
>>> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
>>> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
>>> around this, let's take this chance to make a good community guideline for
>>> all future engine versions, especially for Spark, Flink and Hive that are
>>> in the same repository.
>>>
>>> I can totally understand your point of view Wing, in fact, speaking from
>>> the perspective of AWS EMR, we have to support over 40 versions of the
>>> software because there are people who are still using Spark 1.4, believe it
>>> or not. After all, keep backporting changes will become a liability not
>>> only on the user side, but also on the service provider side, so I believe
>>> it's not a bad practice to push for user upgrade, as it will make the life
>>> of both parties easier in the end. New feature is definitely one of the
>>> best incentives to promote an upgrade on user side.
>>>
>>> I think the biggest issue of option 3 is about its scalability, because
>>> we will have an unbounded list of packages to add and compile in the
>>> future, and we probably cannot drop support of that package once created.
>>> If we go with option 1, I think we can still publish a few patch versions
>>> for old Iceberg releases, and committers can control the amount of patch
>>> versions to guard people from abusing the power of patching. I see this as
>>> a consistent strategy also for Flink and Hive. With this strategy, we can
>>> truly have a compatibility matrix for engine versions against Iceberg
>>> versions.
>>>
>>> -Jack
>>>
>>>
>>>
>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>>> <wy...@cloudera.com.invalid> wrote:
>>>
>>>> I understand and sympathize with the desire to use new DSv2 features in
>>>> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
>>>> think it considers the interests of users. I do not think that most users
>>>> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>>> 2.4?
>>>>
>>>> Please correct me if I'm mistaken, but the folks who have spoken out in
>>>> favor of Option 1 all work for the same organization, don't they? And they
>>>> don't have a problem with making their users, all internal, simply upgrade
>>>> to Spark 3.2, do they? (Or they are already running an internal fork that
>>>> is close to 3.2.)
>>>>
>>>> I work for an organization with customers running different versions of
>>>> Spark. It is true that we can backport new features to older versions if we
>>>> wanted to. I suppose the people contributing to Iceberg work for some
>>>> organization or other that either use Iceberg in-house, or provide software
>>>> (possibly in the form of a service) to customers, and either way, the
>>>> organizations have the ability to backport features and fixes to internal
>>>> versions. Are there any users out there who simply use Apache Iceberg and
>>>> depend on the community version?
>>>>
>>>> There may be features that are broadly useful that do not depend on
>>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>
>>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>>>> modules you're thinking of?
>>>>
>>>> - Wing Yew
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com> wrote:
>>>>
>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>
>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>> limited resources in the open source community, the upsides of option 2 and
>>>>> 3 are probably not worthy.
>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>> predict anything, but even if these use cases are legit, users can still
>>>>> get the new feature by backporting it to an older version in case of
>>>>> upgrading to a newer version isn't an option.
>>>>>
>>>>> Best,
>>>>>
>>>>> Yufei
>>>>>
>>>>> `This is not a contribution`
>>>>>
>>>>>
>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>>>>> <ao...@apple.com.invalid> wrote:
>>>>>
>>>>>> To sum up what we have so far:
>>>>>>
>>>>>>
>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>
>>>>>> The easiest option for us devs, forces the user to upgrade to the
>>>>>> most recent minor Spark version to consume any new Iceberg features.
>>>>>>
>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>
>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>> still separate as we can use separate branches.
>>>>>> Impossible to consume any unreleased changes in core, may slow down
>>>>>> the development.
>>>>>>
>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>
>>>>>> Introduce more modules in the same project.
>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>> complicated.
>>>>>>
>>>>>>
>>>>>> Are there any users for whom upgrading the minor Spark version (e3.1
>>>>>> to 3.2) to consume new features is a blocker?
>>>>>> We follow Option 1 internally at the moment but I would like to hear
>>>>>> what other people think/need.
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>>
>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>>> in a multiple Spark version support future.
>>>>>>
>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>>
>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>> would expect the same argument to also hold here.
>>>>>>
>>>>>>
>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>
>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>> version.
>>>>>>
>>>>>>
>>>>>> This is when it gets a bit complicated. If we want to support both
>>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>>> have our extensions that are extremely low-level and may break not only
>>>>>> between minor versions but also between patch releases.
>>>>>>
>>>>>> f there are some features requiring a newer version, it makes sense
>>>>>> to move that newer version in master.
>>>>>>
>>>>>>
>>>>>> Internally, we don’t deliver new features to older Spark versions as
>>>>>> it requires a lot of effort to port things. Personally, I don’t think it is
>>>>>> too bad to require users to upgrade if they want new features. At the same
>>>>>> time, there are valid concerns with this approach too that we mentioned
>>>>>> during the sync. For example, certain new features would also work fine
>>>>>> with older Spark versions. I generally agree with that and that not
>>>>>> supporting recent versions is not ideal. However, I want to find a balance
>>>>>> between the complexity on our side and ease of use for the users. Ideally,
>>>>>> supporting a few recent versions would be sufficient but our Spark
>>>>>> integration is too low-level to do that with a single module.
>>>>>>
>>>>>>
>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>> would expect the same argument to also hold here.
>>>>>>
>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>> version. This avoids the problem that some users are unwilling to move to a
>>>>>> newer version and keep patching old Spark version branches. If there are
>>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>>> version in master.
>>>>>>
>>>>>> In addition, because currently Spark is considered the most
>>>>>> feature-complete reference implementation compared to all other engines, I
>>>>>> think we should not add artificial barriers that would slow down its
>>>>>> development speed.
>>>>>>
>>>>>> So my thinking is closer to option 1.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>
>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>>>> any Spark features that are offered in newer versions.
>>>>>>> Spark 3.2 is just around the corner and it brings a lot of important
>>>>>>> features such dynamic filtering for v2 tables, required distribution and
>>>>>>> ordering for writes, etc. These features are too important to ignore them.
>>>>>>>
>>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>>>> share that with the rest of the community.
>>>>>>>
>>>>>>> I see two options to move forward:
>>>>>>>
>>>>>>> Option 1
>>>>>>>
>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>> releasing minor versions with bug fixes.
>>>>>>>
>>>>>>> Pros: almost no changes to the build configuration, no extra work on
>>>>>>> our side as just a single Spark version is actively maintained.
>>>>>>> Cons: some new features that we will be adding to master could also
>>>>>>> work with older Spark versions but all 0.12 releases will only contain bug
>>>>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>>>>>> any new Spark or format features.
>>>>>>>
>>>>>>> Option 2
>>>>>>>
>>>>>>> Move our Spark integration into a separate project and introduce
>>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>>
>>>>>>> Pros: decouples the format version from Spark, we can support as
>>>>>>> many Spark versions as needed.
>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>> release, will need a new release of the core format to consume any changes
>>>>>>> in the Spark integration.
>>>>>>>
>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>> worry is that we will have to release the format more frequently (which is
>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>> development may be slower.
>>>>>>>
>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anton
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Spark version support strategy

Posted by OpenInx <op...@gmail.com>.

Thanks for bringing this up,  Anton.

Everyone has great pros/cons to support their preferences.  Before giving
my preference, let me raise one question:    what's the top priority thing
for apache iceberg project at this point in time ?  This question will help
us to answer the following question: Should we support more engine versions
more robustly or be a bit more aggressive and concentrate on getting the
new features that users need most in order to keep the project more
competitive ?

If people watch the apache iceberg project and check the issues &
PR frequently,  I guess more than 90% people will answer the priority
question:   There is no doubt for making the whole v2 story to be
production-ready.   The current roadmap discussion also proofs the thing :
https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
.

In order to ensure the highest priority at this point in time, I will
prefer option-1 to reduce the cost of engine maintenance, so as to free up
resources to make v2 production-ready.

On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sa...@gmail.com> wrote:

> From Dev's point, it has less burden to always support the latest version
> of Spark (for example). But from user's point, especially for us who
> maintain Spark internally, it is not easy to upgrade the Spark version for
> the first time (since we have many customizations internally), and we're
> still promoting to upgrade to 3.1.2. If the community ditches the support
> of old version of Spark3, users have to maintain it themselves unavoidably.
>
> So I'm inclined to make this support in community, not by users
> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
> burden, we could support limited versions of Spark (for example 2 versions).
>
> Just my two cents.
>
> -Saisai
>
>
> Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>
>> Hi Wing Yew,
>>
>> I think 2.4 is a different story, we will continue to support Spark 2.4,
>> but as you can see it will continue to have very limited functionalities
>> comparing to Spark 3. I believe we discussed about option 3 when we were
>> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
>> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
>> around this, let's take this chance to make a good community guideline for
>> all future engine versions, especially for Spark, Flink and Hive that are
>> in the same repository.
>>
>> I can totally understand your point of view Wing, in fact, speaking from
>> the perspective of AWS EMR, we have to support over 40 versions of the
>> software because there are people who are still using Spark 1.4, believe it
>> or not. After all, keep backporting changes will become a liability not
>> only on the user side, but also on the service provider side, so I believe
>> it's not a bad practice to push for user upgrade, as it will make the life
>> of both parties easier in the end. New feature is definitely one of the
>> best incentives to promote an upgrade on user side.
>>
>> I think the biggest issue of option 3 is about its scalability, because
>> we will have an unbounded list of packages to add and compile in the
>> future, and we probably cannot drop support of that package once created.
>> If we go with option 1, I think we can still publish a few patch versions
>> for old Iceberg releases, and committers can control the amount of patch
>> versions to guard people from abusing the power of patching. I see this as
>> a consistent strategy also for Flink and Hive. With this strategy, we can
>> truly have a compatibility matrix for engine versions against Iceberg
>> versions.
>>
>> -Jack
>>
>>
>>
>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>> <wy...@cloudera.com.invalid> wrote:
>>
>>> I understand and sympathize with the desire to use new DSv2 features in
>>> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
>>> think it considers the interests of users. I do not think that most users
>>> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>> 2.4?
>>>
>>> Please correct me if I'm mistaken, but the folks who have spoken out in
>>> favor of Option 1 all work for the same organization, don't they? And they
>>> don't have a problem with making their users, all internal, simply upgrade
>>> to Spark 3.2, do they? (Or they are already running an internal fork that
>>> is close to 3.2.)
>>>
>>> I work for an organization with customers running different versions of
>>> Spark. It is true that we can backport new features to older versions if we
>>> wanted to. I suppose the people contributing to Iceberg work for some
>>> organization or other that either use Iceberg in-house, or provide software
>>> (possibly in the form of a service) to customers, and either way, the
>>> organizations have the ability to backport features and fixes to internal
>>> versions. Are there any users out there who simply use Apache Iceberg and
>>> depend on the community version?
>>>
>>> There may be features that are broadly useful that do not depend on
>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>
>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>>> modules you're thinking of?
>>>
>>> - Wing Yew
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com> wrote:
>>>
>>>> Option 1 sounds good to me. Here are my reasons:
>>>>
>>>> 1. Both 2 and 3 will slow down the development. Considering the limited
>>>> resources in the open source community, the upsides of option 2 and 3 are
>>>> probably not worthy.
>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>> predict anything, but even if these use cases are legit, users can still
>>>> get the new feature by backporting it to an older version in case of
>>>> upgrading to a newer version isn't an option.
>>>>
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>>> `This is not a contribution`
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>>>> <ao...@apple.com.invalid> wrote:
>>>>
>>>>> To sum up what we have so far:
>>>>>
>>>>>
>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>
>>>>> The easiest option for us devs, forces the user to upgrade to the most
>>>>> recent minor Spark version to consume any new Iceberg features.
>>>>>
>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>
>>>>> Can support as many Spark versions as needed and the codebase is still
>>>>> separate as we can use separate branches.
>>>>> Impossible to consume any unreleased changes in core, may slow down
>>>>> the development.
>>>>>
>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>
>>>>> Introduce more modules in the same project.
>>>>> Can consume unreleased changes but it will required at least 5 modules
>>>>> to support 2.4, 3.1 and 3.2, making the build and testing complicated.
>>>>>
>>>>>
>>>>> Are there any users for whom upgrading the minor Spark version (e3.1
>>>>> to 3.2) to consume new features is a blocker?
>>>>> We follow Option 1 internally at the moment but I would like to hear
>>>>> what other people think/need.
>>>>>
>>>>> - Anton
>>>>>
>>>>>
>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I think we should go for option 1. I already am not a big fan of
>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>> in a multiple Spark version support future.
>>>>>
>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>>
>>>>> First of all, is option 2 a viable option? We discussed separating the
>>>>> python module outside of the project a few weeks ago, and decided to not do
>>>>> that because it's beneficial for code cross reference and more intuitive
>>>>> for new developers to see everything in the same repository. I would expect
>>>>> the same argument to also hold here.
>>>>>
>>>>>
>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>
>>>>> Overall I would personally prefer us to not support all the minor
>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>> version.
>>>>>
>>>>>
>>>>> This is when it gets a bit complicated. If we want to support both
>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>> have our extensions that are extremely low-level and may break not only
>>>>> between minor versions but also between patch releases.
>>>>>
>>>>> f there are some features requiring a newer version, it makes sense to
>>>>> move that newer version in master.
>>>>>
>>>>>
>>>>> Internally, we don’t deliver new features to older Spark versions as
>>>>> it requires a lot of effort to port things. Personally, I don’t think it is
>>>>> too bad to require users to upgrade if they want new features. At the same
>>>>> time, there are valid concerns with this approach too that we mentioned
>>>>> during the sync. For example, certain new features would also work fine
>>>>> with older Spark versions. I generally agree with that and that not
>>>>> supporting recent versions is not ideal. However, I want to find a balance
>>>>> between the complexity on our side and ease of use for the users. Ideally,
>>>>> supporting a few recent versions would be sufficient but our Spark
>>>>> integration is too low-level to do that with a single module.
>>>>>
>>>>>
>>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>> First of all, is option 2 a viable option? We discussed separating the
>>>>> python module outside of the project a few weeks ago, and decided to not do
>>>>> that because it's beneficial for code cross reference and more intuitive
>>>>> for new developers to see everything in the same repository. I would expect
>>>>> the same argument to also hold here.
>>>>>
>>>>> Overall I would personally prefer us to not support all the minor
>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>> version. This avoids the problem that some users are unwilling to move to a
>>>>> newer version and keep patching old Spark version branches. If there are
>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>> version in master.
>>>>>
>>>>> In addition, because currently Spark is considered the most
>>>>> feature-complete reference implementation compared to all other engines, I
>>>>> think we should not add artificial barriers that would slow down its
>>>>> development speed.
>>>>>
>>>>> So my thinking is closer to option 1.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I want to discuss our Spark version support strategy.
>>>>>>
>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>>> any Spark features that are offered in newer versions.
>>>>>> Spark 3.2 is just around the corner and it brings a lot of important
>>>>>> features such dynamic filtering for v2 tables, required distribution and
>>>>>> ordering for writes, etc. These features are too important to ignore them.
>>>>>>
>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>>> share that with the rest of the community.
>>>>>>
>>>>>> I see two options to move forward:
>>>>>>
>>>>>> Option 1
>>>>>>
>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>> releasing minor versions with bug fixes.
>>>>>>
>>>>>> Pros: almost no changes to the build configuration, no extra work on
>>>>>> our side as just a single Spark version is actively maintained.
>>>>>> Cons: some new features that we will be adding to master could also
>>>>>> work with older Spark versions but all 0.12 releases will only contain bug
>>>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>>>>> any new Spark or format features.
>>>>>>
>>>>>> Option 2
>>>>>>
>>>>>> Move our Spark integration into a separate project and introduce
>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>
>>>>>> Pros: decouples the format version from Spark, we can support as many
>>>>>> Spark versions as needed.
>>>>>> Cons: more work initially to set everything up, more work to release,
>>>>>> will need a new release of the core format to consume any changes in the
>>>>>> Spark integration.
>>>>>>
>>>>>> Overall, I think option 2 seems better for the user but my main worry
>>>>>> is that we will have to release the format more frequently (which is a good
>>>>>> thing but requires more work and time) and the overall Spark development
>>>>>> may be slower.
>>>>>>
>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>
>>>>>> Thanks,
>>>>>> Anton
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Spark version support strategy

Posted by Saisai Shao <sa...@gmail.com>.

From Dev's point, it has less burden to always support the latest version
of Spark (for example). But from user's point, especially for us who
maintain Spark internally, it is not easy to upgrade the Spark version for
the first time (since we have many customizations internally), and we're
still promoting to upgrade to 3.1.2. If the community ditches the support
of old version of Spark3, users have to maintain it themselves unavoidably.

So I'm inclined to make this support in community, not by users themselves,
as for Option 2 or 3, I'm fine with either. And to relieve the burden, we
could support limited versions of Spark (for example 2 versions).

Just my two cents.

-Saisai


Jack Ye <ye...@gmail.com> 于2021年9月15日周三 下午1:35写道：

> Hi Wing Yew,
>
> I think 2.4 is a different story, we will continue to support Spark 2.4,
> but as you can see it will continue to have very limited functionalities
> comparing to Spark 3. I believe we discussed about option 3 when we were
> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
> around this, let's take this chance to make a good community guideline for
> all future engine versions, especially for Spark, Flink and Hive that are
> in the same repository.
>
> I can totally understand your point of view Wing, in fact, speaking from
> the perspective of AWS EMR, we have to support over 40 versions of the
> software because there are people who are still using Spark 1.4, believe it
> or not. After all, keep backporting changes will become a liability not
> only on the user side, but also on the service provider side, so I believe
> it's not a bad practice to push for user upgrade, as it will make the life
> of both parties easier in the end. New feature is definitely one of the
> best incentives to promote an upgrade on user side.
>
> I think the biggest issue of option 3 is about its scalability, because we
> will have an unbounded list of packages to add and compile in the future,
> and we probably cannot drop support of that package once created. If we go
> with option 1, I think we can still publish a few patch versions for old
> Iceberg releases, and committers can control the amount of patch versions
> to guard people from abusing the power of patching. I see this as a
> consistent strategy also for Flink and Hive. With this strategy, we can
> truly have a compatibility matrix for engine versions against Iceberg
> versions.
>
> -Jack
>
>
>
> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <wy...@cloudera.com.invalid>
> wrote:
>
>> I understand and sympathize with the desire to use new DSv2 features in
>> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
>> think it considers the interests of users. I do not think that most users
>> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>> 2.4?
>>
>> Please correct me if I'm mistaken, but the folks who have spoken out in
>> favor of Option 1 all work for the same organization, don't they? And they
>> don't have a problem with making their users, all internal, simply upgrade
>> to Spark 3.2, do they? (Or they are already running an internal fork that
>> is close to 3.2.)
>>
>> I work for an organization with customers running different versions of
>> Spark. It is true that we can backport new features to older versions if we
>> wanted to. I suppose the people contributing to Iceberg work for some
>> organization or other that either use Iceberg in-house, or provide software
>> (possibly in the form of a service) to customers, and either way, the
>> organizations have the ability to backport features and fixes to internal
>> versions. Are there any users out there who simply use Apache Iceberg and
>> depend on the community version?
>>
>> There may be features that are broadly useful that do not depend on Spark
>> 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>
>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>> modules you're thinking of?
>>
>> - Wing Yew
>>
>>
>>
>>
>>
>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com> wrote:
>>
>>> Option 1 sounds good to me. Here are my reasons:
>>>
>>> 1. Both 2 and 3 will slow down the development. Considering the limited
>>> resources in the open source community, the upsides of option 2 and 3 are
>>> probably not worthy.
>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict
>>> anything, but even if these use cases are legit, users can still get the
>>> new feature by backporting it to an older version in case of upgrading to a
>>> newer version isn't an option.
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>> `This is not a contribution`
>>>
>>>
>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>>> <ao...@apple.com.invalid> wrote:
>>>
>>>> To sum up what we have so far:
>>>>
>>>>
>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>
>>>> The easiest option for us devs, forces the user to upgrade to the most
>>>> recent minor Spark version to consume any new Iceberg features.
>>>>
>>>> *Option 2 (a separate project under Iceberg)*
>>>>
>>>> Can support as many Spark versions as needed and the codebase is still
>>>> separate as we can use separate branches.
>>>> Impossible to consume any unreleased changes in core, may slow down the
>>>> development.
>>>>
>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>
>>>> Introduce more modules in the same project.
>>>> Can consume unreleased changes but it will required at least 5 modules
>>>> to support 2.4, 3.1 and 3.2, making the build and testing complicated.
>>>>
>>>>
>>>> Are there any users for whom upgrading the minor Spark version (e3.1 to
>>>> 3.2) to consume new features is a blocker?
>>>> We follow Option 1 internally at the moment but I would like to hear
>>>> what other people think/need.
>>>>
>>>> - Anton
>>>>
>>>>
>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
>>>> wrote:
>>>>
>>>> I think we should go for option 1. I already am not a big fan of having
>>>> runtime errors for unsupported things based on versions and I don't think
>>>> minor version upgrades are a large issue for users.  I'm especially not
>>>> looking forward to supporting interfaces that only exist in Spark 3.2 in a
>>>> multiple Spark version support future.
>>>>
>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>> aokolnychyi@apple.com.INVALID> wrote:
>>>>
>>>> First of all, is option 2 a viable option? We discussed separating the
>>>> python module outside of the project a few weeks ago, and decided to not do
>>>> that because it's beneficial for code cross reference and more intuitive
>>>> for new developers to see everything in the same repository. I would expect
>>>> the same argument to also hold here.
>>>>
>>>>
>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>
>>>> Overall I would personally prefer us to not support all the minor
>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>> version.
>>>>
>>>>
>>>> This is when it gets a bit complicated. If we want to support both
>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>> have our extensions that are extremely low-level and may break not only
>>>> between minor versions but also between patch releases.
>>>>
>>>> f there are some features requiring a newer version, it makes sense to
>>>> move that newer version in master.
>>>>
>>>>
>>>> Internally, we don’t deliver new features to older Spark versions as it
>>>> requires a lot of effort to port things. Personally, I don’t think it is
>>>> too bad to require users to upgrade if they want new features. At the same
>>>> time, there are valid concerns with this approach too that we mentioned
>>>> during the sync. For example, certain new features would also work fine
>>>> with older Spark versions. I generally agree with that and that not
>>>> supporting recent versions is not ideal. However, I want to find a balance
>>>> between the complexity on our side and ease of use for the users. Ideally,
>>>> supporting a few recent versions would be sufficient but our Spark
>>>> integration is too low-level to do that with a single module.
>>>>
>>>>
>>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>> First of all, is option 2 a viable option? We discussed separating the
>>>> python module outside of the project a few weeks ago, and decided to not do
>>>> that because it's beneficial for code cross reference and more intuitive
>>>> for new developers to see everything in the same repository. I would expect
>>>> the same argument to also hold here.
>>>>
>>>> Overall I would personally prefer us to not support all the minor
>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>> version. This avoids the problem that some users are unwilling to move to a
>>>> newer version and keep patching old Spark version branches. If there are
>>>> some features requiring a newer version, it makes sense to move that newer
>>>> version in master.
>>>>
>>>> In addition, because currently Spark is considered the most
>>>> feature-complete reference implementation compared to all other engines, I
>>>> think we should not add artificial barriers that would slow down its
>>>> development speed.
>>>>
>>>> So my thinking is closer to option 1.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>> aokolnychyi@apple.com.invalid> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> I want to discuss our Spark version support strategy.
>>>>>
>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>> any Spark features that are offered in newer versions.
>>>>> Spark 3.2 is just around the corner and it brings a lot of important
>>>>> features such dynamic filtering for v2 tables, required distribution and
>>>>> ordering for writes, etc. These features are too important to ignore them.
>>>>>
>>>>> Apart from that, I have an end-to-end prototype for merge-on-read with
>>>>> Spark that actually leverages some of the 3.2 features. I’ll be
>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>> share that with the rest of the community.
>>>>>
>>>>> I see two options to move forward:
>>>>>
>>>>> Option 1
>>>>>
>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
>>>>> minor versions with bug fixes.
>>>>>
>>>>> Pros: almost no changes to the build configuration, no extra work on
>>>>> our side as just a single Spark version is actively maintained.
>>>>> Cons: some new features that we will be adding to master could also
>>>>> work with older Spark versions but all 0.12 releases will only contain bug
>>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>>>> any new Spark or format features.
>>>>>
>>>>> Option 2
>>>>>
>>>>> Move our Spark integration into a separate project and introduce
>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>
>>>>> Pros: decouples the format version from Spark, we can support as many
>>>>> Spark versions as needed.
>>>>> Cons: more work initially to set everything up, more work to release,
>>>>> will need a new release of the core format to consume any changes in the
>>>>> Spark integration.
>>>>>
>>>>> Overall, I think option 2 seems better for the user but my main worry
>>>>> is that we will have to release the format more frequently (which is a good
>>>>> thing but requires more work and time) and the overall Spark development
>>>>> may be slower.
>>>>>
>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>
>>>>
>>>>
>>>>
>>>>

Re: [DISCUSS] Spark version support strategy

Posted by Jack Ye <ye...@gmail.com>.

Hi Wing Yew,

I think 2.4 is a different story, we will continue to support Spark 2.4,
but as you can see it will continue to have very limited functionalities
comparing to Spark 3. I believe we discussed about option 3 when we were
doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
around this, let's take this chance to make a good community guideline for
all future engine versions, especially for Spark, Flink and Hive that are
in the same repository.

I can totally understand your point of view Wing, in fact, speaking from
the perspective of AWS EMR, we have to support over 40 versions of the
software because there are people who are still using Spark 1.4, believe it
or not. After all, keep backporting changes will become a liability not
only on the user side, but also on the service provider side, so I believe
it's not a bad practice to push for user upgrade, as it will make the life
of both parties easier in the end. New feature is definitely one of the
best incentives to promote an upgrade on user side.

I think the biggest issue of option 3 is about its scalability, because we
will have an unbounded list of packages to add and compile in the future,
and we probably cannot drop support of that package once created. If we go
with option 1, I think we can still publish a few patch versions for old
Iceberg releases, and committers can control the amount of patch versions
to guard people from abusing the power of patching. I see this as a
consistent strategy also for Flink and Hive. With this strategy, we can
truly have a compatibility matrix for engine versions against Iceberg
versions.

-Jack



On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon <wy...@cloudera.com.invalid>
wrote:

> I understand and sympathize with the desire to use new DSv2 features in
> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
> think it considers the interests of users. I do not think that most users
> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
> and from 3.1 to 3.2. I think there are even a lot of users running Spark
> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
> 2.4?
>
> Please correct me if I'm mistaken, but the folks who have spoken out in
> favor of Option 1 all work for the same organization, don't they? And they
> don't have a problem with making their users, all internal, simply upgrade
> to Spark 3.2, do they? (Or they are already running an internal fork that
> is close to 3.2.)
>
> I work for an organization with customers running different versions of
> Spark. It is true that we can backport new features to older versions if we
> wanted to. I suppose the people contributing to Iceberg work for some
> organization or other that either use Iceberg in-house, or provide software
> (possibly in the form of a service) to customers, and either way, the
> organizations have the ability to backport features and fixes to internal
> versions. Are there any users out there who simply use Apache Iceberg and
> depend on the community version?
>
> There may be features that are broadly useful that do not depend on Spark
> 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>
> I am not in favor of Option 2. I do not oppose Option 1, but I would
> consider Option 3 too. Anton, you said 5 modules are required; what are the
> modules you're thinking of?
>
> - Wing Yew
>
>
>
>
>
> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com> wrote:
>
>> Option 1 sounds good to me. Here are my reasons:
>>
>> 1. Both 2 and 3 will slow down the development. Considering the limited
>> resources in the open source community, the upsides of option 2 and 3 are
>> probably not worthy.
>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict
>> anything, but even if these use cases are legit, users can still get the
>> new feature by backporting it to an older version in case of upgrading to a
>> newer version isn't an option.
>>
>> Best,
>>
>> Yufei
>>
>> `This is not a contribution`
>>
>>
>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>> <ao...@apple.com.invalid> wrote:
>>
>>> To sum up what we have so far:
>>>
>>>
>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>
>>> The easiest option for us devs, forces the user to upgrade to the most
>>> recent minor Spark version to consume any new Iceberg features.
>>>
>>> *Option 2 (a separate project under Iceberg)*
>>>
>>> Can support as many Spark versions as needed and the codebase is still
>>> separate as we can use separate branches.
>>> Impossible to consume any unreleased changes in core, may slow down the
>>> development.
>>>
>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>
>>> Introduce more modules in the same project.
>>> Can consume unreleased changes but it will required at least 5 modules
>>> to support 2.4, 3.1 and 3.2, making the build and testing complicated.
>>>
>>>
>>> Are there any users for whom upgrading the minor Spark version (e3.1 to
>>> 3.2) to consume new features is a blocker?
>>> We follow Option 1 internally at the moment but I would like to hear
>>> what other people think/need.
>>>
>>> - Anton
>>>
>>>
>>> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
>>> wrote:
>>>
>>> I think we should go for option 1. I already am not a big fan of having
>>> runtime errors for unsupported things based on versions and I don't think
>>> minor version upgrades are a large issue for users.  I'm especially not
>>> looking forward to supporting interfaces that only exist in Spark 3.2 in a
>>> multiple Spark version support future.
>>>
>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>> aokolnychyi@apple.com.INVALID> wrote:
>>>
>>> First of all, is option 2 a viable option? We discussed separating the
>>> python module outside of the project a few weeks ago, and decided to not do
>>> that because it's beneficial for code cross reference and more intuitive
>>> for new developers to see everything in the same repository. I would expect
>>> the same argument to also hold here.
>>>
>>>
>>> That’s exactly the concern I have about Option 2 at this moment.
>>>
>>> Overall I would personally prefer us to not support all the minor
>>> versions, but instead support maybe just 2-3 latest versions in a major
>>> version.
>>>
>>>
>>> This is when it gets a bit complicated. If we want to support both Spark
>>> 3.1 and Spark 3.2 with a single module, it means we have to compile against
>>> 3.1. The problem is that we rely on DSv2 that is being actively developed.
>>> 3.2 and 3.1 have substantial differences. On top of that, we have our
>>> extensions that are extremely low-level and may break not only between
>>> minor versions but also between patch releases.
>>>
>>> f there are some features requiring a newer version, it makes sense to
>>> move that newer version in master.
>>>
>>>
>>> Internally, we don’t deliver new features to older Spark versions as it
>>> requires a lot of effort to port things. Personally, I don’t think it is
>>> too bad to require users to upgrade if they want new features. At the same
>>> time, there are valid concerns with this approach too that we mentioned
>>> during the sync. For example, certain new features would also work fine
>>> with older Spark versions. I generally agree with that and that not
>>> supporting recent versions is not ideal. However, I want to find a balance
>>> between the complexity on our side and ease of use for the users. Ideally,
>>> supporting a few recent versions would be sufficient but our Spark
>>> integration is too low-level to do that with a single module.
>>>
>>>
>>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>>
>>> First of all, is option 2 a viable option? We discussed separating the
>>> python module outside of the project a few weeks ago, and decided to not do
>>> that because it's beneficial for code cross reference and more intuitive
>>> for new developers to see everything in the same repository. I would expect
>>> the same argument to also hold here.
>>>
>>> Overall I would personally prefer us to not support all the minor
>>> versions, but instead support maybe just 2-3 latest versions in a major
>>> version. This avoids the problem that some users are unwilling to move to a
>>> newer version and keep patching old Spark version branches. If there are
>>> some features requiring a newer version, it makes sense to move that newer
>>> version in master.
>>>
>>> In addition, because currently Spark is considered the most
>>> feature-complete reference implementation compared to all other engines, I
>>> think we should not add artificial barriers that would slow down its
>>> development speed.
>>>
>>> So my thinking is closer to option 1.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>> aokolnychyi@apple.com.invalid> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> I want to discuss our Spark version support strategy.
>>>>
>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>> support older versions but because we compile against 3.0, we cannot use
>>>> any Spark features that are offered in newer versions.
>>>> Spark 3.2 is just around the corner and it brings a lot of important
>>>> features such dynamic filtering for v2 tables, required distribution and
>>>> ordering for writes, etc. These features are too important to ignore them.
>>>>
>>>> Apart from that, I have an end-to-end prototype for merge-on-read with
>>>> Spark that actually leverages some of the 3.2 features. I’ll be
>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>> share that with the rest of the community.
>>>>
>>>> I see two options to move forward:
>>>>
>>>> Option 1
>>>>
>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
>>>> minor versions with bug fixes.
>>>>
>>>> Pros: almost no changes to the build configuration, no extra work on
>>>> our side as just a single Spark version is actively maintained.
>>>> Cons: some new features that we will be adding to master could also
>>>> work with older Spark versions but all 0.12 releases will only contain bug
>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>>> any new Spark or format features.
>>>>
>>>> Option 2
>>>>
>>>> Move our Spark integration into a separate project and introduce
>>>> branches for 3.0, 3.1 and 3.2.
>>>>
>>>> Pros: decouples the format version from Spark, we can support as many
>>>> Spark versions as needed.
>>>> Cons: more work initially to set everything up, more work to release,
>>>> will need a new release of the core format to consume any changes in the
>>>> Spark integration.
>>>>
>>>> Overall, I think option 2 seems better for the user but my main worry
>>>> is that we will have to release the format more frequently (which is a good
>>>> thing but requires more work and time) and the overall Spark development
>>>> may be slower.
>>>>
>>>> I’d love to hear what everybody thinks about this matter.
>>>>
>>>> Thanks,
>>>> Anton
>>>
>>>
>>>
>>>
>>>

Re: [DISCUSS] Spark version support strategy

Posted by Wing Yew Poon <wy...@cloudera.com.INVALID>.

I understand and sympathize with the desire to use new DSv2 features in
Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
think it considers the interests of users. I do not think that most users
will upgrade to Spark 3.2 as soon as it is released. It is a "minor
version" upgrade in name from 3.1 (or from 3.0), but I think we all know
that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
and from 3.1 to 3.2. I think there are even a lot of users running Spark
2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
2.4?

Please correct me if I'm mistaken, but the folks who have spoken out in
favor of Option 1 all work for the same organization, don't they? And they
don't have a problem with making their users, all internal, simply upgrade
to Spark 3.2, do they? (Or they are already running an internal fork that
is close to 3.2.)

I work for an organization with customers running different versions of
Spark. It is true that we can backport new features to older versions if we
wanted to. I suppose the people contributing to Iceberg work for some
organization or other that either use Iceberg in-house, or provide software
(possibly in the form of a service) to customers, and either way, the
organizations have the ability to backport features and fixes to internal
versions. Are there any users out there who simply use Apache Iceberg and
depend on the community version?

There may be features that are broadly useful that do not depend on Spark
3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?

I am not in favor of Option 2. I do not oppose Option 1, but I would
consider Option 3 too. Anton, you said 5 modules are required; what are the
modules you're thinking of?

- Wing Yew





On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <fl...@gmail.com> wrote:

> Option 1 sounds good to me. Here are my reasons:
>
> 1. Both 2 and 3 will slow down the development. Considering the limited
> resources in the open source community, the upsides of option 2 and 3 are
> probably not worthy.
> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict
> anything, but even if these use cases are legit, users can still get the
> new feature by backporting it to an older version in case of upgrading to a
> newer version isn't an option.
>
> Best,
>
> Yufei
>
> `This is not a contribution`
>
>
> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
> <ao...@apple.com.invalid> wrote:
>
>> To sum up what we have so far:
>>
>>
>> *Option 1 (support just the most recent minor Spark 3 version)*
>>
>> The easiest option for us devs, forces the user to upgrade to the most
>> recent minor Spark version to consume any new Iceberg features.
>>
>> *Option 2 (a separate project under Iceberg)*
>>
>> Can support as many Spark versions as needed and the codebase is still
>> separate as we can use separate branches.
>> Impossible to consume any unreleased changes in core, may slow down the
>> development.
>>
>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>
>> Introduce more modules in the same project.
>> Can consume unreleased changes but it will required at least 5 modules to
>> support 2.4, 3.1 and 3.2, making the build and testing complicated.
>>
>>
>> Are there any users for whom upgrading the minor Spark version (e3.1 to
>> 3.2) to consume new features is a blocker?
>> We follow Option 1 internally at the moment but I would like to hear what
>> other people think/need.
>>
>> - Anton
>>
>>
>> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
>> wrote:
>>
>> I think we should go for option 1. I already am not a big fan of having
>> runtime errors for unsupported things based on versions and I don't think
>> minor version upgrades are a large issue for users.  I'm especially not
>> looking forward to supporting interfaces that only exist in Spark 3.2 in a
>> multiple Spark version support future.
>>
>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>> aokolnychyi@apple.com.INVALID> wrote:
>>
>> First of all, is option 2 a viable option? We discussed separating the
>> python module outside of the project a few weeks ago, and decided to not do
>> that because it's beneficial for code cross reference and more intuitive
>> for new developers to see everything in the same repository. I would expect
>> the same argument to also hold here.
>>
>>
>> That’s exactly the concern I have about Option 2 at this moment.
>>
>> Overall I would personally prefer us to not support all the minor
>> versions, but instead support maybe just 2-3 latest versions in a major
>> version.
>>
>>
>> This is when it gets a bit complicated. If we want to support both Spark
>> 3.1 and Spark 3.2 with a single module, it means we have to compile against
>> 3.1. The problem is that we rely on DSv2 that is being actively developed.
>> 3.2 and 3.1 have substantial differences. On top of that, we have our
>> extensions that are extremely low-level and may break not only between
>> minor versions but also between patch releases.
>>
>> f there are some features requiring a newer version, it makes sense to
>> move that newer version in master.
>>
>>
>> Internally, we don’t deliver new features to older Spark versions as it
>> requires a lot of effort to port things. Personally, I don’t think it is
>> too bad to require users to upgrade if they want new features. At the same
>> time, there are valid concerns with this approach too that we mentioned
>> during the sync. For example, certain new features would also work fine
>> with older Spark versions. I generally agree with that and that not
>> supporting recent versions is not ideal. However, I want to find a balance
>> between the complexity on our side and ease of use for the users. Ideally,
>> supporting a few recent versions would be sufficient but our Spark
>> integration is too low-level to do that with a single module.
>>
>>
>> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>>
>> First of all, is option 2 a viable option? We discussed separating the
>> python module outside of the project a few weeks ago, and decided to not do
>> that because it's beneficial for code cross reference and more intuitive
>> for new developers to see everything in the same repository. I would expect
>> the same argument to also hold here.
>>
>> Overall I would personally prefer us to not support all the minor
>> versions, but instead support maybe just 2-3 latest versions in a major
>> version. This avoids the problem that some users are unwilling to move to a
>> newer version and keep patching old Spark version branches. If there are
>> some features requiring a newer version, it makes sense to move that newer
>> version in master.
>>
>> In addition, because currently Spark is considered the most
>> feature-complete reference implementation compared to all other engines, I
>> think we should not add artificial barriers that would slow down its
>> development speed.
>>
>> So my thinking is closer to option 1.
>>
>> Best,
>> Jack Ye
>>
>>
>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>> aokolnychyi@apple.com.invalid> wrote:
>>
>>> Hey folks,
>>>
>>> I want to discuss our Spark version support strategy.
>>>
>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>> support older versions but because we compile against 3.0, we cannot use
>>> any Spark features that are offered in newer versions.
>>> Spark 3.2 is just around the corner and it brings a lot of important
>>> features such dynamic filtering for v2 tables, required distribution and
>>> ordering for writes, etc. These features are too important to ignore them.
>>>
>>> Apart from that, I have an end-to-end prototype for merge-on-read with
>>> Spark that actually leverages some of the 3.2 features. I’ll be
>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>> share that with the rest of the community.
>>>
>>> I see two options to move forward:
>>>
>>> Option 1
>>>
>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
>>> minor versions with bug fixes.
>>>
>>> Pros: almost no changes to the build configuration, no extra work on our
>>> side as just a single Spark version is actively maintained.
>>> Cons: some new features that we will be adding to master could also work
>>> with older Spark versions but all 0.12 releases will only contain bug
>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>> any new Spark or format features.
>>>
>>> Option 2
>>>
>>> Move our Spark integration into a separate project and introduce
>>> branches for 3.0, 3.1 and 3.2.
>>>
>>> Pros: decouples the format version from Spark, we can support as many
>>> Spark versions as needed.
>>> Cons: more work initially to set everything up, more work to release,
>>> will need a new release of the core format to consume any changes in the
>>> Spark integration.
>>>
>>> Overall, I think option 2 seems better for the user but my main worry is
>>> that we will have to release the format more frequently (which is a good
>>> thing but requires more work and time) and the overall Spark development
>>> may be slower.
>>>
>>> I’d love to hear what everybody thinks about this matter.
>>>
>>> Thanks,
>>> Anton
>>
>>
>>
>>
>>

Re: [DISCUSS] Spark version support strategy

Posted by Yufei Gu <fl...@gmail.com>.

Option 1 sounds good to me. Here are my reasons:

1. Both 2 and 3 will slow down the development. Considering the limited
resources in the open source community, the upsides of option 2 and 3 are
probably not worthy.
2. Both 2 and 3 assume the use cases may not exist. It's hard to predict
anything, but even if these use cases are legit, users can still get the
new feature by backporting it to an older version in case of upgrading to a
newer version isn't an option.

Best,

Yufei

`This is not a contribution`


On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> To sum up what we have so far:
>
>
> *Option 1 (support just the most recent minor Spark 3 version)*
>
> The easiest option for us devs, forces the user to upgrade to the most
> recent minor Spark version to consume any new Iceberg features.
>
> *Option 2 (a separate project under Iceberg)*
>
> Can support as many Spark versions as needed and the codebase is still
> separate as we can use separate branches.
> Impossible to consume any unreleased changes in core, may slow down the
> development.
>
> *Option 3 (separate modules for Spark 3.1/3.2)*
>
> Introduce more modules in the same project.
> Can consume unreleased changes but it will required at least 5 modules to
> support 2.4, 3.1 and 3.2, making the build and testing complicated.
>
>
> Are there any users for whom upgrading the minor Spark version (e3.1 to
> 3.2) to consume new features is a blocker?
> We follow Option 1 internally at the moment but I would like to hear what
> other people think/need.
>
> - Anton
>
>
> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com>
> wrote:
>
> I think we should go for option 1. I already am not a big fan of having
> runtime errors for unsupported things based on versions and I don't think
> minor version upgrades are a large issue for users.  I'm especially not
> looking forward to supporting interfaces that only exist in Spark 3.2 in a
> multiple Spark version support future.
>
> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
> aokolnychyi@apple.com.INVALID> wrote:
>
> First of all, is option 2 a viable option? We discussed separating the
> python module outside of the project a few weeks ago, and decided to not do
> that because it's beneficial for code cross reference and more intuitive
> for new developers to see everything in the same repository. I would expect
> the same argument to also hold here.
>
>
> That’s exactly the concern I have about Option 2 at this moment.
>
> Overall I would personally prefer us to not support all the minor
> versions, but instead support maybe just 2-3 latest versions in a major
> version.
>
>
> This is when it gets a bit complicated. If we want to support both Spark
> 3.1 and Spark 3.2 with a single module, it means we have to compile against
> 3.1. The problem is that we rely on DSv2 that is being actively developed.
> 3.2 and 3.1 have substantial differences. On top of that, we have our
> extensions that are extremely low-level and may break not only between
> minor versions but also between patch releases.
>
> f there are some features requiring a newer version, it makes sense to
> move that newer version in master.
>
>
> Internally, we don’t deliver new features to older Spark versions as it
> requires a lot of effort to port things. Personally, I don’t think it is
> too bad to require users to upgrade if they want new features. At the same
> time, there are valid concerns with this approach too that we mentioned
> during the sync. For example, certain new features would also work fine
> with older Spark versions. I generally agree with that and that not
> supporting recent versions is not ideal. However, I want to find a balance
> between the complexity on our side and ease of use for the users. Ideally,
> supporting a few recent versions would be sufficient but our Spark
> integration is too low-level to do that with a single module.
>
>
> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
>
> First of all, is option 2 a viable option? We discussed separating the
> python module outside of the project a few weeks ago, and decided to not do
> that because it's beneficial for code cross reference and more intuitive
> for new developers to see everything in the same repository. I would expect
> the same argument to also hold here.
>
> Overall I would personally prefer us to not support all the minor
> versions, but instead support maybe just 2-3 latest versions in a major
> version. This avoids the problem that some users are unwilling to move to a
> newer version and keep patching old Spark version branches. If there are
> some features requiring a newer version, it makes sense to move that newer
> version in master.
>
> In addition, because currently Spark is considered the most
> feature-complete reference implementation compared to all other engines, I
> think we should not add artificial barriers that would slow down its
> development speed.
>
> So my thinking is closer to option 1.
>
> Best,
> Jack Ye
>
>
> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
> aokolnychyi@apple.com.invalid> wrote:
>
>> Hey folks,
>>
>> I want to discuss our Spark version support strategy.
>>
>> So far, we have tried to support both 3.0 and 3.1. It is great to support
>> older versions but because we compile against 3.0, we cannot use any Spark
>> features that are offered in newer versions.
>> Spark 3.2 is just around the corner and it brings a lot of important
>> features such dynamic filtering for v2 tables, required distribution and
>> ordering for writes, etc. These features are too important to ignore them.
>>
>> Apart from that, I have an end-to-end prototype for merge-on-read with
>> Spark that actually leverages some of the 3.2 features. I’ll be
>> implementing all new Spark DSv2 APIs for us internally and would love to
>> share that with the rest of the community.
>>
>> I see two options to move forward:
>>
>> Option 1
>>
>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
>> minor versions with bug fixes.
>>
>> Pros: almost no changes to the build configuration, no extra work on our
>> side as just a single Spark version is actively maintained.
>> Cons: some new features that we will be adding to master could also work
>> with older Spark versions but all 0.12 releases will only contain bug
>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>> any new Spark or format features.
>>
>> Option 2
>>
>> Move our Spark integration into a separate project and introduce branches
>> for 3.0, 3.1 and 3.2.
>>
>> Pros: decouples the format version from Spark, we can support as many
>> Spark versions as needed.
>> Cons: more work initially to set everything up, more work to release,
>> will need a new release of the core format to consume any changes in the
>> Spark integration.
>>
>> Overall, I think option 2 seems better for the user but my main worry is
>> that we will have to release the format more frequently (which is a good
>> thing but requires more work and time) and the overall Spark development
>> may be slower.
>>
>> I’d love to hear what everybody thinks about this matter.
>>
>> Thanks,
>> Anton
>
>
>
>
>

Re: [DISCUSS] Spark version support strategy

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

To sum up what we have so far:


Option 1 (support just the most recent minor Spark 3 version)

The easiest option for us devs, forces the user to upgrade to the most recent minor Spark version to consume any new Iceberg features.

Option 2 (a separate project under Iceberg)

Can support as many Spark versions as needed and the codebase is still separate as we can use separate branches.
Impossible to consume any unreleased changes in core, may slow down the development.

Option 3 (separate modules for Spark 3.1/3.2)

Introduce more modules in the same project.
Can consume unreleased changes but it will required at least 5 modules to support 2.4, 3.1 and 3.2, making the build and testing complicated.


Are there any users for whom upgrading the minor Spark version (e3.1 to 3.2) to consume new features is a blocker?
We follow Option 1 internally at the moment but I would like to hear what other people think/need.

- Anton


> On 14 Sep 2021, at 09:44, Russell Spitzer <ru...@gmail.com> wrote:
> 
> I think we should go for option 1. I already am not a big fan of having runtime errors for unsupported things based on versions and I don't think minor version upgrades are a large issue for users.  I'm especially not looking forward to supporting interfaces that only exist in Spark 3.2 in a multiple Spark version support future.
> 
>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <aokolnychyi@apple.com.INVALID <ma...@apple.com.INVALID>> wrote:
>> 
>>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>> 
>> That’s exactly the concern I have about Option 2 at this moment.
>> 
>>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. 
>> 
>> This is when it gets a bit complicated. If we want to support both Spark 3.1 and Spark 3.2 with a single module, it means we have to compile against 3.1. The problem is that we rely on DSv2 that is being actively developed. 3.2 and 3.1 have substantial differences. On top of that, we have our extensions that are extremely low-level and may break not only between minor versions but also between patch releases.
>> 
>>> f there are some features requiring a newer version, it makes sense to move that newer version in master.
>> 
>> Internally, we don’t deliver new features to older Spark versions as it requires a lot of effort to port things. Personally, I don’t think it is too bad to require users to upgrade if they want new features. At the same time, there are valid concerns with this approach too that we mentioned during the sync. For example, certain new features would also work fine with older Spark versions. I generally agree with that and that not supporting recent versions is not ideal. However, I want to find a balance between the complexity on our side and ease of use for the users. Ideally, supporting a few recent versions would be sufficient but our Spark integration is too low-level to do that with a single module.
>>  
>> 
>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>>> 
>>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. This avoids the problem that some users are unwilling to move to a newer version and keep patching old Spark version branches. If there are some features requiring a newer version, it makes sense to move that newer version in master.
>>> 
>>> In addition, because currently Spark is considered the most feature-complete reference implementation compared to all other engines, I think we should not add artificial barriers that would slow down its development speed.
>>> 
>>> So my thinking is closer to option 1.
>>> 
>>> Best,
>>> Jack Ye
>>> 
>>> 
>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <aokolnychyi@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>>> Hey folks,
>>> 
>>> I want to discuss our Spark version support strategy.
>>> 
>>> So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
>>> Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.
>>> 
>>> Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.
>>> 
>>> I see two options to move forward:
>>> 
>>> Option 1
>>> 
>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.
>>> 
>>> Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
>>> Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.
>>> 
>>> Option 2
>>> 
>>> Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.
>>> 
>>> Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
>>> Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.
>>> 
>>> Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.
>>> 
>>> I’d love to hear what everybody thinks about this matter.
>>> 
>>> Thanks,
>>> Anton
>> 
>

Re: [DISCUSS] Spark version support strategy

Posted by Russell Spitzer <ru...@gmail.com>.

I think we should go for option 1. I already am not a big fan of having runtime errors for unsupported things based on versions and I don't think minor version upgrades are a large issue for users.  I'm especially not looking forward to supporting interfaces that only exist in Spark 3.2 in a multiple Spark version support future.

> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <ao...@apple.com.INVALID> wrote:
> 
>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
> 
> That’s exactly the concern I have about Option 2 at this moment.
> 
>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. 
> 
> This is when it gets a bit complicated. If we want to support both Spark 3.1 and Spark 3.2 with a single module, it means we have to compile against 3.1. The problem is that we rely on DSv2 that is being actively developed. 3.2 and 3.1 have substantial differences. On top of that, we have our extensions that are extremely low-level and may break not only between minor versions but also between patch releases.
> 
>> f there are some features requiring a newer version, it makes sense to move that newer version in master.
> 
> Internally, we don’t deliver new features to older Spark versions as it requires a lot of effort to port things. Personally, I don’t think it is too bad to require users to upgrade if they want new features. At the same time, there are valid concerns with this approach too that we mentioned during the sync. For example, certain new features would also work fine with older Spark versions. I generally agree with that and that not supporting recent versions is not ideal. However, I want to find a balance between the complexity on our side and ease of use for the users. Ideally, supporting a few recent versions would be sufficient but our Spark integration is too low-level to do that with a single module.
>  
> 
>> On 13 Sep 2021, at 20:53, Jack Ye <yezhaoqin@gmail.com <ma...@gmail.com>> wrote:
>> 
>> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
>> 
>> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. This avoids the problem that some users are unwilling to move to a newer version and keep patching old Spark version branches. If there are some features requiring a newer version, it makes sense to move that newer version in master.
>> 
>> In addition, because currently Spark is considered the most feature-complete reference implementation compared to all other engines, I think we should not add artificial barriers that would slow down its development speed.
>> 
>> So my thinking is closer to option 1.
>> 
>> Best,
>> Jack Ye
>> 
>> 
>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <aokolnychyi@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>> Hey folks,
>> 
>> I want to discuss our Spark version support strategy.
>> 
>> So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
>> Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.
>> 
>> Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.
>> 
>> I see two options to move forward:
>> 
>> Option 1
>> 
>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.
>> 
>> Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
>> Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.
>> 
>> Option 2
>> 
>> Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.
>> 
>> Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
>> Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.
>> 
>> Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.
>> 
>> I’d love to hear what everybody thinks about this matter.
>> 
>> Thanks,
>> Anton
>

Re: [DISCUSS] Spark version support strategy

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 

That’s exactly the concern I have about Option 2 at this moment.

> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. 

This is when it gets a bit complicated. If we want to support both Spark 3.1 and Spark 3.2 with a single module, it means we have to compile against 3.1. The problem is that we rely on DSv2 that is being actively developed. 3.2 and 3.1 have substantial differences. On top of that, we have our extensions that are extremely low-level and may break not only between minor versions but also between patch releases.

> f there are some features requiring a newer version, it makes sense to move that newer version in master.

Internally, we don’t deliver new features to older Spark versions as it requires a lot of effort to port things. Personally, I don’t think it is too bad to require users to upgrade if they want new features. At the same time, there are valid concerns with this approach too that we mentioned during the sync. For example, certain new features would also work fine with older Spark versions. I generally agree with that and that not supporting recent versions is not ideal. However, I want to find a balance between the complexity on our side and ease of use for the users. Ideally, supporting a few recent versions would be sufficient but our Spark integration is too low-level to do that with a single module.
 

> On 13 Sep 2021, at 20:53, Jack Ye <ye...@gmail.com> wrote:
> 
> First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect the same argument to also hold here. 
> 
> Overall I would personally prefer us to not support all the minor versions, but instead support maybe just 2-3 latest versions in a major version. This avoids the problem that some users are unwilling to move to a newer version and keep patching old Spark version branches. If there are some features requiring a newer version, it makes sense to move that newer version in master.
> 
> In addition, because currently Spark is considered the most feature-complete reference implementation compared to all other engines, I think we should not add artificial barriers that would slow down its development speed.
> 
> So my thinking is closer to option 1.
> 
> Best,
> Jack Ye
> 
> 
> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <ao...@apple.com.invalid> wrote:
> Hey folks,
> 
> I want to discuss our Spark version support strategy.
> 
> So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.
> 
> Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.
> 
> I see two options to move forward:
> 
> Option 1
> 
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.
> 
> Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.
> 
> Option 2
> 
> Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.
> 
> Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
> Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.
> 
> Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.
> 
> I’d love to hear what everybody thinks about this matter.
> 
> Thanks,
> Anton

Re: [DISCUSS] Spark version support strategy

Posted by Jack Ye <ye...@gmail.com>.

First of all, is option 2 a viable option? We discussed separating the
python module outside of the project a few weeks ago, and decided to not do
that because it's beneficial for code cross reference and more intuitive
for new developers to see everything in the same repository. I would expect
the same argument to also hold here.

Overall I would personally prefer us to not support all the minor versions,
but instead support maybe just 2-3 latest versions in a major version. This
avoids the problem that some users are unwilling to move to a newer version
and keep patching old Spark version branches. If there are some features
requiring a newer version, it makes sense to move that newer version in
master.

In addition, because currently Spark is considered the most
feature-complete reference implementation compared to all other engines, I
think we should not add artificial barriers that would slow down its
development speed.

So my thinking is closer to option 1.

Best,
Jack Ye

On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> Hey folks,
>
> I want to discuss our Spark version support strategy.
>
> So far, we have tried to support both 3.0 and 3.1. It is great to support
> older versions but because we compile against 3.0, we cannot use any Spark
> features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important
> features such dynamic filtering for v2 tables, required distribution and
> ordering for writes, etc. These features are too important to ignore them.
>
> Apart from that, I have an end-to-end prototype for merge-on-read with
> Spark that actually leverages some of the 3.2 features. I’ll be
> implementing all new Spark DSv2 APIs for us internally and would love to
> share that with the rest of the community.
>
> I see two options to move forward:
>
> Option 1
>
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
> minor versions with bug fixes.
>
> Pros: almost no changes to the build configuration, no extra work on our
> side as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work
> with older Spark versions but all 0.12 releases will only contain bug
> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
> any new Spark or format features.
>
> Option 2
>
> Move our Spark integration into a separate project and introduce branches
> for 3.0, 3.1 and 3.2.
>
> Pros: decouples the format version from Spark, we can support as many
> Spark versions as needed.
> Cons: more work initially to set everything up, more work to release, will
> need a new release of the core format to consume any changes in the Spark
> integration.
>
> Overall, I think option 2 seems better for the user but my main worry is
> that we will have to release the format more frequently (which is a good
> thing but requires more work and time) and the overall Spark development
> may be slower.
>
> I’d love to hear what everybody thinks about this matter.
>
> Thanks,
> Anton

Re: [DISCUSS] Spark version support strategy

Posted by Anton Okolnychyi <ao...@apple.com.INVALID>.

Hey Imran,

I don’t know why I forgot to mention this option too. It is definitely a solution to consider. We used this approach to support Spark 2 and Spark 3.
Right now, this would mean having iceberg-spark (common code for all versions), iceberg-spark2, iceberg-spark-3 (common code for all Spark 3 versions), and having iceberg-spark-3.1 and iceberg-spark-3.2.
We would also need to move our extensions into each module respectively as they differ.

The main reason to even consider Option 2 is the number of modules we will generate inside the main repo and the extra testing time for non-Spark PRs. If we decide in the future to support multiple versions for Hive/Flink/etc, this may get out of hand.

But like we all agree, Option 2 has a substantial limitation that will require us to release the core before it can be consumed in engine integrations. That’s why Option 3 you mention could be a way to go.

- Anton


> On 13 Sep 2021, at 21:04, Imran Rashid <ir...@cloudera.com.INVALID> wrote:
> 
> Thanks for bringing this up, Anton.
> 
> I am not entirely certain if your option 2 meant "project" in the "Apache project" sense or the "gradle project" sense -- it sounds like you mean "apache project".
> 
> If so, I'd propose Option 3:
> 
> Create a "spark-common" gradle project, which builds against the lowest spark version we plan to support (3.0 for now, I guess) and also creates interfaces for everything specific to different spark versions.  Also create "spark-3.x" gradle projects, which only build against specific gradle versions, and contain implementations for the interface in "spark-common"
> 
> Pros:
> * Can support as many Spark versions as needed, with each version getting as much as it can from its spark version
> * Spark support still integrated into the existing build & release process (I guess this could also be a con)
> 
> Cons:
> * work to setup the builds
> * multiple binaries, setup becomes more complicated for users
> * testing becomes tough as we increase the mix of supported versions
> 
> 
> 
> The "multiple binaries" could be solved with an "Option 4: put it all in one binary and use reflection", though imo this is really painful.
> 
> On Mon, Sep 13, 2021 at 9:39 PM Anton Okolnychyi <ao...@apple.com.invalid> wrote:
> Hey folks,
> 
> I want to discuss our Spark version support strategy.
> 
> So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them.
> 
> Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community.
> 
> I see two options to move forward:
> 
> Option 1
> 
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes.
> 
> Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features.
> 
> Option 2
> 
> Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2.
> 
> Pros: decouples the format version from Spark, we can support as many Spark versions as needed.
> Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration.
> 
> Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower.
> 
> I’d love to hear what everybody thinks about this matter.
> 
> Thanks,
> Anton

Re: [DISCUSS] Spark version support strategy

Posted by Imran Rashid <ir...@cloudera.com.INVALID>.

Thanks for bringing this up, Anton.

I am not entirely certain if your option 2 meant "project" in the "Apache
project" sense or the "gradle project" sense -- it sounds like you mean
"apache project".

If so, I'd propose Option 3:

Create a "spark-common" gradle project, which builds against the lowest
spark version we plan to support (3.0 for now, I guess) and also creates
interfaces for everything specific to different spark versions.  Also
create "spark-3.x" gradle projects, which only build against specific
gradle versions, and contain implementations for the interface in
"spark-common"

Pros:
* Can support as many Spark versions as needed, with each version getting
as much as it can from its spark version
* Spark support still integrated into the existing build & release process
(I guess this could also be a con)

Cons:
* work to setup the builds
* multiple binaries, setup becomes more complicated for users
* testing becomes tough as we increase the mix of supported versions



The "multiple binaries" could be solved with an "Option 4: put it all in
one binary and use reflection", though imo this is really painful.

On Mon, Sep 13, 2021 at 9:39 PM Anton Okolnychyi
<ao...@apple.com.invalid> wrote:

> Hey folks,
>
> I want to discuss our Spark version support strategy.
>
> So far, we have tried to support both 3.0 and 3.1. It is great to support
> older versions but because we compile against 3.0, we cannot use any Spark
> features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important
> features such dynamic filtering for v2 tables, required distribution and
> ordering for writes, etc. These features are too important to ignore them.
>
> Apart from that, I have an end-to-end prototype for merge-on-read with
> Spark that actually leverages some of the 3.2 features. I’ll be
> implementing all new Spark DSv2 APIs for us internally and would love to
> share that with the rest of the community.
>
> I see two options to move forward:
>
> Option 1
>
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
> minor versions with bug fixes.
>
> Pros: almost no changes to the build configuration, no extra work on our
> side as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work
> with older Spark versions but all 0.12 releases will only contain bug
> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
> any new Spark or format features.
>
> Option 2
>
> Move our Spark integration into a separate project and introduce branches
> for 3.0, 3.1 and 3.2.
>
> Pros: decouples the format version from Spark, we can support as many
> Spark versions as needed.
> Cons: more work initially to set everything up, more work to release, will
> need a new release of the core format to consume any changes in the Spark
> integration.
>
> Overall, I think option 2 seems better for the user but my main worry is
> that we will have to release the format more frequently (which is a good
> thing but requires more work and time) and the overall Spark development
> may be slower.
>
> I’d love to hear what everybody thinks about this matter.
>
> Thanks,
> Anton