You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by Ralph Goers <ra...@dslextreme.com> on 2022/03/27 09:19:29 UTC

Breaking up Flume

Sean, (and everyone else)

You mentioned that you want to create separate maven modules to upgrade hive & hbase. The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
As more modules are added the build is just going to get slower.
Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.

I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.

I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.

I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters. I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.

Thoughts?

Ralph

Re: Breaking up Flume

Posted by Ralph Goers <ra...@dslextreme.com>.

Well, as always, if someone wants to code something to support an easier way for users to deploy flume I’m all for it. I know Tristan has been working on creating a Docker image, but I suspect it is going to need something like what you suggest as the default docker image should only contain the minimum amount of stuff to have a working flume instance. For example, I only use the file channel, Avro sources and sinks, and Kafka Sink. Everything else packaged by default would just be stuff I would want to remove.

Ralph

> On Mar 28, 2022, at 9:44 PM, Bessenyei Balázs Donát <be...@apache.org> wrote:
> 
>> I assume you mean the released source
> 
> I was thinking of a git reference (just like you can do with pip or
> npm) so that people can more easily mix and match, but I don't have
> strong opinions about this.
> 
> 
> Donat
> 
> On Mon, Mar 28, 2022 at 7:31 PM Ralph Goers <ra...@dslextreme.com> wrote:
>> 
>> Every release a project does requires a vote and has to meet the ASFs requirements for a release. That said, Apache Maven has seemingly dozens of plugins that are all independently managed and released. If you look at the Maven dev list you will see release votes happening for various things several times a month. But their process used to include something that each release manager updated to track the releases so they could be included in the board report. Now I believe that is all handled by the Apache Reporter service.
>> 
>> I don’t believe our process would be quite that loose. For one, I really don’t consider the way Flume allows new components to be  a true plugin architecture. I would still anticipate we would group releases of things but nothing says it has to be like that.
>> 
>> Downloading the source? I assume you mean the released source. That would be available either by downloading the release zip/tar from the ASF distribution site or by checking out the release tag from git. But I don’t understand why you would do that.
>> I use a customized version of Flume but I build it from the flume zip. This is a bit painful as I have to then delete stuff I don’t want or want to override. It would actually be easier for me if I were to reference the various Flume artifacts I need as dependencies and use the dependency plugin to add them to the Flume application I am building.
>> 
>> Ralph
>> 
>>> On Mar 28, 2022, at 9:57 AM, Bessenyei Balázs Donát <be...@apache.org> wrote:
>>> 
>>> What does the "module releases" thing look like from an ASF release
>>> (process - voting, etc.) perspective?
>>> 
>>> Alternatively, do we want a mechanism to be able to add modules
>>> directly from source? (Homebrew-style)
>>> 
>>> 
>>> Donat
>>> 
>>> On Mon, Mar 28, 2022 at 6:43 PM Ralph Goers <ra...@dslextreme.com> wrote:
>>>> 
>>>> Thanks for the reply!
>>>> 
>>>> In general I agree with what you are proposing. I’d probably suggest once a quarter instead of every 2 months. I also wouldn’t necessarily have a release of every component every quarter. If there have been no changes there isn’t much of a point. And requiring that everything be released together doesn’t really help. I would suggest that Flume would have a flume-parent module that includes a parent pom.xml that all projects would inherit from. It would include a dependency management section that declares the version of dependencies that are used across projects. In addition we would want a flume-bom that contains a pom.xml that includes a dependency management section declaring all the versions of all components for a specific Flume quarterly release.
>>>> 
>>>> As for the versions, I am not sure why you wouldn’t just go with 2.0.0-alpha, 2.0.0-beta or 2.0.0-beta1, 2.0.0-beta2 if you aren’t comfortable labeling them as GA. Once things are stable you would then release 2.0.0.
>>>> 
>>>> Ralph
>>>> 
>>>>> On Mar 28, 2022, at 7:24 AM, Sean Busbey <sb...@apple.com.INVALID> wrote:
>>>>> 
>>>>> That’s a really interesting possibility.
>>>>> 
>>>>> For the 1.10 release I think we should still upgrade the Hive 1 version to the latest 1.y available, but I agree we’d be well served to get a handle on the increasing set of possible dependencies. A 2.0 release would be a great time to change around how deployment works so that folks don’t expect everything to show up in a single omnibus tarball from a single build as they do now.
>>>>> 
>>>>> There’s a lot of things to take care of making that transition less painful, so I’d suggest we get an overall approach described but try to address it incrementally so we’re not facing a very long delay for further project releases.
>>>>> 
>>>>> How about  something like this?
>>>>> 
>>>>> - Release 1.10.0 soon, only backward compatible releases
>>>>> - Release 1.y.0 - every other month, backward compatible dependency updates and bug fixes
>>>>> - Release 2.0 alpha - break up project into multiple repos, establish release cadence(s) w/o binary artifacts
>>>>> - Release 2.1 beta - have an “easy path” convenience binary
>>>>> - Release 2.2 expected to be production ready
>>>>> 
>>>>> For at least those parts of the process that don’t require project svn access I can help with keeping regular 1.y maintenance releases going. We could decide ahead of time on when to stop them; e.g. 6 months after the first “production ready” flume 2.y release.
>>>>> 
>>>>> For the 2.y releases, I think we’re going to have some growing pains in managing how we get from multiple repositories to PMC blessed releases and from there to artifacts someone could use to run flume if they’re used to our current deployment model. Setting expectations via alpha/beta labels and stated packaging goals means we should be able to work out friction points while still walking before we try to run with a long term sustainable path for the project. We could try to put some goal dates on those milestones once we have spent some time discussing details and trying move things forward.
>>>>> 
>>>>>> On Mar 27, 2022, at 4:19 AM, Ralph Goers <ra...@dslextreme.com> wrote:
>>>>>> 
>>>>>> Sean, (and everyone else)
>>>>>> 
>>>>>> You mentioned that you want to create separate maven modules to upgrade hive & hbase.  The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
>>>>>> Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
>>>>>> As more modules are added the build is just going to get slower.
>>>>>> Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.
>>>>>> 
>>>>>> I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.
>>>>>> 
>>>>>> I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.
>>>>>> 
>>>>>> I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters.  I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.
>>>>>> 
>>>>>> Thoughts?
>>>>>> 
>>>>>> Ralph
>>>>> 
>>>>> 
>>>> 
>>

Re: Breaking up Flume

Posted by Bessenyei Balázs Donát <be...@apache.org>.

> I assume you mean the released source

I was thinking of a git reference (just like you can do with pip or
npm) so that people can more easily mix and match, but I don't have
strong opinions about this.


Donat

On Mon, Mar 28, 2022 at 7:31 PM Ralph Goers <ra...@dslextreme.com> wrote:
>
> Every release a project does requires a vote and has to meet the ASFs requirements for a release. That said, Apache Maven has seemingly dozens of plugins that are all independently managed and released. If you look at the Maven dev list you will see release votes happening for various things several times a month. But their process used to include something that each release manager updated to track the releases so they could be included in the board report. Now I believe that is all handled by the Apache Reporter service.
>
> I don’t believe our process would be quite that loose. For one, I really don’t consider the way Flume allows new components to be  a true plugin architecture. I would still anticipate we would group releases of things but nothing says it has to be like that.
>
> Downloading the source? I assume you mean the released source. That would be available either by downloading the release zip/tar from the ASF distribution site or by checking out the release tag from git. But I don’t understand why you would do that.
> I use a customized version of Flume but I build it from the flume zip. This is a bit painful as I have to then delete stuff I don’t want or want to override. It would actually be easier for me if I were to reference the various Flume artifacts I need as dependencies and use the dependency plugin to add them to the Flume application I am building.
>
> Ralph
>
> > On Mar 28, 2022, at 9:57 AM, Bessenyei Balázs Donát <be...@apache.org> wrote:
> >
> > What does the "module releases" thing look like from an ASF release
> > (process - voting, etc.) perspective?
> >
> > Alternatively, do we want a mechanism to be able to add modules
> > directly from source? (Homebrew-style)
> >
> >
> > Donat
> >
> > On Mon, Mar 28, 2022 at 6:43 PM Ralph Goers <ra...@dslextreme.com> wrote:
> >>
> >> Thanks for the reply!
> >>
> >> In general I agree with what you are proposing. I’d probably suggest once a quarter instead of every 2 months. I also wouldn’t necessarily have a release of every component every quarter. If there have been no changes there isn’t much of a point. And requiring that everything be released together doesn’t really help. I would suggest that Flume would have a flume-parent module that includes a parent pom.xml that all projects would inherit from. It would include a dependency management section that declares the version of dependencies that are used across projects. In addition we would want a flume-bom that contains a pom.xml that includes a dependency management section declaring all the versions of all components for a specific Flume quarterly release.
> >>
> >> As for the versions, I am not sure why you wouldn’t just go with 2.0.0-alpha, 2.0.0-beta or 2.0.0-beta1, 2.0.0-beta2 if you aren’t comfortable labeling them as GA. Once things are stable you would then release 2.0.0.
> >>
> >> Ralph
> >>
> >>> On Mar 28, 2022, at 7:24 AM, Sean Busbey <sb...@apple.com.INVALID> wrote:
> >>>
> >>> That’s a really interesting possibility.
> >>>
> >>> For the 1.10 release I think we should still upgrade the Hive 1 version to the latest 1.y available, but I agree we’d be well served to get a handle on the increasing set of possible dependencies. A 2.0 release would be a great time to change around how deployment works so that folks don’t expect everything to show up in a single omnibus tarball from a single build as they do now.
> >>>
> >>> There’s a lot of things to take care of making that transition less painful, so I’d suggest we get an overall approach described but try to address it incrementally so we’re not facing a very long delay for further project releases.
> >>>
> >>> How about  something like this?
> >>>
> >>> - Release 1.10.0 soon, only backward compatible releases
> >>> - Release 1.y.0 - every other month, backward compatible dependency updates and bug fixes
> >>> - Release 2.0 alpha - break up project into multiple repos, establish release cadence(s) w/o binary artifacts
> >>> - Release 2.1 beta - have an “easy path” convenience binary
> >>> - Release 2.2 expected to be production ready
> >>>
> >>> For at least those parts of the process that don’t require project svn access I can help with keeping regular 1.y maintenance releases going. We could decide ahead of time on when to stop them; e.g. 6 months after the first “production ready” flume 2.y release.
> >>>
> >>> For the 2.y releases, I think we’re going to have some growing pains in managing how we get from multiple repositories to PMC blessed releases and from there to artifacts someone could use to run flume if they’re used to our current deployment model. Setting expectations via alpha/beta labels and stated packaging goals means we should be able to work out friction points while still walking before we try to run with a long term sustainable path for the project. We could try to put some goal dates on those milestones once we have spent some time discussing details and trying move things forward.
> >>>
> >>>> On Mar 27, 2022, at 4:19 AM, Ralph Goers <ra...@dslextreme.com> wrote:
> >>>>
> >>>> Sean, (and everyone else)
> >>>>
> >>>> You mentioned that you want to create separate maven modules to upgrade hive & hbase.  The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
> >>>> Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
> >>>> As more modules are added the build is just going to get slower.
> >>>> Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.
> >>>>
> >>>> I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.
> >>>>
> >>>> I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.
> >>>>
> >>>> I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters.  I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> Ralph
> >>>
> >>>
> >>
>

Re: Breaking up Flume

Posted by Ralph Goers <ra...@dslextreme.com>.

Every release a project does requires a vote and has to meet the ASFs requirements for a release. That said, Apache Maven has seemingly dozens of plugins that are all independently managed and released. If you look at the Maven dev list you will see release votes happening for various things several times a month. But their process used to include something that each release manager updated to track the releases so they could be included in the board report. Now I believe that is all handled by the Apache Reporter service.

I don’t believe our process would be quite that loose. For one, I really don’t consider the way Flume allows new components to be  a true plugin architecture. I would still anticipate we would group releases of things but nothing says it has to be like that.

Downloading the source? I assume you mean the released source. That would be available either by downloading the release zip/tar from the ASF distribution site or by checking out the release tag from git. But I don’t understand why you would do that. 
I use a customized version of Flume but I build it from the flume zip. This is a bit painful as I have to then delete stuff I don’t want or want to override. It would actually be easier for me if I were to reference the various Flume artifacts I need as dependencies and use the dependency plugin to add them to the Flume application I am building.

Ralph 

> On Mar 28, 2022, at 9:57 AM, Bessenyei Balázs Donát <be...@apache.org> wrote:
> 
> What does the "module releases" thing look like from an ASF release
> (process - voting, etc.) perspective?
> 
> Alternatively, do we want a mechanism to be able to add modules
> directly from source? (Homebrew-style)
> 
> 
> Donat
> 
> On Mon, Mar 28, 2022 at 6:43 PM Ralph Goers <ra...@dslextreme.com> wrote:
>> 
>> Thanks for the reply!
>> 
>> In general I agree with what you are proposing. I’d probably suggest once a quarter instead of every 2 months. I also wouldn’t necessarily have a release of every component every quarter. If there have been no changes there isn’t much of a point. And requiring that everything be released together doesn’t really help. I would suggest that Flume would have a flume-parent module that includes a parent pom.xml that all projects would inherit from. It would include a dependency management section that declares the version of dependencies that are used across projects. In addition we would want a flume-bom that contains a pom.xml that includes a dependency management section declaring all the versions of all components for a specific Flume quarterly release.
>> 
>> As for the versions, I am not sure why you wouldn’t just go with 2.0.0-alpha, 2.0.0-beta or 2.0.0-beta1, 2.0.0-beta2 if you aren’t comfortable labeling them as GA. Once things are stable you would then release 2.0.0.
>> 
>> Ralph
>> 
>>> On Mar 28, 2022, at 7:24 AM, Sean Busbey <sb...@apple.com.INVALID> wrote:
>>> 
>>> That’s a really interesting possibility.
>>> 
>>> For the 1.10 release I think we should still upgrade the Hive 1 version to the latest 1.y available, but I agree we’d be well served to get a handle on the increasing set of possible dependencies. A 2.0 release would be a great time to change around how deployment works so that folks don’t expect everything to show up in a single omnibus tarball from a single build as they do now.
>>> 
>>> There’s a lot of things to take care of making that transition less painful, so I’d suggest we get an overall approach described but try to address it incrementally so we’re not facing a very long delay for further project releases.
>>> 
>>> How about  something like this?
>>> 
>>> - Release 1.10.0 soon, only backward compatible releases
>>> - Release 1.y.0 - every other month, backward compatible dependency updates and bug fixes
>>> - Release 2.0 alpha - break up project into multiple repos, establish release cadence(s) w/o binary artifacts
>>> - Release 2.1 beta - have an “easy path” convenience binary
>>> - Release 2.2 expected to be production ready
>>> 
>>> For at least those parts of the process that don’t require project svn access I can help with keeping regular 1.y maintenance releases going. We could decide ahead of time on when to stop them; e.g. 6 months after the first “production ready” flume 2.y release.
>>> 
>>> For the 2.y releases, I think we’re going to have some growing pains in managing how we get from multiple repositories to PMC blessed releases and from there to artifacts someone could use to run flume if they’re used to our current deployment model. Setting expectations via alpha/beta labels and stated packaging goals means we should be able to work out friction points while still walking before we try to run with a long term sustainable path for the project. We could try to put some goal dates on those milestones once we have spent some time discussing details and trying move things forward.
>>> 
>>>> On Mar 27, 2022, at 4:19 AM, Ralph Goers <ra...@dslextreme.com> wrote:
>>>> 
>>>> Sean, (and everyone else)
>>>> 
>>>> You mentioned that you want to create separate maven modules to upgrade hive & hbase.  The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
>>>> Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
>>>> As more modules are added the build is just going to get slower.
>>>> Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.
>>>> 
>>>> I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.
>>>> 
>>>> I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.
>>>> 
>>>> I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters.  I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.
>>>> 
>>>> Thoughts?
>>>> 
>>>> Ralph
>>> 
>>> 
>>

Re: Breaking up Flume

Posted by Bessenyei Balázs Donát <be...@apache.org>.

What does the "module releases" thing look like from an ASF release
(process - voting, etc.) perspective?

Alternatively, do we want a mechanism to be able to add modules
directly from source? (Homebrew-style)


Donat

On Mon, Mar 28, 2022 at 6:43 PM Ralph Goers <ra...@dslextreme.com> wrote:
>
> Thanks for the reply!
>
> In general I agree with what you are proposing. I’d probably suggest once a quarter instead of every 2 months. I also wouldn’t necessarily have a release of every component every quarter. If there have been no changes there isn’t much of a point. And requiring that everything be released together doesn’t really help. I would suggest that Flume would have a flume-parent module that includes a parent pom.xml that all projects would inherit from. It would include a dependency management section that declares the version of dependencies that are used across projects. In addition we would want a flume-bom that contains a pom.xml that includes a dependency management section declaring all the versions of all components for a specific Flume quarterly release.
>
> As for the versions, I am not sure why you wouldn’t just go with 2.0.0-alpha, 2.0.0-beta or 2.0.0-beta1, 2.0.0-beta2 if you aren’t comfortable labeling them as GA. Once things are stable you would then release 2.0.0.
>
> Ralph
>
> > On Mar 28, 2022, at 7:24 AM, Sean Busbey <sb...@apple.com.INVALID> wrote:
> >
> > That’s a really interesting possibility.
> >
> > For the 1.10 release I think we should still upgrade the Hive 1 version to the latest 1.y available, but I agree we’d be well served to get a handle on the increasing set of possible dependencies. A 2.0 release would be a great time to change around how deployment works so that folks don’t expect everything to show up in a single omnibus tarball from a single build as they do now.
> >
> > There’s a lot of things to take care of making that transition less painful, so I’d suggest we get an overall approach described but try to address it incrementally so we’re not facing a very long delay for further project releases.
> >
> > How about  something like this?
> >
> > - Release 1.10.0 soon, only backward compatible releases
> > - Release 1.y.0 - every other month, backward compatible dependency updates and bug fixes
> > - Release 2.0 alpha - break up project into multiple repos, establish release cadence(s) w/o binary artifacts
> > - Release 2.1 beta - have an “easy path” convenience binary
> > - Release 2.2 expected to be production ready
> >
> > For at least those parts of the process that don’t require project svn access I can help with keeping regular 1.y maintenance releases going. We could decide ahead of time on when to stop them; e.g. 6 months after the first “production ready” flume 2.y release.
> >
> > For the 2.y releases, I think we’re going to have some growing pains in managing how we get from multiple repositories to PMC blessed releases and from there to artifacts someone could use to run flume if they’re used to our current deployment model. Setting expectations via alpha/beta labels and stated packaging goals means we should be able to work out friction points while still walking before we try to run with a long term sustainable path for the project. We could try to put some goal dates on those milestones once we have spent some time discussing details and trying move things forward.
> >
> >> On Mar 27, 2022, at 4:19 AM, Ralph Goers <ra...@dslextreme.com> wrote:
> >>
> >> Sean, (and everyone else)
> >>
> >> You mentioned that you want to create separate maven modules to upgrade hive & hbase.  The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
> >> Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
> >> As more modules are added the build is just going to get slower.
> >> Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.
> >>
> >> I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.
> >>
> >> I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.
> >>
> >> I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters.  I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.
> >>
> >> Thoughts?
> >>
> >> Ralph
> >
> >
>

Re: Breaking up Flume

Posted by Ralph Goers <ra...@dslextreme.com>.

Thanks for the reply! 

In general I agree with what you are proposing. I’d probably suggest once a quarter instead of every 2 months. I also wouldn’t necessarily have a release of every component every quarter. If there have been no changes there isn’t much of a point. And requiring that everything be released together doesn’t really help. I would suggest that Flume would have a flume-parent module that includes a parent pom.xml that all projects would inherit from. It would include a dependency management section that declares the version of dependencies that are used across projects. In addition we would want a flume-bom that contains a pom.xml that includes a dependency management section declaring all the versions of all components for a specific Flume quarterly release.

As for the versions, I am not sure why you wouldn’t just go with 2.0.0-alpha, 2.0.0-beta or 2.0.0-beta1, 2.0.0-beta2 if you aren’t comfortable labeling them as GA. Once things are stable you would then release 2.0.0.

Ralph

> On Mar 28, 2022, at 7:24 AM, Sean Busbey <sb...@apple.com.INVALID> wrote:
> 
> That’s a really interesting possibility.
> 
> For the 1.10 release I think we should still upgrade the Hive 1 version to the latest 1.y available, but I agree we’d be well served to get a handle on the increasing set of possible dependencies. A 2.0 release would be a great time to change around how deployment works so that folks don’t expect everything to show up in a single omnibus tarball from a single build as they do now.
> 
> There’s a lot of things to take care of making that transition less painful, so I’d suggest we get an overall approach described but try to address it incrementally so we’re not facing a very long delay for further project releases.
> 
> How about  something like this?
> 
> - Release 1.10.0 soon, only backward compatible releases
> - Release 1.y.0 - every other month, backward compatible dependency updates and bug fixes	
> - Release 2.0 alpha - break up project into multiple repos, establish release cadence(s) w/o binary artifacts
> - Release 2.1 beta - have an “easy path” convenience binary
> - Release 2.2 expected to be production ready
> 
> For at least those parts of the process that don’t require project svn access I can help with keeping regular 1.y maintenance releases going. We could decide ahead of time on when to stop them; e.g. 6 months after the first “production ready” flume 2.y release.
> 
> For the 2.y releases, I think we’re going to have some growing pains in managing how we get from multiple repositories to PMC blessed releases and from there to artifacts someone could use to run flume if they’re used to our current deployment model. Setting expectations via alpha/beta labels and stated packaging goals means we should be able to work out friction points while still walking before we try to run with a long term sustainable path for the project. We could try to put some goal dates on those milestones once we have spent some time discussing details and trying move things forward.
> 
>> On Mar 27, 2022, at 4:19 AM, Ralph Goers <ra...@dslextreme.com> wrote:
>> 
>> Sean, (and everyone else)
>> 
>> You mentioned that you want to create separate maven modules to upgrade hive & hbase.  The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
>> Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
>> As more modules are added the build is just going to get slower.
>> Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.
>> 
>> I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.
>> 
>> I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.
>> 
>> I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters.  I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.
>> 
>> Thoughts?
>> 
>> Ralph
> 
>

Re: Breaking up Flume

Posted by Sean Busbey <sb...@apple.com.INVALID>.

That’s a really interesting possibility.

For the 1.10 release I think we should still upgrade the Hive 1 version to the latest 1.y available, but I agree we’d be well served to get a handle on the increasing set of possible dependencies. A 2.0 release would be a great time to change around how deployment works so that folks don’t expect everything to show up in a single omnibus tarball from a single build as they do now.

There’s a lot of things to take care of making that transition less painful, so I’d suggest we get an overall approach described but try to address it incrementally so we’re not facing a very long delay for further project releases.

How about  something like this?

- Release 1.10.0 soon, only backward compatible releases
- Release 1.y.0 - every other month, backward compatible dependency updates and bug fixes	
- Release 2.0 alpha - break up project into multiple repos, establish release cadence(s) w/o binary artifacts
- Release 2.1 beta - have an “easy path” convenience binary
- Release 2.2 expected to be production ready

For at least those parts of the process that don’t require project svn access I can help with keeping regular 1.y maintenance releases going. We could decide ahead of time on when to stop them; e.g. 6 months after the first “production ready” flume 2.y release.

For the 2.y releases, I think we’re going to have some growing pains in managing how we get from multiple repositories to PMC blessed releases and from there to artifacts someone could use to run flume if they’re used to our current deployment model. Setting expectations via alpha/beta labels and stated packaging goals means we should be able to work out friction points while still walking before we try to run with a long term sustainable path for the project. We could try to put some goal dates on those milestones once we have spent some time discussing details and trying move things forward.

> On Mar 27, 2022, at 4:19 AM, Ralph Goers <ra...@dslextreme.com> wrote:
> 
> Sean, (and everyone else)
> 
> You mentioned that you want to create separate maven modules to upgrade hive & hbase.  The Flume build is already very large. In addition, Upgrading to Hive 3 looks like it will require Hadoop 3 while Hive 2 runs with Hadoop 2. This means both dependencies would need to be in the parent pom. I find this problematic for the following reasons:
> Flume contains a ton of dependencies and even more transitive dependencies that are not declared. This makes creating new releases really hard given how many dependencies have to be checked and upgraded.
> As more modules are added the build is just going to get slower.
> Some modules have dependencies on things that are no longer supported. Again, that makes creating a full Flume release hard.
> 
> I would suggest that unless security fixes require it we hold off on creating upgrades in 1.10.0 for HBase and Hive beyond what you have already done. Instead, we should create new repositories for the parts of Flume we want to separate and maintain independently. The HBase and Hive upgrades would end up goring there.
> 
> I believe this will speed up development since builds will no longer take so long.It also means that PRs will go against the target repo which should simplify things. Jira would remain the same as it is today. The component would be used to identify the target repo.
> 
> I would suggest that what should remain in the main Flume build would be primarily, configuration, core, node, sdk, and some of configfilters.  I would expect we would have separate repos for hbase, hdfs, hive, Kafka, embedded-agent, tools, and legacy to start.
> 
> Thoughts?
> 
> Ralph