You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flume.apache.org by Ralph Goers <ra...@dslextreme.com> on 2023/02/26 07:08:18 UTC

Breaking up Flume (again)

As I mentioned last year I would like to start breaking up flume into separate repos. There are a few reasons for this:
1. Flume has grown so large that the CI system can no longer build it. The jobs run out of disk space due to the large logs.
2. The build takes a very long time to run.
3. There are several components that can no longer be practically be supported.

To this end I am planning on creating the following Git repos:
flume-hadoop
flume-http
flume-irc
flume-jdbc
flume-jms
flume-kafka
flume-kudu
flume-legacy
flume-morphline
flume-scribe
flume-search
flume-spring-boot
flume-twitter

For the time being I would propose everything else remain in the current Flume repo.

Note that as each of these is populated they will each need to be released, However, most of these are fairly inactive so after the initial release they may not need to be touched very often.

Also, since Jira now requires new users to request us to create accounts for them I would propose that as each of these repos are set up that they be configured to enable GitHub Issues.

I am looking for feedback on this but if I don’t get any I plan to start work on this within a week or so.

Ralph

Re: Breaking up Flume (again)

Posted by Ralph Goers <ra...@dslextreme.com>.

See below

> On Mar 2, 2023, at 3:01 PM, Sean Busbey <sb...@apple.com.INVALID> wrote:
> 
> Yay! I am very enthusiastic for this progress. Big +1 from me to use this as a chance to also transition off of jira.
> 
> A couple of questions / concerns:
> 
> a) when / how are we going to add additional repositories? Are we adding one whenever someone comes with a new source/sink?

I’d say, it depends. For example, flume-search was added for the out-of-date elastic search stuff but I would also add support for Amazon’s Open Search there if we want to support it. So I would say that the guiding principal should be that everything in the repo is related in some way and is small enough that a group of individuals could support it easily.

> 
> Do we want to have a flume-contrib or the like that things could land in initially? If we do this, can some of these components that are primarily examples live there? I’m thinking of things like flume-twitter, flume-morphline, flume-kudu, flume-legacy.

I wouldn’t create a flume-contrib as that is just going to be a catch-all for stuff. I also see no relationship between twitter, morphline and legacy.

> 
> b) how broad is the flume-hadoop meant to be? I presume the hbase sink(s) won’t be staying in flume-core. I would argue for an independent flume-hbase so we can use the current hbase client libraries without having to worry about Hadoop specifics.

I would add the components that are impacted whenever a Hadoop major release occurs. So I would think that would be Hive,HBase, and HDFS.  Possibly kudu but I am not familiar enough with Kudu to know if it makes sense to be bundled in.

> 
> c) I share some of Tristan’s concerns on downstream consumption. Can we add in a packaging repo that initially provides some kind of minimal downstream consumable flume to deploy as well as an omnibus deploy that contains everything like we have today?

Well, at the very least we will want to have a BOM pom. I’m certainly open to providing variations of a “packaged and deployable” Flume.
> 
> d) when talking about components that can’t practically be supported, I’d like to flag up the current twitter source. It’s great for flume demos, but our current use relies on an API that is dropping out of support. Additionally, there is the newly looming possibility that there won’t be a free-for-use tier of the API to test against.

I was under the impression that a PR recently got merged to upgrade the twitter depenency.  If it still uses the old API then yeah, if fits in that same category.

To be clear, my main goals for doing this are a) be able to build and test Flume in a reasonable amount of time. Flume-ng-core is by far the slowest so this change is only going to marginally help with that, and b) Reduce the size of the build so that the CI system can actually run builds and test stuff for us automatically. Right now the CI system is useless.

Ralph

Re: Breaking up Flume (again)

Posted by Sean Busbey <sb...@apple.com.INVALID>.

Yay! I am very enthusiastic for this progress. Big +1 from me to use this as a chance to also transition off of jira.

A couple of questions / concerns:

a) when / how are we going to add additional repositories? Are we adding one whenever someone comes with a new source/sink?

Do we want to have a flume-contrib or the like that things could land in initially? If we do this, can some of these components that are primarily examples live there? I’m thinking of things like flume-twitter, flume-morphline, flume-kudu, flume-legacy.

b) how broad is the flume-hadoop meant to be? I presume the hbase sink(s) won’t be staying in flume-core. I would argue for an independent flume-hbase so we can use the current hbase client libraries without having to worry about Hadoop specifics.

c) I share some of Tristan’s concerns on downstream consumption. Can we add in a packaging repo that initially provides some kind of minimal downstream consumable flume to deploy as well as an omnibus deploy that contains everything like we have today?

d) when talking about components that can’t practically be supported, I’d like to flag up the current twitter source. It’s great for flume demos, but our current use relies on an API that is dropping out of support. Additionally, there is the newly looming possibility that there won’t be a free-for-use tier of the API to test against.

> On Feb 26, 2023, at 1:08 AM, Ralph Goers <ra...@dslextreme.com> wrote:
> 
> As I mentioned last year I would like to start breaking up flume into separate repos. There are a few reasons for this:
> 1. Flume has grown so large that the CI system can no longer build it. The jobs run out of disk space due to the large logs.
> 2. The build takes a very long time to run.
> 3. There are several components that can no longer be practically be supported.
> 
> To this end I am planning on creating the following Git repos:
> flume-hadoop
> flume-http
> flume-irc
> flume-jdbc
> flume-jms
> flume-kafka
> flume-kudu
> flume-legacy
> flume-morphline
> flume-scribe
> flume-search
> flume-spring-boot
> flume-twitter
> 
> For the time being I would propose everything else remain in the current Flume repo.
> 
> Note that as each of these is populated they will each need to be released, However, most of these are fairly inactive so after the initial release they may not need to be touched very often.
> 
> Also, since Jira now requires new users to request us to create accounts for them I would propose that as each of these repos are set up that they be configured to enable GitHub Issues.
> 
> I am looking for feedback on this but if I don’t get any I plan to start work on this within a week or so.
> 
> Ralph

Re: Breaking up Flume (again)

Posted by Ralph Goers <ra...@dslextreme.com>.

I can empathize with that. The way Flume has been packaged as a deployable zip makes it seem like adding stuff would be a problem. However, I realized what I was doing previously was completely ridiculous.

In my use of Flume I have some custom components. So I was using the maven dependency plugin to unpack the Flume zip. I then deleted or replaced various jars and added my own before repackaging it.  This was painful and had to be hand modified for every Flume release.

In moving to leverage Spring Boot I realized it should be treated as a “normal” Java application where my pom actually specified all the dependencies I wanted. This means I don’t use the distribution zip at all any more and my build makes much more sense. It also means I don’t have nearly as many potential security vulnerabilities since I am not bringing in all the flume modules I don’t use.

So I would suggest that thinking of Flume as a monolithic tool to be deployed much like Fluentd or Logstash are is probably not the best way to view it. 

Ralph

> On Mar 2, 2023, at 7:53 AM, Tristan Stevens <tr...@apache.org> wrote:
> 
> It's a non-binding -1 from me. My concern is that we actually increase the complexity of the deployment and end-user experience by doing this. All of the separate modules are built into separate maven artifacts anyway, so if people do want to package it up then they can.
> 
> My fear is that whatever we gain by splitting it up, we lose in terms of making it harder for people to deploy and use.
> 
> Tristan
> ________________________________
> From: Ralph Goers <ra...@dslextreme.com>
> Sent: 26 February 2023 18:50
> To: dev@flume.apache.org <de...@flume.apache.org>
> Subject: Re: Breaking up Flume (again)
> 
> The morphline solr sink has a dependency on Kite, which is a project abandoned by Cloudera. Someone would have to copy the relevant parts into the morphline repo and maintain them there. I have no interest myself in doing that.
> 
> I already split the Elasticsearch sink into the flume-search repo. As I recall I had problems building it. We have discussed that in other emails. It needs to be upgraded. I suspect the API we would have to use has an acceptable license but I believe ES itself has licensing problems.
> 
> To be honest, I don’t know what the deal is with the legacy sources and why we even have them. We have an Avro source and Thrift source in Flume Core so I don’t know why we even keep them around.
> 
> I personally don’t use Hadoop or any of its related technology. While I know those are important, it is likely I personally will only apply PRs to any of them.
> 
> Ralph
> 
>> On Feb 26, 2023, at 10:29 AM, Bessenyei Balázs Donát <be...@apache.org> wrote:
>> 
>> +1.
>> 
>> For #3, which ones do you think can no longer be practically supported?
>> 
>> 
>> Donat
>> 
>> On Sun, Feb 26, 2023 at 8:08 AM Ralph Goers <ra...@dslextreme.com> wrote:
>>> 
>>> As I mentioned last year I would like to start breaking up flume into separate repos. There are a few reasons for this:
>>> 1. Flume has grown so large that the CI system can no longer build it. The jobs run out of disk space due to the large logs.
>>> 2. The build takes a very long time to run.
>>> 3. There are several components that can no longer be practically be supported.
>>> 
>>> To this end I am planning on creating the following Git repos:
>>> flume-hadoop
>>> flume-http
>>> flume-irc
>>> flume-jdbc
>>> flume-jms
>>> flume-kafka
>>> flume-kudu
>>> flume-legacy
>>> flume-morphline
>>> flume-scribe
>>> flume-search
>>> flume-spring-boot
>>> flume-twitter
>>> 
>>> For the time being I would propose everything else remain in the current Flume repo.
>>> 
>>> Note that as each of these is populated they will each need to be released, However, most of these are fairly inactive so after the initial release they may not need to be touched very often.
>>> 
>>> Also, since Jira now requires new users to request us to create accounts for them I would propose that as each of these repos are set up that they be configured to enable GitHub Issues.
>>> 
>>> I am looking for feedback on this but if I don’t get any I plan to start work on this within a week or so.
>>> 
>>> Ralph
>

Re: Breaking up Flume (again)

Posted by Tristan Stevens <tr...@apache.org>.

It's a non-binding -1 from me. My concern is that we actually increase the complexity of the deployment and end-user experience by doing this. All of the separate modules are built into separate maven artifacts anyway, so if people do want to package it up then they can.

My fear is that whatever we gain by splitting it up, we lose in terms of making it harder for people to deploy and use.

Tristan
________________________________
From: Ralph Goers <ra...@dslextreme.com>
Sent: 26 February 2023 18:50
To: dev@flume.apache.org <de...@flume.apache.org>
Subject: Re: Breaking up Flume (again)

The morphline solr sink has a dependency on Kite, which is a project abandoned by Cloudera. Someone would have to copy the relevant parts into the morphline repo and maintain them there. I have no interest myself in doing that.

I already split the Elasticsearch sink into the flume-search repo. As I recall I had problems building it. We have discussed that in other emails. It needs to be upgraded. I suspect the API we would have to use has an acceptable license but I believe ES itself has licensing problems.

To be honest, I don’t know what the deal is with the legacy sources and why we even have them. We have an Avro source and Thrift source in Flume Core so I don’t know why we even keep them around.

I personally don’t use Hadoop or any of its related technology. While I know those are important, it is likely I personally will only apply PRs to any of them.

Ralph

> On Feb 26, 2023, at 10:29 AM, Bessenyei Balázs Donát <be...@apache.org> wrote:
>
> +1.
>
> For #3, which ones do you think can no longer be practically supported?
>
>
> Donat
>
> On Sun, Feb 26, 2023 at 8:08 AM Ralph Goers <ra...@dslextreme.com> wrote:
>>
>> As I mentioned last year I would like to start breaking up flume into separate repos. There are a few reasons for this:
>> 1. Flume has grown so large that the CI system can no longer build it. The jobs run out of disk space due to the large logs.
>> 2. The build takes a very long time to run.
>> 3. There are several components that can no longer be practically be supported.
>>
>> To this end I am planning on creating the following Git repos:
>> flume-hadoop
>> flume-http
>> flume-irc
>> flume-jdbc
>> flume-jms
>> flume-kafka
>> flume-kudu
>> flume-legacy
>> flume-morphline
>> flume-scribe
>> flume-search
>> flume-spring-boot
>> flume-twitter
>>
>> For the time being I would propose everything else remain in the current Flume repo.
>>
>> Note that as each of these is populated they will each need to be released, However, most of these are fairly inactive so after the initial release they may not need to be touched very often.
>>
>> Also, since Jira now requires new users to request us to create accounts for them I would propose that as each of these repos are set up that they be configured to enable GitHub Issues.
>>
>> I am looking for feedback on this but if I don’t get any I plan to start work on this within a week or so.
>>
>> Ralph

Re: Breaking up Flume (again)

Posted by Ralph Goers <ra...@dslextreme.com>.

The morphline solr sink has a dependency on Kite, which is a project abandoned by Cloudera. Someone would have to copy the relevant parts into the morphline repo and maintain them there. I have no interest myself in doing that.

I already split the Elasticsearch sink into the flume-search repo. As I recall I had problems building it. We have discussed that in other emails. It needs to be upgraded. I suspect the API we would have to use has an acceptable license but I believe ES itself has licensing problems. 

To be honest, I don’t know what the deal is with the legacy sources and why we even have them. We have an Avro source and Thrift source in Flume Core so I don’t know why we even keep them around.

I personally don’t use Hadoop or any of its related technology. While I know those are important, it is likely I personally will only apply PRs to any of them.

Ralph

> On Feb 26, 2023, at 10:29 AM, Bessenyei Balázs Donát <be...@apache.org> wrote:
> 
> +1.
> 
> For #3, which ones do you think can no longer be practically supported?
> 
> 
> Donat
> 
> On Sun, Feb 26, 2023 at 8:08 AM Ralph Goers <ra...@dslextreme.com> wrote:
>> 
>> As I mentioned last year I would like to start breaking up flume into separate repos. There are a few reasons for this:
>> 1. Flume has grown so large that the CI system can no longer build it. The jobs run out of disk space due to the large logs.
>> 2. The build takes a very long time to run.
>> 3. There are several components that can no longer be practically be supported.
>> 
>> To this end I am planning on creating the following Git repos:
>> flume-hadoop
>> flume-http
>> flume-irc
>> flume-jdbc
>> flume-jms
>> flume-kafka
>> flume-kudu
>> flume-legacy
>> flume-morphline
>> flume-scribe
>> flume-search
>> flume-spring-boot
>> flume-twitter
>> 
>> For the time being I would propose everything else remain in the current Flume repo.
>> 
>> Note that as each of these is populated they will each need to be released, However, most of these are fairly inactive so after the initial release they may not need to be touched very often.
>> 
>> Also, since Jira now requires new users to request us to create accounts for them I would propose that as each of these repos are set up that they be configured to enable GitHub Issues.
>> 
>> I am looking for feedback on this but if I don’t get any I plan to start work on this within a week or so.
>> 
>> Ralph

Re: Breaking up Flume (again)

Posted by Bessenyei Balázs Donát <be...@apache.org>.

+1.

For #3, which ones do you think can no longer be practically supported?


Donat

On Sun, Feb 26, 2023 at 8:08 AM Ralph Goers <ra...@dslextreme.com> wrote:
>
> As I mentioned last year I would like to start breaking up flume into separate repos. There are a few reasons for this:
> 1. Flume has grown so large that the CI system can no longer build it. The jobs run out of disk space due to the large logs.
> 2. The build takes a very long time to run.
> 3. There are several components that can no longer be practically be supported.
>
> To this end I am planning on creating the following Git repos:
> flume-hadoop
> flume-http
> flume-irc
> flume-jdbc
> flume-jms
> flume-kafka
> flume-kudu
> flume-legacy
> flume-morphline
> flume-scribe
> flume-search
> flume-spring-boot
> flume-twitter
>
> For the time being I would propose everything else remain in the current Flume repo.
>
> Note that as each of these is populated they will each need to be released, However, most of these are fairly inactive so after the initial release they may not need to be touched very often.
>
> Also, since Jira now requires new users to request us to create accounts for them I would propose that as each of these repos are set up that they be configured to enable GitHub Issues.
>
> I am looking for feedback on this but if I don’t get any I plan to start work on this within a week or so.
>
> Ralph