You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bigtop.apache.org by Evans Ye <ev...@apache.org> on 2021/11/02 16:34:20 UTC

[DISCUSS] New Features post Bigtop 3.0

Hi folks,

With Bigtop 3.0 been released, I think it's time to discuss what's new as
our next steps. Of course the open source ver. of unified compatible Hadoop
Distro. is still our core product going forward. But the surrounding value
added features might be something that can take us further beyond where we
were at. Now, let me post some ideas to start the brainstorming.

1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
2. MLOps integrations: MLFlow, Submarine.
3. Data Lake integrations: Hudi, Iceberg, Delta.

And for some software engineering stuffs, I think we can do a clean up on
out-dated features such as:
1. vagrant provisioner
2. docker sandbox
3. bigtop-ci
4. bigtop-data-generators
5. bigtop-bigpetstore

Any thoughts? Would love to hear all of you.

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by Ganesh Raju <ga...@linaro.org>.

+1 for K8S
+1 for MLFlow
+1 Pulsar

On Tue, Nov 2, 2021 at 11:35 AM Evans Ye <ev...@apache.org> wrote:

> Hi folks,
>
> With Bigtop 3.0 been released, I think it's time to discuss what's new as
> our next steps. Of course the open source ver. of unified compatible Hadoop
> Distro. is still our core product going forward. But the surrounding value
> added features might be something that can take us further beyond where we
> were at. Now, let me post some ideas to start the brainstorming.
>
> 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
> 2. MLOps integrations: MLFlow, Submarine.
> 3. Data Lake integrations: Hudi, Iceberg, Delta.
>
> And for some software engineering stuffs, I think we can do a clean up on
> out-dated features such as:
> 1. vagrant provisioner
> 2. docker sandbox
> 3. bigtop-ci
> 4. bigtop-data-generators
> 5. bigtop-bigpetstore
>
> Any thoughts? Would love to hear all of you.
>


-- 
IRC: ganeshraju@#linaro on irc.freenode.ne <http://irc.freenode.net/>t

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by "Youngwoo Kim (김영우)" <yw...@apache.org>.

Hey Evans,

My comments inline.

Could you elaborate what component is needed for the real-time event
> streaming? Kafka + Flink in current stack are the solution for it. Pulsar
> can be an addition.

I believe Kafka is the most common choice for oss message broker / event
hub. As you know, Flink(or Spark / Kafka Streams itself) + Kafka are
everywhere. Apache Pulsar would be an alternative for Kafka.

Regarding query processing, Could you share more insight on the difference
> of Pinot V.S. Presto? Does Pinot suitable for having Looker plugged in
> front of it for analytical purposes?

Apache Pinot and Druid are designed for real-time and time-series data
analytics, whereas PrestoDB(or Trino / MPP databases) is used to crunch
big-data from general-purpose data lake or 'lake house'.
So, I think it's not a replacement, but rather, it would be a complement
for data platforms.

Thanks,
Youngwoo

On Fri, Nov 5, 2021 at 10:54 PM Evans Ye <ev...@apache.org> wrote:

> Hi Matt Andruff,
> Agree.
>
> Hi Youngwoo,
> Could you elaborate what component is needed for the real-time event
> streaming? Kafka + Flink in current stack are the solution for it. Pulsar
> can be an addition.
> Regarding query processing, Could you share more insight on the difference
> of Pinot V.S. Presto? Does Pinot suitable for having Looker plugged in
> front of it for analytical purposes?
>
> Hi Kengo/Masatake,
> Do you have any feature that is needed for your company to move business
> forward?
>
>
> BTW, we have some discussion in the past for this topic you can take as a
> reference[1].
>
> [1]
>
> https://docs.google.com/document/d/1F2Gxu8GARQDZXgqHn12LKkQ5wCV_AF4b_tVmjYB6YfA/edit#
>
> Youngwoo Kim (김영우) <yw...@apache.org> 於 2021年11月3日 週三 下午4:02寫道：
>
> > Evans,
> > Thanks for starting this discussion.
> >
> > Hopefully, It would be valuable to integrate the real-time event
> streaming
> > and query processing stack e.g., Apache Druid, Pinot, Pulsar and etc.
> >
> > And 'k8s operator for Bigtop' looks promising for me!
> >
> > Thanks,
> > Youngwoo
> >
> > On Wed, Nov 3, 2021 at 1:34 AM Evans Ye <ev...@apache.org> wrote:
> >
> > > Hi folks,
> > >
> > > With Bigtop 3.0 been released, I think it's time to discuss what's new
> as
> > > our next steps. Of course the open source ver. of unified compatible
> > Hadoop
> > > Distro. is still our core product going forward. But the surrounding
> > value
> > > added features might be something that can take us further beyond where
> > we
> > > were at. Now, let me post some ideas to start the brainstorming.
> > >
> > > 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
> > > 2. MLOps integrations: MLFlow, Submarine.
> > > 3. Data Lake integrations: Hudi, Iceberg, Delta.
> > >
> > > And for some software engineering stuffs, I think we can do a clean up
> on
> > > out-dated features such as:
> > > 1. vagrant provisioner
> > > 2. docker sandbox
> > > 3. bigtop-ci
> > > 4. bigtop-data-generators
> > > 5. bigtop-bigpetstore
> > >
> > > Any thoughts? Would love to hear all of you.
> > >
> >
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by Yuqi Gu <gu...@apache.org>.

>*>1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.*

Agreed with Youngwoo that the k8s operator would be a good candidate to
automate the management of the entire lifecycle of our stack.
And just as Luca mentioned, it's not easy to write a new operator.
IMO, we'd better firstly check if there is already a good operator out
there doing the job. (https://operatorhub.io/).
If not, we may make use of k8s operator sdk
<https://github.com/operator-framework/operator-sdk> ( or operator SDK
based on Java <https://github.com/java-operator-sdk/java-operator-sdk>) to
implement our own k8s operator.
It is challenging but interesting work for us. :)

BRs,
Yuqi




Masatake Iwasaki <iw...@oss.nttdata.co.jp> 于2021年11月9日周二 下午10:07写道：

> > Hi Kengo/Masatake,
> > Do you have any feature that is needed for your company to move business
> > forward?
>
> Maintaining "the open source ver. of unified compatible Hadoop Distro" for
> a long-term
> and more frequent release is my primary concern.
> Cleaning up obsolete features/products would help that.
> Adding Ozone (BIGTOP-3445) could be added value for ML workload too.
>
> Thanks,
> Masatake Iwasaki
>
> On 2021/11/05 22:53, Evans Ye wrote:
> > Hi Matt Andruff,
> > Agree.
> >
> > Hi Youngwoo,
> > Could you elaborate what component is needed for the real-time event
> > streaming? Kafka + Flink in current stack are the solution for it. Pulsar
> > can be an addition.
> > Regarding query processing, Could you share more insight on the
> difference
> > of Pinot V.S. Presto? Does Pinot suitable for having Looker plugged in
> > front of it for analytical purposes?
> >
> > Hi Kengo/Masatake,
> > Do you have any feature that is needed for your company to move business
> > forward?
> >
> >
> > BTW, we have some discussion in the past for this topic you can take as a
> > reference[1].
> >
> > [1]
> >
> https://docs.google.com/document/d/1F2Gxu8GARQDZXgqHn12LKkQ5wCV_AF4b_tVmjYB6YfA/edit#
> >
> > Youngwoo Kim (김영우) <yw...@apache.org> 於 2021年11月3日 週三 下午4:02寫道：
> >
> >> Evans,
> >> Thanks for starting this discussion.
> >>
> >> Hopefully, It would be valuable to integrate the real-time event
> streaming
> >> and query processing stack e.g., Apache Druid, Pinot, Pulsar and etc.
> >>
> >> And 'k8s operator for Bigtop' looks promising for me!
> >>
> >> Thanks,
> >> Youngwoo
> >>
> >> On Wed, Nov 3, 2021 at 1:34 AM Evans Ye <ev...@apache.org> wrote:
> >>
> >>> Hi folks,
> >>>
> >>> With Bigtop 3.0 been released, I think it's time to discuss what's new
> as
> >>> our next steps. Of course the open source ver. of unified compatible
> >> Hadoop
> >>> Distro. is still our core product going forward. But the surrounding
> >> value
> >>> added features might be something that can take us further beyond where
> >> we
> >>> were at. Now, let me post some ideas to start the brainstorming.
> >>>
> >>> 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
> >>> 2. MLOps integrations: MLFlow, Submarine.
> >>> 3. Data Lake integrations: Hudi, Iceberg, Delta.
> >>>
> >>> And for some software engineering stuffs, I think we can do a clean up
> on
> >>> out-dated features such as:
> >>> 1. vagrant provisioner
> >>> 2. docker sandbox
> >>> 3. bigtop-ci
> >>> 4. bigtop-data-generators
> >>> 5. bigtop-bigpetstore
> >>>
> >>> Any thoughts? Would love to hear all of you.
> >>>
> >>
> >
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.

> Hi Kengo/Masatake,
> Do you have any feature that is needed for your company to move business
> forward?

Maintaining "the open source ver. of unified compatible Hadoop Distro" for a long-term
and more frequent release is my primary concern.
Cleaning up obsolete features/products would help that.
Adding Ozone (BIGTOP-3445) could be added value for ML workload too.

Thanks,
Masatake Iwasaki

On 2021/11/05 22:53, Evans Ye wrote:
> Hi Matt Andruff,
> Agree.
> 
> Hi Youngwoo,
> Could you elaborate what component is needed for the real-time event
> streaming? Kafka + Flink in current stack are the solution for it. Pulsar
> can be an addition.
> Regarding query processing, Could you share more insight on the difference
> of Pinot V.S. Presto? Does Pinot suitable for having Looker plugged in
> front of it for analytical purposes?
> 
> Hi Kengo/Masatake,
> Do you have any feature that is needed for your company to move business
> forward?
> 
> 
> BTW, we have some discussion in the past for this topic you can take as a
> reference[1].
> 
> [1]
> https://docs.google.com/document/d/1F2Gxu8GARQDZXgqHn12LKkQ5wCV_AF4b_tVmjYB6YfA/edit#
> 
> Youngwoo Kim (김영우) <yw...@apache.org> 於 2021年11月3日 週三 下午4:02寫道：
> 
>> Evans,
>> Thanks for starting this discussion.
>>
>> Hopefully, It would be valuable to integrate the real-time event streaming
>> and query processing stack e.g., Apache Druid, Pinot, Pulsar and etc.
>>
>> And 'k8s operator for Bigtop' looks promising for me!
>>
>> Thanks,
>> Youngwoo
>>
>> On Wed, Nov 3, 2021 at 1:34 AM Evans Ye <ev...@apache.org> wrote:
>>
>>> Hi folks,
>>>
>>> With Bigtop 3.0 been released, I think it's time to discuss what's new as
>>> our next steps. Of course the open source ver. of unified compatible
>> Hadoop
>>> Distro. is still our core product going forward. But the surrounding
>> value
>>> added features might be something that can take us further beyond where
>> we
>>> were at. Now, let me post some ideas to start the brainstorming.
>>>
>>> 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
>>> 2. MLOps integrations: MLFlow, Submarine.
>>> 3. Data Lake integrations: Hudi, Iceberg, Delta.
>>>
>>> And for some software engineering stuffs, I think we can do a clean up on
>>> out-dated features such as:
>>> 1. vagrant provisioner
>>> 2. docker sandbox
>>> 3. bigtop-ci
>>> 4. bigtop-data-generators
>>> 5. bigtop-bigpetstore
>>>
>>> Any thoughts? Would love to hear all of you.
>>>
>>
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by Evans Ye <ev...@apache.org>.

Hi Matt Andruff,
Agree.

Hi Youngwoo,
Could you elaborate what component is needed for the real-time event
streaming? Kafka + Flink in current stack are the solution for it. Pulsar
can be an addition.
Regarding query processing, Could you share more insight on the difference
of Pinot V.S. Presto? Does Pinot suitable for having Looker plugged in
front of it for analytical purposes?

Hi Kengo/Masatake,
Do you have any feature that is needed for your company to move business
forward?


BTW, we have some discussion in the past for this topic you can take as a
reference[1].

[1]
https://docs.google.com/document/d/1F2Gxu8GARQDZXgqHn12LKkQ5wCV_AF4b_tVmjYB6YfA/edit#

Youngwoo Kim (김영우) <yw...@apache.org> 於 2021年11月3日 週三 下午4:02寫道：

> Evans,
> Thanks for starting this discussion.
>
> Hopefully, It would be valuable to integrate the real-time event streaming
> and query processing stack e.g., Apache Druid, Pinot, Pulsar and etc.
>
> And 'k8s operator for Bigtop' looks promising for me!
>
> Thanks,
> Youngwoo
>
> On Wed, Nov 3, 2021 at 1:34 AM Evans Ye <ev...@apache.org> wrote:
>
> > Hi folks,
> >
> > With Bigtop 3.0 been released, I think it's time to discuss what's new as
> > our next steps. Of course the open source ver. of unified compatible
> Hadoop
> > Distro. is still our core product going forward. But the surrounding
> value
> > added features might be something that can take us further beyond where
> we
> > were at. Now, let me post some ideas to start the brainstorming.
> >
> > 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
> > 2. MLOps integrations: MLFlow, Submarine.
> > 3. Data Lake integrations: Hudi, Iceberg, Delta.
> >
> > And for some software engineering stuffs, I think we can do a clean up on
> > out-dated features such as:
> > 1. vagrant provisioner
> > 2. docker sandbox
> > 3. bigtop-ci
> > 4. bigtop-data-generators
> > 5. bigtop-bigpetstore
> >
> > Any thoughts? Would love to hear all of you.
> >
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by "Youngwoo Kim (김영우)" <yw...@apache.org>.

Evans,
Thanks for starting this discussion.

Hopefully, It would be valuable to integrate the real-time event streaming
and query processing stack e.g., Apache Druid, Pinot, Pulsar and etc.

And 'k8s operator for Bigtop' looks promising for me!

Thanks,
Youngwoo

On Wed, Nov 3, 2021 at 1:34 AM Evans Ye <ev...@apache.org> wrote:

> Hi folks,
>
> With Bigtop 3.0 been released, I think it's time to discuss what's new as
> our next steps. Of course the open source ver. of unified compatible Hadoop
> Distro. is still our core product going forward. But the surrounding value
> added features might be something that can take us further beyond where we
> were at. Now, let me post some ideas to start the brainstorming.
>
> 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
> 2. MLOps integrations: MLFlow, Submarine.
> 3. Data Lake integrations: Hudi, Iceberg, Delta.
>
> And for some software engineering stuffs, I think we can do a clean up on
> out-dated features such as:
> 1. vagrant provisioner
> 2. docker sandbox
> 3. bigtop-ci
> 4. bigtop-data-generators
> 5. bigtop-bigpetstore
>
> Any thoughts? Would love to hear all of you.
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by Matt Andruff <ma...@andruffsolutions.com>.

I'd love to see the Amabari MPack for Bigtop 3.0.  I think it's already
starting but I think it would be really nice to have a gui to manage Bigtop.

On Tue, Nov 2, 2021 at 12:35 PM Evans Ye <ev...@apache.org> wrote:

> Hi folks,
>
> With Bigtop 3.0 been released, I think it's time to discuss what's new as
> our next steps. Of course the open source ver. of unified compatible Hadoop
> Distro. is still our core product going forward. But the surrounding value
> added features might be something that can take us further beyond where we
> were at. Now, let me post some ideas to start the brainstorming.
>
> 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
> 2. MLOps integrations: MLFlow, Submarine.
> 3. Data Lake integrations: Hudi, Iceberg, Delta.
>
> And for some software engineering stuffs, I think we can do a clean up on
> out-dated features such as:
> 1. vagrant provisioner
> 2. docker sandbox
> 3. bigtop-ci
> 4. bigtop-data-generators
> 5. bigtop-bigpetstore
>
> Any thoughts? Would love to hear all of you.
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by "Youngwoo Kim (김영우)" <yw...@apache.org>.

Hey Luca,

Good point!
Helm itself is just a package manager for containerized applications for
k8s. I agree that Helm with k8s objects and controllers are sufficient in
most cases. But if we *really* need complex operations or reconciliation
for our stack, I believe the k8s operator would be a good candidate. And
also It looks like customizing Custom Resource for multiple env is not
trivial but some tools like Kustomize would be useful for customizing
manifests.

Thanks,
Youngwoo

On Sat, Nov 6, 2021 at 6:37 PM Luca Toscano <to...@gmail.com> wrote:

> Hi Evans!
>
> On Tue, Nov 2, 2021 at 5:35 PM Evans Ye <ev...@apache.org> wrote:
> >
> > Hi folks,
> >
> > With Bigtop 3.0 been released, I think it's time to discuss what's new as
> > our next steps. Of course the open source ver. of unified compatible
> Hadoop
> > Distro. is still our core product going forward. But the surrounding
> value
> > added features might be something that can take us further beyond where
> we
> > were at. Now, let me post some ideas to start the brainstorming.
> >
> > 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.
>
> I am wondering how complex it is to write a Kubernetes Operator (that
> I assume would be a go-based application that talks with the
> Kubernetes API) vs writing Helm charts (or similar). We use the latter
> extensively at Wikimedia (but not for any Hadoop-related configs) and
> it works really well.
> Tools like Helmfile (https://github.com/roboll/helmfile) are also very
> nice to bootstrap and manage different
> environments/clusters/configurations. The couple Helm+Helmfile seems
> to be more close to what Bigtop currently does with puppet, so it may
> be an alternative (before writing an Operator) to figure out how to
> handle configs.
> For example, how is the Operator going to apply/create/etc..
> configurations? I worked with Istio recently (https://istio.io/), and
> they offer tools that basically wrap Helm configurations (via binary
> client-side tool or K8s Operator) under the hood. I've never written a
> K8s operator so my understanding could be completely wrong!
>
> > 2. MLOps integrations: MLFlow, Submarine.
>
> At Wikimedia we are using KServe/Kubeflow, it may be a good addition
> to the list. We are using Openstack's Swift as object storage for
> models since it offers an S3 API, Apache Ozone could represent a very
> nice alternative (I saw some traction in the Jira, I'll try to
> help/review if needed!).
>
> > 3. Data Lake integrations: Hudi, Iceberg, Delta.
> +1, our plan is to experiment with Apache Iceberg very soon :)
>
> > And for some software engineering stuffs, I think we can do a clean up on
> > out-dated features such as:
> > 1. vagrant provisioner
> > 2. docker sandbox
> > 3. bigtop-ci
> > 4. bigtop-data-generators
> > 5. bigtop-bigpetstore
>
> Something else that would be nice:
> 1) Upgrade the Puppet version where needed (I know that Bigtop needs
> to keep compatibility with OS Distros that offer older versions of
> puppet etc..)
> 2) Migrate init.d scripts to systemd units where possible (for
> example, in Distros like Debian where it is fully supported).
>
> I understand that the above tasks are very complex and that require a
> lot of work :) They may not be super important given the above
> Kubernetes work to focus on, but I thought it was good to mention
> them!
>
> Thanks a lot for all the work!
>
> Luca
>

Re: [DISCUSS] New Features post Bigtop 3.0

Posted by Luca Toscano <to...@gmail.com>.

Hi Evans!

On Tue, Nov 2, 2021 at 5:35 PM Evans Ye <ev...@apache.org> wrote:
>
> Hi folks,
>
> With Bigtop 3.0 been released, I think it's time to discuss what's new as
> our next steps. Of course the open source ver. of unified compatible Hadoop
> Distro. is still our core product going forward. But the surrounding value
> added features might be something that can take us further beyond where we
> were at. Now, let me post some ideas to start the brainstorming.
>
> 1. Deployment on K8S: Ambari or Bigtop Puppet as K8S operators.

I am wondering how complex it is to write a Kubernetes Operator (that
I assume would be a go-based application that talks with the
Kubernetes API) vs writing Helm charts (or similar). We use the latter
extensively at Wikimedia (but not for any Hadoop-related configs) and
it works really well.
Tools like Helmfile (https://github.com/roboll/helmfile) are also very
nice to bootstrap and manage different
environments/clusters/configurations. The couple Helm+Helmfile seems
to be more close to what Bigtop currently does with puppet, so it may
be an alternative (before writing an Operator) to figure out how to
handle configs.
For example, how is the Operator going to apply/create/etc..
configurations? I worked with Istio recently (https://istio.io/), and
they offer tools that basically wrap Helm configurations (via binary
client-side tool or K8s Operator) under the hood. I've never written a
K8s operator so my understanding could be completely wrong!

> 2. MLOps integrations: MLFlow, Submarine.

At Wikimedia we are using KServe/Kubeflow, it may be a good addition
to the list. We are using Openstack's Swift as object storage for
models since it offers an S3 API, Apache Ozone could represent a very
nice alternative (I saw some traction in the Jira, I'll try to
help/review if needed!).

> 3. Data Lake integrations: Hudi, Iceberg, Delta.
+1, our plan is to experiment with Apache Iceberg very soon :)

> And for some software engineering stuffs, I think we can do a clean up on
> out-dated features such as:
> 1. vagrant provisioner
> 2. docker sandbox
> 3. bigtop-ci
> 4. bigtop-data-generators
> 5. bigtop-bigpetstore

Something else that would be nice:
1) Upgrade the Puppet version where needed (I know that Bigtop needs
to keep compatibility with OS Distros that offer older versions of
puppet etc..)
2) Migrate init.d scripts to systemd units where possible (for
example, in Distros like Debian where it is fully supported).

I understand that the above tasks are very complex and that require a
lot of work :) They may not be super important given the above
Kubernetes work to focus on, but I thought it was good to mention
them!

Thanks a lot for all the work!

Luca